|Issue 62 December 22 2008|
Unicode: a Primer
A Few Basic Character Encoding ConceptsBefore we get into Unicode itself, we need to understand the basics of how characters are represented in computer memory and on the screen. As you know, at the most basic level computers are nothing more than very powerful number-crunching machines. The computer's Central Processing Unit, or CPU, is composed of millions of tiny transistors, each of which can be set to one of two states--off or on. Because of this binary nature, CPU's manipulate numbers using base 2, or binary, arithmetic; in other words, all numbers are represented by 0's and 1's. That means that any time you press a key on your keyboard you are ultimately sending a binary code to the computer's CPU.
Each 0 or 1 stored in a computer's memory is known as a bit. Computers handle and store data in chunks of 8 bits called bytes. The earliest micro-computers used 8-bit processors--CPU's that could process one 8-bit byte of data at a time. In the early days of computing, one bit in every byte was used for internal "housekeeping" purposes, leaving only 7 bits available for storing data. For these and other historical reasons, characters of text could only be represented by a maximum of 7 bits.
Because computers needed to be able to pass text data to one another, in the early 1960s the American Standards Association developed a standard character encoding system known as ASCII--the American Standard Code for Information Interchange. Due to the 7-bit limitation, this standard only specified 128 (27) code points, numbered from 0 to 127. Upper and lower case Latin letters, numbers and common punctuation were assigned unique codes in this table. For instance, upper-case A is ASCII 64, the numeral 3 is ASCII 51, a space character is ASCII 32, and so forth. This standard is still widely used today and is 100% reliably consistent across all operating systems.
In the 1970s and 1980s several things happened that brought increasing complexity and inconsistency to the original ASCII standard. More efficient microprocessor designs freed up the 8th bit in the byte so that the size of the character encoding table could be doubled to 256 (28) code points. However, at the same time computer use outside of the U.S. and Europe skyrocketed, while intense competition emerged between various operating systems. This meant that scores of different uses for the upper 128 code points emerged. IBM, Apple Computer, and Microsoft each developed a standard character mapping for what is sometimes called "extended ASCII". The Microsoft version became the most commonly used and became the basis for what is known as the ISO-8859-1 (Latin 1) character set. Apple's extended ASCII is mapped to completely different characters and is known as Mac Roman.
At the same time, people who needed to express non-Latin character sets created scores of fonts that mapped the upper 128 code points to other alphabets, such as Cyrillic, Hebrew, Greek, or East European Latin alphabets. To further confuse matters, character based writing systems such as Chinese and Japanese could not be expressed in only 256 code points, so systems were devised under which pairs of two ASCII characters were combined, which could be mapped to large character lookup tables that could handle the thousands of characters needed for these writing systems.
During the 1980s and 1990s it was a common occurrence for people exchanging documents electronically internationally to find that what started as, say, Cyrillic text on one end came out as hopeless gobbledygook on the other end, because the recipient didn't have a compatible character font. Even relatively common punctuation and typographic characters, such as European currency symbols, "curled" quotes, and dashes could be rendered incorrectly on the recipient's system, due to incompatible encoding schemes between the sender and recipient systems.
Unicode to the Rescue
In the 1980's the Unicode consortium emerged to try to address this confusing situation. The character encoding standard that emerged is designed to provide a way to display all of the world's languages by using a larger, 16-bit character table. The goal in Unicode was to assign each character in all the world's languages a unique code number. The Unicode consortium's credo is:
Unicode provides a unique number for every character,
The road to the Unicode utopia, however, has been long and arduous, and even today many of the most common computing applications rely on ASCII text for storage and transmission of text data. Revolution is one of these, but fortunately for us, it also supports Unicode and provides us reasonably robust, if not 100% complete, support for Unicode.
In order to understand how to use Unicode text in Revolution, it is important to be familiar with a few terms and concepts. Unlike ASCII, Unicode is much more than simply a collection of character codes. It also defines things like sort order, writing direction, when a character is represented by a specific glyph, and much more. Therefore implementing Unicode is much more complex than implementing ASCII. You can read more about how Revolution implements Unicode in section 6.4 of the Revolution User Guide.
Unicode and Revolution
With this background we can finally talk about Unicode as it is implemented in Revolution. Under the Unicode standard, there are several encoding systems, most notably UTF-8, UTF-16, and UTF-32. As the names imply, they are suited for for 8-bit, 16-bit, and 32-bit computing systems respectively. Revolution uses the UTF-16 encoding. However, Revolution has the ability to transcode between UTF-16 and several other common encodings. Thus, the first Important Thing to remember when using Unicode in Revolution is:
1. Unicode in Revolution is always UTF-16.
Another complication comes from the fact that different CPU's store sequences of bytes in different orders. Since Unicode characters typically are made up of two bytes each (often referred to as double-byte characters,) that means that the order in which Unicode characters are stored can be different when comparing, say, Power PC (PPC) processors with Intel processors. Motorola and PPC processors store the most significant byte first in the sequence and so are called "big endian"; Intel processors store the most significant byte last in the sequence and are called "little endian". The details of why and how this works aren't important here, but what you need to remember is Important Thing number 2:
2. Revolution Uses the unicode byte order determined by the host processor.
What does this mean for you as a developer? Let's consider the case of two users. User 1 works on an older Mac system that runs on a PPC processor. User 2 works in Windows on an Intel processor. If User 1 creates unicode text and saves it to a file and sends it to User 2, when user two tries to read the file it will come out scrambled because it came from a big-endian system. Even though it is possible to convert big-endian unicode to little-endian, it adds pain, complexity and uncertainty. That's why I recommend Corollary 1 to Important Thing number 2:
Use UTF 8 to store and transfer Unicode text in Revolution.
It's not hard, and I explain how to do this later on.
Tips for Using Unicode in RevolutionTyping Unicode text in fields.
This is a good place to start because it's the easiest. Revolution fields can handle Unicode text input without any intervention by the developer. That is because Revolution simply uses the text input methods supplied by the host operating system. So if you want to type Japanese characters into a field, you simply select the Japanese text input system you want to use and start typing. Revolution knows how to render it properly in the field, and it is then ready for use. Bottom line: if you want to learn how to select the text input method on your OS, see the help documentation for that OS.
However, Unicode text input in Rev is not perfect. Revolution still has trouble rendering right-to-left languages like Hebrew and Arabic while you are typing them. Specifically, it will properly render characters in a word from right to left, but when you type a space to begin a new word, the new word is inserted to the right of the previous word, not to the left as it should be. For this reason I recommend creating Hebrew and Arabic texts outside of Revolution and importing them, rather than trying to type them within Revolution.
put charToNum("a") -- returns 97 put numToChar(97) -- returns the letter 'a'
You can use the numToChar() function to create a rudimentary ASCII table. Just create a new field, name it "ascii" and run this routine:
put empty into fld "ascii" repeat with i = 0 to 255 put i & tab & numToChar(i) end repeat
That's how these two functions work by default. But you can tell Revolution to expect Unicode values for these two functions by first setting the useUnicode property to true.
This brings up Important Thing number 3:
3. The useUnicode property only affects the charToNum() and numToChar() functions.
There is a common misconception among Rev developers who are new to Unicode that the useUnicode property is a kind of magic switch that will automatically change all of your text operations into Unicode. It's not. In fact, the useUnicode property might be more accurately named the useTwoByteCharsWithCharToNumAndNumToChar property. You can see why they went with useUnicode.
Let's look at how this works in practice. Let's say you have a field "russText" containing the sentence Я люблю тебя. The sentence begins with the upper case Russian letter 'Я'. If you wanted to find out which Unicode code point corresponds to that letter you would do this:
set the useUnicode to true put charToNum(char 1 to 2 of fld "russText") -- returns 1071
Conversely, to render a Unicode character using its code point do this:
set the useUnicode to true set the unicodeText of fld "russLetter" to numToChar(1071) -- \ the letter 'Я' should appear in the field
The unicodeText property.
The previous example is a good way to introduce another important tool for using Unicode in Rev: the unicodeText property. If you want to move unicode text from field to field, you have to use this property. In the normal ASCII world you can just do this:
put field 1 into field 2
But in the brave new Unicode world if you want to put Unicode text into a field you have to set its unicodeText property:
set the unicodeText of fld "newPlace" to the unicodeText of fld \ "oldPlace"
This all leads to Important Thing number 4:
4. The secret to manipulating Unicode text in fields lies in the unicodeText of the field.
So if you want to move chunks of text, you have to refer to chunks of the unicodeText:
-- Copying a Unicode character to another field set the unicodeText of fld "letter" to char 1 to 2 of fld \ "sentence" -- Moving words set the unicodeText of fld "other" to \ word 1 to 2 of the unicodeText of fld "this" -- Inserting Unicode text from one field into another get the unicodeText of fld "info" set the unicodeText of fld "info" to \ it && word 2 of line 2 of the unicodeText of fld "bottom"
Converting between single and double-byte encodings
When using Unicode text, especially if you are importing or exporting text from or to other systems or environments, you may need to convert your Unicode to a single-byte encoding system, or vice-versa. The most common reason for doing this is reading and writing UTF-8 files. As I mentioned above, I recommend storing your Unicode text in UTF-8 format if you are planning to share it with others or send it over the internet. UTF-8 is part of the Unicode standard, and is a way to store Unicode (double-byte) text in an ASCII (single-byte) text file. UTF-8 is especially important for encoding Unicode text for use in web browsers and email.
The keys to using UTF-8 text in Rev are the uniEncode() and uniDecode() functions. Let's say you've gotten some UTF-8 text from a web site and you want to display it in your Revolution stack. You store it in a file called myUniText.ut8. This is how you would read it in:
put url ("binfile:/path/to/file/myUniText.ut8") into tRawTxt set the unicodetext of fld "display" to \ uniencode(tRawTxt,"UTF8")
Conversely, to save Unicode text from Rev to a UTF-8 file, use uniDecode():
get the unicodeText of fld "myUniText" put unidecode(it,"utf8") into url \ "binfile:/path/to/file/myUniFile.ut8"
Here's Important Thing number 5:
5. For reliably transporting Unicode text, convert it and store it as UTF-8 text.
What about Unicode in buttons and menus?
So far, we've only been talking about Unicode text in fields. Almost none of that applies to buttons, primarily because buttons have no unicodeText property. Instead, the basic approach for displaying Unicode text in buttons and menus consists of two steps:
1. Set the textFont of the button to a Unicode font;
Unicode font names in Revolution take the form Font Name,language, where Font Name is the name of any font installed on the system, and language is the name of the language you want, or the term "unicode". For example, for Russian Cyrillic text I might use "Arial,Russian" as the font name; for Japanese, "Osaka,Japanese"; and for Greek, "Geneva,Unicode". Not every language can be used as the second part of a Unicode font name. For a complete list of valid language names see the Rev Dictionary entry for uniEncode.
One way to assign a Unicode label to a button is to reference some existing Unicode text in a hidden field. Let's say, for example, that we are making a stack for Mandarin Chinese speakers and we want to give our Start button a Chinese label, 開始. We could type or import the Unicode text to a field and use that field as the source text for the button label:
set the textFont of button "start" to "BiauKai,Chinese" set the label of button "start" to the unicodeText of fld \ "hiddenChinText"
One technique I often use for creating Unicode button labels is to store the Unicode label text in a custom property of the button. When I do this I store it as UTF-8 text to avoid the byte order problem when moving the stack from machine to machine. So first I would store the unicode text in a custom property:
set the chinLabel of button "start" \ to unidecode(the unicodeText of fld "hiddenChinText","UTF8")
Once that was in place I would use the custom property as the source of the Unicode text:
set the textFont of button "start" to "BiauKai,Chinese" set the label of button "start" \ to uniencode(the chinLabel of btn "start","UTF8")
One more note on Unicode buttons: Because Unicode text doesn't always "travel" well from platform to platform, I usually set Unicode button labels and menu contents each time I go to the card, in a preOpenCard handler.
Unicode Ask and Answer dialogs
Ask and answer dialog prompts can have Unicode prompts, but you can't pass Unicode text in the ask and answer command arguments. Instead you use another handy technique for setting Unicode text—store the Unicode as entities in HTML text. Storing the htmlText of a field that contains Unicode text is another reliable way of keeping the Unicode text intact during transfers. It also is the only way to display Unicode text in ask and answer dialog prompts.
To see what I mean, let's look at the Chinese start button example above. In the first case I had the Unicode Chinese text 開始 in a text field "hiddenChinText". If I were to examine the htmlText of this field it would look something like this:
<font face="BiauKai" size="14" lang="zh-TW">開始</font>
Notice that the two Chinese characters are embedded in the htmlText as Unicode entities: 開 and 始. HTML Unicode entities like this will reliably render as the proper Unicode characters in Revolution, regardless of the operating system the stack is running on. So to use Unicode characters in ask and answer prompts, do something like this:
put the htmlText of fld "hiddenChinText" into tChinPrompt answer tChinPrompt with "Cancel" or "OK"
There is one other advantage of saving Unicode text as HTML entities—it is the best way to save Unicode text with text styles like bold and italic and font attributes like size and color.
Unicode stack titleI'll finish up this primer by mentioning one Unicode feature that is new in Revolution as of version 2.9—the ability to use Unicode text for title of the stack window. Just set the unicodeTitle property of the stack to a valid unicode string. Here's an example:
set the unicodeTitle of this stack to the unicodeText of fld \ "russTitle"
In describing the state of Rev's Unicode implementation, I would say this—Unicode in Revolution is not perfect, but it is perfectly usable. If you master the basic concepts I've described here you, and remember the Important Things I have listed you will have the tools you need to diagnose and solve just about any problem that arises, and you'll be on your way to being able to produce Revolution applications for almost any language.
Gillam, Richard Unicode Demystified. Addison-Wesley, 2003.
Revolution User Guide, section 6.4. Runtime Revolution, Ltd., 2008.
A stack with all of the examples in this article, along with many others, can be accessed at http://revolution.byu.edu/unicode/UnicodeInRev.rev.