Tuesday, March 27, 2007

Diversity of Characters

It looks like I will be spending the day trying to coordinate character sets for my database conversion. Both programmers (and computer users for that matter) should be aware of the encoding that they use for their writing. The ecoding is the process that maps the characters that you see on your computer screen into the bits stored on the computer.

Joel on Programming has an interesting read on unicode. Joel seems to have an extremely low opinion of American programmers in general and of PHP in particular. (PHP is maintained by the Zend corporation in Isreal.)

I happen to have a positive view of both PHP and American programmers.

If anything, I am more apt to question the people trying to stuff Unicode down our throat, than to question the people who are in the trenches trying to make programs work.

Back to to Unicode. In the early days of computer programming, processing and storage capacity was expensive. Programmers encoded data in CAPSLOCK because computer space was too valuable to waste on inconsequential details like case.

In the early days, computers were so expensive that if a county wanted to have a computer for their own character set, the natural choice would be to design machines and software from the ground up. It was really not until the '80s that the price of computer capacity dropped to the point where people could start thinking of cheap machines that had the capacity to process the complexity of different languages.

The natural impulse of computer science was to handle the diversity of languages through parallel evolution of different operating systems and character sets. This was accompanied by the coevolution of technologies to translate between the different operating systems.


As a student of languages and linguistics, I was actually hoping that different language groups in the world would end up developing their own approaches to software. I was hoping that the parallel evolution of computer science in different language groups would lead to a diversity of operating systems.

Of course, there were powerful interests who wanted to see one operating system dominate the entire world.

Since the existence of different character sets was leading to parallel evolution of operating systems in different cultures. Powerful multinationals wanting to dominate the world had to act. They did so by stuffing down our collective gullet a new standard called unicode.

The goal of the Unicode effort was to stomp out this natural evolution of computer science by encoding the diversity of the known languages of the world into a single character set. The first hope was that we could do this with a 16bit number. That was too small to encode all of the subtleties of Chinese and other symbolic languages. There was some hope that use 32 bits would suffice. Each character you typed on the screen would be a number between 1 and 4294967296. With a 32bit character set, each character that an English writer used would be 33818640 times larger than what is actually needed to record the character.

A good writing application doesn’t just record the written word. An application might also record revision history, etc.. An implementation of a universal character set means a great deal of wasted space for English writers.

A 32bit character set is so overbearing, that no-one really wants to use it. The current group think is to push an idea called UTF8.

UTF8 uses 8bit characters for Latin languages and a point set scheme for other languages. You can directly translate ASCII to UTF8. Letters from other alphabets would just be bigger.

Unfortunately, the existence of variable length characters is problematic for many languages and database applications that assume that the binary representation of all characters is the same size. By adopting UTF8, you actually end up precluding the use of primitive fix length databases. Which is sad because primitive fixed length databases are fast and easy to program.

Since UTF8 gives precedence to English by making our letters the smaller at the top of the chart; I actually see UTF8 is more imperialistic than the paradigm where Americans used ASCII and allowed other linguistic groups to evolve their own character sets.

In some ways, I see the debate over character sets as a reflection of the overall debate between the classical liberal world view and progressive world view. The classical liberal view would have Americans continuing to pursue the development of operating systems and character sets that best allow the expression of what Americans want to accomplish while people in other linguistic traditions develop character sets and operating systems that best express their desires.

Parallel evolution leads to greater diversity.

Since we are interested in communicating with the world, there would be a natural coevolution of schemas for translating ideas between cultures.

The classical liberal approach to the diversity of languages would be to allow for the parallel evolution of different ideas and character sets. The progressive approach is to try to create a single universal character set and to force everyone to use that one universal character set.

BTW, you may notice that writers favoring Unicode often take a very condescending attitude to traditional coding techniques.

My thoughts on this issue are that programmers should store information at a cardinality that best matches the data. For example, if you are making shoes, you might have 5 colors, 10 sizes and 4 widths. There are only 200 permutations of this shoe description. Ideally, the character set in your shoe database would not require too much more wasted space than what is needed to express these 200 permutations. The wasted space may not look like a lot when you are talking about 1 or 2 shoe orders. But when you are talking about a database recording on millions of shoes, the inefficiencies add up.

Storing this data in ASCII format is already inefficient. Storing data on shoe orders in Unicode multiplies that inefficiency by 4. When ordering one pair of shoes, the fact that you wasted some space doesn't really matter. When you start talking about hundreds of millions of shoes, the space starts to matter.

I am not completely dimissive of unicode. The shoe company may want to sell its shoes in every country. The sales department is likely to want to have a database that contains the name of their shoes (along with sales text) in every language (including Klingon for the big push at the Star Trek convention). A database might encode the attributes of the shoe in ASCII, and the names of the shoes in Unicode.

Having a mix of character sets is both more efficient and allows for greater diversity than trying to force one universal character set at the operating system level.

In most cases, the cardinality of the information you are collecting is quite low, while the quatity of items that you are recording is large. For example in DNA analysis, you might have 20 or so nucleotides. Human chromozone #1 has 220 base pairs of nucleotides. If you are doing DNA analysis, you will want to encode the nucleotides with the smallest symbol possible so that you can analyze the complexity of the DNA string.

Analysis of things like protein folding, you add the complexity of space and time to your analysis. Much of the really interesting computer science these days pushes the limits of information theory.

Even though I've been condemned in life to work on less interesting programs, my sentiments lie with those programmers pushing computer science to its limits. The design of data should be driven by the structure of the data under analysis and not by the anti-American sentiments of the "progressives" in the sociology department of the university.

IMHO, real diversity comes by allowing the free evolution of different approaches to the problems of the world. The grand schemes that are supposed to force diversity upon us tend to be inefficient and become overbearing. Forced conformity does not create real diversity.

Joel on programming smuggly notes at the end of his article that his company stores everything in two byte UCS-2. I think that the better approach is store data in the most compressed format possible and to have translation tables that let you expand as needed. Joel's programming style may be appropriate for small web publishing firms that are trying to reach a universal audience. However, it is not appropriate for the interesting program questions that involve tons of data and computer capacity.

2 comments:

Scott Hinrichs said...

Being a programmer, I found this post very interesting. My company grapples with the issues of producing a broad variety of data that works both for our vendors and customers. It is sometimes cheaper to buy more capacity to achieve uniformity than to try to translate everything.

You might find the writings of Paul Graham to be interesting. Torn between his love and training in art and in computer languages, he ended up making it big in the run-up to the dot com bubble burst. He's now a venture capitalist, but he has a deep love of computer languages and espouses more pure languages as being the most efficient in the long run.

y-intercept said...

"It is sometimes cheaper to buy more capacity to achieve uniformity than to try to translate everything."

I sympathize with the sentiment that it is just easier to use an inefficient protocol and to buy more equipment. Equipment is cheap.

For that matter, I agree with RFC 2277 that requires all Internet protocols to identify the encoding used for character data with UTF-8.

I am not sure if taking the easy out really is the best route. It seems to me that the method where you simultaneously develop systems that work with different character sets and protocols will ultimate be more robust and faster than one developed with a one size fits all protocol.