Characters of the world

The spread of computers worldwide and their increase in power means that text based computing is now a reality around the globe.

In particular libraries are being automated and are wishing to access and be accessed internationally. Similarly general purpose information systems are having to incorporate text from foreign sources.

Thus it has become a requirement that computers (when handling text) are able to recognise and manipulate text in different languages. Different languages often mean different scripts, but not always. In the case of the European languages the latin alphabet serves for all of them with the addition of diacritics.

However it is the case that most computers are used in one language so a set of character sets were developed for MS-DOS for each of the languages. This was done for Europe by leaving the bottom 128 characters the same as for English (the ASCII set) and modifying the top 128 characters. This is fine for single language single machine use, however as soon as it is desired to send information between computers a problem arises for the same code point represents different characters in different character sets.

The problem intensifies for library and other text database systems where it is required to mix text in a number of languages and correctly represent them and interfile them.

Another twist to the tail is for the ideographic languages of the far east where the number of characters is large (Chinese has either 7.000 or 13.000 basic "characters"). These will obviously not fit into 128 code points however clever the programmer. It is necessary to invent another method of encoding.

Thus the basic problems with current character encoding are:

incompatible encoding
not enough code space

Back to table of contents
To next section: 16-bits to the rescue
To previous section: The place of the character set in a document