History of character encoding

The history of encoding man's languages is a long one. It has become remarkably more complex since the advent of computers.

Before computers language was spoken and then written as shapes on paper. One recognised these shapes and was able to reproduce the sounds and hence the meaning of the message (or one wasn't able and the message was a mystery).

With the advent of computers it was necessary to convert the characters into codes inside the computer's memory so that the text could be stored and reproduced. Thus the step of divorcing the rendering (drawn shape) of the text element (character) from "what it represented" was taken.

These character encodings (code points) were used to represent the fact of a character being in a particular place in a text without saying anything about its shape size, colour, etc.

Early computers structured their memory into chunks (now called "bytes" almost universally) and one of these chunks was used to represent a character. In different computers these chunks were different sizes. This lead to two problems:

different computers could store different numbers of code points
different computers would store different characters at the same code point.

Both these make it difficult to exchange information between computers. The answer was to come up with a standard or two.

The computer industry set a de facto standard for the length of a "chunk" in the early days of 7- bits. Some computers had previously used 4-bits or 6-bits as there processing unit (chunk).

The 7-bit "byte" allowed for 128 different characters to be encoded. This was certainly sufficient to handle upper case English and even to extend to lower case as well. It included some control codes and punctuation and numbers. This became ASCII (American Standard Code for Information Interchange).

It worked well for a number of years while computing was mainly an English language preserve and text based computing was relatively unimportant. But the advent of minicomputers and especially Personal computers in the early '80s changed all that. The size of the byte became 8-bits and now there was room to store 256 codes. So "extended ASCII" (also called ANSI) was born.

Also born was the IBM PC and it had an earth shaking effect on character processing.

With the PC came MS-DOS and with it came the beginnings of internationalisation with the idea of "code pages".

Code pages are just alternative sets of 256 characters for use in different parts of the world or for different languages. By now the "code page" and the "character set" had become virtually synonymus.

And history has caught up to the current day.

Back to table of contents
To next section: The place of the character set in a document