Sachin Jha

History

In 70s when C Programming language was invented, EBCDIC was on the way out. The only characters that mattered were good old unaccented English letters, called ASCII which represents character using a number from 32 to 127.

‘A’ -> 65

SPACE -> 32

7 bit (2 power 7 = 128) was good enough to store those characters. Most computers were using 8 bit Bytes those days. Codes below 32 were used for things like current page printing (12), computer beep (7). All manufacturers starting using upper-128 characters for their own purpose. There were no consistency between manufacturers.

Asian alphabets have thousands of letters which make things more complicated. This was mostly solved by DBCS (double byte character set). Mostly people still assumed that a byte was a character. As long as it character is not moving from one computer to other or different language then it will work. But due to internet expansion it became a normal use case to transfer character from one computer to another, and whole mess started. Luckily unicode was invented.

Unicode

Unicode is character set which includes every reasonable writing system. In unicode letter maps to code point. How code point is represented in memory is a different story. X is different from Y and x but same as X and X. The idea that X in a Times New Roman font is the same character as the X in a Helvetica font. Every letter is assigned a magic number which is written like U+0041 called code point. U is for unicode and 41 is hexadecimal. You can explore complete set at Unicode web site .

Unicode can define any number of letters and in fact they have > 65,536 so not every unicode letter can really be squeezed into two bytes.

For example Hello correspond to below in Unicode code points:

U+0048 U+0065 U+006C U+006C U+006F.

Again, storing this in memory is totally different story. Thats were encodings comes in.

Encoding

The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, let’s just store those numbers in two bytes each. So Hello becomes

00 48 00 65 00 6C 00 6C 00 6F

Due to empty 00, Unicode was ignored for a long time. Thats when UTF-8 was invented. UTF-8 was another system for storing your string of Unicode code points. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using up to 6 bytes. This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don’t even notice anything wrong.

It does not make sense to have a string without knowing what encoding it uses. Thats why it is important to preserve encoding with the data. Thats where Content-Type is used.

Content-Type: text/plain; charset=”UTF-8″

Sachin Jha

Software Development Manager

Author: Sachin Jha

H1/H4 Visa Stamping in Canada

Unicode and Character Sets

History

Unicode

Encoding