You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

8.3 KiB

bits, ASCII, Unicode

Encoding systems

As a computer can only work with numbers, it cannot process of letters text directly. In order to work with text, textual characters need to be translated into numbers and vice versa. This is done via the process of text encodings.

It might be your first reaction to think that this shouldnt be so difficult. We could represent the letter in binary code. An a encoded as a 0, b as a 1 and c as a 01 etc. And in fact this is more or less how text encodings work. However, at the time when computing was being developed different encodings emerged.

ASCII encoding

The dominant encoding at the time became ASCII (for American Standard Code for Information Interchange.), which was created on behalf of the U.S. Government in 1963 to allow for information interchange between their different computing systems.

The encoding uses a 7-bit system, which means that they could only store characters in 128 (2^7=128) numbers (0000 0000 until 0111 1111). The resulting encoding schema assigned to each of these 128 numbers:

all the letters in the English alphabet
numbers from 0-9
punctuation marks
and control characters

Thanks to the simplicity of the encoding it quickly became a standard for the American computing industry.

ASCII imperialism

Thanks to the power of the US Military and US corporations the American computing industry became the global computing industry. Computers that we use today are rooted in American networking history, and so is the ASCII standard. However, the reality is that ASCII can only represent 26 Latin letters in the English alphabet but computers are used all over the world, by people speaking different languages. They would often end up with American computers that could not represent their language in ASCII. Think for example of scripts like Greek, Cyrillic and Arabic or even Latin scripts that use accents such as the ü or ø. Altough 128 might sound like a lot of characters, it is not enough to represent all different languages.

ASCII flavours: PETSCII

Commodore 64 (1982)

The Commodore 64, also known as the C64 or the CBM 64, is an 8-bit home computer introduced in January 1982 by Commodore International (first shown at the Consumer Electronics Show, January 710, 1982, in Las Vegas). It has been listed in the Guinness World Records as the highest-selling single computer model of all time, with independent estimates placing the number sold between 12.5 and 17 million units.

Preceded by the Commodore VIC-20 and Commodore PET, the C64 took its name from its 64 kilobytes (65,536 bytes) of RAM. With support for multicolor sprites and a custom chip for waveform generation, the C64 could create superior visuals and audio compared to systems without such custom hardware.

https://en.wikipedia.org/wiki/Commodore_64

Part of the Commodore 64's success was its sale in regular retail stores instead of only electronics or computer hobbyist specialty stores. Commodore produced many of its parts in-house to control costs, including custom integrated circuit chips from MOS Technology. In the United States, it has been compared to the Ford Model T automobile for its role in bringing a new technology to middle-class households via creative and affordable mass-production.

Kahney, Leander (September 9, 2003). "Grandiose Price for a Modest PC". CondéNet, Inc. Archived from the original on September 14, 2008. Retrieved September 13, 2008.

PETSCII

The Commodore PET's lack of a programmable bitmap-mode for computer graphics, as well as it having no redefinable character set capability, may be one of the reasons PETSCII was developed; by creatively using the well-thought-out block graphics, a higher degree of sophistication in screen graphics is attainable than by using plain ASCII's letter/digit/punctuation characters. In addition to the relatively diverse set of geometrical shapes that can thus be produced, PETSCII allows for several grayscale levels by its provision of differently hatched checkerboard squares/half-squares. Finally, the reverse-video mode (see below) is used to complete the range of graphics characters, in that it provides mirrored half-square blocks.

https://en.wikipedia.org/wiki/PETSCII

Draw PETSCII art in the browser:

Use PETSCII as a font!

PETSCII bots!

Unicode universalism

As electronic text was increasingly being exchanged online and between language areas, issues emerged when text encoded in one language was shared and read on systems assuming an encoding in another language. Unicode was a response to the incompatible text encoding standards that were proliferating.

When different encodings assign the same binary numbers to different characters, this results in illegible documents. The solution, partly made possible by increased computing capacity, was to strive for a single universal encoding which would encompass all writing systems 6

You can experience this following this exercise.

So in order to overcome the limitations of ASCII people created the Unicode Consortium to create a single universal character encoding:

The Unicode standards are designed to normalise the encoding of characters, to efficiently manage the way they are stored, referred to and displayed in order to facilitate cross-platform, multilingual and international text exchange. The Unicode Standard is mammoth in size and covers well over 110,000 characters, of which [..] 1,000 are [..] emoji. 7

In effect the Unicode Standard combined all the different national character encodings together into a single large ledger in order to try to represent all languages.

It is divided in so called blocks, which are basically number tables that describe which number is connected to which character.

The table starts counting at 0x0 and continues all the way up to 0x10FFFF.

The first block actually corresponds with ASCII:

https://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF

It contains many different scripts for supporting large and smaller language groups, including for example Ethiopian and Cherokee:

https://en.wikibooks.org/wiki/Unicode/Character_reference/1000-1FFF

However there are also blocks that describe Arrows and other symbols:

https://en.wikibooks.org/wiki/Unicode/Character_reference/2000-2FFF

Emoji are also part of the unicode table.

In [ ]: