ASCII – an Unhappy Legacy for Computers

In my writing, I’ve had to give up using Diacritical marks, commonly called accents. They are more accurate, and show respect for foreign languages. But I’ve used them in the past for foreign words, names such as Schrodinger, and then seen computer software turn them into something meaningless. A document may look fine in Microsoft Word, and then turn letters with diacritical marks into weird symbols when posted to the web.

This matter is another example of humans laying down inconsistent rules, decades ago. A lot of early electronic systems stored letters in a code called ASCII, which stands for American Standard Code for Information Interchange. ASCII was originally a set of 128 characters for telecommunication. They included numbers and both upper-case and lower-case letters, but not the diacritical marks that written English mostly refuses to use to clarify its inconsistent and unpredictable spelling. This also meant that computers could store letters as units of seven ‘bits’, a bit being something with two possible states, normally represented as 0 and 1. Computers normally store their data like that, binary numbers. But computers were also standardised to work with eight-bit units called bytes. I’m old enough to have actually used computers that used six-bit bytes, made by a UK company called International Computers Limited, long since absorbed by Fujitsu. But eight bits soon became the norm.

So, computers could store seven-bit ASCII in eight-bit storage blocks. But this was wasteful, particularly in the early years when memory was expensive – my first job was with a mainframe computer doing accounts for an electronics company, and it had 48K of memory, using an obsolete technology called core-store. So to meet more complex needs, ASCII was expanded to be a set of 256 characters, conveniently stored in one ‘byte’ of computer memory. This including some characters with the diacritical marks used by most European languages. But sadly, this was done several times, and inconsistently. The code for an accented letter in one version of ASCII can mean something completely different in another version. This weakness survives in software that was probably developed using different brands of computer. There is a much superior system called Unicode, which uses extra computer memory to encode more than 128,000 characters covering 135 modern and historic scripts and includes all sorts of letters with diacritical marks. It was started for Chinese ideograms, given the need to easily convert between traditional ideograms and the simplified versions that are part of Mao’s legacy. But the world of computing and the internet has not so far standardised on Unicode, which is slower and takes up more space when held electronically.

This is a from a much longer and more diverse article, Physics and the Nature of Reality, which has relevant references.

Share this: