Character entities


Character encodingAs in real life characters that build written language differ from system to system. Ελληνικά characters differ from Русский,  汉语 and Latin characters. Fortunately these character sets have been standardized and called alphabets. The same goes for character sets in the digital world. As computers can only process binary data, all characters are mapped to a number. In the early days such a mapping of the Latin alphabet, along with some other graphical ‘characters’, digits and control characters (e.g. escape, tab, line feed, carriage return) was standardized. This standard is known as  the American Standard Code for Information Interchange (ASCII) and was developed by the American Standards Association (currently: ANSI). This 7-bit encoding lacked digital representations for many characters of e.g. foreign characters (as respectively Greek, Russian and Chinese are mentioned above) but also accents like å, è, ï, ó and û were not represented in the set. But as you can see in this paragraph, improvements have been made to facilitate such ’special’ characters.

Character sets

Other character sets have been defined. The International Organization for Standardization came up with the ISO-8859 series to satisfy this shortcoming and defined some different character sets using 8 bits per character. Microsoft developed its own schemes too like cp-1252 for instance, along with some others (like IBM). Also some local institutes needed to create encoding to facilitate the needs of their native language which the standardized sets still lacked. This introduced the problem of multiple interpretations of numbers. What characters are they mapped to? What character set do I need to use to decode 65 for instance? Does 65 mean A or a or ä or ç or R or…

With the ISO 8-bit character sets, 256 characters were possible (190 characters without control characters, etc). This was sufficient as the sets covered the top ten most used languages. Still for instance Chinese and Japanese we not at all covered. This is where the Unicode Transformation Format (UTF) comes into play, developed by the Unicode Consortium. Unicode are multi-byte character sets which means that per character one, two, three or four 8-bit bytes are used to identify that character. This UTF-8 scheme is nowadays the most commonly used set. It is backward compatible with ASCII and with over 30.000 characters it is able to represent most of the living languages with a single code. UTF contains information of how to convert lowercase characters to uppercase and vice verse which is not the same or even symmetric for every character, and it has sorting rules. E.g. traditional Spanisch knows a single ch character which is sorted between the c and d and in Greek the uppercase Σ in lowercase is a σ, but if it is the last character of a word the lowercase of Σ is ς.

Problems or challenges

When programming web applications for instance a programmer often has to work with a database, file system, web server and one or several browsers. These systems and the data traffic between them (protocols) need to be tuned so they use the same character sets. If this is not the case errors can occur. For instance, if you send “Hellø world” from the database in UTF-8 to the browser, which interprets the bit stream it is receiving as CP-1252, the string is displayed as “Hell? world” and you have probably seen the sign appear as because of this error. The solution to this problem, of course is to keep all character sets the same when different systems (applications) communicate with each other.

Copy sign sgml entity

Copyright sign, it's Numeric Character Reference and alias

A way to prevent this problem from occurring is to obey the HTML standard which requires characters not defined in certain character sets to be converted to their Numeric Character Reference (HTML entity). That doing this is mandatory for element attribute values is fairly unknown. This is done by code points denoted by &#[code]; where [code] is replaced by a number. For most used characters an alias is available. E.g. the equivalent for & is &, ë has ë, a space is denoted by   and the copyright sign has equivalent © with alias ©. The codes (digits) and aliases itself only contain characters from the ASCII set, meaning an entire SGML (including HMTL, XML, etc) document can be composed of only ASCII characters. PHP has a built in function that searches and converts such characters in an input. Also, a nice tool to look up entities is available at LeftLogic.com.

VN:F [1.3.1_645]
Rating: 9.5/10 (2 votes cast)
  • Share/Save/Bookmark
  1. No comments yet.
(will not be published)
  1. No trackbacks yet.