~ Office Supplies ~~ Buy Posters ~~ A-Z Products ~~ Website Advertising


UTF-8 - Wikipedia

<<Up     Contents

UTF-8

UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding that is used to represent Unicode-encoded text using a stream of bytes.

Description

UTF-8 is currently standardized as RFC 2279 (UTF-8, a transformation format of ISO 10646), which is quite extensive and detailed. However, a short summary is brought below, in the case that the reader is interested only in a general overview.

The characters that are smaller than 128 are encoded with a single byte that contains their value: these correspond exactly to the 128 7-bit ASCII characters. In other cases, several bytes are required. The bytes' upper bit is always 1, in order for them to be always greater than 128 and not look like any of the 7-bit ASCII characters (particularly the ones used for control, e.g. Carriage Return). The encoded character is divided into several groups of bits, which are then divided among the lower positions inside these bytes.

Code range
hexadecimal
UTF-16 UTF-8
binary
Notes
U00000 - U0007F: 00000000 0xxxxxxx 0xxxxxxx ASCII equivalence range; byte begins with zero
U00080 - U007FF: 00000xxx xxxxxxxx 110xxxxx 10xxxxxx first byte begins with 11, the following byte(s) begin with 10
U00800 - U0FFFF: xxxxxxxx xxxxxxxx 1110xxxx 10xxxxxx 10xxxxxx
U10000 - UFFFFF: 110110xx xxxxxxxx
110111xx xxxxxxxx*
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx UTF-16 requires surrogate characters; an offset of 0x10000 is subtracted, so the bit pattern is not identical with UTF-8
For example, the character alef (א), which is Unicode 0x05D0, is encoded into UTF-8 in this way:

So the first 128 characters need one byte. The next 1920 characters need two bytes to encode. This includes Greek, Cyrillic, Coptic, Armenian[?], Hebrew, and Arabic characters. The rest of the UCS-2 characters use three bytes, and additional characters are encoded in 4 bytes. For representing the full 32-bit codespace of UCS-4 up to 6 bytes may be required, but there are currently no plans to assign characters beyond the 1 million or so that can be represented in 4 bytes in both UTF-8 and UTF-16.

Advantages

Disadvantages

Example web pages written in UTF-8:

wikipedia.org dumped 2003-03-17 with terodump




 
 
40 ct mint green HIDDENITE Faceting cabbing rough raw uncut gemstone crystal nice jewel Beautiful 4
 40 ct mint green HIDDENITE ing cabbing raw uncut crystal nice jewel Beautiful 4 
 
10 grams creamy blue AZURITE tumbled polished raw gemstone cabbing rough jewelry 50 carats PRETTY
 10 grams creamy blue AZURITE tumbled polished raw cabbing jewelry 50 carats PRETTY 
 
10 carats maroon red Jasper agate gem Polished rectangle blocks Cabbing cab cabochon rough gemstones
 10 carats maroon red Jasper agate Polished rectangle blocks Cabbing cab cabochon  
 
Tanzanite blue IOLITE gems jewels Loose natural 5mm square faceted cut jewelry gemstone pair 5 mm pr
 Tanzanite blue IOLITE jewels Loose 5mm square ed cut jewelry pair 5 mm pr 
 
62 carat LABRADORITE feldspar gemstone Blue gold large hand polished gem stone jewelry 12 gr PRETTY
 62 carat LABRADORITE feldspar Blue gold large hand polished jewelry 12 gr PRETTY