UTF-8 编码资料-rushrush-ChinaUnix博客

1999-08-11 - Global players need many languages. And writing systems. For Chinese, Korean, or just Greek, we need a way to code such non-ASCII characters.
1999-08-11 - 全球化的玩家需要多种语言,和书写系统. 比如中文韩语或者希腊语.我们需要一种方式去编码非ascii字符.

For a historical perspective and beginner's technical introduction, see Joel Spolky's missive at
从历史的角度和初学者的技术指导请看Joel Spolky的文章

The encoding standard to cover all these writing systems is the ( ), a 16 (or more) bit-wide encoding for presently 94,140 distinct coded characters derived from more than 25 supported scripts (as of Unicode 3.1).
覆盖所有书写系统的编码是Unicode. ( ) 以16位或更多编码表示94140个字符. (script是神马?)

Tcl/Tk supports the Unicode from version 8.1 as 16-bit chars or in the UTF-8 encoding as the internal representation for strings.
Tcl/Tk从8.1版本开始支持Unicode.使用16位或UTF-8编码.(内部使用UTF-8)

is made to cover 7-bit ASCII, Unicode, and its superset ISO 10646 (which offers 31 bits width, but seems to be an overkill for most practical purposes).
UTF-8 覆盖了7位ASCII, Unicode,和它的超集 ISO10646 (31位宽,但是实际情况是太恐怖了.)

Characters are represented as sequences of 1..6 eight-bit bytes - termed octets in the character set business - (for ASCII: 1, for Unicode: 2..3) as follows:
字符表示成1到6个8位的字节的序列.

ASCII 0x00..0x7F (Unicode page 0, left half): 0x00..0x7F. Nothing changed.
ASCII 0x00..0x7F (Unicode page 0, left half): 0x00..0x7F. 没任何改变
Unicode, pages 00..07: 2 bytes, 110aaabb 10bbbbbb, where aaa are the rightmost bits of page#, bb.. are the bits of the second Unicode byte. These pages cover European/Extended Latin, Greek, Cyrillic, Armenian, , Arabic.
Unicode, pages 00..07: 2 字节, 110aaabb 10bbbbbb, aaa是最右边的位 bb..是第二个Unicode字节覆盖了欧洲拉丁字母希腊斯拉夫美国犹太
Unicode, pages 08..FE: 3 bytes, 1110aaaa 10aaaabb 10bbbbbb. These cover all the rest of Unicode, including Hangul, Kanji, and what else. This means that East Asian texts are 50% longer in UTF-8 than in pure 16 bit Unicode.
Unicode, pages 08..FE: 3 字节, 1110aaaa 10aaaabb 10bbbbbb. 覆盖了所有剩余的Unicode 包括中文韩文等. 这意味着中文的UTF8编码将比UTF16编码长50%.(也比GB3212长50%).
ISO 10646 codes beyond Unicode: 4..6 bytes. (Never seen one yet).
ISO 10646 超越了Unicode 4到6字节 (从没见过)

A general principle of UTF-8

UTF-8 的基本原理

A general principle of UTF-8 is that the first byte either is a single-byte character (if below 0x80), or indicates length of multi-byte code by the number of 1's before the first 0 and is then filled up with data bits.
UTF-8的主要思想是第一个字节或者是单字节ASCII, 或者指示了多字节编码的数量. 二进制1的数量指示了多字节编码的字节数.

All other bytes start with bits 10 and are then filled up with 6 data bits. See also . A sequence of n bytes can hold
所有其他字节起始于二进制10 然后填充6位数据位参见 . 一个n字节的序列可以负载b字节数据

 b = 5n + 1  (1 < n < 7)

bits "payload", so the maximum is 31 bits for a 6-byte sequence.
所以6字节的序列最多可以表示31位.

It follows from this that bytes in UTF-8 encoding fall in distinct ranges:
字节在UTF-8编码中可以分为如下几类:

   00..7F - plain old ASCII 老的ASCII
   80..BF - non-initial bytes of multibyte code 非多字节编码起始字节
   C2..FD - initial bytes of multibyte code (C0, C1 are not legal!)
            多字节编码的起始字节 (C0 C1是无效的)
   FE, FF - never used (so, free for byte-order marks).
            从来不用 所以可以作为字节序标志

The distinction between initial and non-initial helps in plausibility checks, or to re-synchronize with missing data.
起始字节和非起始字节的区别可以帮助检查错误.比如字符串中丢失一个字节的情况,GB2312编码会完全乱套.而UTF-8则不会.

Besides, it's independent of byte order (16-bit Unicode inherits byte order, so has to express that with the magic FEFF. Should you read FFFE, you're to swap).
UTF-8是字节序无关的. 而UTF-16则不是.

If an UCS fits 7 bits, its coded as 0xxxxxxx. This makes ASCII character represented by themselves
If an UCS fits 11 bits, it is coded as 110xxxxx 10xxxxxx
If an UCS fits 16 bits, it is coded as 1110xxxx 10xxxxxx 10xxxxxx
If an UCS fits 21 bits, it is coded as 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
If an UCS fits 26 bits, it is coded as 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
If an UCS fits 31 bits, it is coded as 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx