UTF-8 编码资料

1560阅读 0评论2010-12-21 rushrush
分类:LINUX

1999-08-11 - Global players need many languages. And writing systems. For Chinese, Korean, or just Greek, we need a way to code such non-ASCII characters.
1999-08-11 - 全球化的玩家需要多种语言,和书写系统. 比如 中文 韩语 或者 希腊语.我们需要一种方式去编码非ascii字符.

For a historical perspective and beginner's technical introduction, see Joel Spolky's missive at
从历史的角度 和初学者的技术指导 请看Joel Spolky的文章

The encoding standard to cover all these writing systems is the ( ), a 16 (or more) bit-wide encoding for presently 94,140 distinct coded characters derived from more than 25 supported scripts (as of Unicode 3.1).
覆盖所有书写系统的编码是Unicode. ( ) 以16位或更多编码表示94140个字符. (script是神马?)

Tcl/Tk supports the Unicode from version 8.1 as 16-bit chars or in the UTF-8 encoding as the internal representation for strings.
Tcl/Tk从8.1版本开始支持Unicode.使用16位或UTF-8编码.(内部使用UTF-8)

is made to cover 7-bit ASCII, Unicode, and its superset ISO 10646 (which offers 31 bits width, but seems to be an overkill for most practical purposes).
UTF-8 覆盖了7位ASCII, Unicode,和它的超集 ISO10646 (31位宽,但是实际情况是太恐怖了.)

Characters are represented as sequences of 1..6 eight-bit bytes - termed octets in the character set business - (for ASCII: 1, for Unicode: 2..3) as follows:
字符表示成1到6个8位的字节的序列.

A general principle of UTF-8

UTF-8 的基本原理

A general principle of UTF-8 is that the first byte either is a single-byte character (if below 0x80), or indicates length of multi-byte code by the number of 1's before the first 0 and is then filled up with data bits.
UTF-8的主要思想是第一个字节或者是单字节ASCII, 或者指示了多字节编码的数量. 二进制1的数量指示了多字节编码的字节数.

All other bytes start with bits 10 and are then filled up with 6 data bits. See also . A sequence of n bytes can hold
所有其他字节起始于 二进制10 然后填充6位数据位 参见 . 一个n字节的序列可以负载b字节数据
 b = 5n + 1  (1 < n < 7)
bits "payload", so the maximum is 31 bits for a 6-byte sequence.
所以6字节的序列最多可以表示31位.

It follows from this that bytes in UTF-8 encoding fall in distinct ranges:
字节在UTF-8编码中可以分为如下几类:
   00..7F - plain old ASCII 老的ASCII
80..BF - non-initial bytes of multibyte code 非多字节编码起始字节
C2..FD - initial bytes of multibyte code (C0, C1 are not legal!)
多字节编码的起始字节 (C0 C1是无效的)
FE, FF - never used (so, free for byte-order marks).
从来不用 所以可以作为字节序标志
The distinction between initial and non-initial helps in plausibility checks, or to re-synchronize with missing data.
起始字节和非起始字节的区别可以帮助检查错误.比如字符串中丢失一个字节的情况,GB2312编码会完全乱套.而UTF-8则不会.

Besides, it's independent of byte order (16-bit Unicode inherits byte order, so has to express that with the magic FEFF. Should you read FFFE, you're to swap).
UTF-8是字节序无关的. 而UTF-16则不是.


上一篇:如何使用sudo让普通用户具有关机的权限
下一篇:没有了