计算机常用字符的编码主要分为两种:ASCII码和Unicode码。 1. ASCII 码 ASCII(American Standard Code for Information Interchange) 美国信息互换标准代码,是基于拉丁字母的一套电脑编码系统。ASCII是标准的单字节字符编码方案,用于基于文本的数据,使用7位或者8位的二进制组合起来表示128或者256中可能的字符。
下面先介绍Encoding的理解,然后分别详细介绍这几种编码方式的优点缺点和区别。 Encoding的理解 Internally, the .NET Framework stores text as Unicode UTF-16. An encoder transforms this text data to a sequence of bytes. A decoder transforms a sequence of bytes into this internal format. An encoding describes the rules by which an encoder or decoder operates. For example, the UTF8Encoding class describes the rules for encoding to and decoding from a sequence of bytes representing text as UTF-8. Encoding and decoding can also include certain validation steps. For example, theUnicodeEncoding class checks all surrogates to make sure they constitute valid surrogate pairs. Both of these classes inherit from theEncoding class. 关键的一句为:An encoding describes the rules by which an encoder or decoder operates
UTF是一种将Unicode码编码成内存中二进制表示的方法。The Unicode Standard assigns a code point (a number) to each character in every supported script. A Unicode Transformation Format (UTF) is a way to encode that code point. Selecting an Encoding Class when you have the opportunity to choose an encoding, you are strongly recommended to use a Unicode encoding, typically eitherUTF8Encoding orUnicodeEncoding (UTF32Encoding is also supported). In particular,UTF8Encoding is preferred overASCIIEncoding. If the content is ASCII, the two encodings are identical, but UTF8Encoding can also represent every Unicode character, while ASCIIEncoding supports only the Unicode character values between U+0000 and U+007F. Because ASCIIEncoding does not provide error detection,UTF8Encoding is also better for security. UTF8Encoding has been tuned to be as fast as possible and should be faster than any other encoding. Even for content that is entirely ASCII, operations performed withUTF8Encoding are faster than operations performed with ASCIIEncoding. You should consider usingASCIIEncoding only for certain legacy applications. However, even in this case, UTF8Encoding might still be a better choice. Assuming default settings, the following scenarios can occur: If your application has content that is not strictly ASCII and encodes it withASCIIEncoding, each non-ASCII character encodes as a question mark ("?"). If the application then decodes this data, the information is lost. If the application has content that is not strictly ASCII and encodes it with UTF8Encoding, the result seems unintelligible if interpreted as ASCII. However, if the application then decodes this data, the data performs a round trip successfully.
When selecting the ASCII encoding for your applications, consider the following: The ASCII encoding is usually appropriate for protocols that require ASCII. If your application requires 8-bit encoding, the UTF-8 encoding is recommended over the ASCII encoding. For the characters 0-7F, the results are identical, but use of UTF-8 avoids data loss by allowing representation of all Unicode characters that are representable. Note that the ASCII encoding has an 8th bit ambiguity that can allow malicious use, but the UTF-8 encoding removes ambiguity about the 8th bit. Previous versions of .NET Framework allowed spoofing by merely ignoring the 8th bit. The current version has been changed so that non-ASCII code points fall back during the decoding of bytes. 2. UTF7Encoding Represents a UTF-7 encoding of Unicode characters. The UTF-7 encoding represents Unicode characters as sequences of 7-bit ASCII characters. This encoding supports certain protocols for which it is required, most often e-mail or newsgroup protocols. Since UTF-7 is not particularly secure or robust, and most modern systems allow 8-bit encodings, UTF-8 should normally be preferred to UTF-7. UTF7Encoding does not provide error detection. For security reasons, the application should useUTF8Encoding,UnicodeEncoding, orUTF32Encoding and enable error detection. UTF7Encoding推荐不被使用。
3. UTF8Encoding UTF-8 encoding represents each code point as a sequence of one to four bytes. UTFEncoding将Unicode码编码成1-4个单字节码。 UTF-8 encoding 以字节对Unicode进行编码,不同范围的字符使用不同长度的编码,UTF-8 encoding 的最大长度为4个字节。 UTF8Encoding的编码速度要比其他的所有编码方式都要快,即使是要编码的内容都是ASCII码,编码速度也要比用ASCIIEncoding编码的速度要快。 UTF8Encoding的效果要比ASCIIEncoding的效果好的多,所以推荐用UTF8Encoding,而不是ASCIIEncoding。 when you have the opportunity to choose an encoding, you are strongly recommended to use a Unicode encoding, typically eitherUTF8Encoding orUnicodeEncoding (UTF32Encoding is also supported). In particular,UTF8Encoding is preferred overASCIIEncoding. If the content is ASCII, the two encodings are identical, but UTF8Encoding can also represent every Unicode character, while ASCIIEncoding supports only the Unicode character values between U+0000 and U+007F. Because ASCIIEncoding does not provide error detection,UTF8Encoding is also better for security. UTF8Encoding has been tuned to be as fast as possible and should be faster than any other encoding. Even for content that is entirely ASCII, operations performed withUTF8Encoding are faster than operations performed with ASCIIEncoding. You should consider usingASCIIEncoding only for certain legacy applications. However, even in this case, UTF8Encoding might still be a better choice. Assuming default settings, the following scenarios can occur: If your application has content that is not strictly ASCII and encodes it withASCIIEncoding, each non-ASCII character encodes as a question mark ("?"). If the application then decodes this data, the information is lost. If the application has content that is not strictly ASCII and encodes it with UTF8Encoding, the result seems unintelligible if interpreted as ASCII. However, if the application then decodes this data, the data performs a round trip successfully.
4. UnicodeEncoding UnicodeEncoding编码以16位无符号整数为编码单位,编码成1-2个16位的integers。 The Unicode Standard assigns a code point (a number) to each character in every supported script. A Unicode Transformation Format (UTF) is a way to encode that code point. TheUnicode Standard uses the following UTFs: UTF-8, which represents each code point as a sequence of one to four bytes. UTF-16, which represents each code point as a sequence of one to two 16-bit integers. UTF-32, which represents each code point as a 32-bit integer. UnicodeEncoding无法兼容ASCII,C#的默认编码方式就是UnicodeEncoding。使用的编码方式为UTF-16 5. UTF32Encoding UTF32Encoding 以32位无符号整数为编码单位,编码成一个32bit的integer