• 热门标签

当前位置: 主页 > 航空资料 >

时间:2010-06-30 09:00来源:蓝天飞行翻译 作者:admin
曝光台 注意防骗 网曝天猫店富美金盛家居专营店坑蒙拐骗欺诈消费者

the 16-bit limitations. UTF-32 encodes each code position as a 32-bit binary integer, i.e. as four octets.
This is a very obvious and simple encoding. However, it is inefficient in terms of the number of octets
Edition Number: 4.5 47
AIXM PRIMER
needed. If we have normal English text or other text which contains ISO Latin 1 characters only, the
length of the Unicode encoded octet sequence is four times the length of the string in ISO 8859-1 encoding.
UTF-32 is rarely used, except perhaps in internal operations (since it is very simple for the purposes
of string processing). UTF-16 represents each code position in the Basic Multilingual Plane as two octets.
Other code positions are presented using so-called surrogate pairs, utilizing some code positions in the
BMP reserved for the purpose. This, too, is a very simple encoding when the data contains BMP characters
only. Unicode can be, and often is, encoded in other ways, too, such as the following encodings:
UTF-7: Each character code is presented as a sequence of one or more octets in the range 0 - 127 ("bytes
with most significant bit set to 0", or "seven-bit bytes", hence the name). Most ASCII characters are
presented as such, each as one octet, but for obvious reasons some octet values must be reserved for use
as "escape" octets, specifying the octet together with a certain number of subsequent octets forms a
multi-octet encoded presentation of one character. There is an example of using UTF-7 later in this
document.
UTF-8: Character codes less than 128 (effectively, the ASCII repertoire) are presented "as such", using
one octet for each code (character) All other codes are presented, according to a relatively complicated
method, so that one code (character) is presented as a sequence of two to four octets, each of which is
in the range 128 - 255. This means that in a sequence of octets, octets in the range 0 - 127 ("bytes with
most significant bit set to 0") directly represent ASCII characters, whereas octets in the range 128 - 255
("bytes with most significant bit set to 1") are to be interpreted as really encoded presentations of characters.
IETF Policy on Character Sets and Languages (RFC 2277) favors UTF-8. It requires support to it in
Internet protocols (and doesn't even mention UTF-7). Note that UTF-8 is efficient, if the data consists
dominantly of ASCII characters with just a few "special characters" in addition to them, and reasonably
efficient for dominantly ISO Latin 1 text.
C.4.Why Character Encoding is Important
Problems with character encoding can arise when files generated on one system are transferred and interpreted
on another system - basically when data is exchanged. Because the upper 128 bytes were allocated
differently between different encodings (usually along the same lines as language, but sometimes
there were even different encodings within a single country), data could be interpreted incorrectly. For
example, on some computers the character code 130 would display as é (e with an acute accent), but on
computers sold in Israel it was the Hebrew letter Gimel.
This misinterpretation could have a range of effects depending on the software. At one end of the scale
this could be manifested as an application incorrectly displaying or printing a character - this is a typical
problem for word processors. At the other end of the scale an unexpected character code may be sufficient
to crash an application. Although the latter may appear to be the more severe, in terms of data integrity
it is the potentially undetected error that can be more problematic. This can happen quite easily if data
is incorrectly interpreted and then rewritten or saved using another encoding. Equally problematic is the
case in which the declared encoding is different to the actual encoding.
Obviously this has major implications for aeronautical data exchange if ICAO standards for the integrity
of data are to be adhered to. Within AIXM, these kind of problems could effect all elements which do
not restrict their content to simple characters. These are textual descriptions and textual remark.
C.5. Specifying Character Encodings in XML
The solution to the problem, or at least the beginning of the solution, is to know which encoding a string
uses, whether it be in a file, in memory or in an e-mail. The remainder of the solution is for software to
48 Edition Number: 4.5
AIXM PRIMER
be able to identify that encoding and to interpret it correctly, or handle failure gracefully. Whenever
textual data is exchanged between systems, the sender and the recipient should agree on the character
encoding used. The easiest way to do this is to specify the character encoding used in a header statement
 
中国航空网 www.aero.cn
航空翻译 www.aviation.cn
本文链接地址:AIXM_Primer_4.5(22)