xml如何探测字符编码

xml如何探测字符编码
之前一度以为，xml头部的字符声明部分，即：

[html]
<?xml version="1.0" encoding="UTF-8"?>
可以轻易被parser解析，因为这部分可以保证全英文，都是基础的ascii字符，所有编码对这块兼容。
看了这篇帖子和xml规范才知道：

1、不是所有编码方式都是ascii基础字符兼容的

[plain]
UTF-16 (LE/BE), UCS-2, UCS-4 and EBCDIC are all legal encodings that don't encode those basic characters the same way as ASCII.

2、xml parser需要至少支持UTF-8和UTF-16，其尝试解析的顺序为：
依据BOM头
[java]
(for UTF-16, UTF-32 and even UTF-8 with the dummy BOM)
依据文件头部的编码声明（The encoding declaration，类似BOM的作用）

[plain]
try UTF-32, UTF-16, UTF-8, ASCII and other ASCII-compatible single-byte encodings.
依据 encoding attribute（属xml规范中要求的文本内容的一部分）
传输层等外部环境所使用的编码方式也有助于推断xml的编码方式
注意：虽然encoding attribute的优先级比较低，不过仍有必要写清楚，因为parser往往通过上述步骤推断出encoding family，再通过读取它来确定使用何种编码方式。

UCS-4和UTF-8
之前没见过UCS-4，就顺便学习且复习了。

Unicode是一种编码方式，UTF是对Unicode的具体实现。

UTF目前三个版本的比较：

Name UTF-8 UTF-16 UTF-16BE UTF-16LE UTF-32 UTF-32BE UTF-32LE
Smallest code point 0000 0000 0000 0000 0000 0000 0000
Largest code point 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF
Code unit size 8 bits 16 bits 16 bits 16 bits 32 bits 32 bits 32 bits
Byte order N/A <BOM> big-endian little-endian <BOM> big-endian little-endian
Fewest bytes per character 1 2 2 2 4 4 4
Most bytes per character 4 4 4 4 4 4 4
可见，UTF-8和16是变长的，前者更通用，也相对节省空间。

概念上可以把UCS-4等价于UTF-32，而实际上后者是前者的子集。

关于endian的典故很有意思，大部分命名都是很形象的（如gc里面的eden）只是我们不理解人家的文化背景。

为什么在java，html，css，js里转义unicode的方式各不相同？
在上面的FAQ里，看到了这段也摘录于此：

Q: Is there a standard method to package a Unicode character so it fits an 8-Bit ASCII stream?

A: There are three or four options for 易做图 Unicode fit into an 8-bit format.

a) Use UTF-8. This preserves ASCII, but not Latin-1, because the characters >127 are different from Latin-1. UTF-8 uses the bytes in the ASCII only for ASCII characters. Therefore, it works well in any environment where ASCII characters have a significance as syntax characters, e.g. file name syntaxes, markup languages, etc., but where the all other characters may use arbitrary bytes.
Example: “Latin Small Letter s with Acute” (015B) would be encoded as two bytes: C5 9B.

b) Use Java or C style escapes, of the form \uXXXXX or \xXXXXX. This format is not standard for text files, but well defined in the framework of the languages in question, primarily for source files.
Example: The Polish word “wyjście” with character “Latin Small Letter s with Acute” (015B) in the middle (ś is one character) would look like: “wyj\u015Bcie".

c) Use the &#xXXXX; or &#DDDDD; numeric character escapes as in HTML or XML. Again, these are not standard for plain text files, but well defined within the framework of these markup languages.
Example: “wyjście” would look like “wyjście"

d) Use SCSU. This format compresses Unicode into 8-bit format, preserving most of ASCII, but using some of the control codes as commands for the decoder. However, while ASCII text will look like ASCII text after being encoded in SCSU, other characters may occasionally be encoded with the same byte values, 易做图 SCSU unsuitable for 8-bit channels that blindly interpret any of the bytes as ASCII characters.
Example: “<SC2> wyjÛcie” where <SC2> indicates the byte 0x12 and “Û” corresponds to byte 0xDB. [AF]

结论：

java / json： \uXXXX
css：\XXXX
html / xml：&#XXXX
xml的解析方式
主要有两种方式：DOM和SAX，前者面向文档，在内存中映射出整个树状结构；后者面向事件，解析往往是一次性的，当然效率也更高。
JAXP是java世界解析xml的的统一入口，意图是避免依赖特定厂商的接口。其提供了DOM和SAX两种解析方式，默认实现是apache的xerces。
若要更换默认实现，可以参考这段：
Use the javax.xml.parsers.SAXParserFactory system property.
Use the properties file "lib/jaxp.properties" in the JRE directory. This configuration file is in standard java.util.Properties format and contains the fully qualified name of the implementation class with the key being the system property defined above. The jaxp.properties file is read only once by the JAXP implementation and it's values are then cached for future use. If the file does not exist when the first attempt is made to read from it, no further attempts are made to check for its existence. It is not possible to change the value of any property in jaxp.properties after it has been read for the first time.
Use the Services API (as detailed in the JAR specification), if available, to determine the classname. The Services API will look for a classname in the file META-INF/services/javax.xml.parsers.SAXParserFactory in jars available to the runtime.
Platform default SAXParserFactory instance.
看起来只有service api比较通用，查到一篇文章以后有机会再细看。

补充：Web开发 , 其他 ,