Java处理UTF-8带BOM的文本的读写
什么是BOM
BOM(byte-order mark),即字节顺序标记,它是插入到以UTF-8、UTF16或UTF-32编码Unicode文件开头的特殊标记,用来识别Unicode文件的编码类型。对于UTF-8来说,BOM并不是必须的,因为BOM用来标记多字节编码文件的编码类型和字节顺序(big-endian或little- endian)。
BOMs 文件头:
00 00 FE FF = UTF-32, big-endian
FF FE 00 00 = UTF-32, little-endian
EF BB BF = UTF-8,
FE FF = UTF-16, big-endian
FF FE = UTF-16, little-endian
下面举个例子,针对UTF-8的文件BOM做个处理:
String xmla = StringFileToolkit.file2String(new File(“D:\\projects\\mailpost\\src\\a.xml”),“UTF-8”);
byte[] b = xmla.getBytes(“UTF-8”);
String xml = new String(b,3,b.length-3,“UTF-8”);
..............
思路是:先按照UTF-8编码读取文件后,跳过前三个字符,重新构建一个新的字符串,然后用Dom4j解析处理,这样就不会报错了。
其他编码的方式处理思路类似,其实可以写一个通用的自动识别的BOM的工具,去掉BOM信息,返回字符串。
不过这个处理过程已经有牛人解决过了:http://koti.mbnet.fi/akini/java/unicodereader/
Java代码
Example code using UnicodeReader class
Here is an example method to read text file. It will recognize bom marker and skip it while reading.
//import http://koti.mbnet.fi/akini/java/unicodereader/UnicodeReader.java.txt
public static char[] loadFile(String file) throws IOException {
// read text file, auto recognize bom marker or use
// system default if markers not found.
BufferedReader reader = null;
CharArrayWriter writer = null;
UnicodeReader r = new UnicodeReader(new FileInputStream(file), null);
char[] buffer = new char[16 * 1024]; // 16k buffer
int read;
try {
reader = new BufferedReader(r);
writer = new CharArrayWriter();
while( (read = reader.read(buffer)) != -1) {
writer.write(buffer, 0, read);
}
writer.flush();
return writer.toCharArray();
} catch (IOException ex) {
throw ex;
} finally {
try {
writer.close(); reader.close(); r.close();
} catch (Exception ex) { }
}
}
Java代码
Example code to write UTF-8 with bom marker
Write bom marker bytes to start of empty file and all proper text editors have no problems using a correct charset while reading files. Java's OutputStreamWriter does not write utf8 bom marker bytes.
public static void saveFile(String file, String data, boolean append) throws IOException {
BufferedWriter bw = null;
OutputStreamWriter osw = null;
File f = new File(file);
FileOutputStream fos = new FileOutputStream(f, append);
try {
// write UTF8 BOM mark if file is empty
if (f.length() < 1) {
final byte[] bom = new byte[] { (byte)0xEF, (byte)0xBB, (byte)0xBF };
fos.write(bom);
}
osw = new OutputStreamWriter(fos, "UTF-8");
bw = new BufferedWriter(osw);
if (data != null) bw.write(data);
} catch (IOException ex) {
throw ex;
} finally {
try { bw.close(); fos.close(); } catch (Exception ex) { }
}
}
实际应用:
Java代码
package com.dayo.gerber;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.Reader;
import java.util.Properties;
/**
*
* @author 刘飞(liufei)
*
*/
public class Generate4YYQTPScript {
private static final String ENCODING = "UTF-8";
private static final String GERBER_CONFIG = "config/gerber4yy.properties";
private static Properties GERBER_CONFIG_PROPS = null;
private static final String GERBER_FORMAT_DIALOG_TITLE_SCRIPT = "{#GERBER_FORMAT_DIALOG_TITLE}";
private static String GERBER_FORMAT_DIALOG_TITLE = "";
/* gerber properties parmters keys config */
private static final String QTP_SCRIPT_IN = "script.in";
private static final String QTP_SCRIPT_OUT = "script.out";
private static final String QTP_SYSTEM_PATH = "QTP.system.path";
private static final String QTP_SYSTEM_PATH_S
补充:软件开发 , Java ,