java 获取编码是GBK的网页源码 乱码~help!!!
各位大侠,参考了网上的获取网页源码的代码,但是在网页编码是“GBK”的网页源码获取上,总是出现乱码~代码如下,请各位大侠给想想办法~不胜感激import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
public class WebCon {
// 获取网页源代码
public String getWebCon(String pageURL,String encoding) {
StringBuffer sb = new StringBuffer();
try {
URL url = new URL(pageURL);
BufferedReader in = new BufferedReader(new InputStreamReader(url
.openStream(),encoding));
String line;
while ((line = in.readLine()) != null) {
sb.append(line);
}
in.close();
} catch (Exception e) { // Report any errors that arise
sb.append(e.toString());
System.err.println(e);
System.err.println("Usage: java HttpClient <URL> [<filename>]");
}
return sb.toString();
}
public static void main(String args[]){
WebCon wc = new WebCon();
System.out.println(wc.getWebCon("http://roll.sohu.com/20110903/n318214096.shtml", "gbk"));
}
} --------------------编程问答--------------------
import java.io.BufferedInputStream;
import java.net.URL;
import java.net.URLConnection;
import java.util.zip.GZIPInputStream;
public class WebCon {
// 获取网页源代码
public String getWebCon(String pageURL,String encoding) {
StringBuffer sb = new StringBuffer();
GZIPInputStream gzip_in = null;
byte[] buf = new byte[1024];
try {
URL url = new URL(pageURL);
//获得连接的编码形式!
URLConnection uc = url.openConnection();
System.err.println("--------------编码为 :" + uc.getContentEncoding()+"--------");
gzip_in = new GZIPInputStream(new BufferedInputStream(
url.openStream()));
int num;
while ((num = gzip_in.read(buf, 0, buf.length)) != -1) {
sb.append(new String(buf, encoding));
}
} catch (Exception e) {
e.printStackTrace();
}
return sb.toString();
}
public static void main(String args[]){
WebCon wc = new WebCon();
System.out.println(wc.getWebCon("http://roll.sohu.com/20110903/n318214096.shtml", "gbk"));
}
}
网页的编码是经过gzip压缩的 所以需要用GZIPInputStream来解码 然后再转码! --------------------编程问答-------------------- 楼主,参考下这里的乱码解决方案:
http://hpjianhua.iteye.com/blog/428585 --------------------编程问答--------------------
额~~谢谢你的回复,这样子的话就获取不了编码是gb2312或者utf-8的网页源代码了,请问有通用的方法获取这三种编码的网页源代码么? --------------------编程问答--------------------
--------------------编程问答-------------------- 依旧乱码~不管用 --------------------编程问答-------------------- 我也遇到这个问题,求解???????? --------------------编程问答--------------------
HttpClient httpclient = new DefaultHttpClient();
HttpGet httpget = new HttpGet("http://localhost:8080/test/");
HttpResponse response = httpclient.execute(httpget);
HttpEntity entity = response.getEntity();
if (entity != null) {
System.out.println("entity.getContentEncoding():"+entity.getContentEncoding());
System.out.println("EntityUtils.toString(entity)"+EntityUtils.toString(entity));
URL url = new URL("http://roll.sohu.com/20110903/n318214096.shtml");
URLConnection connection = url.openConnection();
try (BufferedReader reader =
new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream()),"GBK"))){
for (String line = reader.readLine(); line != null; line = reader.readLine()) {
System.out.println(line);
}
} finally {
}
$ curl -I http://roll.sohu.com/20110903/n318214096.shtml
HTTP/1.1 200 OK
Content-Type: text/html
Content-Length: 20308
Connection: keep-alive
Date: Mon, 28 Nov 2011 07:29:19 GMT
Server: SWS
Vary: Accept-Encoding
Cache-Control: max-age=120
Expires: Mon, 28 Nov 2011 07:31:19 GMT
Last-Modified: Mon, 28 Nov 2011 07:27:30 GMT
Content-Encoding: gzip
FSS-Cache: EXPIRED from 30278270.36897598.41311203
补充:Java , Java相关