当前位置:编程学习 > JAVA >>

java 获取编码是GBK的网页源码 乱码~help!!!

各位大侠,参考了网上的获取网页源码的代码,但是在网页编码是“GBK”的网页源码获取上,总是出现乱码~代码如下,请各位大侠给想想办法~不胜感激
import java.io.BufferedReader;  
import java.io.InputStreamReader;
import java.net.URL;
public class WebCon {  
    // 获取网页源代码  
    public String getWebCon(String pageURL,String encoding) {  
        StringBuffer sb = new StringBuffer();  
        try {  
            URL url = new URL(pageURL);  
            BufferedReader in = new BufferedReader(new InputStreamReader(url  
                    .openStream(),encoding));  
            String line;  
            while ((line = in.readLine()) != null) {  
                sb.append(line);  
            }  
            in.close();  
        } catch (Exception e) { // Report any errors that arise  
            sb.append(e.toString());  
            System.err.println(e);  
            System.err.println("Usage:   java   HttpClient   <URL>   [<filename>]");  
        }  
        return sb.toString();  
    }  
    public static void main(String args[]){
     WebCon wc = new WebCon();
     System.out.println(wc.getWebCon("http://roll.sohu.com/20110903/n318214096.shtml", "gbk"));

    
    
    
    }
} --------------------编程问答--------------------


import java.io.BufferedInputStream;
import java.net.URL;
import java.net.URLConnection;
import java.util.zip.GZIPInputStream;
public class WebCon {   
// 获取网页源代码   
public String getWebCon(String pageURL,String encoding) {   
StringBuffer sb = new StringBuffer();   
GZIPInputStream gzip_in = null;
byte[] buf = new byte[1024];
try {
URL url = new URL(pageURL);
//获得连接的编码形式!
URLConnection uc = url.openConnection();
System.err.println("--------------编码为 :" + uc.getContentEncoding()+"--------");
gzip_in = new GZIPInputStream(new BufferedInputStream(
url.openStream()));
int num;
while ((num = gzip_in.read(buf, 0, buf.length)) != -1) {
sb.append(new String(buf, encoding));
}
} catch (Exception e) {
e.printStackTrace();
}
return sb.toString();  
}   
public static void main(String args[]){
WebCon wc = new WebCon();
System.out.println(wc.getWebCon("http://roll.sohu.com/20110903/n318214096.shtml", "gbk"));
}
}


网页的编码是经过gzip压缩的 所以需要用GZIPInputStream来解码 然后再转码! --------------------编程问答-------------------- 楼主,参考下这里的乱码解决方案:
http://hpjianhua.iteye.com/blog/428585 --------------------编程问答--------------------
引用 1 楼 serbry0033 的回复:
Java code


import java.io.BufferedInputStream;
import java.net.URL;
import java.net.URLConnection;
import java.util.zip.GZIPInputStream;
public class WebCon {   
    // 获取网页源代码   
    public String……


额~~谢谢你的回复,这样子的话就获取不了编码是gb2312或者utf-8的网页源代码了,请问有通用的方法获取这三种编码的网页源代码么? --------------------编程问答--------------------

HttpClient httpclient = new DefaultHttpClient();
HttpGet httpget = new HttpGet("http://localhost:8080/test/");
HttpResponse response = httpclient.execute(httpget);
HttpEntity entity = response.getEntity();
if (entity != null) {
System.out.println("entity.getContentEncoding():"+entity.getContentEncoding());
System.out.println("EntityUtils.toString(entity)"+EntityUtils.toString(entity));

--------------------编程问答-------------------- 依旧乱码~不管用
引用 4 楼 huxiweng 的回复:
Java code

HttpClient httpclient = new DefaultHttpClient();
            HttpGet httpget = new HttpGet("http://localhost:8080/test/");
            HttpResponse response = httpclient.execute(httpget);……
--------------------编程问答-------------------- 我也遇到这个问题,求解???????? --------------------编程问答--------------------
        URL url = new URL("http://roll.sohu.com/20110903/n318214096.shtml");
        URLConnection connection = url.openConnection();
        try (BufferedReader reader =
             new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream()),"GBK"))){
                for (String line = reader.readLine(); line != null; line = reader.readLine()) {
                    System.out.println(line);
                }
        } finally {

        }

$ curl -I http://roll.sohu.com/20110903/n318214096.shtml
HTTP/1.1 200 OK
Content-Type: text/html
Content-Length: 20308
Connection: keep-alive
Date: Mon, 28 Nov 2011 07:29:19 GMT
Server: SWS
Vary: Accept-Encoding
Cache-Control: max-age=120
Expires: Mon, 28 Nov 2011 07:31:19 GMT
Last-Modified: Mon, 28 Nov 2011 07:27:30 GMT
Content-Encoding: gzip
FSS-Cache: EXPIRED from 30278270.36897598.41311203
补充:Java ,  Java相关
CopyRight © 2012 站长网 编程知识问答 www.zzzyk.com All Rights Reserved
部份技术文章来自网络,