当前位置:编程学习 > JAVA >>

Java HttpClient使用小结

这里就不啰嗦了,主要是在做demo的时候遇到的一些问题在这里总结一下:

1、使用连接池
虽说http协议时无连接的,但毕竟是基于tcp的,底层还是需要和服务器建立连接的。对于需要从同一个站点抓取大量网页的程序,应该使用连接池,否则每次抓取都和Web站点建立连接、发送请求、获得响应、释放连接,一方面效率不高,另一方面稍不小心就会疏忽了某些资源的释放、导致站点拒绝连接(很多站点会拒绝同一个ip的大量连接、防止DOS攻击)。

连接池的例程如下:


[java]
<SPAN style="WHITE-SPACE: pre"> </SPAN>SchemeRegistry schemeRegistry = new SchemeRegistry(); 
    schemeRegistry.register(new Scheme("http", 80, PlainSocketFactory.getSocketFactory())); 
<SPAN style="WHITE-SPACE: pre"> </SPAN>schemeRegistry.register(new Scheme("https", 443, SSLSocketFactory.getSocketFactory())); 
<SPAN style="WHITE-SPACE: pre"> PoolingClientConnectionManager</SPAN> cm = new PoolingClientConnectionManager(schemeRegistry); 
    cm.setMaxTotal(200); 
    cm.setDefaultMaxPerRoute(2); 
    HttpHost googleResearch = new HttpHost("research.google.com", 80); 
    HttpHost 易做图En = new HttpHost("en.易做图.org", 80); 
    cm.setMaxPerRoute(new HttpRoute(googleResearch), 30); 
    cm.setMaxPerRoute(new HttpRoute(易做图En), 50); 

 SchemeRegistry schemeRegistry = new SchemeRegistry();
 schemeRegistry.register(new Scheme("http", 80, PlainSocketFactory.getSocketFactory()));
 schemeRegistry.register(new Scheme("https", 443, SSLSocketFactory.getSocketFactory()));
 PoolingClientConnectionManager cm = new PoolingClientConnectionManager(schemeRegistry);
 cm.setMaxTotal(200);
 cm.setDefaultMaxPerRoute(2);
 HttpHost googleResearch = new HttpHost("research.google.com", 80);
 HttpHost 易做图En = new HttpHost("en.易做图.org", 80);
 cm.setMaxPerRoute(new HttpRoute(googleResearch), 30);
 cm.setMaxPerRoute(new HttpRoute(易做图En), 50);
SchemaRegistry的作用是注册协议的默认端口号。PoolingClientConnectionManager是池化连接管理器,即连接池,setMaxTotal设置连接池的最大连接数,setDefaultMaxPerRoute设置每个路由(http://hc.apache.org/httpcomponents-client-ga/tutorial/html/connmgmt.html#d5e467)上的默认连接个数,setMaxPerRoute则单独为某个站点设置最大连接个数。

从连接池中获取http client也很方面:


[java]
DefaultHttpClient client = new DefaultHttpClient(cm); 

DefaultHttpClient client = new DefaultHttpClient(cm);

2、设置HttpClient参数

HttpClient需要设置合适的参数,才能更好地工作。默认的参数能够应付少量的抓取工作,但找到一组合适的参数往往能改善特定情况下的抓取效果。设置参数的例程如下:


[java]
<SPAN style="WHITE-SPACE: pre">     </SPAN>DefaultHttpClient client = new DefaultHttpClient(cm); 
        Integer socketTimeout = 10000; 
        Integer connectionTimeout = 10000; 
        final int retryTime = 3; 
        client.getParams().setParameter(CoreConnectionPNames.SO_TIMEOUT, socketTimeout); 
        client.getParams().setParameter(CoreConnectionPNames.CONNECTION_TIMEOUT, connectionTimeout); 
        client.getParams().setParameter(CoreConnectionPNames.TCP_NODELAY, false); 
        client.getParams().setParameter(CoreConnectionPNames.SOCKET_BUFFER_SIZE, 1024 * 1024); 
        HttpRequestRetryHandler myRetryHandler = new HttpRequestRetryHandler() 
        { 
            @Override 
            public boolean retryRequest(IOException exception, int executionCount, HttpContext context) 
            { 
                if (executionCount >= retryTime) 
                { 
                    // Do not retry if over max retry count  
                    return false; 
                } 
                if (exception instanceof InterruptedIOException) 
                { 
                    // Timeout  
                    return false; 
                } 
                if (exception instanceof UnknownHostException) 
                { 
                    // Unknown host  
                    return false; 
                } 
                if (exception instanceof ConnectException) 
                { 
                    // Connection refused  
                    return false; 
                } 
                if (exception instanceof SSLException) 
                { 
                    // SSL handshake exception  
                    return false; 
                } 
                HttpRequest request = (HttpRequest) context.getAttribute(ExecutionContext.HTTP_REQUEST); 
   

补充:软件开发 , Java ,
CopyRight © 2022 站长资源库 编程知识问答 zzzyk.com All Rights Reserved
部分文章来自网络,