Java HttpClient使用小结

这里就不啰嗦了，主要是在做demo的时候遇到的一些问题在这里总结一下：

1、使用连接池
虽说http协议时无连接的，但毕竟是基于tcp的，底层还是需要和服务器建立连接的。对于需要从同一个站点抓取大量网页的程序，应该使用连接池，否则每次抓取都和Web站点建立连接、发送请求、获得响应、释放连接，一方面效率不高，另一方面稍不小心就会疏忽了某些资源的释放、导致站点拒绝连接（很多站点会拒绝同一个ip的大量连接、防止DOS攻击）。

连接池的例程如下：

[java]
 SchemeRegistry schemeRegistry = new SchemeRegistry();
 schemeRegistry.register(new Scheme("http", 80, PlainSocketFactory.getSocketFactory()));
 schemeRegistry.register(new Scheme("https", 443, SSLSocketFactory.getSocketFactory()));
 PoolingClientConnectionManager cm = new PoolingClientConnectionManager(schemeRegistry);
 cm.setMaxTotal(200);
 cm.setDefaultMaxPerRoute(2);
 HttpHost googleResearch = new HttpHost("research.google.com", 80);
 HttpHost 易做图En = new HttpHost("en.易做图.org", 80);
 cm.setMaxPerRoute(new HttpRoute(googleResearch), 30);
 cm.setMaxPerRoute(new HttpRoute(易做图En), 50);

SchemeRegistry schemeRegistry = new SchemeRegistry();
schemeRegistry.register(new Scheme("http", 80, PlainSocketFactory.getSocketFactory()));
schemeRegistry.register(new Scheme("https", 443, SSLSocketFactory.getSocketFactory()));
PoolingClientConnectionManager cm = new PoolingClientConnectionManager(schemeRegistry);
cm.setMaxTotal(200);
cm.setDefaultMaxPerRoute(2);
HttpHost googleResearch = new HttpHost("research.google.com", 80);
HttpHost 易做图En = new HttpHost("en.易做图.org", 80);
cm.setMaxPerRoute(new HttpRoute(googleResearch), 30);
cm.setMaxPerRoute(new HttpRoute(易做图En), 50);
SchemaRegistry的作用是注册协议的默认端口号。PoolingClientConnectionManager是池化连接管理器，即连接池，setMaxTotal设置连接池的最大连接数，setDefaultMaxPerRoute设置每个路由（http://hc.apache.org/httpcomponents-client-ga/tutorial/html/connmgmt.html#d5e467）上的默认连接个数，setMaxPerRoute则单独为某个站点设置最大连接个数。

从连接池中获取http client也很方面：

[java]
DefaultHttpClient client = new DefaultHttpClient(cm);

DefaultHttpClient client = new DefaultHttpClient(cm);

2、设置HttpClient参数

HttpClient需要设置合适的参数，才能更好地工作。默认的参数能够应付少量的抓取工作，但找到一组合适的参数往往能改善特定情况下的抓取效果。设置参数的例程如下：

[java]
 DefaultHttpClient client = new DefaultHttpClient(cm);
 Integer socketTimeout = 10000;
 Integer connectionTimeout = 10000;
 final int retryTime = 3;
 client.getParams().setParameter(CoreConnectionPNames.SO_TIMEOUT, socketTimeout);
 client.getParams().setParameter(CoreConnectionPNames.CONNECTION_TIMEOUT, connectionTimeout);
 client.getParams().setParameter(CoreConnectionPNames.TCP_NODELAY, false);
 client.getParams().setParameter(CoreConnectionPNames.SOCKET_BUFFER_SIZE, 1024 * 1024);
 HttpRequestRetryHandler myRetryHandler = new HttpRequestRetryHandler()
 {
 @Override
 public boolean retryRequest(IOException exception, int executionCount, HttpContext context)
 {
 if (executionCount >= retryTime)
 {
 // Do not retry if over max retry count
 return false;
 }
 if (exception instanceof InterruptedIOException)
 {
 // Timeout
 return false;
 }
 if (exception instanceof UnknownHostException)
 {
 // Unknown host
 return false;
 }
 if (exception instanceof ConnectException)
 {
 // Connection refused
 return false;
 }
 if (exception instanceof SSLException)
 {
 // SSL handshake exception
 return false;
 }
 HttpRequest request = (HttpRequest) context.getAttribute(ExecutionContext.HTTP_REQUEST);

补充：软件开发 , Java ,