Java HttpClient使用小结
这里就不啰嗦了,主要是在做demo的时候遇到的一些问题在这里总结一下:
1、使用连接池
虽说http协议时无连接的,但毕竟是基于tcp的,底层还是需要和服务器建立连接的。对于需要从同一个站点抓取大量网页的程序,应该使用连接池,否则每次抓取都和Web站点建立连接、发送请求、获得响应、释放连接,一方面效率不高,另一方面稍不小心就会疏忽了某些资源的释放、导致站点拒绝连接(很多站点会拒绝同一个ip的大量连接、防止DOS攻击)。
连接池的例程如下:
[java]
<SPAN style="WHITE-SPACE: pre"> </SPAN>SchemeRegistry schemeRegistry = new SchemeRegistry();
schemeRegistry.register(new Scheme("http", 80, PlainSocketFactory.getSocketFactory()));
<SPAN style="WHITE-SPACE: pre"> </SPAN>schemeRegistry.register(new Scheme("https", 443, SSLSocketFactory.getSocketFactory()));
<SPAN style="WHITE-SPACE: pre"> PoolingClientConnectionManager</SPAN> cm = new PoolingClientConnectionManager(schemeRegistry);
cm.setMaxTotal(200);
cm.setDefaultMaxPerRoute(2);
HttpHost googleResearch = new HttpHost("research.google.com", 80);
HttpHost 易做图En = new HttpHost("en.易做图.org", 80);
cm.setMaxPerRoute(new HttpRoute(googleResearch), 30);
cm.setMaxPerRoute(new HttpRoute(易做图En), 50);
SchemeRegistry schemeRegistry = new SchemeRegistry();
schemeRegistry.register(new Scheme("http", 80, PlainSocketFactory.getSocketFactory()));
schemeRegistry.register(new Scheme("https", 443, SSLSocketFactory.getSocketFactory()));
PoolingClientConnectionManager cm = new PoolingClientConnectionManager(schemeRegistry);
cm.setMaxTotal(200);
cm.setDefaultMaxPerRoute(2);
HttpHost googleResearch = new HttpHost("research.google.com", 80);
HttpHost 易做图En = new HttpHost("en.易做图.org", 80);
cm.setMaxPerRoute(new HttpRoute(googleResearch), 30);
cm.setMaxPerRoute(new HttpRoute(易做图En), 50);
SchemaRegistry的作用是注册协议的默认端口号。PoolingClientConnectionManager是池化连接管理器,即连接池,setMaxTotal设置连接池的最大连接数,setDefaultMaxPerRoute设置每个路由(http://hc.apache.org/httpcomponents-client-ga/tutorial/html/connmgmt.html#d5e467)上的默认连接个数,setMaxPerRoute则单独为某个站点设置最大连接个数。
从连接池中获取http client也很方面:
[java]
DefaultHttpClient client = new DefaultHttpClient(cm);
DefaultHttpClient client = new DefaultHttpClient(cm);
2、设置HttpClient参数
HttpClient需要设置合适的参数,才能更好地工作。默认的参数能够应付少量的抓取工作,但找到一组合适的参数往往能改善特定情况下的抓取效果。设置参数的例程如下:
[java]
<SPAN style="WHITE-SPACE: pre"> </SPAN>DefaultHttpClient client = new DefaultHttpClient(cm);
Integer socketTimeout = 10000;
Integer connectionTimeout = 10000;
final int retryTime = 3;
client.getParams().setParameter(CoreConnectionPNames.SO_TIMEOUT, socketTimeout);
client.getParams().setParameter(CoreConnectionPNames.CONNECTION_TIMEOUT, connectionTimeout);
client.getParams().setParameter(CoreConnectionPNames.TCP_NODELAY, false);
client.getParams().setParameter(CoreConnectionPNames.SOCKET_BUFFER_SIZE, 1024 * 1024);
HttpRequestRetryHandler myRetryHandler = new HttpRequestRetryHandler()
{
@Override
public boolean retryRequest(IOException exception, int executionCount, HttpContext context)
{
if (executionCount >= retryTime)
{
// Do not retry if over max retry count
return false;
}
if (exception instanceof InterruptedIOException)
{
// Timeout
return false;
}
if (exception instanceof UnknownHostException)
{
// Unknown host
return false;
}
if (exception instanceof ConnectException)
{
// Connection refused
return false;
}
if (exception instanceof SSLException)
{
// SSL handshake exception
return false;
}
HttpRequest request = (HttpRequest) context.getAttribute(ExecutionContext.HTTP_REQUEST);
 
补充:软件开发 , Java ,