python2.7下urllib2的connection自动close的原因及解决办法

6400阅读 0评论2016-05-15 zhongtang
分类:Python/Ruby

前文介绍了urllib2的常见问题的解决方案,今天来特别讨论下urllib2中短连接问题。

1、urllib2代码

如下文代码所示,自定义 'Connection': 'keep-alive',通知服务器交互结束后,不断开连接,即所谓长连接。

点击(此处)折叠或打开

  1. #测试8 使用urllib2 测试Connection=keep-alive
  2. import urllib2
  3. import cookielib


  4. httpHandler = urllib2.HTTPHandler(debuglevel=1)
  5. httpsHandler = urllib2.HTTPSHandler(debuglevel=1)
  6. opener = urllib2.build_opener(httpHandler, httpsHandler)
  7. urllib2.install_opener(opener)

  8. loginHeaders={
  9. 'User-Agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.0 Chrome/30.0.1599.101 Safari/537.36',
  10. 'Referer': '',
  11. 'Connection': 'keep-alive'
  12. }
  13. request=urllib2.Request('',headers=loginHeaders)
  14. response = urllib2.urlopen(request)
  15. page=''
  16. page= response.read()
  17. print response.info()
  18. print page

输出报文:
注意日志中划线部分,可以看到请求报文其他头部,例如User-agent已被修改成功,但connection仍然保持close
  1. Connection: close
  2. header: Connection: close

点击(此处)折叠或打开

  1. send: 'GET / HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: \r\nReferer: \r\nConnection: close\r\nUser-Agent: Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.0 Chrome/30.0.1599.101 Safari/537.36\r\n\r\n'
  2. reply: 'HTTP/1.1 200 OK\r\n'
  3. header: Connection: close
  4. header: Transfer-Encoding: chunked
  5. header: Expires: Thu, 19 Nov 1981 08:52:00 GMT
  6. header: Date: Sun, 15 May 2016 02:51:33 GMT
  7. header: Content-Type: text/html; charset=utf-8
  8. header: Server: nginx/1.2.9
  9. header: Vary: Accept-Encoding
  10. header: X-Powered-By: ThinkPHP
  11. header: Set-Cookie: PHPSESSID=q0pobulph34f8sum6akhpovkg1; path=/
  12. header: Cache-Control: private
  13. header: Pragma: no-cache
  14. Connection: close
  15. Transfer-Encoding: chunked
  16. Expires: Thu, 19 Nov 1981 08:52:00 GMT
  17. Date: Sun, 15 May 2016 02:51:33 GMT
  18. Content-Type: text/html; charset=utf-8
  19. Server: nginx/1.2.9
  20. Vary: Accept-Encoding
  21. X-Powered-By: ThinkPHP
  22. Set-Cookie: PHPSESSID=q0pobulph34f8sum6akhpovkg1; path=/
  23. Cache-Control: private
  24. Pragma: no-cache

2、httplib2写法代码

换成httplib2协议的代码,当然这也是urllib2不支持keep-alive的解决办法之一,另一个方法是Requests

点击(此处)折叠或打开

  1. #测试8 使用httplib2测试Connection=keep-alive
  2. import httplib2

  3. ghttp = httplib2.Http()
  4. httplib2.debuglevel=1
  5. loginHeaders={
  6. 'User-Agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.0 Chrome/30.0.1599.101 Safari/537.36',
  7. 'Connection': 'keep-alive'
  8. }

  9. response ,page= ghttp.request('',headers=loginHeaders )
  10. print page.decode('utf-8')

可以看到输出中,长连接设置成功。
  1. header: Connection: Keep-Alive

点击(此处)折叠或打开
  1. connect: (www.suning.com.cn, 80) ************
  2. send: 'GET / HTTP/1.1\r\nHost: \r\nconnection: keep-alive\r\naccept-encoding: gzip, deflate\r\nuser-agent: Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.0 Chrome/30.0.1599.101 Safari/537.36\r\n\r\n'
  3. reply: 'HTTP/1.1 200 OK\r\n'
  4. header: Connection: Keep-Alive
  5. header: Transfer-Encoding: chunked
  6. header: Expires: Thu, 19 Nov 1981 08:52:00 GMT
  7. header: Date: Sun, 15 May 2016 02:59:50 GMT
  8. header: Content-Type: text/html; charset=utf-8
  9. header: Server: nginx/1.2.9
  10. header: Vary: Accept-Encoding
  11. header: X-Powered-By: ThinkPHP
  12. header: Set-Cookie: PHPSESSID=egs5ef9dja68ti8u72v0hg5066; path=/
  13. header: Cache-Control: private
  14. header: Pragma: no-cache

3、分析原因

还是上urllib2的源码吧,可以看到在do_open核心方法中,connection被写死成了close。
至于原因就是上面那一堆注释,大概意思是addinfourl这个类一旦启用长链接,可以读取到上次交互未读完的应答报文,为了防止此类情况,所以强制性将Connection写死成close

   def do_open(self, http_class, req, **http_conn_args):
        ……
        # We want to make an HTTP/1.1 request, but the addinfourl
        # class isn't prepared to deal with a persistent connection.
        # It will try to read all remaining data from the socket,
        # which will block while the server waits for the next request.
        # So make sure the connection gets closed after the (only)
        # request. 
      headers["Connection"] = "close"
 
      headers = dict((name.title(), val) for name, val in headers.items())

        if req._tunnel_host:
            tunnel_headers = {}
            proxy_auth_hdr = "Proxy-Authorization"
            if proxy_auth_hdr in headers:
                tunnel_headers[proxy_auth_hdr] = headers[proxy_auth_hdr]
                # Proxy-Authorization should not be sent to origin
                # server.
                del headers[proxy_auth_hdr]
            h.set_tunnel(req._tunnel_host, headers=tunnel_headers)

        try:
            h.request(req.get_method(), req.get_selector(), req.data, headers)



上一篇:python 2.7 下urllib2 常见用法总结
下一篇:python2.7下同步华为云照片的爬虫源码