python 2.7 下urllib2 常见用法总结

2930阅读 0评论2016-05-15 zhongtang
分类:Python/Ruby

urllib2做为python下,在httplib之上再次封装的强大html协议实现,应用非常广泛。

虽然现在又有更新的requests,httplib2等,但urllib2胜在强大的应用基础以及众多的网络资料。

下面分别总结个人在学习中的遇到的一些tips.

1、通常写法 urllib2.Request(url,data,headers)

其中url是你要访问的网站,data是post方法时要提交的post报文,headers是html报文
特别注意Request必须是首字母大写。
附上一段urllib2的Request类的源码,是不是清晰啦……
class Request:
    def __init__(self, url, data=None, headers={},
                 origin_req_host=None, unverifiable=False):

另外要说明的是http协议中,post方法与get方法其实都是基于tcp通讯,其区别一个有post报文,一个没有。
摘一段源码:
    def get_method(self):
        if self.has_data():
            return "POST"
        else:
            return "GET"

Anyway:
urllib2.py的源码文件在
C:\Python27\Lib目录下,C:\Python27是我的python2.7安装目录。


点击(此处)折叠或打开

  1. #-*- coding=utf-8-*-
  2. __author__ = 'zhongtang'

  3. import urllib2

  4. request=urllib2.Request('')
  5. response=urllib2.urlopen(request)
  6. headdata=response.info()
  7. print 'html协议头部报文\n==========\n'
  8. print headdata
  9. bodydata=response.read()
  10. print 'html协议正式报文\n==========\n'
  11. print bodydata.decode('gbk')
  12. print 'html报文类型\n==========\n
  13. print headdata["Content-Type"]


输出结果如下,有以下几个信息值得注意:

1、Content-Type: text/html 代表返回的是text格式的html脚本,如果请求的是一个图片,则该值变成Content-Type: image/png,代表返回png格式图片。

2、Content-Length: 227 代表html脚本的长度为227个字节。

3、Set-Cookie: BD_NOT_HTTPS=1; path=/; Max-Age=300
Set-Cookie: __bsi=16631850667377372470_00_41_R_N_2_0303_C02F_N_I_I_0; expires=Sat, 14-May-16 15:36:40 GMT; domain= path=/
代表服务器端返回两个cookie项目,分别是BD_NOT_HTTPS、__bsi。
这里要特别说明cookie的path、domain、expires几个属性,
分别代表cookie的适用路径,domain代表适用网站,expires代表有效期。
后面用到cookielib时会发现,cookielib会自动匹配路径信息。具体应用请自行百度cookie的用法说明。

4、报头其实是一个字典,可以用dict["keyname"]的方式来提取值。
例如使用headdata["Content-Type"]就可以提取到"text/html"



点击(此处)折叠或打开

  1. html协议头部报文
  2. ==========


  3. Server: bfe/1.0.8.14
  4. Date: Sat, 14 May 2016 15:36:35 GMT
  5. Content-Type: text/html
  6. Content-Length: 227
  7. Connection: close
  8. Last-Modified: Thu, 09 Oct 2014 10:47:57 GMT
  9. X-UA-Compatible: IE=Edge,chrome=1
  10. Set-Cookie: BD_NOT_HTTPS=1; path=/; Max-Age=300
  11. Pragma: no-cache
  12. Cache-control: no-cache
  13. Accept-Ranges: bytes
  14. Set-Cookie: __bsi=16631850667377372470_00_41_R_N_2_0303_C02F_N_I_I_0; expires=Sat, 14-May-16 15:36:40 GMT; domain=www.baidu.com; path=/


  15. html协议正式报文
  16. ==========


  17. <html>
  18. <head>
  19. <script>
  20. location.replace(location.href.replace("https://","http://"));
  21. </script>
  22. </head>
  23. <body>
  24. <noscript><meta http-equiv="refresh" content="0;url="></noscript>
  25. </body>
  26. </html>

  27. html协议报文类型
  28. ==========
  29. text/html


2、打开urllib2调试开关(debuglevel)

#http协议调试,默认是0,不打印日志
httpHandler = urllib2.HTTPHandler(debuglevel=1)
#https协议调试 
httpsHandler = urllib2.HTTPSHandler(debuglevel=1)
opener = urllib2.build_opener(httpHandler, httpsHandler)
urllib2.install_opener(opener) 
request=urllib2.Request('')
response=urllib2.urlopen(request)


也摘一段urllib2的源码:
    class HTTPSHandler(AbstractHTTPHandler):
        def __init__(self, debuglevel=0, context=None):
            AbstractHTTPHandler.__init__(self, debuglevel)
            self._context = context

3、direct转向自动支持

urllib2支持自动转向,假如服务器端有自动redirect,urllib2会自动去提交获取到转向链接,并执行结果

摘一段urllib2源码中的注释:
The HTTPRedirectHandler automatically deals with HTTP 301, 302, 303 and 307
也就是说针对服务器返回的301,302,303,307等Redirect代码,urllib自动发起新的request进行转向。

点击(此处)折叠或打开

  1. #http协议调试
  2. httpHandler = urllib2.HTTPHandler(debuglevel=1)
  3. #https协议调试
  4. httpsHandler = urllib2.HTTPSHandler(debuglevel=1)
  5. opener = urllib2.build_opener(httpHandler, httpsHandler)
  6. urllib2.install_opener(opener)
  7. request=urllib2.Request('')
  8. response=urllib2.urlopen(request)
  9. print "="*80
  10. print response.geturl()

下面是输出报文:
1、第一次访问:GET方式访问 /others/login.action,网站返回302代码,以及转向地址Location: https://www.hicloud.com:443/wap

2、第二次访问:GET方式访问 /wap,网站返回200代码,并成功返回报文
可以看到代码中,只写了访问login.action,第二次访问wap由urllib2包自动触发。

3、判断是否发生了转向,可以用response.geturl()来获取最后一次提交的url,跟原始url进行比较即可判断。
摘一段源码说明这一点:
            - info(): return a mimetools.Message object for the headers
            - geturl(): return the original request URL
            - code: HTTP status code

点击(此处)折叠或打开

  1. send: 'CONNECT HTTP/1.0\r\n'
  2. send: '\r\n'
  3. send: 'GET /others/login.action HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: \r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'

  4. reply: 'HTTP/1.1 302 Found\r\n'
  5. header: Server: openresty
  6. header: Date: Sat, 14 May 2016 16:36:58 GMT
  7. header: Content-Length: 0
  8. header: Connection: close
  9. header: Set-Cookie: JSESSIONID=776F82D5E017CBBF256D0D0B88FE291E; Path=/; Secure; HttpOnly
  10. header: X-Frame-Options: SAMEORIGIN
  11. header: Location: https://www.hicloud.com:443/wap

  12. send: 'CONNECT HTTP/1.0\r\n'
  13. send: '\r\n'
  14. send: 'GET /wap HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: \r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'

  15. reply: 'HTTP/1.1 200 OK\r\n'
  16. header: Server: openresty
  17. header: Date: Sat, 14 May 2016 16:36:59 GMT
  18. header: Content-Type: text/html;charset=UTF-8
  19. header: Content-Length: 2669
  20. header: Connection: close
  21. header: Vary: Accept-Encoding
  22. header: Set-Cookie: JSESSIONID=81E6053875FCFDC781FFEA810AEACB3D; Path=/; Secure; HttpOnly
  23. header: Set-Cookie: lang=en-us; Domain=.hicloud.com; Path=/; Secure; HttpOnly

  24. ================================================================================
  25. https:///wap



4、Cookie支持

cooklib会自动管理网站的cookie,就是当跟一个网站有多次交互时,上次访问返回的cookie,下次访问时会自动带上。


点击(此处)折叠或打开

  1. #测试cookie
  2. import urllib2
  3. import cookielib

  4. cookie = cookielib.CookieJar()
  5. opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
  6. urllib2.install_opener(opener)
  7. request= urllib2.Request('')
  8. response = urllib2.urlopen(request)
  9. print '='*80
  10. for item in cookie:
  11.     print item.name ,item.value

下面输出报文:
1、cookielib会自动管理cookie,下次访问会自动带上上次访问返回的cookie。大家都知道很多网站靠靠cookie来判断是否登录成功。

2、cookielib会自动匹配cookie路径。就是说假设有两个cookie 都名称,其中分别是
sessionid=1 path=/login 
sessionid=2 path=/info
假设下次访问 /login/login.action,则自动带上sessionid=1这一条。

3、基于前述两条,安装cooklib之后,在访问那些校验cookie的网站,基本不用关心cookie方面的问题。

点击(此处)折叠或打开

  1. send: '\r\n'
  2. send: 'GET / HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: \r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
  3. reply: 'HTTP/1.1 200 OK\r\n'
  4. header: Server: bfe/1.0.8.14
  5. header: Date: Sat, 14 May 2016 17:07:12 GMT
  6. header: Content-Type: text/html
  7. header: Content-Length: 227
  8. header: Connection: close
  9. header: Last-Modified: Thu, 09 Oct 2014 10:47:57 GMT
  10. header: Set-Cookie: BD_NOT_HTTPS=1; path=/; Max-Age=300
  11. header: Set-Cookie: BIDUPSID=F993DEECD112909EB7D02E3B0AE81B6F; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
  12. header: Set-Cookie: PSTM=1463245632; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
  13. header: P3P: CP=" OTI DSP COR IVA OUR IND COM "
  14. header: X-UA-Compatible: IE=Edge,chrome=1
  15. header: Pragma: no-cache
  16. header: Cache-control: no-cache
  17. header: Accept-Ranges: bytes
  18. header: Set-Cookie: __bsi=14786677092151074740_00_24_R_N_2_0303_C02F_N_I_I_0; expires=Sat, 14-May-16 17:07:17 GMT; domain=www.baidu.com; path=/
  19. BIDUPSID 32BDDA8658FE6C716195136E37EEBF5E
  20. PSTM 1463245823
  21. __bsi 16825859069001358637_00_28_R_N_1_0303_C02F_N_I_I_0
  22. BD_NOT_HTTPS 1

5、自定义agent
很多网站在校验html head值,这时候就需要自定义agent,来伪装浏览器。

点击(此处)折叠或打开

  1. import urllib2
  2. import cookielib

  3. loginHeaders={
  4. 'User-Agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.0 Chrome/30.0.1599.101 Safari/537.36',
  5. 'Referer': ''
  6. }
  7. httpHandler = urllib2.HTTPHandler(debuglevel=1)
  8. httpsHandler = urllib2.HTTPSHandler(debuglevel=1)
  9. opener = urllib2.build_opener(httpHandler, httpsHandler)
  10. urllib2.install_opener(opener)
  11. request=urllib2.Request('',headers=loginHeaders)
  12. response = urllib2.urlopen(request)
  13. page=''
  14. page= response.read()
  15. print page.decode('utf-8')

下面是应答报文:
1、Refer、User-Agent内容变成自定义内容,接下来伪装浏览器基本不是问题。

点击(此处)折叠或打开

  1. send: 'GET / HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: \r\nReferer: \r\nConnection: close\r\n\r\nUser-Agent: Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.0 Chrome/30.0.1599.101 Safari/537.36\r\n'
  2. reply: 'HTTP/1.1 200 OK\r\n'
  3. header: Connection: close
  4. header: Transfer-Encoding: chunked
  5. header: Expires: Thu, 19 Nov 1981 08:52:00 GMT
  6. header: Date: Sat, 14 May 2016 17:49:10 GMT
  7. header: Content-Type: text/html; charset=utf-8
  8. header: Server: nginx/1.2.9
  9. header: Vary: Accept-Encoding
  10. header: X-Powered-By: ThinkPHP
  11. header: Set-Cookie: PHPSESSID=mkefijbt0o4v65oiboqqn803t2; path=/
  12. header: Cache-Control: private
  13. header: Pragma: no-cache
  14. ?<!doctype html>
  15. <html lang="zh-CN">
  16. <head>
  17.     <meta charset="utf-8">
  18.     <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />

6、同时使用cookie、debug、自定义agent
httpHandler = urllib2.HTTPHandler(debuglevel=1)
httpsHandler = urllib2.HTTPSHandler(debuglevel=1)
cookie = cookielib.CookieJar() 
cookieHandler=urllib2.HTTPCookieProcessor(cookie)
opener = urllib2.build_opener(cookieHandler,httpHandler, httpsHandler)
urllib2.install_opener(opener)
loginHeaders={
'User-Agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.0 Chrome/30.0.1599.101 Safari/537.36',
'Referer': ''
}
request=urllib2.Request('',headers=loginHeaders)
response = urllib2.urlopen(request)

注意,urllib2.build_opener(cookieHandler,httpHandler, httpsHandler)。
这几个handler并没有先后顺序,甚至还可以增加代理handler,这么写为什么被支持,我还是来看源码(看不懂,没关系,重点看红色那一段)。

#handlers基本可以理解成一个数组
def build_opener(*handlers):

    """Create an opener object from a list of handlers.
    The opener will use several default handlers, including support
    for HTTP, FTP and when applicable, HTTPS.
    If any of the handlers passed as arguments are subclasses of the
    default handlers, the default handlers will not be used.
    """
    import types
    def isclass(obj):
        return isinstance(obj, (types.ClassType, type))

    opener = OpenerDirector()
    default_classes = [ProxyHandler, UnknownHandler, HTTPHandler,
                       HTTPDefaultErrorHandler, HTTPRedirectHandler,
                       FTPHandler, FileHandler, HTTPErrorProcessor]
    if hasattr(httplib, 'HTTPS'):
        default_classes.append(HTTPSHandler)
    skip = set()
    for klass in default_classes:
        #遍历数组

        for check in handlers:
            if isclass(check):
                if issubclass(check, klass):
                    skip.add(klass)
            elif isinstance(check, klass):
                skip.add(klass)



点击(此处)折叠或打开

  1. #综合使用debug、cookie、自定义agent
  2. import urllib2
  3. import cookielib

  4. httpHandler = urllib2.HTTPHandler(debuglevel=1)
  5. httpsHandler = urllib2.HTTPSHandler(debuglevel=1)
  6. cookie = cookielib.CookieJar()
  7. opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie),httpHandler, httpsHandler)
  8. urllib2.install_opener(opener)

  9. loginHeaders={
  10. 'User-Agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.0 Chrome/30.0.1599.101 Safari/537.36',
  11. 'Referer': ''
  12. }
  13. request=urllib2.Request('',headers=loginHeaders)
  14. response = urllib2.urlopen(request)
  15. page=''
  16. page= response.read()
  17. print page

7、urllib2源码详细解释

另外,附一篇有关urllib2源码的说明文章,作者虽然加了很多注释,但对于初学者有点难度,尤其是html协议没学好的同学。
http://xw2423.byr.edu.cn/blog/archives/794
 



上一篇:windows 软链接的建立及删除
下一篇:python2.7下urllib2的connection自动close的原因及解决办法