python 2.7 下urllib2 常见用法总结-zhongtang-ChinaUnix博客

urllib2做为python下，在httplib之上再次封装的强大html协议实现，应用非常广泛。

虽然现在又有更新的requests，httplib2等，但urllib2胜在强大的应用基础以及众多的网络资料。

下面分别总结个人在学习中的遇到的一些tips.

1、通常写法 urllib2.Request(url,data,headers)

其中url是你要访问的网站地址，data是post方法时要提交的post报文，headers是html报文头部字典
特别注意Request必须是首字母大写。
附上一段urllib2里的Request类的源码，是不是清晰啦……
class Request:
def __init__(self, url, data=None, headers={},
origin_req_host=None, unverifiable=False):

另外要说明的是http协议中，post方法与get方法其实都是基于tcp通讯，其区别一个有post报文，一个没有。
摘一段源码:
def get_method(self):
if self.has_data():
return "POST"
else:
return "GET"

Anyway：
urllib2.py的源码文件在C:\Python27\Lib目录下，C:\Python27是我的python2.7安装目录。

点击(此处)折叠或打开

#-*- coding=utf-8-*-
__author__ = 'zhongtang'
import urllib2
request=urllib2.Request('')
response=urllib2.urlopen(request)
headdata=response.info()
print 'html协议头部报文\n==========\n'
print headdata
bodydata=response.read()
print 'html协议正式报文\n==========\n'
print bodydata.decode('gbk')
print 'html报文类型\n==========\n
print headdata["Content-Type"]

输出结果如下，有以下几个信息值得注意：

1、Content-Type: text/html 代表返回的是text格式的html脚本，如果请求的是一个图片，则该值变成Content-Type: image/png，代表返回png格式图片。

2、Content-Length: 227 代表html脚本的长度为227个字节。

3、Set-Cookie: BD_NOT_HTTPS=1; path=/; Max-Age=300
Set-Cookie: __bsi=16631850667377372470_00_41_R_N_2_0303_C02F_N_I_I_0; expires=Sat, 14-May-16 15:36:40 GMT; domain= path=/
代表服务器端返回两个cookie项目，分别是BD_NOT_HTTPS、__bsi。
这里要特别说明cookie的path、domain、expires几个属性，分别代表cookie的适用路径，domain代表适用网站，expires代表有效期。
后面用到cookielib时会发现，cookielib会自动匹配路径信息。具体应用请自行百度cookie的用法说明。

4、报头其实是一个字典，可以用dict["keyname"]的方式来提取值。
例如使用headdata["Content-Type"]就可以提取到"text/html"。

点击(此处)折叠或打开

html协议头部报文
==========
Server: bfe/1.0.8.14
Date: Sat, 14 May 2016 15:36:35 GMT
Content-Type: text/html
Content-Length: 227
Connection: close
Last-Modified: Thu, 09 Oct 2014 10:47:57 GMT
X-UA-Compatible: IE=Edge,chrome=1
Set-Cookie: BD_NOT_HTTPS=1; path=/; Max-Age=300
Pragma: no-cache
Cache-control: no-cache
Accept-Ranges: bytes
Set-Cookie: __bsi=16631850667377372470_00_41_R_N_2_0303_C02F_N_I_I_0; expires=Sat, 14-May-16 15:36:40 GMT; domain=www.baidu.com; path=/
html协议正式报文
==========
<html>
<head>
<script>
location.replace(location.href.replace("https://","http://"));
</script>
</head>
<body>
<noscript><meta http-equiv="refresh" content="0;url="></noscript>
</body>
</html>
html协议报文类型
==========
text/html

2、打开urllib2调试开关（debuglevel）

#http协议调试，默认是0，不打印日志
httpHandler = urllib2.HTTPHandler(debuglevel=1)
#https协议调试
httpsHandler = urllib2.HTTPSHandler(debuglevel=1)
opener = urllib2.build_opener(httpHandler, httpsHandler)
urllib2.install_opener(opener)
request=urllib2.Request('')
response=urllib2.urlopen(request)

也摘一段urllib2的源码：
class HTTPSHandler(AbstractHTTPHandler):
def __init__(self, debuglevel=0, context=None):
AbstractHTTPHandler.__init__(self, debuglevel)
self._context = context

3、direct转向自动支持

urllib2支持自动转向，假如服务器端有自动redirect，urllib2会自动去提交获取到转向链接，并执行结果。

摘一段urllib2源码中的注释：
The HTTPRedirectHandler automatically deals with HTTP 301, 302, 303 and 307
也就是说针对服务器返回的301,302,303,307等Redirect代码，urllib自动发起新的request进行转向。

点击(此处)折叠或打开

#http协议调试
httpHandler = urllib2.HTTPHandler(debuglevel=1)
#https协议调试
httpsHandler = urllib2.HTTPSHandler(debuglevel=1)
opener = urllib2.build_opener(httpHandler, httpsHandler)
urllib2.install_opener(opener)
request=urllib2.Request('')
response=urllib2.urlopen(request)
print "="*80
print response.geturl()

下面是输出报文：
1、第一次访问：GET方式访问 /others/login.action，网站返回302代码，以及转向地址：Location: https://www.hicloud.com:443/wap

2、第二次访问：GET方式访问 /wap，网站返回200代码，并成功返回报文
可以看到代码中，只写了访问login.action，第二次访问wap由urllib2包自动触发。

3、判断是否发生了转向，可以用response.geturl()来获取最后一次提交的url，跟原始url进行比较即可判断。
也摘一段源码说明这一点：
- info(): return a mimetools.Message object for the headers
- geturl(): return the original request URL
- code: HTTP status code

点击(此处)折叠或打开

send: 'CONNECT HTTP/1.0\r\n'
send: '\r\n'
send: 'GET /others/login.action HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: \r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Server: openresty
header: Date: Sat, 14 May 2016 16:36:58 GMT
header: Content-Length: 0
header: Connection: close
header: Set-Cookie: JSESSIONID=776F82D5E017CBBF256D0D0B88FE291E; Path=/; Secure; HttpOnly
header: X-Frame-Options: SAMEORIGIN
header: Location: https://www.hicloud.com:443/wap
send: 'CONNECT HTTP/1.0\r\n'
send: '\r\n'
send: 'GET /wap HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: \r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: openresty
header: Date: Sat, 14 May 2016 16:36:59 GMT
header: Content-Type: text/html;charset=UTF-8
header: Content-Length: 2669
header: Connection: close
header: Vary: Accept-Encoding
header: Set-Cookie: JSESSIONID=81E6053875FCFDC781FFEA810AEACB3D; Path=/; Secure; HttpOnly
header: Set-Cookie: lang=en-us; Domain=.hicloud.com; Path=/; Secure; HttpOnly
================================================================================
https:///wap

4、Cookie支持

cooklib会自动管理网站的cookie，就是当跟一个网站有多次交互时，上次访问返回的cookie，下次访问时会自动带上。

点击(此处)折叠或打开

#测试cookie
import urllib2
import cookielib
cookie = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
urllib2.install_opener(opener)
request= urllib2.Request('')
response = urllib2.urlopen(request)
print '='*80
for item in cookie:
print item.name ,item.value

下面输出报文：
1、cookielib会自动管理cookie，下次访问会自动带上上次访问返回的cookie。大家都知道很多网站靠靠cookie来判断是否登录成功。

2、cookielib会自动匹配cookie路径。就是说假设有两个cookie 都名称，其中分别是
sessionid=1 path=/login
sessionid=2 path=/info
假设下次访问 /login/login.action，则自动带上sessionid=1这一条。

3、基于前述两条，安装cooklib之后，在访问那些校验cookie的网站，基本不用关心cookie方面的问题。

点击(此处)折叠或打开

send: '\r\n'
send: 'GET / HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: \r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: bfe/1.0.8.14
header: Date: Sat, 14 May 2016 17:07:12 GMT
header: Content-Type: text/html
header: Content-Length: 227
header: Connection: close
header: Last-Modified: Thu, 09 Oct 2014 10:47:57 GMT
header: Set-Cookie: BD_NOT_HTTPS=1; path=/; Max-Age=300
header: Set-Cookie: BIDUPSID=F993DEECD112909EB7D02E3B0AE81B6F; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
header: Set-Cookie: PSTM=1463245632; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
header: P3P: CP=" OTI DSP COR IVA OUR IND COM "
header: X-UA-Compatible: IE=Edge,chrome=1
header: Pragma: no-cache
header: Cache-control: no-cache
header: Accept-Ranges: bytes
header: Set-Cookie: __bsi=14786677092151074740_00_24_R_N_2_0303_C02F_N_I_I_0; expires=Sat, 14-May-16 17:07:17 GMT; domain=www.baidu.com; path=/
BIDUPSID 32BDDA8658FE6C716195136E37EEBF5E
PSTM 1463245823
__bsi 16825859069001358637_00_28_R_N_1_0303_C02F_N_I_I_0
BD_NOT_HTTPS 1

5、自定义agent
很多网站在校验html head值，这时候就需要自定义agent，来伪装浏览器。

点击(此处)折叠或打开

import urllib2
import cookielib
loginHeaders={
'User-Agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.0 Chrome/30.0.1599.101 Safari/537.36',
'Referer': ''
}
httpHandler = urllib2.HTTPHandler(debuglevel=1)
httpsHandler = urllib2.HTTPSHandler(debuglevel=1)
opener = urllib2.build_opener(httpHandler, httpsHandler)
urllib2.install_opener(opener)
request=urllib2.Request('',headers=loginHeaders)
response = urllib2.urlopen(request)
page=''
page= response.read()
print page.decode('utf-8')

下面是应答报文：
1、Refer、User-Agent内容变成自定义内容，接下来伪装浏览器基本不是问题。

点击(此处)折叠或打开

send: 'GET / HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: \r\nReferer: \r\nConnection: close\r\n\r\nUser-Agent: Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.0 Chrome/30.0.1599.101 Safari/537.36\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Connection: close
header: Transfer-Encoding: chunked
header: Expires: Thu, 19 Nov 1981 08:52:00 GMT
header: Date: Sat, 14 May 2016 17:49:10 GMT
header: Content-Type: text/html; charset=utf-8
header: Server: nginx/1.2.9
header: Vary: Accept-Encoding
header: X-Powered-By: ThinkPHP
header: Set-Cookie: PHPSESSID=mkefijbt0o4v65oiboqqn803t2; path=/
header: Cache-Control: private
header: Pragma: no-cache
?<!doctype html>
<html lang="zh-CN">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />

6、同时使用cookie、debug、自定义agent
httpHandler = urllib2.HTTPHandler(debuglevel=1)
httpsHandler = urllib2.HTTPSHandler(debuglevel=1)
cookie = cookielib.CookieJar()
cookieHandler=urllib2.HTTPCookieProcessor(cookie)
opener = urllib2.build_opener(cookieHandler,httpHandler, httpsHandler)
urllib2.install_opener(opener)
loginHeaders={
'User-Agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.0 Chrome/30.0.1599.101 Safari/537.36',
'Referer': ''
}
request=urllib2.Request('',headers=loginHeaders)
response = urllib2.urlopen(request)

注意，urllib2.build_opener(cookieHandler,httpHandler, httpsHandler)。
这几个handler并没有先后顺序，甚至还可以增加代理handler，这么写为什么被支持，我还是来看源码（看不懂，没关系，重点看红色那一段）。

#handlers基本可以理解成一个数组
def build_opener(*handlers):
"""Create an opener object from a list of handlers.
The opener will use several default handlers, including support
for HTTP, FTP and when applicable, HTTPS.
If any of the handlers passed as arguments are subclasses of the
default handlers, the default handlers will not be used.
"""
import types
def isclass(obj):
return isinstance(obj, (types.ClassType, type))

opener = OpenerDirector()
default_classes = [ProxyHandler, UnknownHandler, HTTPHandler,
HTTPDefaultErrorHandler, HTTPRedirectHandler,
FTPHandler, FileHandler, HTTPErrorProcessor]
if hasattr(httplib, 'HTTPS'):
default_classes.append(HTTPSHandler)
skip = set()
for klass in default_classes:
#遍历数组
for check in handlers:
if isclass(check):
if issubclass(check, klass):
skip.add(klass)
elif isinstance(check, klass):
skip.add(klass)

点击(此处)折叠或打开

#综合使用debug、cookie、自定义agent
import urllib2
import cookielib
httpHandler = urllib2.HTTPHandler(debuglevel=1)
httpsHandler = urllib2.HTTPSHandler(debuglevel=1)
cookie = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie),httpHandler, httpsHandler)
urllib2.install_opener(opener)
loginHeaders={
'User-Agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.0 Chrome/30.0.1599.101 Safari/537.36',
'Referer': ''
}
request=urllib2.Request('',headers=loginHeaders)
response = urllib2.urlopen(request)
page=''
page= response.read()
print page

7、urllib2源码详细解释

另外，附一篇有关urllib2源码的说明文章，作者虽然加了很多注释，但对于初学者有点难度，尤其是html协议没学好的同学。
http://xw2423.byr.edu.cn/blog/archives/794