python urllib2使用

时间：2010-09-05 来源：Done

HTTP的访问过程就是一来一回的. python提供的urllib2很方便发起访问请求:
* urllib2.urlopen(url)
url为完整的URL
* urllib2.urlopen(request)
request为urllib2.Request类实例

这样就发起了HTTP访问请求.

现在的网站一般都会对自动处理脚本起防范的. 比如在header段的cookie, 还有就是在post请求发出的数据中加入key=value形式的一串字符串.

I. 请求的header段处理
header在python对应的数据结构就是dict, 如:
{'cookie': '111111111111111', 'Accept-Encoding': 'gzip,deflate'}
使用方法:
request = urllib2.Request(url, headers) # headers就是字典实例
retval = urllib2.urlopen(request) # 请求将被发出去

II. post的数据处理
post的数据在python对应的数据结构是str, 如:
'person=jessinio&gender=male'

使用方法有两种:
1.
retval = urllib2.urlopen(url='http://www.google.com', data='person=jessinio&gender=male') #这样一个post请求就被发出去了.

2.
request = urllib2.Request(url, data='person=jessinio&gender=male') #指定request实例拥有的data字符串
retval = urllib2.urlopen(request) # 请求将被发出去

* 只要知道headers和post请求需要的数据结构是对应于python哪种实例后就很容易使用urllib2库

请求发出来, 接来又来一个问题: 请求后返回的数据是什么东西?

全世界都知道返回的东西肯定是字符流~~~(-_-)

常常在平时出现这样的问题: 请求一个html文件, 但返回的不是文本数据. 比如gzip. 那就需要处理一次:
    if retval.headers.has_key('content-encoding'):
        fileobj = StringIO.StringIO()
        fileobj.write(url.read())
        fileobj.seek(0)
        gzip_file = gzip.GzipFile(fileobj=fileobj)
        context = gzip_file.read()
    else:
        context = url.read()

这样就很方便得到文本数据了.