Python3学习urllib的使用方法示例

2023-09-20 07:22:08 311

urllib是python的一个获取url(UniformResourceLocators,统一资源定址符)了，可以利用它来抓取远程的数据进行保存，本文整理了一些关于urllib使用中的一些关于header,代理,超时,认证,异常处理处理方法。

1.基本方法

urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)

url: 需要打开的网址
data：Post提交的数据
timeout：设置网站的访问超时时间

直接用urllib.request模块的urlopen（）获取页面，page的数据格式为bytes类型，需要decode（）解码，转换成str类型。

fromurllibimportrequest
response=request.urlopen(r'http://python.org/')#HTTPResponse类型
page=response.read()
page=page.decode('utf-8')

urlopen返回对象提供方法：

read(),readline(),readlines(),fileno(),close()：对HTTPResponse类型数据进行操作
info()：返回HTTPMessage对象，表示远程服务器返回的头信息
getcode()：返回Http状态码。如果是http请求，200请求成功完成;404网址未找到
geturl()：返回请求的url

1、简单读取网页信息

importurllib.request
response=urllib.request.urlopen('http://python.org/')
html=response.read()

2、使用request

urllib.request.Request(url,data=None,headers={},method=None)

使用request（）来包装请求，再通过urlopen（）获取页面。

importurllib.request
req=urllib.request.Request('http://python.org/')
response=urllib.request.urlopen(req)
the_page=response.read()

3、发送数据，以登录知乎为例

'''''
Createdon2016年5月31日

@author:gionee
'''
importgzip
importre
importurllib.request
importurllib.parse
importhttp.cookiejar

defungzip(data):
try:
print("尝试解压缩...")
data=gzip.decompress(data)
print("解压完毕")
except:
print("未经压缩，无需解压")

returndata

defgetXSRF(data):
cer=re.compile('name=\"_xsrf\"value=\"(.*)\"',flags=0)
strlist=cer.findall(data)
returnstrlist[0]

defgetOpener(head):
#cookies处理
cj=http.cookiejar.CookieJar()
pro=urllib.request.HTTPCookieProcessor(cj)
opener=urllib.request.build_opener(pro)
header=[]
forkey,valueinhead.items():
elem=(key,value)
header.append(elem)
opener.addheaders=header
returnopener
#header信息可以通过firebug获得
header={
'Connection':'Keep-Alive',
'Accept':'text/html,application/xhtml+xml,*/*',
'Accept-Language':'en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3',
'User-Agent':'Mozilla/5.0(WindowsNT6.1;WOW64;rv:46.0)Gecko/20100101Firefox/46.0',
'Accept-Encoding':'gzip,deflate',
'Host':'www.zhihu.com',
'DNT':'1'
}

url='http://www.zhihu.com/'
opener=getOpener(header)
op=opener.open(url)
data=op.read()
data=ungzip(data)
_xsrf=getXSRF(data.decode())

url+="login/email"
email="登录账号"
password="登录密码"
postDict={
'_xsrf':_xsrf,
'email':email,
'password':password,
'rememberme':'y'
}
postData=urllib.parse.urlencode(postDict).encode()
op=opener.open(url,postData)
data=op.read()
data=ungzip(data)

print(data.decode())

4、http错误

importurllib.request
req=urllib.request.Request('http://www.lz881228.blog.163.com')
try:
urllib.request.urlopen(req)
excepturllib.error.HTTPErrorase:
print(e.code)
print(e.read().decode("utf8"))

5、异常处理

fromurllib.requestimportRequest,urlopen
fromurllib.errorimportURLError,HTTPError

req=Request("http://www.abc.com/")
try:
response=urlopen(req)
exceptHTTPErrorase:
print('Theservercouldn'tfulfilltherequest.')
print('Errorcode:',e.code)
exceptURLErrorase:
print('Wefailedtoreachaserver.')
print('Reason:',e.reason)
else:
print("good!")
print(response.read().decode("utf8"))

6、http认证

importurllib.request

#createapasswordmanager
password_mgr=urllib.request.HTTPPasswordMgrWithDefaultRealm()

#Addtheusernameandpassword.
#Ifweknewtherealm,wecoulduseitinsteadofNone.
top_level_url="https://www.nhooo.com/"
password_mgr.add_password(None,top_level_url,'rekfan','xxxxxx')

handler=urllib.request.HTTPBasicAuthHandler(password_mgr)

#create"opener"(OpenerDirectorinstance)
opener=urllib.request.build_opener(handler)

#usetheopenertofetchaURL
a_url="https://www.nhooo.com/"
x=opener.open(a_url)
print(x.read())

#Installtheopener.
#Nowallcallstourllib.request.urlopenuseouropener.
urllib.request.install_opener(opener)
a=urllib.request.urlopen(a_url).read().decode('utf8')

print(a)

7、使用代理

importurllib.request

proxy_support=urllib.request.ProxyHandler({'sock5':'localhost:1080'})
opener=urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)

a=urllib.request.urlopen("http://www.baidu.com").read().decode("utf8")
print(a)

8、超时

importsocket
importurllib.request

#timeoutinseconds
timeout=2
socket.setdefaulttimeout(timeout)

#thiscalltourllib.request.urlopennowusesthedefaulttimeout
#wehavesetinthesocketmodule
req=urllib.request.Request('https://www.nhooo.com/')
a=urllib.request.urlopen(req).read()
print(a)

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持毛票票。

Python3学习urllib的使用方法示例

热门推荐

随机推荐