微客导航 » 文章资讯 » Python爬虫获取页面所有URL链接过程详解

Python爬虫获取页面所有URL链接过程详解

2023-07-23 17:03:04 436

如何获取一个页面内所有URL链接？在Python中可以使用urllib对网页进行爬取，然后利用BeautifulSoup对爬取的页面进行解析，提取出所有的URL。

什么是BeautifulSoup？

BeautifulSoup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

BeautifulSoup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，BeautifulSoup就不能自动识别编码方式了。

BeautifulSoup支持Python标准库中的HTML解析器,还支持一些第三方的解析器，如果我们不安装它，则Python会使用Python默认的解析器，lxml解析器更加强大，速度更快。

全部代码：

frombs4importBeautifulSoup
importtime,re,urllib2
t=time.time()
websiteurls={}
defscanpage(url):
websiteurl=url
t=time.time()
n=0
html=urllib2.urlopen(websiteurl).read()
soup=BeautifulSoup(html)
pageurls=[]
Upageurls={}
pageurls=soup.find_all("a",href=True)
forlinksinpageurls:
ifwebsiteurlinlinks.get("href")andlinks.get("href")notinUpageurlsandlinks.get("href")notinwebsiteurls:
Upageurls[links.get("href")]=0
forlinksinUpageurls.keys():
try:
urllib2.urlopen(links).getcode()
except:
print"connectfailed"
else:
t2=time.time()
Upageurls[links]=urllib2.urlopen(links).getcode()
printn,
printlinks,
printUpageurls[links]
t1=time.time()
printt1-t2
n+=1
print("totalis"+repr(n)+"links")
printtime.time()-t
scanpage(http://news.163.com/)

利用BeautifulSoup还可以有针对性的获取网页链接：Python爬虫获取网页上的链接，通过beautifulsoup的findall()方法对匹配的标签进行查找。

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持毛票票。

返回顶部
3162201930
czq8825@qq.com

Python爬虫获取页面所有URL链接过程详解

热门推荐

随机推荐