Python下利用BeautifulSoup解析HTML的实现

2023-08-02 23:27:04 304

摘要

BeautifulSoup是一个可以从HTML或XML格式文件中提取数据的Python库，他可以将HTML或XML数据解析为Python对象，以方便通过Python代码进行处理。

文档环境

Centos7.5
Python2.7
BeautifulSoup4

BeautifuSoup使用说明

BeautifulSoup的基本功能就是对HTML的标签进行查找及编辑。

基本概念-对象类型

BeautifulSoup将复杂HTML文档转换成一个复杂的树形结构，每个节点都被转换成一个Python对象，BeautifulSoup将这些对象定义了4种类型:Tag、NavigableString、BeautifulSoup、Comment。

对象类型	描述
BeautifulSoup	文档的全部内容
Tag	HTML的标签
NavigableString	标签包含的文字
Comment	是一种特殊的NavigableString类型，当标签中的NavigableString被注释时，则定义为该类型

安装及引用

#BeautifulSoup
pipinstallbs4

#解析器
pipinstalllxml
pipinstallhtml5lib

#初始化
frombs4importBeautifulSoup

#方法一，直接打开文件
soup=BeautifulSoup(open("index.html"))

#方法二，指定数据
resp="data"
soup=BeautifulSoup(resp,'lxml')

#soup为BeautifulSoup类型对象
print(type(soup))

标签搜索及过滤

基本方法

标签搜索有find_all()和find()两个基本的搜索方法，find_all()方法会返回所有匹配关键字的标签列表，find()方法则只返回一个匹配结果。

soup=BeautifulSoup(resp,'lxml')

#返回一个标签名为"a"的Tag
soup.find("a")

#返回所有tag列表
soup.find_all("a")

##find_all方法可被简写
soup("a")

#找出所有以b开头的标签
fortaginsoup.find_all(re.compile("^b")):
print(tag.name)

#找出列表中的所有标签
soup.find_all(["a","p"])

#查找标签名为p，class属性为"title"
soup.find_all("p","title")

#查找属性id为"link2"
soup.find_all(id="link2")

#查找存在属性id的
soup.find_all(id=True)

#
soup.find_all(href=re.compile("elsie"),id='link1')

#
soup.find_all(attrs={"data-foo":"value"})

#查找标签文字包含"sisters"
soup.find(string=re.compile("sisters"))

#获取指定数量的结果
soup.find_all("a",limit=2)

#自定义匹配方法
defhas_class_but_no_id(tag):
returntag.has_attr('class')andnottag.has_attr('id')
soup.find_all(has_class_but_no_id)

#仅对属性使用自定义匹配方法
defnot_lacie(href):
returnhrefandnotre.compile("lacie").search(href)
soup.find_all(href=not_lacie)

#调用tag的find_all()方法时,BeautifulSoup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数recursive=False

soup.find_all("title",recursive=False)

扩展方法

ind_parents()	所有父辈节点
find_parent()	第一个父辈节点
find_next_siblings()	之后的所有兄弟节点
find_next_sibling()	之后的第一个兄弟节点
find_previous_siblings()	之前的所有兄弟节点
find_previous_sibling()	之前的第一个兄弟节点
find_all_next()	之后的所有元素
find_next()	之后的第一个元素
find_all_previous()	之前的所有元素
find_previous()	之前的第一个元素

CSS选择器

BeautifulSoup支持大部分的CSS选择器http://www.w3.org/TR/CSS2/selector.html,在Tag或BeautifulSoup对象的.select()方法中传入字符串参数,即可使用CSS选择器的语法找到tag。

html_doc="""


TheDormouse'sstory


TheDormouse'sstory


Onceuponatimetherewerethreelittlesisters;andtheirnameswere
Elsie,
Lacie
and
Tillie;
andtheylivedatthebottomofawell.


...
"""

soup=BeautifulSoup(html_doc)

#所有a标签
soup.select("a")

#逐层查找
soup.select("bodya")
soup.select("htmlheadtitle")

#tag标签下的直接子标签
soup.select("head>title")
soup.select("p>#link1")

#所有匹配标签之后的兄弟标签
soup.select("#link1~.sister")

#匹配标签之后的第一个兄弟标签
soup.select("#link1+.sister")

#根据calss类名
soup.select(".sister")
soup.select("[class~=sister]")

#根据ID查找
soup.select("#link1")
soup.select("a#link1")

#根据多个ID查找
soup.select("#link1,#link2")

#根据属性查找
soup.select('a[href]')

#根据属性值查找
soup.select('a[href^="http://example.com/"]')
soup.select('a[href$="tillie"]')
soup.select('a[href*=".com/el"]')

#只获取一个匹配结果
soup.select(".sister",limit=1)

#只获取一个匹配结果
soup.select_one(".sister")

标签对象方法

标签属性

soup=BeautifulSoup('Extremelybold
Extremelybold2')
#获取所有的p标签对象
tags=soup.find_all("p")
#获取第一个p标签对象
tag=soup.p
#输出标签类型
type(tag)
#标签名
tag.name
#标签属性
tag.attrs
#标签属性class的值
tag['class']
#标签包含的文字内容，对象NavigableString的内容
tag.string

#返回标签内所有的文字内容
forstringintag.strings:
print(repr(string))

#返回标签内所有的文字内容,并去掉空行
forstringintag.stripped_strings:
print(repr(string))

#获取到tag中包含的所有及包括子孙tag中的NavigableString内容，并以Unicode字符串格式输出
tag.get_text()
##以"|"分隔
tag.get_text("|")
##以"|"分隔，不输出空字符
tag.get_text("|",strip=True)
获取子节点
tag.contents#返回第一层子节点的列表
tag.children#返回第一层子节点的listiterator对象
forchildintag.children:
print(child)

tag.descendants#递归返回所有子节点
forchildintag.descendants:
print(child)

获取父节点

tag.parent#返回第一层父节点标签
tag.parents#递归得到元素的所有父辈节点

forparentintag.parents:
ifparentisNone:
print(parent)
else:
print(parent.name)

获取兄弟节点

#下一个兄弟元素
tag.next_sibling

#当前标签之后的所有兄弟元素
tag.next_siblings
forsiblingintag.next_siblings:
print(repr(sibling))

#上一个兄弟元素
tag.previous_sibling

#当前标签之前的所有兄弟元素
tag.previous_siblings
forsiblingintag.previous_siblings:
print(repr(sibling))

元素的遍历

BeautifulSoup中把每个tag定义为一个“element”，每个“element”，被自上而下的在HTML中排列，可以通过遍历命令逐个显示标签

#当前标签的下一个元素
tag.next_element

#当前标签之后的所有元素
forelementintag.next_elements:
print(repr(element))

#当前标签的前一个元素
tag.previous_element
#当前标签之前的所有元素
forelementintag.previous_elements:
print(repr(element))

修改标签属性

soup=BeautifulSoup('Extremelybold')
tag=soup.b

tag.name="blockquote"
tag['class']='verybold'
tag['id']=1

tag.string="Newlinktext."
print(tag)

修改标签内容（NavigableString)

soup=BeautifulSoup('Extremelybold')
tag=soup.b
tag.string="Newlinktext."

添加标签内容（NavigableString)

soup=BeautifulSoup("Foo")
tag=soup.a
tag.append("Bar")
tag.contents

#或者

new_string=NavigableString("Bar")
tag.append(new_string)
print(tag)

添加注释(Comment)

注释是一个特殊的NavigableString对象，所以同样可以通过append()方法进行添加。

frombs4importComment
soup=BeautifulSoup("Foo")
new_comment=soup.new_string("Nicetoseeyou.",Comment)
tag.append(new_comment)
print(tag)

添加标签(Tag)

添加标签方法有两种，一种是在指定标签的内部添加（append方法），另一种是在指定位置添加(insert、insert_before、insert_after方法)

append方法

soup=BeautifulSoup("")
tag=soup.b
new_tag=soup.new_tag("a",href="http://www.example.com"rel="externalnofollow")
new_tag.string="Linktext."
tag.append(new_tag)
print(tag)

*insert方法，是指在当前标签子节点列表的指定位置插入对象（Tag或NavigableString）

html='Ilinkedtoexample.com'
soup=BeautifulSoup(html)
tag=soup.a
tag.contents
tag.insert(1,"butdidnotendorse")
tag.contents

insert_before()和insert_after()方法则在当前标签之前或之后的兄弟节点添加元素

html='Ilinkedtoexample.com'
soup=BeautifulSoup(html)
tag=soup.new_tag("i")
tag.string="Don't"
soup.b.insert_before(tag)
soup.b

*wrap()和unwrap()可以对指定的tag元素进行包装或解包,并返回包装后的结果。

```python
#添加包装
soup=BeautifulSoup("IwishIwasbold.")
soup.p.string.wrap(soup.new_tag("b"))
#输出IwishIwasbold.

soup.p.wrap(soup.new_tag("div"))
#输出IwishIwasbold.

#拆解包装
markup='Ilinkedtoexample.com'
soup=BeautifulSoup(markup)
a_tag=soup.a

a_tag.i.unwrap()
a_tag
#输出Ilinkedtoexample.com

删除标签

html='Ilinkedtoexample.com'
soup=BeautifulSoup(html)
#清楚当前标签的所有子节点
soup.b.clear()

#将当前标签及所有子节点从soup中移除,返回当前标签。
b_tag=soup.b.extract()
b_tag
soup

#将当前标签及所有子节点从soup中移除，无返回。
soup.b.decompose()

#将当前标签替换为指定的元素
tag=soup.i
new_tag=soup.new_tag("p")
new_tag.string="Don't"
tag.replace_with(new_tag)

其他方法

输出

#格式化输出
tag.prettify()
tag.prettify("latin-1")

使用BeautifulSoup解析后,文档都被转换成了Unicode，特殊字符也被转换为Unicode，如果将文档转换成字符串,Unicode编码会被编码成UTF-8.这样就无法正确显示HTML特殊字符了
使用Unicode时,BeautifulSoup还会智能的把“引号”转换成HTML或XML中的特殊字符

文档编码

使用BeautifulSoup解析后,文档都被转换成了Unicode，其使用了“编码自动检测”子库来识别当前文档编码并转换成Unicode编码。

soup=BeautifulSoup(html)
soup.original_encoding

#也可以手动指定文档的编码
soup=BeautifulSoup(html,from_encoding="iso-8859-8")
soup.original_encoding

#为提高“编码自动检测”的检测效率，也可以预先排除一些编码
soup=BeautifulSoup(markup,exclude_encodings=["ISO-8859-7"])
通过BeautifulSoup输出文档时,不管输入文档是什么编码方式,默认输出编码均为UTF-8编码
文档解析器
BeautifulSoup目前支持,“lxml”,“html5lib”,和“html.parser”

soup=BeautifulSoup("")
soup
#输出：
soup=BeautifulSoup("","lxml")
soup
#输出：
soup=BeautifulSoup("","html5lib")
soup
#输出：
soup=BeautifulSoup("","html.parser")
soup
#输出：

参考文档
https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持毛票票。
声明：本文内容来源于网络，版权归原作者所有，内容由互联网用户自发贡献自行上传，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任。如果您发现有涉嫌版权的内容，欢迎发送邮件至：czq8825#qq.com（发邮件时，请将#更换为@）进行举报，并提供相关证据，一经查实，本站将立刻删除涉嫌侵权内容。

上一篇
pytorch forward两个参数实例下一篇
Python利用Scrapy框架爬取豆瓣电影示例

热门推荐

1 毛坯房验收经验和常识看了之后再验房心里有底

2 二手房收房如何交接二手房收房注意问题

3 专业验收毛坯房的价格商品房验收合格的标准

4 精装房怎么验收精装房请验房师有用吗

5 一般要到哪里找验房师验房师有哪些作用呢

6 请人验房一般是多少钱验房师费用是多少

7 怎样测量房子面积建筑面积和使用面积怎么算

8 收房需要注意什么仔细检查不松懈

9 收房时三书一证一表是什么主要作用介绍

10 交房时交房税费有哪些本文为你一一讲解

11 验房都需要验什么要做哪些准备呢

12 毛坯房验房师有必要请吗毛坯房装修完如何验收

13 地下室防水工程质量验收规范详解

14 水性涂料、油性涂料区别介绍

15 零基础布艺DIY工坊教你做超萌猫头鹰钥匙包

16 三棵树漆怎么样？三棵树漆官方网站

17 家庭“装修套餐”中猫腻你知道吗？

18 小空间大浴望卫浴间装修巧支招

Python下利用BeautifulSoup解析HTML的实现

热门推荐

随机推荐