Python使用BeautifulSoup库解析HTML基本使用教程

2024-03-20 08:36:03 487

BeautifulSoup是Python的一个第三方库，可用于帮助解析html/XML等内容，以抓取特定的网页信息。目前最新的是v4版本，这里主要总结一下我使用的v3版本解析html的一些常用方法。

准备

1.BeautifulSoup安装

为了能够对页面中的内容进行解析，本文使用BeautifulSoup。当然，本文的例子需求较简单，完全可以使用分析字符串的方式。

执行

sudoeasy_installbeautifulsoup4

即可安装。

2.requests模块的安装

requests模块用于加载要请求的web页面。

在python的命令行中输入importrequests，报错说明requests模块没有安装。

我这里打算采用easy_install的在线安装方式安装，发现系统中并不存在easy_install命令，输入sudoapt-getinstallpython-setuptools来安装easy_install工具。

执行sudoeasy_installrequests安装requests模块。

基础

1.初始化
导入模块

#!/usr/bin/envpython
fromBeautifulSoupimportBeautifulSoup#processhtml
#fromBeautifulSoupimportBeautifulStoneSoup#processxml
#importBeautifulSoup#all

创建对象：str初始化，常用urllib2或browser返回的html初始化BeautifulSoup对象。

doc=['hello',
'
Thisisparagraphoneofptyhonclub.org.',
'
Thisisparagraphtwoofpythonclub.org.',
'']
soup=BeautifulSoup(''.join(doc))

指定编码：当html为其他类型编码（非utf-8和ascii），比如GB2312的话，则需要指定相应的字符编码，BeautifulSoup才能正确解析。

htmlCharset="GB2312"
soup=BeautifulSoup(respHtml,fromEncoding=htmlCharset)

2.获取tag内容
寻找感兴趣的tag块内容，返回对应tag块的剖析树

head=soup.find('head')
#head=soup.head
#head=soup.contents[0].contents[0]
printhead

返回内容：hello
说明一下，contents属性是一个列表，里面保存了该剖析树的直接儿子。

html=soup.contents[0]#<html>...</html>
head=html.contents[0]#<head>...</head>
body=html.contents[1]#<body>...</body>

3.获取关系节点
使用parent获取父节点

body=soup.body
html=body.parent#html是body的父亲

使用nextSibling,previousSibling获取前后兄弟

head=body.previousSibling#head和body在同一层，是body的前一个兄弟
p1=body.contents[0]#p1,p2都是body的儿子，我们用contents[0]取得p1
p2=p1.nextSibling#p2与p1在同一层，是p1的后一个兄弟,当然body.content[1]也可得到

contents[]的灵活运用也可以寻找关系节点,寻找祖先或者子孙可以采用findParent(s),findNextSibling(s),findPreviousSibling(s)

4.find/findAll用法详解
函数原型：find(name=None,attrs={},recursive=True,text=None,**kwargs)，findAll会返回所有符合要求的结果，并以list返回。
tag搜索

find(tagname)#直接搜索名为tagname的tag如：find('head')
find(list)#搜索在list中的tag，如:find(['head','body'])
find(dict)#搜索在dict中的tag，如:find({'head':True,'body':True})
find(re.compile(''))#搜索符合正则的tag,如:find(re.compile('^p'))搜索以p开头的tag
find(lambda)#搜索函数返回结果为true的tag,如:find(lambdaname:iflen(name)==1)搜索长度为1的tag
find(True)#搜索所有tag

attrs搜索

find(id='xxx')#寻找id属性为xxx的
find(attrs={id=re.compile('xxx'),algin='xxx'})#寻找id属性符合正则且algin属性为xxx的
find(attrs={id=True,algin=None})#寻找有id属性但是没有algin属性的


resp1=soup.findAll('a',attrs={'href':match1})
resp2=soup.findAll('h1',attrs={'class':match2})
resp3=soup.findAll('img',attrs={'id':match3})

text搜索
文字的搜索会导致其他搜索给的值如：tag,attrs都失效。方法与搜索tag一致

printp1.text
#u'Thisisparagraphone.'
printp2.text
#u'Thisisparagraphtwo.'
#注意：1，每个tag的text包括了它以及它子孙的text。2，所有text已经被自动转为unicode，如果需要，可以自行转码encode(xxx)

recursive和limit属性
recursive=False表示只搜索直接儿子，否则搜索整个子树，默认为True。当使用findAll或者类似返回list的方法时，limit属性用于限制返回的数量，如findAll('p',limit=2)：返回首先找到的两个tag。

实例
本文以博客的文档列表页面为例，利用python对页面中的文章名进行提取。

文章列表页中的文章列表部分的url如下：

<ulclass="listing">
<liclass="listing-item"><spanclass="date">2014-12-03</span><ahref="/post/linux_funtion_advance_feature"title="Linux函数高级特性">Linux函数高级特性</a>
</li>
<liclass="listing-item"><spanclass="date">2014-12-02</span><ahref="/post/cgdb"title="cgdb的使用">cgdb的使用</a>
</li>
...
</ul>

代码：

#!/usr/bin/envpython
#-*-coding:utf-8-*-

'ahttpparsetestprograme'

__author__='kuringlv'


importrequests
importbs4

archives_url="http://kuring.me/archive"

defstart_parse(url):
print"开始获取(%s)内容"%url
response=requests.get(url)
print"获取网页内容完毕"

soup=bs4.BeautifulSoup(response.content.decode("utf-8"))
#soup=bs4.BeautifulSoup(response.text);

#为了防止漏掉调用close方法，这里使用了with语句
#写入到文件中的编码为utf-8
withopen('archives.txt','w')asf:
forarchiveinsoup.select("li.listing-itema"):
f.write(archive.get_text().encode('utf-8')+"\n")
printarchive.get_text().encode('utf-8')

#当命令行运行该模块时，__name__等于'__main__'
#其他模块导入该模块时，__name__等于'parse_html'
if__name__=='__main__':
start_parse(archives_url)

Python使用BeautifulSoup库解析HTML基本使用教程

热门推荐

随机推荐