Python实现简单HTML表格解析的方法

2024-03-30 13:40:04 324

本文实例讲述了Python实现简单HTML表格解析的方法。分享给大家供大家参考。具体分析如下：

这里依赖libxml2dom，确保首先安装！导入到你的脚步并调用parse_tables()函数。

1.source=astringcontainingthesourcecodeyoucanpassinjustthetableortheentirepagecode

2.headers=alistofintsORalistofstrings
Iftheheadersareintsthisisfortableswithnoheader,justlistthe0basedindexoftherowsinwhichyouwanttoextractdata.
Iftheheadersarestringsthisisfortableswithheadercolumns(withthetags)itwillpulltheinformationfromthespecifiedcolumns

3.The0basedindexofthetableinthesourcecode.Iftherearemultipletablesandthetableyouwanttoparseisthethirdtableinthecodethenpassinthenumber2here

Itwillreturnalistoflists.eachinnerlistwillcontaintheparsedinformation.

具体代码如下：

#Thegoaloftableparseristogetspecificinformationfromspecific
#columnsinatable.
#Input:sourcecodefromatypicalwebsite
#Arguments:alistofheaderstheuserwantstoreturn
#Output:Alistoflistsofthedataineachrow
importlibxml2dom
defparse_tables(source,headers,table_index):
"""parse_tables(stringsource,listheaders,table_index)
headersmaybealistofstringsifthetablehasheadersdefinedor
headersmaybealistofintsifnoheadersdefinedthiswillgetdata
fromtherowsindex.
Thismethodreturnsalistoflists
"""
#Determineiftheheaderslistisstringsorintsandmakesurethey
#areallthesametype
j=0
print'Printingheaders:',headers
#routetothecorrectfunction
#iftheheadertypeisint
iftype(headers[0])==type(1):
#runno_headerfunction
returnno_header(source,headers,table_index)
#iftheheadertypeisstring
eliftype(headers[0])==type('a'):
#runtheheader_givenfunction
returnheader_given(source,headers,table_index)
else:
#returnnoneiftheheadersaren'tcorrect
returnNone
#Thisfunctiontakesinthesourcecodeofthewholepageastringlistof
#headersandtheindexnumberofthetableonthepage.Itreturnsalistof
#listswiththescrapedinformation
defheader_given(source,headers,table_index):
#initiatealisttoholethereturnlist
return_list=[]
#initiatealisttoholdtheindexnumbersofthedataintherows
header_index=[]
#getadocumentobjectoutofthesourcecode
doc=libxml2dom.parseString(source,html=1)
#getthetablesfromthedocument
tables=doc.getElementsByTagName('table')
try:
#trytogetfocueonthedesiredtable
main_table=tables[table_index]
except:
#ifthetabledoesn'texitsthenreturnanerror
return['Thetableindexwasnotfound']
#getalistofheadersinthetable
table_headers=main_table.getElementsByTagName('th')
#needasentryvaluefortheheaderloop
loop_sentry=0
#loopthrougheachheaderlookingformatches
forheaderintable_headers:
#iftheheaderisinthedesiredheaderslist
ifheader.textContentinheaders:
#addittotheheader_index
header_index.append(loop_sentry)
#addonetotheloop_sentry
loop_sentry+=1
#gettherowsfromthetable
rows=main_table.getElementsByTagName('tr')
#sentryvaluedetectingifthefirstrowisbeingviewed
row_sentry=0
#loopthroughtherowsinthetable,skippingthefirstrow
forrowinrows:
#ifrow_sentryis0thisisourfirstrow
ifrow_sentry==0:
#maketherow_sentrynot0
row_sentry=1337
continue
#getallcellsfromthecurrentrow
cells=row.getElementsByTagName('td')
#initiatealisttoappendintothereturn_list
cell_list=[]
#iteratethroughalloftheheaderindex's
foriinheader_index:
#appendthecellstextcontenttothecell_list
cell_list.append(cells[i].textContent)
#appendthecell_listtothereturn_list
return_list.append(cell_list)
#returnthereturn_list
returnreturn_list
#Thisfunctiontakesinthesourcecodeofthewholepageanintlistof
#headersindicatingtheindexnumberoftheneededitemandtheindexnumber
#ofthetableonthepage.Itreturnsalistoflistswiththescrapedinfo
defno_header(source,headers,table_index):
#initiatealisttoholdthereturnlist
return_list=[]
#getadocumentobjectoutofthesourcecode
doc=libxml2dom.parseString(source,html=1)
#getthetablesfromdocument
tables=doc.getElementsByTagName('table')
try:
#Trytogetfocusonthedesiredtable
main_table=tables[table_index]
except:
#ifthetabledoesn'texitsthenreturnanerror
return['Thetableindexwasnotfound']
#getalloftherowsoutofthemain_table
rows=main_table.getElementsByTagName('tr')
#loopthrougheachrow
forrowinrows:
#getallcellsfromthecurrentrow
cells=row.getElementsByTagName('td')
#initiatealisttoappendintothereturn_list
cell_list=[]
#loopthroughthelistofdesiredheaders
foriinheaders:
try:
#trytoaddtextfromthecellintothecell_list
cell_list.append(cells[i].textContent)
except:
#ifthereisanerrorusuallyanindexerrorjustcontinue
continue
#appendthedatascrapedintothereturn_list
return_list.append(cell_list)
#returnthereturnlist
returnreturn_list

希望本文所述对大家的Python程序设计有所帮助。

Python实现简单HTML表格解析的方法

热门推荐

随机推荐