python自定义解析简单xml格式文件的方法
本文实例讲述了python自定义解析简单xml格式文件的方法。分享给大家供大家参考。具体分析如下:
因为公司内部的接口返回的字串支持2种形式:php数组,xml;结果php数组python不能直接用,而xml字符串的格式不是标准的,所以也不能用标准模块解析。【不标准的地方是某些节点会的名称是以数字开头的】,所以写个简单的脚步来解析一下文件,用来做接口测试。
#!/usr/bin/envpython
#encoding:utf-8
importre
classxmlparse:
def__init__(self,xmlstr):
self.xmlstr=xmlstr
self.xmldom=self.__convet2utf8()
self.xmlnodelist=[]
self.xpath=''
def__convet2utf8(self):
headstr=self.__get_head()
xmldomstr=self.xmlstr.replace(headstr,'')
if'gbk'inheadstr:
xmldomstr=xmldomstr.decode('gbk').encode('utf-8')
elif'gb2312'inheadstr:
xmldomstr=self.xmlstr.decode('gb2312').encode('utf-8')
returnxmldomstr
def__get_head(self):
headpat=r'<\?xml.*\?>'
headpatobj=re.compile(headpat)
headregobj=headpatobj.match(self.xmlstr)
ifheadregobj:
headstr=headregobj.group()
returnheadstr
else:
return''
defparse(self,xpath):
self.xpath=xpath
xpatlist=[]
xpatharr=self.xpath.split('/')
forxnodeinxpatharr:
ifxnode:
spcindex=xnode.find('[')
ifspcindex>-1:
index=int(xnode[spcindex+1:-1])
xnode=xnode[:spcindex]
else:
index=0;
temppat=('<%s>(.*?)</%s>'%(xnode,xnode),index)
xpatlist.append(temppat)
xmlnodestr=self.xmldom
forxpat,indexinxpatlist:
xmlnodelist=re.findall(xpat,xmlnodestr)
xmlnodestr=xmlnodelist[index]
ifxmlnodestr.startswith(r'<![CDATA['):
xmlnodestr=xmlnodestr.replace(r'<![CDATA[','')[:-3]
self.xmlnodelist=xmlnodelist
returnxmlnodestr
if'__main__'==__name__:
xmlstr='<?xmlversion="1.0"encoding="utf-8"standalone="yes"?><resultObject><a><product_id>aaaaa</product_id><product_name><![CDATA[bbbbb]]></a><b><product_id>bbbbb</product_id><product_name><![CDATA[bbbbb]]></b></product_name></resultObject>'
xpath1='/product_id'
xpath2='/product_id[1]'
xpath3='/a/product_id'
xp=xmlparse(xmlstr)
print'xmlstr:',xp.xmlstr
print'xmldom:',xp.xmldom
print'------------------------------'
getstr=xp.parse(xpath1)
print'xpath:',xp.xpath
print'getlist:',xp.xmlnodelist
print'getstring:',getstr
print'------------------------------'
getstr=xp.parse(xpath2)
print'xpath:',xp.xpath
print'getlist:',xp.xmlnodelist
print'getstring:',getstr
print'------------------------------'
getstr=xp.parse(xpath3)
print'xpath:',xp.xpath
print'getlist:',xp.xmlnodelist
print'getstring:',getstr
运行结果:
xmlstr:<?xmlversion="1.0"encoding="utf-8"standalone="yes"?><resultObject><a><product_id>aaaaa</product_id><product_name><![CDATA[bbbbb]]></a><b><product_id>bbbbb</product_id><product_name><![CDATA[bbbbb]]></b></product_name></resultObject> xmldom:<resultObject><a><product_id>aaaaa</product_id><product_name><![CDATA[bbbbb]]></a><b><product_id>bbbbb</product_id><product_name><![CDATA[bbbbb]]></b></product_name></resultObject> ------------------------------ xpath:/product_id getlist:['aaaaa','bbbbb'] getstring:aaaaa ------------------------------ xpath:/product_id[1] getlist:['aaaaa','bbbbb'] getstring:bbbbb ------------------------------ xpath:/a/product_id getlist:['aaaaa'] getstring:aaaaa
因为返回的xml格式比较简单,没有带属性的节点,所以处理起来就比较简单了。但测试还是发现有一个bug。即当相同节点嵌套时会出现正则匹配出问题,该问题的可以通过避免在xpath中出现有嵌套节点的名称来解决,否则只有重写复杂的机制了。
希望本文所述对大家的Python程序设计有所帮助。