python自定义解析简单xml格式文件的方法
本文实例讲述了python自定义解析简单xml格式文件的方法。分享给大家供大家参考。具体分析如下:
因为公司内部的接口返回的字串支持2种形式:php数组,xml;结果php数组python不能直接用,而xml字符串的格式不是标准的,所以也不能用标准模块解析。【不标准的地方是某些节点会的名称是以数字开头的】,所以写个简单的脚步来解析一下文件,用来做接口测试。
#!/usr/bin/envpython #encoding:utf-8 importre classxmlparse: def__init__(self,xmlstr): self.xmlstr=xmlstr self.xmldom=self.__convet2utf8() self.xmlnodelist=[] self.xpath='' def__convet2utf8(self): headstr=self.__get_head() xmldomstr=self.xmlstr.replace(headstr,'') if'gbk'inheadstr: xmldomstr=xmldomstr.decode('gbk').encode('utf-8') elif'gb2312'inheadstr: xmldomstr=self.xmlstr.decode('gb2312').encode('utf-8') returnxmldomstr def__get_head(self): headpat=r'<\?xml.*\?>' headpatobj=re.compile(headpat) headregobj=headpatobj.match(self.xmlstr) ifheadregobj: headstr=headregobj.group() returnheadstr else: return'' defparse(self,xpath): self.xpath=xpath xpatlist=[] xpatharr=self.xpath.split('/') forxnodeinxpatharr: ifxnode: spcindex=xnode.find('[') ifspcindex>-1: index=int(xnode[spcindex+1:-1]) xnode=xnode[:spcindex] else: index=0; temppat=('<%s>(.*?)</%s>'%(xnode,xnode),index) xpatlist.append(temppat) xmlnodestr=self.xmldom forxpat,indexinxpatlist: xmlnodelist=re.findall(xpat,xmlnodestr) xmlnodestr=xmlnodelist[index] ifxmlnodestr.startswith(r'<![CDATA['): xmlnodestr=xmlnodestr.replace(r'<![CDATA[','')[:-3] self.xmlnodelist=xmlnodelist returnxmlnodestr if'__main__'==__name__: xmlstr='<?xmlversion="1.0"encoding="utf-8"standalone="yes"?><resultObject><a><product_id>aaaaa</product_id><product_name><![CDATA[bbbbb]]></a><b><product_id>bbbbb</product_id><product_name><![CDATA[bbbbb]]></b></product_name></resultObject>' xpath1='/product_id' xpath2='/product_id[1]' xpath3='/a/product_id' xp=xmlparse(xmlstr) print'xmlstr:',xp.xmlstr print'xmldom:',xp.xmldom print'------------------------------' getstr=xp.parse(xpath1) print'xpath:',xp.xpath print'getlist:',xp.xmlnodelist print'getstring:',getstr print'------------------------------' getstr=xp.parse(xpath2) print'xpath:',xp.xpath print'getlist:',xp.xmlnodelist print'getstring:',getstr print'------------------------------' getstr=xp.parse(xpath3) print'xpath:',xp.xpath print'getlist:',xp.xmlnodelist print'getstring:',getstr
运行结果:
xmlstr:<?xmlversion="1.0"encoding="utf-8"standalone="yes"?><resultObject><a><product_id>aaaaa</product_id><product_name><![CDATA[bbbbb]]></a><b><product_id>bbbbb</product_id><product_name><![CDATA[bbbbb]]></b></product_name></resultObject> xmldom:<resultObject><a><product_id>aaaaa</product_id><product_name><![CDATA[bbbbb]]></a><b><product_id>bbbbb</product_id><product_name><![CDATA[bbbbb]]></b></product_name></resultObject> ------------------------------ xpath:/product_id getlist:['aaaaa','bbbbb'] getstring:aaaaa ------------------------------ xpath:/product_id[1] getlist:['aaaaa','bbbbb'] getstring:bbbbb ------------------------------ xpath:/a/product_id getlist:['aaaaa'] getstring:aaaaa
因为返回的xml格式比较简单,没有带属性的节点,所以处理起来就比较简单了。但测试还是发现有一个bug。即当相同节点嵌套时会出现正则匹配出问题,该问题的可以通过避免在xpath中出现有嵌套节点的名称来解决,否则只有重写复杂的机制了。
希望本文所述对大家的Python程序设计有所帮助。