我一直在尝试使用
PHP和XMLReader解析一个非常大的XML文件,但似乎无法得到我正在寻找的结果.基本上,我正在搜索大量的信息,如果一个包含某个zipcode,我想返回那一点XML,或继续搜索,直到找到该zipcode.从本质上讲,我将把这个大文件分解成只有几个小块,所以不必查看数千或数百万组信息,它可能是10或20.
这里有一些我喜欢的XML
//search through xml <lineups country="USA"> //cache TX02217 as a variable <headend headendId="TX02217"> //cache Grande Gables at The Terrace as a variable <name>Grande Gables at The Terrace</name> //cache Grande Communications as a variable <mso msoId="17541">Grande Communications</mso> <marketIds> <marketId type="DMA">635</marketId> </marketIds> //check to see if any of the postal codes are equal to $pc variable that will be set in the PHP <postalCodes> <postalCode>11111</postalCode> <postalCode>22222</postalCode> <postalCode>33333</postalCode> <postalCode>78746</postalCode> </postalCodes> //cache Austin to a variable <location>Austin</location> <lineup> //cache all prgSvcID's to an array i.e. 20014,10722 <station prgSvcId="20014"> //cache all channels to an array i.e. 002,003 <chan effDate="2006-01-16" tier="1">002</chan> </station> <station prgSvcId="10722"> <chan effDate="2006-01-16" tier="1">003</chan> </station> </lineup> <areasServed> <area> //cache community to a variable $community <community>ThornDale</community> <county code="45331" size="D">Milam</county> //cache state to a variable i.e. TX <state>TX</state> </area> <area> <community>Thrall</community> <county code="45491" size="B">Williamson</county> <state>TX</state> </area> </areasServed> </headend> //if any of the postal codes matched $pc //echo back the xml from <headend> to </headend> //if none of the postal codes matched $pc //clear variables and move to next <headend> <headend> etc etc etc </headend> <headend> etc etc etc </headend> <headend> etc etc etc </headend> </lineups>
PHP:
<?PHP $pc = "78746"; $xmlfile="myFile.xml"; $reader = new XMLReader(); $reader->open($xmlfile); while ($reader->read()) { //search to see if groups contain $pc and echo info }
我知道我正在努力使它变得比它应该更难,但我试图操纵这么大的文件有点不知所措.任何帮助表示赞赏.
解决方法
为了通过XMLReader获得更大的灵活性,我通常创建自己
iterators that are able to work on the
XMLReader
object and provide the steps I need.
这开始于对所有节点的简单迭代,以及可选地具有特定名称的元素上的迭代.让我们调用最后一个XMLElementIterator,将读取器和元素名称作为参数.
在你的场景中,我将创建一个迭代器,为当前元素返回一个SimpleXMLElement,只取< headend>内容:
require('xmlreader-iterators.PHP'); // https://gist.github.com/hakre/5147685 class HeadendIterator extends XMLElementIterator { const ELEMENT_NAME = 'headend'; public function __construct(XMLReader $reader) { parent::__construct($reader,self::ELEMENT_NAME); } /** * @return SimpleXMLElement */ public function current() { return simplexml_load_string($this->reader->readOuterXml()); } }
配备这个迭代器,你的其余工作主要是小菜一碟.首先加载10千兆字节的文件:
$pc = "78746"; $xmlfile = '../data/lineups.xml'; $reader = new XMLReader(); $reader->open($xmlfile);
然后检查< headend> element包含信息,如果是,则显示数据/ XML:
foreach (new HeadendIterator($reader) as $headend) { /* @var $headend SimpleXMLElement */ if (!$headend->xpath("/*/postalCodes/postalCode[. = '$pc']")) { continue; } echo 'Found,name: ',$headend->name,"\n"; echo "==========================================\n"; $headend->asXML('PHP://stdout'); }
这确实是你想要实现的:迭代大文档(对内存友好)直到你找到你感兴趣的元素.然后你处理具体元素,它只是XML; XMLReader::readOuterXml()
是一个很好的工具.
示例输出:
Found,name: Grande Gables at The Terrace ========================================== <?xml version="1.0"?> <headend headendId="TX02217"> <name>Grande Gables at The Terrace</name> <mso msoId="17541">Grande Communications</mso> <marketIds> <marketId type="DMA">635</marketId> </marketIds> <postalCodes> <postalCode>11111</postalCode> <postalCode>22222</postalCode> <postalCode>33333</postalCode> <postalCode>78746</postalCode> </postalCodes> <location>Austin</location> <lineup> <station prgSvcId="20014"> <chan effDate="2006-01-16" tier="1">002</chan> </station> <station prgSvcId="10722"> <chan effDate="2006-01-16" tier="1">003</chan> </station> </lineup> <areasServed> <area> <community>ThornDale</community> <county code="45331" size="D">Milam</county> <state>TX</state> </area> <area> <community>Thrall</community> <county code="45491" size="B">Williamson</county> <state>TX</state> </area> </areasServed> </headend>