问题描述
我有以下名为Comments.xml
的XML文件,其大小为15 GB。我想获得一个具有2个键的字典,即UserId
和Text
。请注意,文件中UserId
和Text
缺少许多值。我尝试了以下代码,但是由于文件大小太大,RAM(13 GB RAM)崩溃了。有没有一种有效的方法可以从xml文件中获取数据以进行数据分析?
xml文件Comments.xml
的一部分
<comments>
<row Id = '1' UserId = '143' Text = 'Hello World'>
<row Id = '2' UserId = '183' Text = 'Trigonometry is important.'>
<row Id = '3' UserId = '5645' Text = 'Mathematics is best.'>
<row Id = '4' UserId = '143' Text = 'Hello stack overflow'>
<row Id = '5' UserId = '143' Text = 'Hello'>
代码
import xml.etree.cElementTree as ET
tree = ET.iterparse('Comments.xml')
comments = {} #Dictionary to store the required data
for event,root in tree:
if (('Text' in root.attrib) and ('UserId' in root.attrib)): #To check for missing values
Text = root.attrib['Text']
UserId = root.attrib['UserId']
userid_comments.update({UserId:Text}) #Adding data to dictionary
root.clear()
预期产量
{'143':'Hello World','183':'Trigonometry is important.','5645':'Mathematics is best.','143':'Hello stack overflow','143':'Hello'}
OR
{'UserId':['143','183','5645','143','143'],'Text':['Hello World','Trigonometry is important.','Mathematics is best.','Hello stack overflow','Hello']}
解决方法
另一种方法。
import io
from simplified_scrapy import SimplifiedDoc
def getComments(fileName):
comments = {'UserId': [],'Text': []}
with io.open(fileName,"r",encoding='utf-8') as file:
line = file.readline() # Read data line by line
while line != '':
doc = SimplifiedDoc(line) # Instantiate a doc
row = doc.getElement('row') # Get row
if row:
comments['UserId'].append(row['UserId'])
comments['Text'].append(row['Text'])
line = file.readline()
return comments
comments = getComments('Comments.xml') # This dictionary will be very large,too
还有更多示例:https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
,请参见下文。
这将不将解决您的RAM问题。为了解决RAM问题,您需要使用SAX:
XML的简单API(SAX)-在这里,您注册感兴趣事件的回调,然后让解析器继续处理文档。当您的文档很大或有内存限制时,此功能非常有用,它会在从磁盘读取文件时解析该文件,并且整个文件永远不会存储在内存中。
import xml.etree.ElementTree as ET
from collections import defaultdict
data = defaultdict(list)
xml = '''<comments>
<row Id = "1" UserId = "143" Text = "Hello World"/>
<row Id = "2" UserId = "183" Text = "Trigonometry is important."/>
<row Id = "3" UserId = "5645" Text = "Mathematics is best."/>
<row Id = "4" UserId = "143" Text = "Hello stack overflow"/>
<row Id = "5" UserId = "143" Text = "Hello"/></comments>'''
root = ET.fromstring(xml)
for row in root.findall('.//row'):
user_id = row.attrib.get('UserId')
text = row.attrib.get('Text')
if user_id is not None and text is not None:
data[user_id].append(text)
print(data)
输出
defaultdict(<class 'list'>,{'143': ['Hello World','Hello stack overflow','Hello'],'183': ['Trigonometry is important.'],'5645': ['Mathematics is best.']})