问题描述
我在IPython中运行的shell脚本返回以下对象:
results = ['{"url": "https://url.com","date": "2020-10-02T21:25:20+00:00","content": "mycontent\nmorecontent\nmorecontent","renderedContent": "myrenderedcontent","id": 123,"username": "somename","user": {"username": "somename","displayname": "some name","description": "my description","rawDescription": "my description","descriptionUrls": [],"verified": false,"created": "2020-02-00T02:00:00+00:00","followersCount": 1,"friendsCount": 1,"statusesCount": 1,"favouritesCount": 1,"listedCount": 1,"mediaCount": 1,"location": "","protected": false,"linkUrl": null,"linkTcourl": null,"profileImageUrl": "https://myprofile.com/mypic.jpg","profileBannerUrl": "https://myprofile.com/mypic.jpg"},"outlinks": [],"outlinks2": "","outlinks3": [],"outlinks4": "","replyCount": 0,"retweetCount": 0,"likeCount": 0,"quoteCount": 0,"conversationId": 123,"lang": "en","source": "<a href=\\"mysource.com" rel=\\"something\\">Sometext</a>","media": [{"previewUrl": "smallpic.jpg","fullUrl": "largepic.jpg","type": "photo"}],"forwarded": null,"quoted": null,"mentionedUsers": [{"username": "name1","displayname": "name 1","id": 345,"description": null,"rawDescription": null,"descriptionUrls": null,"verified": null,"created": null,"followersCount": null,"friendsCount": null,"statusesCount": null,"favouritesCount": null,"listedCount": null,"mediaCount": null,"location": null,"protected": null,"link2url": null,"profileImageUrl": null,"profileBannerUrl": null}]}',...]
而...
表示与前一个条目相似的更多条目。根据type(),这是一个slist。根据上述shell脚本的文档,这是一个jsonlines文件。
最终,我想将其转换为一个csv对象,其中的键是列,值是值,其中每个条目(如上面所示的条目)都是一行。像这样:
url date content ...
https://url.com 2020-10-02T21:25:20+00:00 mycontent ...
我已经尝试过提出的解决方案here,但收到的数据帧具有键值对,如下所示:
import pandas as pd
df = pd.DataFrame(data=results)
df = df[0].str.split(',',expand=True)
df = df.rename(columns=df.iloc[0])
解决方法
尽管示例数据包含多个问题,但如果您解决了这些问题,则可以解决此问题:
import json
import pandas as pd
fragment = '{"url": "https://url.com","date": "2020-10-02T21:25:20+00:00","content": "mycontent\\\\nmorecontent\\\\nmorecontent","renderedContent": "myrenderedcontent","id": 123,"username": "somename","user": {"username": "somename","displayname": "some name","description": "my description","rawDescription": "my description","descriptionUrls": [],"verified": false,"created": "2020-02-00T02:00:00+00:00","followersCount": 1,"friendsCount": 1,"statusesCount": 1,"favouritesCount": 1,"listedCount": 1,"mediaCount": 1,"location": "","protected": false,"linkUrl": null,"linkTcourl": null,"profileImageUrl": "https://myprofile.com/mypic.jpg","profileBannerUrl": "https://myprofile.com/mypic.jpg"},"outlinks": [],"outlinks2": "","outlinks3": [],"outlinks4": "","replyCount": 0,"retweetCount": 0,"likeCount": 0,"quoteCount": 0,"conversationId": 123,"lang": "en","source": "<a href=\\"mysource.com\\" rel=\\"something\\">Sometext</a>","media": [{"previewUrl": "smallpic.jpg","fullUrl": "largepic.jpg","type": "photo"}],"forwarded": null,"quoted": null,"mentionedUsers": [{"username": "name1","displayname": "name 1","id": 345,"description": null,"rawDescription": null,"descriptionUrls": null,"verified": null,"created": null,"followersCount": null,"friendsCount": null,"statusesCount": null,"favouritesCount": null,"listedCount": null,"mediaCount": null,"location": null,"protected": null,"link2url": null,"profileImageUrl": null,"profileBannerUrl": null}]}'
data = json.loads(fragment)
df = pd.DataFrame([data])
df.to_csv('test_out.csv')
注意:示例数据已在此示例中修复,更改为:
-
"
已在“源代码”中正确转义 -
\n
被转义为\\\\n
,也可能是\\n
,但我认为您不希望在csv中使用换行符
如果结果是这些列表:
import json
import pandas as pd
results = get_results_somewhere()
df = pd.DataFrame([json.loads(r) for r in results])
df.to_csv('test_out.csv')
如果您输入的错误仅限于上述情况,则可以按以下步骤解决它们:
def fix_input(s):
return regex.sub('(?<=<[^>]*?)(")',r'\\"',regex.sub(r'(?<=<[^>]*?)(\\)','',regex.sub('\n','\\\\\\\\n',s)))
此操作将转义\\"
内先前转义的<>
,然后将"
内的所有<>
替换为\\"
,并“修复”换行符。如果您无法理解正则表达式为何以这种方式工作,那么可能是一个单独的问题。
整个事情:
import json
import regex
import pandas as pd
def fix_input(s):
return regex.sub('(?<=<[^>]*?)(")',s)))
results = get_results_somewhere()
fixed_results = fix_input(results)
df = pd.DataFrame([json.loads(r) for r in fixed_results])
df.to_csv('test_out.csv')
注意:由于它使用了可变长度的后向标记,因此它使用的是第三方regex
而不是re
。