将slist转换为csv

问题描述

我在IPython中运行的shell脚本返回以下对象：

results = ['{"url": "https://url.com","date": "2020-10-02T21:25:20+00:00","content": "mycontent\nmorecontent\nmorecontent","renderedContent": "myrenderedcontent","id": 123,"username": "somename","user": {"username": "somename","displayname": "some name","description": "my description","rawDescription": "my description","descriptionUrls": [],"verified": false,"created": "2020-02-00T02:00:00+00:00","followersCount": 1,"friendsCount": 1,"statusesCount": 1,"favouritesCount": 1,"listedCount": 1,"mediaCount": 1,"location": "","protected": false,"linkUrl": null,"linkTcourl": null,"profileImageUrl": "https://myprofile.com/mypic.jpg","profileBannerUrl": "https://myprofile.com/mypic.jpg"},"outlinks": [],"outlinks2": "","outlinks3": [],"outlinks4": "","replyCount": 0,"retweetCount": 0,"likeCount": 0,"quoteCount": 0,"conversationId": 123,"lang": "en","source": "<a href=\\"mysource.com" rel=\\"something\\">Sometext</a>","media": [{"previewUrl": "smallpic.jpg","fullUrl": "largepic.jpg","type": "photo"}],"forwarded": null,"quoted": null,"mentionedUsers": [{"username": "name1","displayname": "name 1","id": 345,"description": null,"rawDescription": null,"descriptionUrls": null,"verified": null,"created": null,"followersCount": null,"friendsCount": null,"statusesCount": null,"favouritesCount": null,"listedCount": null,"mediaCount": null,"location": null,"protected": null,"link2url": null,"profileImageUrl": null,"profileBannerUrl": null}]}',...]

而...表示与前一个条目相似的更多条目。根据type（），这是一个slist。根据上述shell脚本的文档，这是一个jsonlines 文件。

最终，我想将其转换为一个csv对象，其中的键是列，值是值，其中每个条目（如上面所示的条目）都是一行。像这样：

url              date                       content   ...
https://url.com  2020-10-02T21:25:20+00:00  mycontent ...

我已经尝试过提出的解决方案here，但收到的数据帧具有键值对，如下所示：

import pandas as pd
df = pd.DataFrame(data=results)
df = df[0].str.split(',',expand=True)
df = df.rename(columns=df.iloc[0])

解决方法

尽管示例数据包含多个问题，但如果您解决了这些问题，则可以解决此问题：

import json
import pandas as pd

fragment = '{"url": "https://url.com","date": "2020-10-02T21:25:20+00:00","content": "mycontent\\\\nmorecontent\\\\nmorecontent","renderedContent": "myrenderedcontent","id": 123,"username": "somename","user": {"username": "somename","displayname": "some name","description": "my description","rawDescription": "my description","descriptionUrls": [],"verified": false,"created": "2020-02-00T02:00:00+00:00","followersCount": 1,"friendsCount": 1,"statusesCount": 1,"favouritesCount": 1,"listedCount": 1,"mediaCount": 1,"location": "","protected": false,"linkUrl": null,"linkTcourl": null,"profileImageUrl": "https://myprofile.com/mypic.jpg","profileBannerUrl": "https://myprofile.com/mypic.jpg"},"outlinks": [],"outlinks2": "","outlinks3": [],"outlinks4": "","replyCount": 0,"retweetCount": 0,"likeCount": 0,"quoteCount": 0,"conversationId": 123,"lang": "en","source": "<a href=\\"mysource.com\\" rel=\\"something\\">Sometext</a>","media": [{"previewUrl": "smallpic.jpg","fullUrl": "largepic.jpg","type": "photo"}],"forwarded": null,"quoted": null,"mentionedUsers": [{"username": "name1","displayname": "name 1","id": 345,"description": null,"rawDescription": null,"descriptionUrls": null,"verified": null,"created": null,"followersCount": null,"friendsCount": null,"statusesCount": null,"favouritesCount": null,"listedCount": null,"mediaCount": null,"location": null,"protected": null,"link2url": null,"profileImageUrl": null,"profileBannerUrl": null}]}'

data = json.loads(fragment)
df = pd.DataFrame([data])
df.to_csv('test_out.csv')

注意：示例数据已在此示例中修复，更改为：

"已在“源代码”中正确转义
\n被转义为\\\\n，也可能是\\n，但我认为您不希望在csv中使用换行符

如果结果是这些列表：

import json
import pandas as pd

results = get_results_somewhere()

df = pd.DataFrame([json.loads(r) for r in results])
df.to_csv('test_out.csv')

如果您输入的错误仅限于上述情况，则可以按以下步骤解决它们：

def fix_input(s):
    return regex.sub('(?<=<[^>]*?)(")',r'\\"',regex.sub(r'(?<=<[^>]*?)(\\)','',regex.sub('\n','\\\\\\\\n',s)))

此操作将转义\\"内先前转义的<>，然后将"内的所有<>替换为\\"，并“修复”换行符。如果您无法理解正则表达式为何以这种方式工作，那么可能是一个单独的问题。

整个事情：

import json
import regex
import pandas as pd


def fix_input(s):
    return regex.sub('(?<=<[^>]*?)(")',s)))


results = get_results_somewhere()
fixed_results = fix_input(results)

df = pd.DataFrame([json.loads(r) for r in fixed_results])
df.to_csv('test_out.csv')

注意：由于它使用了可变长度的后向标记，因此它使用的是第三方regex而不是re。

jsonlines python