将slist转换为csv

问题描述

我在IPython中运行的shell脚本返回以下对象:

results = ['{"url": "https://url.com","date": "2020-10-02T21:25:20+00:00","content": "mycontent\nmorecontent\nmorecontent","renderedContent": "myrenderedcontent","id": 123,"username": "somename","user": {"username": "somename","displayname": "some name","description": "my description","rawDescription": "my description","descriptionUrls": [],"verified": false,"created": "2020-02-00T02:00:00+00:00","followersCount": 1,"friendsCount": 1,"statusesCount": 1,"favouritesCount": 1,"listedCount": 1,"mediaCount": 1,"location": "","protected": false,"linkUrl": null,"linkTcourl": null,"profileImageUrl": "https://myprofile.com/mypic.jpg","profileBannerUrl": "https://myprofile.com/mypic.jpg"},"outlinks": [],"outlinks2": "","outlinks3": [],"outlinks4": "","replyCount": 0,"retweetCount": 0,"likeCount": 0,"quoteCount": 0,"conversationId": 123,"lang": "en","source": "<a href=\\"mysource.com" rel=\\"something\\">Sometext</a>","media": [{"previewUrl": "smallpic.jpg","fullUrl": "largepic.jpg","type": "photo"}],"forwarded": null,"quoted": null,"mentionedUsers": [{"username": "name1","displayname": "name 1","id": 345,"description": null,"rawDescription": null,"descriptionUrls": null,"verified": null,"created": null,"followersCount": null,"friendsCount": null,"statusesCount": null,"favouritesCount": null,"listedCount": null,"mediaCount": null,"location": null,"protected": null,"link2url": null,"profileImageUrl": null,"profileBannerUrl": null}]}',...]

...表示与前一个条目相似的更多条目。根据type(),这是一个slist。根据上述shell脚本的文档,这是一个jsonlines文件

最终,我想将其转换为一个csv对象,其中的键是列,值是值,其中每个条目(如上面所示的条目)都是一行。像这样:

url              date                       content   ...
https://url.com  2020-10-02T21:25:20+00:00  mycontent ...

我已经尝试过提出的解决方here,但收到的数据帧具有键值对,如下所示:

import pandas as pd
df = pd.DataFrame(data=results)
df = df[0].str.split(',',expand=True)
df = df.rename(columns=df.iloc[0]) 

解决方法

尽管示例数据包含多个问题,但如果您解决了这些问题,则可以解决此问题:

import json
import pandas as pd

fragment = '{"url": "https://url.com","date": "2020-10-02T21:25:20+00:00","content": "mycontent\\\\nmorecontent\\\\nmorecontent","renderedContent": "myrenderedcontent","id": 123,"username": "somename","user": {"username": "somename","displayname": "some name","description": "my description","rawDescription": "my description","descriptionUrls": [],"verified": false,"created": "2020-02-00T02:00:00+00:00","followersCount": 1,"friendsCount": 1,"statusesCount": 1,"favouritesCount": 1,"listedCount": 1,"mediaCount": 1,"location": "","protected": false,"linkUrl": null,"linkTcourl": null,"profileImageUrl": "https://myprofile.com/mypic.jpg","profileBannerUrl": "https://myprofile.com/mypic.jpg"},"outlinks": [],"outlinks2": "","outlinks3": [],"outlinks4": "","replyCount": 0,"retweetCount": 0,"likeCount": 0,"quoteCount": 0,"conversationId": 123,"lang": "en","source": "<a href=\\"mysource.com\\" rel=\\"something\\">Sometext</a>","media": [{"previewUrl": "smallpic.jpg","fullUrl": "largepic.jpg","type": "photo"}],"forwarded": null,"quoted": null,"mentionedUsers": [{"username": "name1","displayname": "name 1","id": 345,"description": null,"rawDescription": null,"descriptionUrls": null,"verified": null,"created": null,"followersCount": null,"friendsCount": null,"statusesCount": null,"favouritesCount": null,"listedCount": null,"mediaCount": null,"location": null,"protected": null,"link2url": null,"profileImageUrl": null,"profileBannerUrl": null}]}'

data = json.loads(fragment)
df = pd.DataFrame([data])
df.to_csv('test_out.csv')

注意:示例数据已在此示例中修复,更改为:

  • "已在“源代码”中正确转义
  • \n被转义为\\\\n,也可能是\\n,但我认为您不希望在csv中使用换行符

如果结果是这些列表:

import json
import pandas as pd

results = get_results_somewhere()

df = pd.DataFrame([json.loads(r) for r in results])
df.to_csv('test_out.csv')

如果您输入的错误仅限于上述情况,则可以按以下步骤解决它们:

def fix_input(s):
    return regex.sub('(?<=<[^>]*?)(")',r'\\"',regex.sub(r'(?<=<[^>]*?)(\\)','',regex.sub('\n','\\\\\\\\n',s)))

此操作将转义\\"内先前转义的<>,然后将"内的所有<>替换为\\",并“修复”换行符。如果您无法理解正则表达式为何以这种方式工作,那么可能是一个单独的问题。

整个事情:

import json
import regex
import pandas as pd


def fix_input(s):
    return regex.sub('(?<=<[^>]*?)(")',s)))


results = get_results_somewhere()
fixed_results = fix_input(results)

df = pd.DataFrame([json.loads(r) for r in fixed_results])
df.to_csv('test_out.csv')

注意:由于它使用了可变长度的后向标记,因此它使用的是第三方regex而不是re