从 Pandas df 行内的嵌套字典中获取项目值，然后去掉其余的

问题描述

我实现了 allennlp's OIE，它提取嵌入在嵌套字符串中的主语、谓语、宾语信息（以 ARG0、V、ARG1 等形式）。但是，我需要确保每个输出都链接到原始句子的给定 ID。

我生成了以下 Pandas 数据帧，其中 OIE output 包含 allennlp 算法的原始输出。

当前输出：

句子	ID	OIE 输出
'女孩去电影院'	'abcd'	{'verbs':[{'verb': 'went','description':'[ARG0：那个女孩] [V：去了] [ARG1：去电影院]'}]}
'他是对的，他是一名工程师'	'efgh'	{'verbs':[{'verb': 'is','description':'[ARG0: He] [V: is] [ARG1:right]'},{'verb': 'is','description':'[ARG0: He] [V: is] [ARG1: an Engineer]'}]}

我得到上表的代码：

oie_l = []

for sent in sentences:
  oie_pred = predictor_oie.predict(sentence=sent) #allennlp oie predictor
  for d in oie_pred['verbs']: #get to the nested info
    d.pop('tags') #remove unnecessary info
  oie_l.append(oie_pred)

df['OIE out'] = oie_l #add new column to df

期望的输出：

句子	ID	OIE Triples
'女孩去电影院'	'abcd'	'[ARG0：那个女孩] [V：去了] [ARG1：去电影院]'
'他是对的，他是一名工程师'	'efgh'	'[ARG0: He] [V: is] [ARG1:right]'
'他是对的，他是一名工程师'	'efgh'	'[ARG0: He] [V: is] [ARG1: an Engineer]'

方法思路：

为了获得所需的 'OIE Triples' 输出，我考虑将初始的 'OIE 输出' 转换为字符串，然后使用正则表达式来提取 ARG。但是，我不确定这是否是最佳解决方案，因为“ARG”可能会有所不同。另一种方法，将迭代到 description: 的嵌套值，以列表的形式替换当前在 OIE 输出中的内容，然后实现 df.explode() 方法对其进行扩展，以便正确的句子和 id 列在“爆炸”后链接到三元组。

感谢任何建议。

解决方法

您的第二个想法应该可以解决问题：

import ast
df["OIE Triples"] = df["OIE output"].apply(ast.literal_eval)

df["OIE Triples"] = df["OIE Triples"].apply(lambda val: [a_dict["description"]
                                                         for a_dict in val["verbs"]])
df = df.explode("OIE Triples").drop(columns="OIE output")

如果 "OIE output" 值不是真正的 dict 而是 str，我们通过 dict 将它们转换为 ast.literal_eval。（因此，如果它们是 dict，您可以跳过前 2 行）。

然后我们为系列的每个 value 获得一个列表，该系列由 "description" 连接的最外层 dict 键的 "verbs" 组成。

最后 explode 这个 description 列出并drop "OIE output" 列，因为它不再需要。

得到

                              sentence      ID                                      OIE Triples
0        'The girl went to the cinema'  'abcd'  [ARG0: The girl] [V: went] [ARG1:to the cinema]
1  'He is right and he is an engineer'  'efgh'                  [ARG0: He] [V: is] [ARG1:right]
1  'He is right and he is an engineer'  'efgh'            [ARG0: He] [V: is] [ARG1:an engineer]

allennlp pandas pandas python triples

从 Pandas df 行内的嵌套字典中获取项目值，然后去掉其余的

问题描述

解决方法

相关问答