问题描述
我有一个包含来自 pubmed
的文章的数据集。数据框看起来像这样:
df = pd.DataFrame({"section_names":[["introduction","methods","section1","another section","discussion"],["introduction","discussion","other section","one more section","conclusion"]],"sections":[[["intro text","another sentence"],["some text","some text","more text"],"some text"],"some text"]],[["intro text","more text","some text"]]]})
所以基本上,section_names
列具有文章中所有部分的名称。在“部分”列中,section_names
中的每个部分名称的列表中有实际文本。作为第一步,我想将每个部分都放在一列中。所以,我这样做了:
df_col = pd.DataFrame([dict(zip(*pair)) for pair in zip(df['section_names'],df['sections'])]):
值 NaN
是有意义的,因为这些部分不适用于特定列,对于每一列至少有一个非 NaN 值。对于很多不同版块名称的文章,列数会急剧增加。在原始数据集中,我实际上有大约 10,000 列。
我现在想要的是合并列并拥有最多 4 列(介绍、方法、讨论、结论)。我想说:
在部分名称 methods
之后,合并所有其他部分,直到
discussion
与 methods
和 methods
之后的所有合并直到
conclusion
与 discussion
在我们的 df
中使用此规则,对于第一篇文章,section1
和 another section
将与 methods
合并。对于第二个条目,other section
和 one more section
应与 discussion
合并。
我该怎么做?
解决方法
一种选择是根据所需列的位置创建列索引,然后将每个组的行聚合到列表中:
desired_columns = ['introduction','methods','discussion','conclusion']
new_df = df.groupby(df.columns.isin(desired_columns).cumsum(),axis=1).agg(
lambda x: x.agg(
lambda r: list(itertools.chain.from_iterable(r.dropna()))
or np.nan,axis=1)
)
new_df.columns = desired_columns
new_df
:
introduction methods discussion conclusion
0 [intro text,another sentence] [some text,some text,more text,some text] [some text,some text] NaN
1 [intro text,another sentence] [some text,some text] [some text,some text]
列索引是使用以下方法创建的:
df.columns.isin(desired_columns).cumsum()
产生如下组:
[1 2 2 2 3 3 3 4]
完整的工作示例:
import itertools
import numpy as np
import pandas as pd
df = pd.DataFrame({
"section_names": [
["introduction","methods","section1","anothersection","discussion"],["introduction","discussion","othersection","onemoresection","conclusion"]],"sections": [
[["introtext","anothersentence"],["sometext","sometext","moretext"],"sometext"],"sometext"]],[["introtext","moretext","sometext"]]]
})
df = pd.DataFrame(
[dict(zip(*pair)) for pair in zip(df['section_names'],df['sections'])])
desired_columns = ['introduction',axis=1)
)
new_df.columns = desired_columns
print(new_df.to_string())