基于 Pandas 规则的列合并

问题描述

我有一个包含来自 pubmed 的文章的数据集。数据框看起来像这样：

df = pd.DataFrame({"section_names":[["introduction","methods","section1","another section","discussion"],["introduction","discussion","other section","one  more section","conclusion"]],"sections":[[["intro text","another sentence"],["some text","some text","more text"],"some text"],"some text"]],[["intro text","more text","some text"]]]})

所以基本上，section_names 列具有文章中所有部分的名称。在“部分”列中，section_names 中的每个部分名称的列表中有实际文本。作为第一步，我想将每个部分都放在一列中。所以，我这样做了：

df_col = pd.DataFrame([dict(zip(*pair)) for pair in zip(df['section_names'],df['sections'])]):

值 NaN 是有意义的，因为这些部分不适用于特定列，对于每一列至少有一个非 NaN 值。对于很多不同版块名称的文章，列数会急剧增加。在原始数据集中，我实际上有大约 10,000 列。

我现在想要的是合并列并拥有最多 4 列（介绍、方法、讨论、结论）。我想说：

在部分名称 methods 之后，合并所有其他部分，直到 discussion 与 methods 和 methods 之后的所有合并直到 conclusion 与 discussion

在我们的 df 中使用此规则，对于第一篇文章，section1 和 another section 将与 methods 合并。对于第二个条目，other section 和 one more section 应与 discussion 合并。

我该怎么做？

解决方法

一种选择是根据所需列的位置创建列索引，然后将每个组的行聚合到列表中：

desired_columns = ['introduction','methods','discussion','conclusion']
new_df = df.groupby(df.columns.isin(desired_columns).cumsum(),axis=1).agg(
    lambda x: x.agg(
        lambda r: list(itertools.chain.from_iterable(r.dropna()))
                  or np.nan,axis=1)
)
new_df.columns = desired_columns

new_df：

                     introduction                                                                        methods                                                                                discussion              conclusion
0  [intro text,another sentence]  [some text,some text,more text,some text]                                                                    [some text,some text]                     NaN
1  [intro text,another sentence]                                                         [some text,some text]  [some text,some text]

列索引是使用以下方法创建的：

df.columns.isin(desired_columns).cumsum()

产生如下组：

[1 2 2 2 3 3 3 4]

完整的工作示例：

import itertools

import numpy as np
import pandas as pd

df = pd.DataFrame({
    "section_names": [
        ["introduction","methods","section1","anothersection","discussion"],["introduction","discussion","othersection","onemoresection","conclusion"]],"sections": [
        [["introtext","anothersentence"],["sometext","sometext","moretext"],"sometext"],"sometext"]],[["introtext","moretext","sometext"]]]
})

df = pd.DataFrame(
    [dict(zip(*pair)) for pair in zip(df['section_names'],df['sections'])])

desired_columns = ['introduction',axis=1)
)
new_df.columns = desired_columns
print(new_df.to_string())

pandas pandas pubmed python

基于 Pandas 规则的列合并

问题描述

解决方法

相关问答