大熊猫数据框中的递归函数

问题描述

我创建了以下数据框

import pandas as pd    
df = pd.DataFrame({'parent': ['AC1','AC2','AC3','AC1','AC11','AC5','AC6','AC8','AC9'],'child': ['AC2','AC4','AC12','AC7','AC9','AC10']})

输出以下内容

    parent  child
0   AC1     AC2
1   AC2     AC3
2   AC3     AC4
3   AC1     AC11
4   AC11    AC12
5   AC5     AC2
6   AC5     AC6
7   AC6     AC7
8   AC8     AC9
9   AC9     AC10

我想创建一个结果数据框,其中每个父级(意味着它在子级列中不存在)列出了最后的子级。

df_result = pd.DataFrame({'parent': ['AC1','AC2'],'child': ['AC4','AC10','AC4']})
    parent  child
0   AC1     AC4
1   AC1     AC12
2   AC5     AC4
3   AC5     AC7
4   AC8     AC10
5   AC2     AC4

我已经启动了以下功能,但不确定如何完成该功能

def get_child(df):
result = {}
if df['parent'] not in df['child']:
    return result[df['parent']]

解决方法

这是树结构,一种特殊的图形。数据帧并不是表示树的一种特别方便的方法。我建议您切换到networkx或其他基于图形的软件包。然后查找如何进行简单的路径遍历;您可以在图形包文档中找到直接支持。

如果您坚持要自己执行此操作(这是合理的编程练习),则只需要类似此伪代码的

for each parent not in "child" column:
    here = parent
    while here in parent column:
        here = here["child"]

    record (parent,here) pair
,

虽然您的预期输出似乎与您的描述不一致(AC2似乎不应该视为父级,因为它不是源节点),但我非常有信心您希望运行{{3} }从每个源节点定位到其所有叶子。在数据框中执行此操作并不方便,因此我们可以使用 <label for="id_10139347"> CHECK BOX </label> <input class="suscriptionCheck" id="id_10139347" type="checkbox" name="id_10139347"> <br/> <button>SOME BUTTON</button> button { display: none; } .suscriptionCheck:checked ~ button{ display:block; padding: 14px; } 并创建一个traversal字典来表示图形。我认为图中没有周期。

df.values

输出:

import pandas as pd
from collections import defaultdict

def find_leaves(graph,src):
    if src in graph:
        for neighbor in graph[src]:
            yield from find_leaves(graph,neighbor)
    else:
        yield src

def pair_sources_to_leaves(df):
    graph = defaultdict(list)
    children = set()

    for parent,child in df.values:
        graph[parent].append(child)
        children.add(child)

    leaves = [[x,list(find_leaves(graph,x))] 
               for x in graph if x not in children]
    return (pd.DataFrame(leaves,columns=df.columns)
              .explode(df.columns[-1])
              .reset_index(drop=True))

if __name__ == "__main__":
    df = pd.DataFrame({
        "parent": ["AC1","AC2","AC3","AC1","AC11","AC5","AC6","AC8","AC9"],"child": ["AC2","AC4","AC12","AC7","AC9","AC10"]
    })
    print(pair_sources_to_leaves(df))