在熊猫中将相似项目分组

问题描述

我正在尝试做某事，我想知道这是否可以在Pandas中完成，或者是否有更好的工具来完成这项工作（目前，我只是在使用纯Python）。这是起始数据：

# We have a listing of files for the movie Titanic
# And we want to break them into groups of similar titles,# To see which of those are possible duplicates.
import pandas as pd
titanic_files = [
    {"File": "Titanic_HD2398.mov","Resolution": "HD","FrameRate": 23.98,"Runtime": 102},{"File": "Titanic1.mov","Resolution": "SD",{"File": "Titanic.mov","FrameRate": 24.00,"Runtime": 103},{"File": "MY_HD2398.mov","Runtime": 102}
]
df = pd.DataFrame(titanic_files)

我想按相似的数据对这些文件进行分组，而不是折叠行级数据，例如：

第1步-按分辨率分组


---- HD ----
File               Resolution             FrameRate              RunTime
Titanic_HD2398.mov HD                     23.98                  102
Titanic1.mov       HD                     23.98                  102
Titanic.mov        HD                     24.00                  103
MY_HD2398.mov      HD                     23.98                  102

---- SD ----
File               Resolution             FrameRate              RunTime
Titanic1.mov       SD                     23.98                  102

第2步-按FrameRate分组

---- HD -----------------------
 +----------- 23.98 ------------
File               Resolution             FrameRate              RunTime
Titanic_HD2398.mov HD                     23.98                  102
Titanic1.mov       HD                     23.98                  102
MY_HD2398.mov      HD                     23.98                  102

 +----------- 24.00 ------------
File               Resolution             FrameRate              RunTime
Titanic.mov        HD                     24.00                  103


---- SD -----------------------
 + ---------- 23.98 ------------

File               Resolution             FrameRate              RunTime
Titanic1.mov       SD                     23.98                  102

最后，我希望基本上为每个最小的分组都有单独的数据帧。在python中，我目前正在使用以下数据结构进行此操作：

{
   'GroupingKeys': [{File1WithinThatBucket},{File2WithinThatBucket},...]
}

例如：

{
   'HD+23.98' + [{'File': ...}],'HD+24.00' + [{'File': ...}]
}

另外，请记住，我正在分组的字段大约有10-15个，我在上面的问题中仅包括了两个字段，因此这种方法需要相当概括（另外，一些匹配项条件不精确，例如，运行时可能被存储到+/- 2秒之类的值，某些值可能为null等。

回到最初的问题：是否可以在Pandas中进行类似的操作？如果可以，如何进行？

解决方法

Pandas的groupby似乎是要使用的工具，它可以根据需要使用任意数量的石斑鱼，并且它们的类型可以是列表，系列，column_name，index_level，可调用...您可以为其命名

例如，您可以这样做：

df = df.groupby(
    [
        'Resolution',df.FrameRate//0.02 * 0.02,pd.cut(df.Runtime,bins=[45,90,95,100,120])
    ]
).File.apply(list)

这将返回一个具有3个级别的唯一MultiIndex和一个列的DataFrame，每一行包含一个文件名列表。

如果出于某种原因，由于其他原因，您想要将一个df拆分为多个，并保持这种方式，则也可以获取每个组的完整行。

for group_id,group_rows in df.groupby(...):
    # group id are tuples each with a unique combination of the grouping vectors
    # group_rows is a df of the matching rows,with the same columns as df

aggregation group-by pandas python