稀疏矩阵中的快速逐行布尔运算

问题描述

我有一个带有采购订单的约 4.4M 数据框。我对指示该采购订单中存在某些项目的列感兴趣。它的结构如下：

df['item_arr'].head()
1   [a1,a2,a5]
2   [b1,b2,c3...
3   [b3]
4   [a2]

有 4k 个不同的项目，并且在每一行中总是至少有一个。我生成了另一个 4.4M x 4k 数据帧 df_te_sub，其稀疏结构表示在布尔值方面相同的数组，即

c = df_te_sub.columns[[10,20,30]]
df_te_sub[c].head()
>>a10   b8  c1
0   0   0   0
1   0   0   0
2   0   0   0
3   0   0   0
4   0   0   0

列的名称并不重要，尽管它是按字母顺序排列的，但它是有价值的。

给定项目的子集 g，我试图提取两种不同情况的订单（行）：

该行中至少存在一项
出现在行中的项目都是g的子集

我认为最好的第一条规则是：

c = df_te_sub[g].sparse.to_coo()
rows = pd.unique(c.row)

第二条规则提出了挑战。我尝试了不同的东西，但它们都很慢：

# using set
s = set(g)
df['item_arr'].apply(s.issuperset)

# using the "non selected items"
c = df_te_sub[df_te_sub.columns[~df_te_sub.columns.isin(g)]].sparse.to_coo()
x = np.ones(len(df_te_sub),dtype='bool')
x[c.row] = False

# mix
s = set(g)
c = df_te_sub[g].sparse.to_coo()
rows = pd.unique(c.row)
df['item_arr'].iloc[rows].apply(s.issuperset)

有什么提高性能的想法吗？我需要为几个子集执行此操作。

输出可以按行（例如 [0,2,3]）或布尔掩码（例如 True False True True ....）给出，因为两者都可以对订单数据帧进行切片。

解决方法

我觉得你想多了。如果您有一个成员资格的布尔数组，那么您已经完成了 90% 的工作。

from scipy.sparse import csc_matrix

# Turn your sparse data into a sparse array
arr = csc_matrix(df_te_sub.sparse.to_coo())

# Get the number of items per row
row_len = arr.sum(axis=1).A.flatten()

# Get the column indices for each item and slice your array
arr_col_idx = [df.columns.get_loc(g_val) for g_val in g]

# Sum the number of items in g in the slice per row
arr_g = arr[:,arr_col_idx].sum(axis=1).A.flatten()

# Find all the rows with at least one thing in g
arr_one_g = arr_g > 0

# Find all the things in the rows which are subsets of G
# This assumes row_len is always greater than 0,if it isnt add a test for that
arr_subset_g = (row_len - arr_g) == 0

arr_one_g 和 arr_subset_g 是一维布尔数组，应该为你想要的东西建立索引。

boolean-operations pandas pandas sparse-matrix