问题描述
我有一个3轴数据的DataFrame,带有用于分组的成员资格标签:
df = pd.DataFrame( [[0,1,2,0],[-1,[-2,3,1],[1,2],[6,5],[-4,-1,6],[0,6]],columns = ['x','y','z','member'])
我的目标有些虚构:我希望找到每个组与下一个n_skip
组之间的点对距离,从最小到最大。 n_skip
就是我所说的交错:
例如,对于n_skip=2
,我希望找到以下距离:
- 带有
member == 0
->对member == 1,2
的行 - 与
member == 1
->对member == 2,5
的行 - 与
member == 2
->对member == 5,6
排 - 与
member == 5
->对member == 6
排 - 没有
member == 6
的计算。
有没有嵌套的for循环的高效方法?直觉上,这被暗示到in this question answer.上,我无法使用传统的apply
来并行化Pandas DataFrames上的函数。将功能应用于交错的一组组的快速方法是什么?
## heading ### Organize by group membership
groups = df.groupby('member')
# Define constants
max_member = 6
n_skip = 2
start_row = 0
matrix = np.zeros((df.shape[0],df.shape[0]))
# Iterate for each group
for i in range(max_member):
try:
pts_curr = groups.get_group(i)
except KeyError:
continue
# Save end row index
end_row = start_row + pts_curr.shape[0]
# Save start col index
start_col = end_row
# Grab the destination group nodes
for j in range(i+1,int(np.min([i+n_skip+1,max_member]))):
try:
pts_clr_next = groups.get_group(j)
except KeyError:
continue
# Save end col index
end_col = start_col + pts_clr_next.shape[0]
# Calculate cdist
z_sq = cdist(pts_curr[['z']],pts_next[['z']])
# Save results in matrix at right positions
matrix[start_row:end_row,start_col:end_col] = z_sq
# update col index
start_col = end_col
# update row index
start_row = end_row
解决方法
在4K行上进行交叉合并并不算太糟糕(约有1,600万行)。让我们尝试交叉合并和查询:
import { StyledInnerSignInButton } from './styles';
[...]
<StyledInnerSignInButton
type="submit"
fullWidth
variant="contained"
color="primary"
>
Sign In
</StyledInnerSignInButton>
输出:
n = 2
# dummy key
df['dummy'] = 1
# this is the member group number
df['rank'] = df['member'].rank(method='dense')
# cross merge and filter
new_df = (df.merge(df,on='dummy')
.query('rank_x<rank_y<=rank_x+@n')
)
# euclidean distance
dist = (new_df[['x_x','y_x','z_x']].sub(new_df[['x_y','y_y','z_y']].values)**2).sum(1)**.5
# output dataframe with member label
pd.DataFrame({'member1':new_df['member_x'],'member2':new_df['member_y'],'dist':dist})
选项2 :如果数据帧较大,循环可能不会太糟:
member1 member2 dist
2 0 1 2.449490
3 0 1 1.414214
4 0 2 1.414214
5 0 2 1.732051
12 0 1 2.236068
13 0 1 3.000000
14 0 2 2.236068
15 0 2 2.828427
24 1 2 3.162278
25 1 2 3.000000
26 1 5 8.485281
27 1 5 4.690416
34 1 2 1.414214
35 1 2 1.000000
36 1 5 5.477226
37 1 5 6.164414
46 2 5 5.477226
47 2 5 6.164414
48 2 6 3.000000
49 2 6 1.414214
56 2 5 5.744563
57 2 5 6.557439
58 2 6 4.000000
59 2 6 1.000000
68 5 6 5.744563
69 5 6 6.633250
78 5 6 5.916080
79 5 6 5.830952