随机将数据框分成几组，并均匀分配值

问题描述

我的数据框分为两组（A和B），在这些组中，有六个子组（a，b，c，{ {1}}，d和e）。以下示例数据：

虽然我在这里只列出了12行等于1的值，但实际上实际数据集中有300行（等于2、3等的值）。我正在尝试将数据框随机分为6个批次，每个批次包含50个值。但是，我希望每个批次都包含index group subgroup value 0 A a 1 1 A b 1 2 A c 1 3 A d 1 4 A e 1 5 A f 1 6 B a 1 7 B b 1 8 B c 1 9 B d 1 10 B e 1 11 B f 1 ... ... ... ...值的均匀分布（因此25 A和25 B）和group值的近似均匀分布。

例如，batch_1可能包含：

25个subgroup，其中包含4个A，5个a，4个b，4个c，5个d和3 e。还有25个f，其中包含5个B，4个a，3个b，5个c，4个{{ 1}}和4 d。

这6个批次将分配给1个用户。（因此，实际上我需要将数据帧随机分为6个批次以供更多用户使用。）但是我无法确定这是应该随机拆分还是采样数据帧的问题。有人对如何实现这一目标有建议吗？

这可能会有所帮助，但不能确保值的均匀分布：https://www.geeksforgeeks.org/break-list-chunks-size-n-python/

解决方法

使用一些技巧

使用pd.factorize()将分类数据转换为每个类别的值
计算一个值/因子 f ，它表示组/子组
将此np.random.uniform()随机分配，最小值和最大值接近1
具有一个表示分组的值，可以sort_values()和reset_index()具有一个干净的有序索引
最后按整数余数进行分组

group = list("ABCD")
subgroup = list("abcdef")
df = pd.DataFrame([{"group":group[random.randint(0,len(group)-1)],"subgroup":subgroup[random.randint(0,len(subgroup)-1)],"value":random.randint(1,3)} for i in range(300)])

bins=6
dfc = df.assign(
    # take into account concentration of group and subgroup
    # randomise a bit....
    f = ((pd.factorize(df["group"])[0] +1)*10 + 
            (pd.factorize(df["subgroup"])[0] +1) 
            *np.random.uniform(0.99,1.01,len(df))
        ),).sort_values("f").reset_index(drop=True).assign(
    gc=lambda dfa: dfa.index%(bins)
).drop(columns="f")

# check distribution ... used plot for SO
dfc.groupby(["gc","group","subgroup"]).count().unstack(0).plot(kind="barh")
# every group same size...
# dfc.groupby("gc").count()

# now it's easy to get each of the cuts.... 0 through 5
# dfcut0 = dfc.query("gc==0").drop(columns="gc").copy().reset_index(drop=True)
# dfcut0

输出

data-wrangling dataframe pandas python