有没有办法在 python 中创建 bins 而不是列出所有 bin 编号如下面的代码所示,也许不必使用 np.digitize?

问题描述

在我的代码中,我创建了 10 个垃圾箱(下面列出了特定垃圾箱范围):

  1. 4100000-4155304

  2. 4155304-4210608

  3. 4210608-4321216

  4. 4321216-4542432

  5. 4542432-4984865

  6. 4984865-5327533

  7. 5327533-5670201

  8. 5670201-5746217

  9. 5746217-5873109

  10. 5873109-6000000

    bins = [4100000,4155304,4210608,4321216,4542432,4984865,5327533,5670201,5746217,5873109,6000000]
    bin_indices = np.digitize(bins_array,bins)
    

有没有一种方法可以做到这一点,而不必列出所有的 bin 编号(bins = [bin numbers]),也许也不必使用 np.digitize? 非常感谢!

解决方法

只需使用 numpy.arange 方法:

bins = np.arange(4100000,6000000,55304)
bins

输出

array([4100000,4155304,4210608,4265912,4321216,4376520,4431824,4487128,4542432,4597736,4653040,4708344,4763648,4818952,4874256,4929560,4984864,5040168,5095472,5150776,5206080,5261384,5316688,5371992,5427296,5482600,5537904,5593208,5648512,5703816,5759120,5814424,5869728,5925032,5980336])

干杯

,

我找不到另一篇 SO 帖子的原作者,我从使用 Pandas 得到了这个,但也许可以尝试下面这样的东西,我非常快速地通过一个想法来尝试。数据框只是 numpy 随机范围,用于在您正在寻找的范围内生成假数据。

import pandas as pd
import numpy as np

#create bins & categories for data ranges
cats = ['4100000_4155303','4155304_4210608','4210608_4321215','4321216_4542431','4542432_4984864','4984865_5327532','5327533_5670200','5670201_5746216','5746217_5873108','5873109_6000000']

bins = [0,4100000,4321215,4542431,5327532,5670200,5746216,5873108,6000000]


def binn(df):
    df = (df.groupby([df.index,pd.cut(df['A'],bins,labels=cats)])
                .size()
                .unstack(fill_value=0)
                .reindex(columns=cats,fill_value=0))
    return df


rng = np.random.default_rng()
df = pd.DataFrame(rng.integers(4155304,size=(1000,1)),columns=list('A'))

dfBinned = binn(df)

print('All data binned in column A of the df')
print(dfBinned.sum(axis = 0))

打印:

All data binned in column A of the df
A
4100000_4155303      0
4155304_4210608     35
4210608_4321215     42
4321216_4542431    130
4542432_4984864    239
4984865_5327532    174
5327533_5670200    205
5670201_5746216     37
5746217_5873108     63
5873109_6000000     75
dtype: int64