如何从 Pandas 数据框中随机删除 10% 的属性值

问题描述

我有一个包含 30 列的数据集。最后一列是分类中的目标变量。

我需要随机删除 10% 的属性值。所以 0-29 列中 10% 的值应该是 NA。我还希望数据删除随机发生，即我不希望所有列都具有相同的删除百分比。我要求所有列都有不同的删除百分比。但作为一个整体，所有列加在一起应该是从原始属性值中删除的 10%。

非常感谢您的帮助。

解决方法

这样的东西可能正是您要找的。p>

import numpy as np

# get dimensions of df
nrows,ncols = len(df.index),30          

volume = nrows * ncols                    # total number of entries in df
volume_to_be_nan = int(volume * 0.1)      # number of entries to turn to NaN (10 %)

# randomly generate index locations for the new NaNs
indices = np.random.randint(volume,size=volume_to_be_nan)
row_indices = indices % nrows
col_indices = (indices / nrows).astype(int)

# assign NaN to each of the indices in df
for ri,ci in zip(row_indices,col_indices):
  df.iloc[ri,ci] = np.nan

例如，如果 df 是：

   0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29
0  19  52  65  85  76  79  99  85  53  20  35   2  66  58  51  56  63  46   0  63  14  27  79  45  30  83  35  32  45  16
1  37  16  75  28  23  77  19  99  34  70  31  74  59  85  90  83  85   2  16  12   6  18   2  16  42  95  54   4  57  23
2  54  54  99  96  64  43  65  17  72  82  19  26  10  64  82  18  72  53  49  76  90  29   6  40  80  57  48  60  75  17
3  57  33  82  28  14  29   2  69   4  67  23  87  31  34  12  86  74  67  32  69  43  19  63   6  78  31  12  16  60  60
4  10  82  26  62  22  21  37  17  33  20  40  93  50  75  24  91  41  79  56  24   5  89  95  59  80  36  23  38  41  79

然后上面的代码返回df为：

     0     1   2     3   4     5   6   7     8   9   10  11    12    13    14    15  16  17    18  19  20  21  22  23    24  25  26  27    28  29
0   NaN  52.0  65  85.0  76  79.0  99  85  53.0  20  35   2   NaN  58.0  51.0  56.0  63  46   0.0  63  14  27  79  45  30.0  83  35  32   NaN  16
1  37.0   NaN  75  28.0  23  77.0  19  99  34.0  70  31  74  59.0   NaN   NaN   NaN  85   2  16.0  12   6  18   2  16  42.0  95  54   4  57.0  23
2  54.0  54.0  99  96.0  64   NaN  65  17  72.0  82  19  26  10.0  64.0  82.0  18.0  72  53   NaN  76  90  29   6  40  80.0  57  48  60   NaN  17
3  57.0  33.0  82  28.0  14   NaN   2  69   NaN  67  23  87   NaN  34.0  12.0  86.0  74  67  32.0  69  43  19  63   6   NaN  31  12  16  60.0  60
4  10.0  82.0  26   NaN  22  21.0  37  17  33.0  20  40  93  50.0  75.0  24.0  91.0  41  79  56.0  24   5  89  95  59  80.0  36  23  38  41.0  79

此处，条目总数为 150，NaN 总数随机分布在数据帧中，为 15（即 150 的 10%）。

当然 Jaeden 的解决方案有效，但是您可以通过使用 pandas 函数而无需复杂的编程即可获得所需的结果。基本上你需要将所有列melt() 到一列；然后随机取所需的行数；最后将 pivot() 转换为原始数据框形状。很高兴在最后检查 nans 的数量，以确保一切正常。

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1,99,size=(4,30)),columns=list(range(0,30)))

df = df.reset_index()
df_onecolumn = pd.melt(df,id_vars=['index'])
df_sampled = df_onecolumn.sample(frac=0.9).reset_index(drop=True)
df_fraction = df_sampled.pivot(index='index',columns='variable',values='value')

df_fraction.isna().sum().sum()

classification dataframe pandas pandas