我使用Python 3.4.1与numpy 0.10.1和pandas 0.17.0.我有一个大型数据框,列出了个体动物的种类和性别.它是一个真实的数据集,并且不可避免地存在由NaN表示的缺失值.可以生成数据的简化版本:
import numpy as np
import pandas as pd
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'species': ["dog","dog",np.nan,"dog","dog","cat","cat","cat","dog","cat","cat","dog","dog","dog","dog",np.nan,"cat","cat","dog","dog"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"]})
打印数据帧给出:
gender id species
0 male 1 dog
1 female 2 dog
2 female 3 NaN
3 male 4 dog
4 male 5 dog
5 female 6 cat
6 female 7 cat
7 NaN 8 cat
8 male 9 dog
9 male 10 cat
10 female 11 cat
11 male 12 dog
12 female 13 dog
13 female 14 dog
14 male 15 dog
15 female 16 NaN
16 male 17 cat
17 female 18 cat
18 NaN 19 dog
19 male 20 dog
我想生成一个交叉表格,以显示每个物种中的雄性和雌性数量,使用以下方法:
pd.crosstab(tempDF['species'],tempDF['gender'])
这会产生下表:
gender female male
species
cat 4 2
dog 3 7
这是我所期待的.但是,如果我包含marginins = True选项,它会产生:
pd.crosstab(tempDF['species'],tempDF['gender'],margins=True)
gender female male All
species
cat 4 2 7
dog 3 7 11
All 9 9 20
如您所见,边际总数似乎不正确,可能是由数据帧中缺少的数据引起的.这是预期的行为吗?在我看来,它似乎很混乱.当然,边际总数应该是行和列的总和,因为它们出现在表中,并且不包括表中未表示的任何缺失数据.包括dropna = False不会影响结果.
在创建表之前,我可以删除带有NaN的任何行,但这似乎是很多额外的工作,并且在进行分析时需要考虑很多额外的事情.我应该将此报告为错误吗?
解决方法:
我想一个解决方法是在创建表之前将NaN转换为’missing’,然后交叉管理将包含专门用于缺失值的列和行:
pd.crosstab(tempDF['species'].fillna('missing'),tempDF['gender'].fillna('missing'),margins=True)
gender female male missing All
species
cat 4 2 1 7
dog 3 7 1 11
missing 2 0 0 2
All 9 9 2 20
就个人而言,我希望看到默认行为,所以我不必记住在每个交叉表计算中替换所有NaN.