python-按类别过滤熊猫数据帧的最快方法

我有一个非常大的数据框,其中包含1亿行和分类列.我想知道是否有比使用here提到的.isin()方法或.join()方法更快的按类别选择行的方法.

考虑到数据已经分类,我认为选择类别应该很快,但是我进行的一些测试却令人失望.我找到的唯一其他解决方案是来自here,但该解决方案似乎不适用于0.20.2的熊猫.

这是一个示例数据集.

import pandas as pd
import random
import string
df = pd.DataFrame({'categories': [random.choice(string.ascii_letters) 
                                  for _ in range(1000000)]*100,
                   'values': [random.choice([0,1]) 
                              for _ in range(1000000)]*100})
df['categories'] = df['categories'].astype('category')

用.isin()测试：

%timeit df[df['categories'].isin(list(string.ascii_lowercase))]
44 s ± 894 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

使用.join()：

%timeit df.set_index('categories').join(
    pd.Series(index=list(string.ascii_lowercase), name='temp'), 
    how='inner').rename_axis('categories').reset_index().drop('temp', 1)
24.7 s ± 1.69 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

解决方法:

这是一种类似但不同的方法,可以直接比较值而不是使用isin.

基本地图/ lambda比较：

%timeit df[df['categories'].map(lambda x: x in string.ascii_lowercase)]
> 1 loop, best of 3: 12.3 s per loop

使用isin：

%timeit df[df['categories'].isin(list(string.ascii_lowercase))]
> 1 loop, best of 3: 55.1 s per loop

版本：Py 3.5.1 / IPython 5.1.0 / Pandas 0.20.3

背景：我注意到在one of the SO posts中您链接到一个评论者,提到评论者isin需要在执行期间创建set(),因此跳过该步骤并进行基本列表查找似乎是此处的加速.

disclamer：不是我经常处理的秤的类型,因此可能会有更快的选择.

编辑：Johngalt的评论中可应要求提供更多详细信息：

df.shape
> (100000000, 2)
df.dtypes
> categories    category
 values           int64
 dtype: object

为了创建样本数据,我从初始问题中复制/粘贴了样本DF.在2015年初的MBP模型上运行.

python-按类别过滤熊猫数据帧的最快方法

相关文章