在Python中查找元组列表的平均值的最快方法是什么,每个元组都包含一对namedtuple?

问题描述

import numpy as numpy
from collections import namedtuple
from random import random

Smoker    = namedtuple("Smoker",["Female","Male"])
Nonsmoker = namedtuple("Nonsmoker","Male"])

LST = [(Smoker(random(),random()),Nonsmoker(random(),random())) for i in range(100)]

所以我有一个长长的列表,其元素是元组。每个元组包含一对namedtuple。找到此列表平均值的最快方法是什么?理想情况下,结果仍然是相同的结构,即(Smoker(Female=w,Male=x),Nonsmoker(Female=y,Male=z)) ..

grizzly = Smoker(np.mean([a.Female for a,b in LST]),np.mean([a.Male for a,b in LST]))
panda = Nonmoker(np.mean([b.Female for a,np.mean([b.Male for a,b in LST]))
result = (grizzly,panda)

解决方法

np.mean必须将列表转换为数组,这需要时间。 Python sum节省时间:

In [6]: %%timeit
   ...: grizzly = Smoker(np.mean([a.Female for a,b in LST]),np.mean([a.Male for
   ...: a,b in LST]))
   ...: panda = Nonsmoker(np.mean([b.Female for a,np.mean([b.Male for
   ...:  a,b in LST]))
   ...: result = (grizzly,panda)
   ...: 
   ...: 
158 µs ± 597 ns per loop (mean ± std. dev. of 7 runs,10000 loops each)

In [9]: %%timeit
   ...: n=len(LST)
   ...: grizzly = Smoker(sum([a.Female for a,b in LST])/n,sum([a.Male for a,b in
   ...:  LST])/n)
   ...: panda = Nonsmoker(sum([b.Female for a,sum([b.Male for a,b i
   ...: n LST])/n)
   ...: result = (grizzly,panda)
   ...: 
   ...: 
46.2 µs ± 37.4 ns per loop (mean ± std. dev. of 7 runs,10000 loops each)

两者都产生相同的result(在小的epsilon内):

In [8]: result
Out[8]: 
(Smoker(Female=0.5383695316982974,Male=0.5493854404111675),Nonsmoker(Female=0.4913454565011218,Male=0.47143788469638825))

如果您可以将值收集到一个可能为(n,4)形状的数组中,则平均值将很快。一次计算可能不值得-

In [11]: M = np.array([(a.Female,a.Male,b.Female,b.Male) for a,b in LST])
In [12]: np.mean(M,axis=0)
Out[12]: array([0.53836953,0.54938544,0.49134546,0.47143788])

In [13]: timeit M = np.array([(a.Female,b in LST])
128 µs ± 1.22 µs per loop (mean ± std. dev. of 7 runs,10000 loops each)
In [14]: timeit np.mean(M,axis=0)
21.9 µs ± 371 ns per loop (mean ± std. dev. of 7 runs,10000 loops each)

由于命名元组可以像常规元组一样进行访问,因此我们可以直接从LST创建数组:

In [16]: np.array(LST).shape
Out[16]: (100,2,2)
In [17]: np.array(LST).mean(axis=0)
Out[17]: 
array([[0.53836953,0.54938544],[0.49134546,0.47143788]])

但时间安排并不令人鼓舞:

In [18]: timeit np.array(LST).mean(axis=0)
1.26 ms ± 7.92 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)

我还可以从您的列表中创建结构化的数组-具有嵌套的dtypes:

In [26]: dt = np.dtype([('Smoker',[('Female','f'),('Male','f')]),('Nonsmoker',[
    ...: ('Female','f')])])
In [27]: M=np.array(LST,dt)
In [28]: M['Smoker']['Female'].mean()
Out[28]: 0.53836954

奇怪的是,时机相对不错:

In [29]: timeit M=np.array(LST,dt)
40.6 µs ± 243 ns per loop (mean ± std. dev. of 7 runs,10000 loops each)

但是我必须分别取每个均值,否则首先将其转换为非结构化数组。

我可以使用viewrecfunctions实用程序从结构化数组中构造一个(n,4)个浮点数组:

In [53]: M1 = M.view([('f0','f',(4,))])['f0']
In [54]: M1.shape
Out[54]: (100,4)
In [55]: M2=rf.structured_to_unstructured(M)

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...