为什么a = [0]的listx for a中的x比a = []更快？

问题描述

我用三种不同的cpython版本测试了list(x for x in a)。在a = [0]上，它比在a = []上要快得多：

 3.9.0 64-bit       3.9.0 32-bit       3.7.8 64-bit
a = []  a = [0]    a = []  a = [0]    a = []  a = [0]

465 ns  412 ns     543 ns  515 ns     513 ns  457 ns   
450 ns  406 ns     544 ns  515 ns     506 ns  491 ns   
456 ns  408 ns     551 ns  513 ns     515 ns  487 ns   
455 ns  413 ns     548 ns  516 ns     513 ns  491 ns   
452 ns  404 ns     549 ns  511 ns     508 ns  486 ns

使用tuple而不是list，这是可以预期的另一种方式：

 3.9.0 64-bit       3.9.0 32-bit       3.7.8 64-bit
a = []  a = [0]    a = []  a = [0]    a = []  a = [0]

354 ns  405 ns     467 ns  514 ns     421 ns  465 ns   
364 ns  407 ns     467 ns  527 ns     425 ns  464 ns   
353 ns  399 ns     490 ns  549 ns     419 ns  465 ns   
352 ns  400 ns     500 ns  556 ns     414 ns  474 ns   
354 ns  405 ns     494 ns  560 ns     420 ns  474 ns

因此，list（和基础生成器迭代器）必须执行更多操作时，为什么会更快？

在Windows 10 Pro 2004 64位上进行了测试。

基准代码：

from timeit import repeat

setups = 'a = []','a = [0]'
number = 10**6

print(*setups,sep='   ')
for _ in range(5):
    for setup in setups:
        t = min(repeat('list(x for x in a)',setup,number=number)) / number
        print('%d ns' % (t * 1e9),end='   ')
    print()

字节大小，表明它不为输入[]分配，但为输入[0]分配：

>>> [].__sizeof__()
40
>>> list(x for x in []).__sizeof__()
40

>>> [0].__sizeof__()
48
>>> list(x for x in [0]).__sizeof__()
72

解决方法

您观察到的是pymalloc（Python memory manager）比C运行时提供的内存管理器快。

在事件探查器中很容易看出，这两个版本之间的主要区别在于list_resize和_PyObjectRealloc的情况需要更多的时间。但是为什么呢？

从迭代器创建新列表时，列表将尝试to get a hint迭代器中有多少个元素：

a=[]

但是，此doesn't work for generators和提示是默认值n = PyObject_LengthHint(iterable,8);。

在迭代器用尽之后，列表将尝试to shrink，因为只有0或1个元素（由于大小提示过大而没有分配原始容量）。对于1个元素，这将导致（由于分配过多）4个元素的容量。但是，对于0个元素，有一个特殊的处理方法：它将not be over-allocated：

因此，在“空”情况下，将要求// ... if (newsize == 0) new_allocated = 0; num_allocated_bytes = new_allocated * sizeof(PyObject *); items = (PyObject **)PyMem_Realloc(self->ob_item,num_allocated_bytes); // ...输入0个字节。此调用将通过_PyObject_Malloc向下传递到pymalloc_alloc，如果字节为0，则返回PyMem_Realloc：

NULL

但是，如果if (UNLIKELY(nbytes == 0)) { return NULL; }返回_PyObject_Malloc，则pymalloc falls back到“原始” malloc：

NULL

在definition of _PyMem_RawMalloc中可以很容易地看到

：

static void *
_PyObject_Malloc(void *ctx,size_t nbytes)
{
    void* ptr = pymalloc_alloc(ctx,nbytes);
    if (LIKELY(ptr != NULL)) {
        return ptr;
    }

    ptr = PyMem_RawMalloc(nbytes);
    if (ptr != NULL) {
        raw_allocated_blocks++;
    }
    return ptr;
}

因此，案例static void * _PyMem_RawMalloc(void *ctx,size_t size) { /* PyMem_RawMalloc(0) means malloc(1). Some systems would return NULL for malloc(0),which would be treated as an error. Some platforms would return a pointer with no memory behind it,which would break pymalloc. To solve these problems,allocate an extra byte. */ if (size == 0) size = 1; return malloc(size); }将使用a=[0]，而案例pymalloc将使用基础c运行时的内存管理器，这说明了观察到的差异。

现在，这一切都可以看作是错过的优化，因为对于a=[]，我们可以将newsize=0设置为ob_item，调整其他成员并返回。

让我们尝试一下：

NULL

通过此修复程序，空情况比预期的static int list_resize(PyListObject *self,Py_ssize_t newsize) { // ... if (newsize == 0) { PyMem_Del(self->ob_item); self->ob_item = NULL; Py_SIZE(self) = 0; self->allocated = 0; return 0; } // ... }情况要快一些（约10％）。

我声称a=[0]在较小的情况下比C运行时内存管理器更快，可以使用pymalloc进行轻松测试：如果需要分配超过512个字节，{{1} }将退回到简单的bytes：

pymalloc

实际差异大于显示的50％（不能通过仅将一个字节更改大小来解释此跳转），因为至少有一部分时间用于初始化字节对象，依此类推。 / p>

在cython的帮助下，这是一个更直接的比较：

malloc

现在

print(bytes(479).__sizeof__())   #  512
%timeit bytes(479)               # 189 ns ± 20.4 ns
print(bytes(480).__sizeof__())   #  513
%timeit bytes(480)               # 296 ns ± 24.8 ns

即%%cython from libc.stdlib cimport malloc,free from cpython cimport PyMem_Malloc,PyMem_Del def with_pymalloc(int size): cdef int i for i in range(1000): PyMem_Del(PyMem_Malloc(size)) def with_cmalloc(int size): cdef int i for i in range(1000): free(malloc(size))快约3倍（或每个分配约35ns）。注意：some compilers would optimize %timeit with_pymalloc(1) # 15.8 µs ± 566 ns %timeit with_cmalloc(1) # 51.9 µs ± 2.17 µs退出，但MSVC doesn't。

再举一个例子：前段时间，我已经通过pymalloc替换了默认的分配器，而该分配器导致了c ++的pymalloc，从而导致了a speed up of factor 4。

使用以下脚本进行概要分析：

free(malloc(size))

与发布模式下的VisualStudio内置性能分析器一起使用。

std::map版本需要6.6秒（在分析器中），而a=[0] # or a=[] for _ in range(10000000): list(x for x in a)版本需要6.9秒（即慢5％）。在“修复”之后，a=[0]仅需要5.8秒。

在a=[]和a=[]中花费的时间份额：

list_resize

显然，每次运行之间存在差异，但是运行时间上的差异非常大，可以解释观察到的时间差的大部分。

注意：_PyObject_Realloc分配的a=[0] a=[] a=[],fixed list_resize 3.5% 10.2% 3% _PyObject_Realloc 3.2% 9.3% 1%秒的差异约为每个分配30ns-这个数字与我们为pymalloc和c-runtime分配之间的差异得出的数字相似。

在通过调试器验证以上内容时，必须意识到，在调试模式下，Python使用了pymalloc的调试版本，该版本将其他数据附加到所需的内存中，因此在调试时将永远不会要求pymalloc分配0字节。版本，但为0.3，并且不会退回到10^7。因此，要么在发布版本的调试模式下调试，要么在debug-build中切换为realease-pymalloc（可能有一个选项-我只是不知道，代码中的相关部分是here和{{ 3}}）。

cpython performance performance performance python python-internals