如何在for循环中获取lil_matrix元素的索引？

问题描述

我使用scipy.sparse.lil_matrix创建了一个稀疏矩阵：

import scipy.sparse as sp
test = sp.lil_matrix((3,3))
test[0,0]=1

我可以执行以下操作遍历并打印非零元素：

for el in test:
    print(el)

将打印出(0,0) 1.0。如何在不打印的情况下访问这两条信息？换句话说，对于lil_matrix的元素返回索引和值的合适方法是什么？进行el.data会返回array([list([])],dtype=object)。

请注意，我正在使用lil_matrix，因为我需要在非常大的double for循环中为其分配非零值。

解决方法

您想要的显示与str稀疏矩阵的coo显示非常相似。

In [216]: M = (sparse.random(5,5,.2)*10).astype(int)
In [217]: M
Out[217]: 
<5x5 sparse matrix of type '<class 'numpy.int64'>'
    with 5 stored elements in COOrdinate format>
In [218]: print(M)   # str(M)
  (0,0)    0
  (0,2)    8
  (1,3)    8
  (1,4)    8
  (4,4)    4

稀疏矩阵具有一种nonzero方法来显示非零元素的坐标。

In [219]: M.nonzero()
Out[219]: (array([0,1,4],dtype=int32),array([2,3,4,dtype=int32))

对于coo，值存储为3个数组：

In [220]: M.data,M.row,M.col
Out[220]: 
(array([0,8,4]),array([0,2,dtype=int32))

对于coo格式的这些元素的顺序没有任何限制。甚至可能有重复项，尽管在转换为显示格式或csr格式时会对其进行累加。

当我们将其转换为lil格式时，数据现在存储在2个列表数组中，每行一个列表：

In [221]: Ml = M.tolil()
In [222]: Ml.data
Out[222]: 
array([list([0,8]),list([8,list([]),list([4])],dtype=object)
In [223]: Ml.rows
Out[223]: 
array([list([0,2]),list([3,dtype=object)

它也有nonzero，但请看一下代码（它使用coo格式）：

In [224]: Ml.nonzero()
Out[224]: (array([0,dtype=int32))
In [225]: Ml.nonzero??
Signature: Ml.nonzero()
Source:   
    def nonzero(self):
         ...
        # convert to COOrdinate format
        A = self.tocoo()
        nz_mask = A.data != 0
        return (A.row[nz_mask],A.col[nz_mask])
File:      /usr/local/lib/python3.6/dist-packages/scipy/sparse/base.py
Type:      method

实际上，这是所有稀疏格式的通用nonzero。 nz_mask部分考虑到了矩阵可能有0个尚未清除的值的事实。

尽管lil设计用于逐个元素轻松更新，但我们通常建议尽可能使用coo类型的输入数组创建一个矩阵。通常可以更有效地创建这些数组。甚至列表追加或扩展都可以更快。

着眼于Ml矩阵的迭代-它为每一行创建一个lil：

In [230]: [x for x in Ml]
Out[230]: 
[<1x5 sparse matrix of type '<class 'numpy.int64'>'
    with 2 stored elements in List of Lists format>,<1x5 sparse matrix of type '<class 'numpy.int64'>'
    with 2 stored elements in List of Lists format>,<1x5 sparse matrix of type '<class 'numpy.int64'>'
    with 0 stored elements in List of Lists format>,<1x5 sparse matrix of type '<class 'numpy.int64'>'
    with 1 stored elements in List of Lists format>]

我们可以显示每一行的数据：

In [231]: [((i,x.rows[0]),x.data[0]) for i,x in enumerate(Ml)]
Out[231]: 
[((0,[0,((1,[3,[8,((2,[]),((3,((4,[4]),[4])]

或过滤出空行：

In [232]: [((i,x in enumerate(Ml) if x.data[0]]
Out[232]: [((0,[4])]

我们需要再次迭代以分离出每一行中的元素。

在使用稀疏数组与密集数组时，经验法则是稀疏性（非零元素的百分比）应小于10％，以使其值得使用稀疏格式。但这在很大程度上取决于您的使用和关注。

从简单的数据存储角度来看，请注意coo格式必须为每个非零项使用3个数字，而不是对于密集数组仅使用1个数字。稀疏矩阵乘法对于csr格式来说比较好。可以仅关注data值（例如sin）的其他计算也相对有效。但是，如果数学必须比较两个矩阵的稀疏性（例如加法和逐元素乘法），则稀疏票价会更糟。

索引，切片和求和实际上可能使用矩阵乘法。 coo格式没有实现这些。 lil可以很好地执行一些面向行的操作。创建稀疏矩阵的基本操作需要时间。

全部在.data和.rows

中

from scipy import sparse
arr = sparse.random(10,format='lil',density=0.5)

对于具有25个元素的10x5数组：

>>> arr
<10x5 sparse matrix of type '<class 'numpy.float64'>'
    with 25 stored elements in List of Lists format>

>>> arr.data.shape
(10,)

>>> arr.data
array([list([0.7656088763162588,0.7262695483137545]),list([0.5229054168281109,0.6329489698531673,0.9090750679268123]),list([0.3285250285217297,0.12678874412598085,0.49074613569184733]),list([0.9376762935882884]),list([0.7783159122917774]),list([0.8750078624527947,0.017065437987856757,0.7161352157970525]),list([0.6849637433019786,0.05732598765212671,0.09948536587262824]),list([0.5683250727980487,0.960851197599538,0.7540173942047833]),list([0.5891879469424754,0.7901005027272154,0.5829700379167293]),list([0.6266097436787399,0.8843420498719459,0.9040791506861361])],dtype=object)

.data数组的每个元素都是一个列表，其中包含该行的值。

>>> arr.rows
array([list([0,list([0,list([1,list([1]),list([3]),3]),3])],dtype=object)

.rows数组的每个元素是.data中每个非零值的列索引的列表。

请注意，我正在使用lil_matrix，因为我需要在非常大的double for循环中为其分配非零值。

这几乎肯定不是一个好主意。 lil_matrix的开销意味着，如果它不小于5％稀疏，则几乎可以肯定，最好填充一个密集数组。即便如此，它还是很不稳定。这是一种非常糟糕的数据存储格式。

编辑：

>>>> for r in arr:
>>>>     print(r.data)

[list([0.7656088763162588,0.7262695483137545])]
[list([0.5229054168281109,0.9090750679268123])]
[list([0.3285250285217297,0.49074613569184733])]
[list([0.9376762935882884])]
[list([0.7783159122917774])]
[list([0.8750078624527947,0.7161352157970525])]
[list([0.6849637433019786,0.09948536587262824])]
[list([0.5683250727980487,0.7540173942047833])]
[list([0.5891879469424754,0.5829700379167293])]
[list([0.6266097436787399,0.9040791506861361])]

编辑2：

我不知道您的实际功能或目标是什么，但是如果您知道有多少个非零项目，则可以预分配所需的数组，并跳过整个lil事情。

import numpy as np

N = 10000
data,rows,cols = np.zeros(N),np.zeros(N),np.zeros(N)

for i,r in enumerate(_):
    for j,c in enumerate(_):
        _idx = i * len(cols) + j
        data[_idx] = some_data_function()
        rows[_idx] = r
        cols[_idx] = c

arr = sparse.csr_matrix((data,(rows,cols)))

python scipy scipy sparse-matrix