在计算 bigram BOW 后尝试替换一行会导致形状不一致错误

问题描述

我正在尝试计算每个文件的 bigram BOW,并在 scipy csr_matrix 的每次迭代中替换一行。由于我有 10868 个文件和最大 BOW 特征 66049,我已经定义了 bytebigram_vect = scipy.sparse.csr_matrix((10868,66049)) 的最终向量。 在每次迭代中,我应该得到 1,66049,我想用 bytebigram_vect 中的一行替换它。我的代码给出了不一致的形状错误

vector = CountVectorizer(lowercase=False,ngram_range=(2,2),vocabulary=byte_bigram_vocab)
bytebigram_vect = scipy.sparse.csr_matrix((10868,66049))
for i in range(len(byte_file_name)):
  # Downloading only a single byte file from train.7z
  
  file_name=byte_file_name[i]
  !7z e train.7z -o/content/bytefiles *$file_name -r
  f = open('bytefiles/' + file_name)
  bytebigram_vect[i:]+= scipy.sparse.csr_matrix(vector.fit_transform([f.read().replace('\n',' ').lower()]))
  # Deleting the file
  os.remove(file_name)
  f.close() 

错误

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-61-352aa3807cfd> in <module>()
      8   get_ipython().system('7z e train.7z -o/content/bytefiles *$file_name -r')
      9   f = open('bytefiles/' + file_name)
---> 10   bytebigram_vect[i:]+= scipy.sparse.csr_matrix(vector.fit_transform([f.read().replace('\n',' ').lower()]))
     11   # Deleting the file
     12   os.remove(file_name)

/usr/local/lib/python3.7/dist-packages/scipy/sparse/base.py in __add__(self,other)
    416         elif isspmatrix(other):
    417             if other.shape != self.shape:
--> 418                 raise ValueError("inconsistent shapes")
    419             return self._add_sparse(other)
    420         elif isdense(other):

ValueError: inconsistent shapes

解决方法

bytebigram_vect[i]= scipy.sparse.csr_matrix(vector.fit_transform([f.read().replace('\n',' ').lower()]))

只是替换这条线就帮我实现了