问题描述
我正在尝试计算每个文件的 bigram BOW,并在 scipy csr_matrix 的每次迭代中替换一行。由于我有 10868 个文件和最大 BOW 特征 66049,我已经定义了 bytebigram_vect = scipy.sparse.csr_matrix((10868,66049))
的最终向量。
在每次迭代中,我应该得到 1,66049,我想用 bytebigram_vect 中的一行替换它。我的代码给出了不一致的形状错误。
vector = CountVectorizer(lowercase=False,ngram_range=(2,2),vocabulary=byte_bigram_vocab)
bytebigram_vect = scipy.sparse.csr_matrix((10868,66049))
for i in range(len(byte_file_name)):
# Downloading only a single byte file from train.7z
file_name=byte_file_name[i]
!7z e train.7z -o/content/bytefiles *$file_name -r
f = open('bytefiles/' + file_name)
bytebigram_vect[i:]+= scipy.sparse.csr_matrix(vector.fit_transform([f.read().replace('\n',' ').lower()]))
# Deleting the file
os.remove(file_name)
f.close()
错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-61-352aa3807cfd> in <module>()
8 get_ipython().system('7z e train.7z -o/content/bytefiles *$file_name -r')
9 f = open('bytefiles/' + file_name)
---> 10 bytebigram_vect[i:]+= scipy.sparse.csr_matrix(vector.fit_transform([f.read().replace('\n',' ').lower()]))
11 # Deleting the file
12 os.remove(file_name)
/usr/local/lib/python3.7/dist-packages/scipy/sparse/base.py in __add__(self,other)
416 elif isspmatrix(other):
417 if other.shape != self.shape:
--> 418 raise ValueError("inconsistent shapes")
419 return self._add_sparse(other)
420 elif isdense(other):
ValueError: inconsistent shapes
解决方法
bytebigram_vect[i]= scipy.sparse.csr_matrix(vector.fit_transform([f.read().replace('\n',' ').lower()]))
只是替换这条线就帮我实现了