问题描述
我正在尝试为不同数据集的文档实现 cosine_similarity
。每个集合有 30 个文档,我有兴趣匹配 documents
和 documents2
之间的相似文档。
到目前为止,我的方法是这样的:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os
import sys
import glob
import codecs
from collections import defaultdict
from collections import Counter
from nltk import word_tokenize
import nltk
import re
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from sklearn.metrics.pairwise import cosine_similarity
from contextlib import ExitStack
#max_df = 29 set to ignore terms that appear in more than 29 documents
#TFIDF vectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=29)
#Create list of documents to work with
path = "C:\\Users\\path\\Desktop\\research\\dataset\\1"
text_files = [f for f in os.listdir(path) if f.endswith('.txt')]
documents = [os.path.join(path,name) for name in text_files]
with ExitStack() as stack:
files = [stack.enter_context(open(filename,encoding="utf-8")).read() for filename in documents]
X = tfidf_vectorizer.fit_transform(files)
path2 = "C:\\Users\\path\\Desktop\\research\\dataset\\2"
text_files2 = [f for f in os.listdir(path2) if f.endswith('.txt')]
documents2 = [os.path.join(path2,name) for name in text_files2]
with ExitStack() as stack:
files2 = [stack.enter_context(open(filename,encoding="utf-8")).read() for filename in documents2]
X2 = tfidf_vectorizer.fit_transform(files2)
#X = X.reshape(-1,1)
#X2 = X2.reshape(-1,1)
sm = cosine_similarity(X,X2)
我收到以下错误 ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 5068 while Y.shape[1] == 4479
。
如果我取消注释 reshape
语句,我会得到 numpy.core._exceptions.MemoryError: Unable to allocate 152. GiB for an array with shape (152040,134370) and data type float64
。
有什么想法吗?
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)