问题描述
我无法将CountVectorizer应用于Excel导入的数据集。我尝试将数据中的所有整数交换为字符串,但是CountVectorizer仍会注册整数。
import numpy as np
import sklearn
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer as cv
from sklearn.linear_model import Perceptron
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
pos = pd.read_excel("/content/drive/My Drive/Polarity_pos.xlsx",header = None,names=None)
neg = pos = pd.read_excel("/content/drive/My Drive/Polarity_neg.xlsx",names=None)
merged_train = pd.merge(pos,neg)
string = merged_train.astype('str')
train=pd.DataFrame(data=string).replace('\d+','NUM',regex=True)
print(train.loc[19,:])
#analyzer='word',stop_words=None,analyzer = 'word'
vectorizer = cv()
count_vector = vectorizer.fit_transform(train)
出现错误:
AttributeError Traceback (most recent call last)
<ipython-input-116-adcd263d8e89> in <module>()
26 #analyzer='word',analyzer = 'word'
27 vectorizer = cv()
---> 28 count_vector = vectorizer.fit_transform(train)
29
30
3 frames
/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py in _preprocess(doc,accent_function,lower)
66 """
67 if lower:
---> 68 doc = doc.lower()
69 if accent_function is not None:
70 doc = accent_function(doc)
AttributeError: 'int' object has no attribute 'lower'
解决方法
可能是您为fit_transform
向CountVectorizer
提供了错误的输入。它不需要数据框,而是“可遍历原始文本文档”。请参见docs.,以便您可以尝试展平数据框,然后使用矢量化器。但是请确保您正在做的事情仍然适合您的问题。试试这个:
count_vector = vectorizer.fit_transform(train.stack())
train.stack()
将您的数据框转换为序列。