问题描述
我想运行多项式NB算法来预测将有多少赞许在Google Play上发表评论。 数据是从离线导航应用程序的评论中抓取的。
我试图使用列表来运行算法,但这无济于事,输入必须是字符串。 在查看数据集的长度时,它们看起来相等:
当我从excel c / p到记事本并c / p返回两个文档时,它们看起来都具有相等的行数。
但是Python却不这么认为,这是我得到的错误:
<ipython-input-15-e013b17d1a55> in <module>
18
19 #Split as training and testing sets
---> 20 xtrain,xtest,ytrain,ytest = train_test_split(tfidf,int_classes,test_size=0.2,random_state=0,stratify=thumbs)
21
22 #Build the model
~\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py in train_test_split(*arrays,**options)
2125 raise TypeError("Invalid parameters passed: %s" % str(options))
2126
-> 2127 arrays = indexable(*arrays)
2128
2129 n_samples = _num_samples(arrays[0])
~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in indexable(*iterables)
291 """
292 result = [_make_indexable(X) for X in iterables]
--> 293 check_consistent_length(*result)
294 return result
295
~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays)
255 if len(uniques) > 1:
256 raise ValueError("Found input variables with inconsistent numbers of"
--> 257 " samples: %r" % [int(l) for l in lengths])
258
259
ValueError: Found input variables with inconsistent numbers of samples: [139736,134145]
这是代码:
import os
import nltk
import nltk.corpus
from nltk.corpus import stopwords
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_excel("All_Apps_unicode.xlsx",parse_dates=['date'])
df.head()
df.info()
#Read reviews
with open ("app_reviews.txt","r",encoding="utf8") as reviews:
descriptions = reviews.read().splitlines()
print("Sample review description :",descriptions[:2])
#Setup stopwords
import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords
#setup wordnet for lemmatization
#nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
from sklearn.feature_extraction.text import TfidfVectorizer
#Custom tokenizer that will perform tokenization,stopword removal and lemmatization
def customtokenize(str):
tokens=nltk.word_tokenize(str)
nostop = list(filter(lambda token: token not in stopwords.words('english'),tokens))
lemmatized=[lemmatizer.lemmatize(word) for word in nostop ]
return lemmatized
#Generate TFIDF matrix
vectorizer = TfidfVectorizer(tokenizer=customtokenize)
tfidf=vectorizer.fit_transform(descriptions)
print("\nSample feature names identified : ",vectorizer.get_feature_names()[:25])
print("\nSize of TFIDF matrix : ",tfidf.shape)
#Loading the pre-built classifications for training
with open("app_thumbs.txt",'r',encoding="utf8") as thumbs:
classifications = thumbs.read().splitlines()
#Create Labels and integer classes
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(classifications)
print("Classes found : ",le.classes_)
#Convert classes to integers for use with ML
int_classes = le.transform(classifications)
print("\nClasses converted to integers :",int_classes)
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
#Split as training and testing sets
xtrain,stratify=thumbs)
#Build the model
classifier= MultinomialNB().fit(xtrain,train)
解决方法
您的df跨列具有相等数量的值。但是,大熊猫将NaN值分配给数据中缺少的值。我相信您尚未处理缺失的值。因此有所不同。
请在此处详细了解缺失值及其处理方法:
https://towardsdatascience.com/handling-missing-values-with-pandas-b876bf6f008f