为多项朴素贝叶斯准备数据并运行算法-ValueError:找到输入变量的样本数不一致

问题描述

我想运行多项式NB算法来预测将有多少赞许在Google Play上发表评论。 数据是从离线导航应用程序的评论中抓取的。

我试图使用列表来运行算法,但这无济于事,输入必须是字符串。 在查看数据集的长度时,它们看起来相等:

enter image description here

当我从excel c / p到记事本并c / p返回两个文档时,它们看起来都具有相等的行数。

但是Python却不这么认为,这是我得到的错误

<ipython-input-15-e013b17d1a55> in <module>
     18 
     19 #Split as training and testing sets
---> 20 xtrain,xtest,ytrain,ytest = train_test_split(tfidf,int_classes,test_size=0.2,random_state=0,stratify=thumbs)
     21 
     22 #Build the model

~\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py in train_test_split(*arrays,**options)
   2125         raise TypeError("Invalid parameters passed: %s" % str(options))
   2126 
-> 2127     arrays = indexable(*arrays)
   2128 
   2129     n_samples = _num_samples(arrays[0])

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in indexable(*iterables)
    291     """
    292     result = [_make_indexable(X) for X in iterables]
--> 293     check_consistent_length(*result)
    294     return result
    295 

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays)
    255     if len(uniques) > 1:
    256         raise ValueError("Found input variables with inconsistent numbers of"
--> 257                          " samples: %r" % [int(l) for l in lengths])
    258 
    259 

ValueError: Found input variables with inconsistent numbers of samples: [139736,134145]

这是代码

import os
import nltk
import nltk.corpus 
from nltk.corpus import stopwords
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_excel("All_Apps_unicode.xlsx",parse_dates=['date'])

df.head()

df.info()

#Read reviews
with open ("app_reviews.txt","r",encoding="utf8") as reviews:  
    descriptions = reviews.read().splitlines()
print("Sample review description :",descriptions[:2])

#Setup stopwords
import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords

#setup wordnet for lemmatization
#nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

from sklearn.feature_extraction.text import TfidfVectorizer

#Custom tokenizer that will perform tokenization,stopword removal and lemmatization
def customtokenize(str):
    tokens=nltk.word_tokenize(str)
    nostop = list(filter(lambda token: token not in stopwords.words('english'),tokens))
    lemmatized=[lemmatizer.lemmatize(word) for word in nostop ]
    return lemmatized

#Generate TFIDF matrix
vectorizer = TfidfVectorizer(tokenizer=customtokenize)
tfidf=vectorizer.fit_transform(descriptions)

print("\nSample feature names identified : ",vectorizer.get_feature_names()[:25])
print("\nSize of TFIDF matrix : ",tfidf.shape)

#Loading the pre-built classifications for training
with open("app_thumbs.txt",'r',encoding="utf8") as thumbs:  
    classifications = thumbs.read().splitlines()

#Create Labels and integer classes
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(classifications)
print("Classes found : ",le.classes_)

#Convert classes to integers for use with ML
int_classes = le.transform(classifications)
print("\nClasses converted to integers :",int_classes)

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

#Split as training and testing sets
xtrain,stratify=thumbs)

#Build the model
classifier= MultinomialNB().fit(xtrain,train)

解决方法

您的df跨列具有相等数量的值。但是,大熊猫将NaN值分配给数据中缺少的值。我相信您尚未处理缺失的值。因此有所不同。

请在此处详细了解缺失值及其处理方法:

https://towardsdatascience.com/handling-missing-values-with-pandas-b876bf6f008f