问题描述
class LemmaTokenizer(object):
def __init__(self):
self.wnl = WordNetLemmatizer()
def __call__(self,articles):
result = [self.wnl.lemmatize(t) for t in word_tokenize(articles)]
# print(result)
return result
并经过一些预处理步骤
descript_data= descript_data.replace(np.nan,'',regex=True)
descript_data= descript_data.str.replace('\d+',' ')
descript_data= descript_data.str.replace(r'(\b\w{1,2}\b)',' ')
descript_data= descript_data.str.replace('[^\w\s]',' ')
我运行了以下内容:
vect = TfidfVectorizer(strip_accents = 'ascii',stop_words = 'english',lowercase = True,max_df = 0.8,min_df = 10,analyzer='word',tokenizer=LemmaTokenizer())
final = vect.fit_transform(descript_data)
print(vect.get_feature_names())
其中descript_data
是文本数据列。
结果中仍会同时获得原始单词及其补缀单词,并带有“ s”,“ ly”等。
我该如何解决?
解决方法
使用下面的例句中给出的代码,没有问题。预处理步骤中可能存在一些问题。
<div class="col-sm-7 date" id="sunDiv">
<input type="text" class="form-control" name = "sunday" id="sunday" placeholder="" value="" required >
<select class = 'test'>
<option>1</option>
<option selected>9</option>
<option>10</option>
<option>11</option>
<option>12</option>
</select> :
<select class = 'test'>
<option>00</option>
<option>15</option>
<option>30</option>
<option>45</option>
</select>
<select class = 'test'>
<option>AM</option>
<option >PM</option>
</select> to
<select class = 'test'>
<option>1</option>
<option>2</option>
<option>3</option>
<option>4</option>
<option selected>5</option>
<option>6</option>
<option>7</option>
<option>8</option>
<option>9</option>
<option>10</option>
<option>11</option>
<option>12</option>
</select> :
<select class = 'test'>
<option>00</option>
<option>15</option>
<option>30</option>
<option>45</option>
</select>
<select class = 'test'>
<option>AM</option>
<option selected>PM</option>
</select>
<input type="checkbox" name="closed" id="closed" value="closed" class="closed"><span>  Closed</span>   
<input type="checkbox" name="unknown" id="unknown" value="unknown" class="unknown"><span> Unknown</span>
</div>
<button id = 'testbtn' class = 'btn btn-info'>Test</button>
$('#testbtn').on('click',function(){
grabTimeStr("sunDiv")
function grabTimeStr(day){
var arr4 =[]
day = $('#'+day).attr('id')
console.log(day)
$('.test',$(day).parent()).each(function(value){
arr4.push((`${this.value}`));
console.log((`${this.value}`))
})
}