为什么不根据特征结果对文本进行词素化?

问题描述

使用以下自定义标记

class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self,articles):
        result = [self.wnl.lemmatize(t) for t in word_tokenize(articles)]
       # print(result)
        return result

并经过一些预处理步骤

descript_data= descript_data.replace(np.nan,'',regex=True)
descript_data= descript_data.str.replace('\d+',' ')
descript_data= descript_data.str.replace(r'(\b\w{1,2}\b)',' ')
descript_data= descript_data.str.replace('[^\w\s]',' ')

我运行了以下内容

vect = TfidfVectorizer(strip_accents = 'ascii',stop_words = 'english',lowercase = True,max_df = 0.8,min_df = 10,analyzer='word',tokenizer=LemmaTokenizer()) 

final = vect.fit_transform(descript_data)
print(vect.get_feature_names())

其中descript_data是文本数据列。 结果中仍会同时获得原始单词及其补缀单词,并带有“ s”,“ ly”等。 我该如何解决

解决方法

使用下面的例句中给出的代码,没有问题。预处理步骤中可能存在一些问题。

                <div class="col-sm-7 date"  id="sunDiv">
                <input type="text" class="form-control" name = "sunday" id="sunday" placeholder="" value="" required >
                <select class = 'test'>
                    <option>1</option>
                    <option selected>9</option>
                    <option>10</option>
                    <option>11</option>
                    <option>12</option>
                </select> :
                <select class = 'test'>
                    <option>00</option>
                    <option>15</option>
                    <option>30</option>
                    <option>45</option>
                </select>
                <select class = 'test'>
                    <option>AM</option>
                    <option >PM</option>
                </select> to
                <select class = 'test'>
                    <option>1</option>
                    <option>2</option>
                    <option>3</option>
                    <option>4</option>
                    <option selected>5</option>
                    <option>6</option>
                    <option>7</option>
                    <option>8</option>
                    <option>9</option>
                    <option>10</option>
                    <option>11</option>
                    <option>12</option>
                </select> :
                <select class = 'test'> 
                    <option>00</option>
                    <option>15</option>
                    <option>30</option>
                    <option>45</option>
                </select>
                <select class = 'test'> 
                    <option>AM</option>
                    <option selected>PM</option>
                </select>
                <input type="checkbox" name="closed" id="closed" value="closed" class="closed"><span>&nbsp Closed</span>&nbsp &nbsp 
                <input type="checkbox" name="unknown" id="unknown" value="unknown" class="unknown"><span>&nbspUnknown</span>
            </div>
<button id = 'testbtn' class = 'btn btn-info'>Test</button>

    $('#testbtn').on('click',function(){
        grabTimeStr("sunDiv")

    function grabTimeStr(day){
        var arr4 =[]
        day = $('#'+day).attr('id')
        console.log(day)
        $('.test',$(day).parent()).each(function(value){                   
            arr4.push((`${this.value}`)); 
            console.log((`${this.value}`))
        }) 
    }