使用imdb数据集上的朴素贝叶斯在情感分析器上工作,以下是获取评论功能,以获取来自ds的评论但给出错误

问题描述

当我尝试使用get_reviews fn时,它在 <form method = "POST" action="/proxy_settings"> {% if proxy_mode == '0' %} <input type="radio" name="proxy_mode" value = '0' checked=true>Auto {% else %} <input type="radio" name="proxy_mode" value = '0'>Auto {% endif %} <br> {% if proxy_mode == '1' %} <input type="radio" name="proxy_mode" value = '1' checked=true>Manual {% else %} <input type="radio" name="proxy_mode" value = '1'>Manual {% endif %} <br> <br> <section> <table border="1"> <tr> <td>Description</td> <td>delay</td> <td>select</td> </tr> {% for node_name,node_delay in node_list.items() %} <tr> <td>{{node_name}}</td> <td>{{node_delay}}</td> {% if loop.index0 == proxy_node|int %} <td><input type="radio" name="proxy_node" value={{loop.index0}} checked=true></td> {% else %} <td><input type="radio" name="proxy_node" value={{loop.index0}}></td> {% endif %} </tr> {% endfor %} </table> </section> <br> <section> <button type="submit">CONFIRM</button> </section> </form> 处给了我错误,说f.read.decode()没有函数解码,当我删除str时又给了我指定错误错误一个错误

.decode()

错误

               def get_reviews(dirname,positive=True ):
                     label = 1 if positive else 0
                     reviews = []
                     for filename in os.listdir(dirname): 
                        if filename.endswith(".txt"):
                        with open(dirname + filename,"r+") as f:
            
                           review = f.read().decode('utf-8')#we decoding text as utf 8
                           review = review.lower().replace("<br />"," ")
                           review = re.sub(token_regex,"",review) 
            
            #returning a tuple of the reviews text and lable for 
            #wheather it a positive or negative review
                           reviews.append([review,label])
            
                  return reviews

这是我尝试删除.decode()时遇到的第二个错误,如果删除.decode()则出现以下错误

AttributeError                            Traceback (most recent call last)
<ipython-input-6-92e2ebb79bdf> in <module>()
----> 1 positive_reviews,negative_reviews=extract_reviews()

<ipython-input-5-233b24b569a3> in extract_reviews()
     22             tar.extractall()
     23             tar.close()
---> 24     positive_reviews = get_reviews("aclimdb/train/pos/",positive = True)
     25     negative_reviews = get_reviews("aclimdb/train/neg/",positive=False)
     26 

<ipython-input-5-233b24b569a3> in get_reviews(dirname,positive)
      7             with open(dirname + filename,"r+") as f:
      8 
----> 9                 review = f.read().decode('utf-8')#we decoding text as utf 8
     10                 review = review.lower().replace("<br />"," ")#converting it to lower case and removing spaces
     11                 review = re.sub(token_regex,review) #and surbbing the sentenses having special characters

AttributeError: 'str' object has no attribute 'decode'

解决方法

读取文件时,应使用utf-8-sig而不是utf-8。那应该可以解决问题。