使用Tweepy进行流传输:将Unicode字符转换为字母

问题描述

我在使用Tweepy进行流式传输时捕获的推文采用Unicode特殊字符,因此我需要将它们作为字母。我在该网站上找到了许多解决方案,但由于我是实时收集推文,因此似乎没有一个解决方案甚至无法应用于我的案子。有人可以帮忙吗?

这是我的代码

from urllib3.exceptions import ProtocolError
from tweepy import Stream
from tweepy.auth import OAuthHandler
from tweepy.streaming import StreamListener
import time

ckey = 'your code here'
csecret = 'your code here'
atoken = 'your code here'
asecret = 'your code here'

class listener(StreamListener):
    
    def on_data(self,data):
        while True:
            try:
                #print (data)
                tweet = data.split(',"text":"')[1].split('","')[0]
                tweet2 = data.split(',"screen_name":"')[1].split('","location')[0]
                print (tweet2,tweet)
                saveFile = open ('test.csv','a')
                saveFile.write('@')
                saveFile.write(tweet2)
                saveFile.write(';')
                saveFile.write(tweet)
                saveFile.write('\n')
                saveFile.close()
                return True
        
            except ProtocolError:
                continue
            except BaseException as e:
                print ('Failed on data',str(e))
                break
    
        def on_error(self,status):
            print (status)

auth = OAuthHandler(ckey,csecret)
auth.set_access_token(atoken,asecret)
twitterStream = Stream(auth,listener())
twitterStream.filter(track=['keyword'])

这是关键字“ fluminense”的输出

adrianabpadilha Impressionante como mesmo com poucas op\u00e7\u00f5es para o banco o Burro s\u00f3 me sobe o Wisney e o Higor! Pq n\u00e3o levar o Pato\u2026 https:\/\/t.co\/lO4CJJsaaP
Miguel_Aalmeida RT @pulligffc: O Fluminense em dia de jogo olha pra mim e faz isso
TRANquilINHO3 Time fdpt \ud83d\ude20
LeleoCasttroo @jrmenini @FFvinho Palmeiras e Fluminense ainda tiveram a base como fonte de renda,atl\u00e9tico n\u00e3o revela um jogador\u2026 https:\/\/t.co\/ZF8awS6pDt
SouzaArthur6 @CezarSabia @andreisilvasoar @ndrzej87 @futebol_info C\u00e9zar,existe um tempo certo de testagem,q se d\u00e1 no 5\u00b0 da doe\u2026 https:\/\/t.co\/zmBlBzafdo
Thomasrodrigue_ @renatojr_07 \u00c9 o mesmo exemplo da final da ta\u00e7a rio,a \u00fanica coisa que muda \u00e9 que na final n\u00e3o tinha jogador contam\u2026 https:\/\/t.co\/3Q2nCBw9XS

如您所见,诸如“ç”和“õ”之类的某些字符分别显示为“ / u00e7”和“ \ u00f5”。

谢谢!

解决方法

由于编码字符问题而发生这种情况。您可以使用unicode_escape encoding

对字符串进行解码

例如

s = r'\u00e7'
print s
\u00e7 #output
print s.decode('unicode-escape')
ç #output