问题描述
我正在尝试使用正则表达式和许多其他内容删除 URL 以清理数据,为此我有以下功能
def depure(data):
'''
input : data
output: data without #URLs,Emails,Characters and single quotes
'''
#remove URLs with a regular expressions (not sure if they exist)
regex = r'https?://\S+|www\.\S+'
url_pattern = re.compile(regex)
data = url_pattern.sub(r'',data)
# Remove Emails
data = re.sub('\S*@\S*\s?','',data)
# Remove new line characters
data = re.sub('\s+',' ',data)
# Remove distracting single quotes
data = re.sub("\'","",data)
return data
test_temp = []
#tranform data sequences to list
train_to_list = train_data.tolist()
test_to_list = test_data.tolist()
#for train data
for i in range(len(train_data)):
train_temp.append(depure(train_data[i]))
train_words = list(sent_to_words(train_temp))
new_train = []
for i in range(len(train_words)):
new_train.append(detokenize(train_data[i]))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-59-1e848ac6a862> in <module>()
7 #for train data
8 for i in range(len(train_data)):
----> 9 train_temp.append(depure(train_data[i]))
10 train_words = list(sent_to_words(train_temp))
11 new_train = []
1 frames
/usr/lib/python3.7/re.py in sub(pattern,repl,string,count,flags)
192 a callable,it's passed the Match object and must return
193 a replacement string to be used."""
--> 194 return _compile(pattern,flags).sub(repl,count)
195
196 def subn(pattern,count=0,flags=0):
TypeError: cannot use a string pattern on a bytes-like object
有人可以帮我吗?
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)