问题描述
我想抓住口语中构成整个单词的文本组(用空格分隔的文本组被视为单词)。例如,当我想在文本文件中找到单词 is 时,即使文件中不包含单词is,单词s is ter中的is是检测到。我对 词法分析 有所了解,但无法将其应用于我的项目。有人可以提供这种情况的python代码。
这是我使用的代码,但它导致了上述问题。
words_to_find = ("test1","test2","test3")
line = 0
#User_Input.txt is a file saved in my computer which i used as the input of the system
with open("User_Input.txt","r") as f:
txt = f.readline()
line += 1
for word in words_to_find:
if word in txt:
print(F"Word: '{word}' found at line {line},"
F"pos: {txt.index(word)}")
解决方法
您应该使用spacy来标记列表,因为自然语言往往很棘手,但有例外和其他例外:
from spacy.lang.en import English
nlp = English()
# Create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.Defaults.create_tokenizer(nlp)
txt = f.readlines()
line += 1
for txt_line in txt:
[print(f'Word {word} found at line {line}; pos: {txt.index(word)}') for word in nlp(txt)]
或者,您可以通过以下方式使用textblob:
# from textblob import TextBlob
txt = f.readlines()
blob = TextBlob(txt)
for index,word in enumerate(list(blob.words)):
line = line + 1
print(f'Word {word.text} found in position {index} at line {line}')
,
使用nltk
以健壮的方式标记您的文本。另外,请记住文本中的单词可能是大小写混合的。在搜索之前将它们转换为小写。
import nltk
words = nltk.word_tokenize(txt.lower())
,
通常使用正则表达式,特别是\b
术语,这意味着“单词边界”是我将单词与其他任意字符分开的方式。这是一个示例:
import re
# words with arbitrary characters in between
data = """now is; the time for,all-good-men
to come\t to the,aid of
their... country"""
exp = re.compile(r"\b\w+")
pos = 0
while True:
m = exp.search(data,pos)
if not m:
break
print(m.group(0))
pos = m.end(0)
结果:
now
is
the
time
for
all
good
men
to
come
to
the
aid
of
their
country
,
您可以使用RegEx:
import re
words_to_find = ["test1","test2","test3"] # converted this to a list to use `in`
line = 0
with open("User_Input.txt","r") as f:
txt = f.readline()
line += 1
rx = re.findall('(\w+)',txt) # rx will be a list containing all the words in `txt`
# you can iterate for every word in a line
for word in rx: # for every word in the RegEx list
if word in words_to_find: print(word)
# or you can iterate through your search case only
# note that this will find only the first occurance of each word in `words_to_find`
for word in words_to_find # `test1`,`test2`,`test3`...
if word in rx: print(word) # if `test1` is present in this line's list of words...
上面的代码如何执行(\w+)
RegEx到您的文本字符串并返回匹配项列表。在这种情况下,RegEx将匹配任何用空格分隔的单词。
有用的资源:Debuggex用于测试RegExes,Python RegExes和RegExr用于了解正则表达式的更多信息。
,如果您试图在文本文件中找到单词test1,test2或test3,则无需手动增加行值。假设文本文件在单独的行中包含每个单词,则以下代码有效
words_to_find = ("test1","test3")
file = open("User_Input.txt","r").readlines()
for line in file:
txt = line.strip('\n')
for word in words_to_find:
if word in txt:
print(F"Word: '{word}' found at line {file.index(line)+1},"F"pos: {txt.index(word)}")
我不知道该表示什么职位。
,我认为只需在字符串参数中放置空格。