仅读取文本文件中完整单词的python代码用于仅检测完整单词的词法分析是什么？

问题描述

我想抓住口语中构成整个单词的文本组（用空格分隔的文本组被视为单词）。例如，当我想在文本文件中找到单词 is 时，即使文件中不包含单词is，单词s is ter中的is是检测到。我对 词法分析 有所了解，但无法将其应用于我的项目。有人可以提供这种情况的python代码。

这是我使用的代码，但它导致了上述问题。

 words_to_find = ("test1","test2","test3")
    line = 0
    #User_Input.txt is a file saved in my computer which i used as the input of the system 
    with open("User_Input.txt","r") as f:
        txt = f.readline()
        line += 1
        for word in words_to_find:
            if word in txt:
                print(F"Word: '{word}' found at line {line}," 
                      F"pos: {txt.index(word)}")

解决方法

您应该使用spacy来标记列表，因为自然语言往往很棘手，但有例外和其他例外：

from spacy.lang.en import English

nlp = English()
# Create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.Defaults.create_tokenizer(nlp)
txt = f.readlines()
line += 1
for txt_line in txt:
    [print(f'Word {word} found at line {line}; pos: {txt.index(word)}') for word in nlp(txt)]

或者，您可以通过以下方式使用textblob：

# from textblob import TextBlob
txt = f.readlines()
blob = TextBlob(txt)
for index,word in enumerate(list(blob.words)):
    line = line + 1
    print(f'Word {word.text} found in position {index} at line {line}')

使用nltk以健壮的方式标记您的文本。另外，请记住文本中的单词可能是大小写混合的。在搜索之前将它们转换为小写。

import nltk
words = nltk.word_tokenize(txt.lower())

通常使用正则表达式，特别是\b术语，这意味着“单词边界”是我将单词与其他任意字符分开的方式。这是一个示例：

import re
 
# words with arbitrary characters in between
data = """now is;  the time for,all-good-men
to come\t to the,aid of 
their... country"""

exp = re.compile(r"\b\w+")

pos = 0
while True:
    m = exp.search(data,pos)
    if not m:
        break
    print(m.group(0))
    pos = m.end(0)

结果：

now
is
the
time
for
all
good
men
to
come
to
the
aid
of
their
country

您可以使用RegEx：

import re

words_to_find = ["test1","test2","test3"] # converted this to a list to use `in`
line = 0
with open("User_Input.txt","r") as f:
  txt = f.readline()
  line += 1
  rx = re.findall('(\w+)',txt) # rx will be a list containing all the words in `txt`

  # you can iterate for every word in a line
  for word in rx: # for every word in the RegEx list
    if word in words_to_find: print(word)

    # or you can iterate through your search case only
    # note that this will find only the first occurance of each word in `words_to_find`
    for word in words_to_find # `test1`,`test2`,`test3`...
      if word in rx: print(word) # if `test1` is present in this line's list of words...

上面的代码如何执行(\w+) RegEx到您的文本字符串并返回匹配项列表。在这种情况下，RegEx将匹配任何用空格分隔的单词。

有用的资源：Debuggex用于测试RegExes，Python RegExes和RegExr用于了解正则表达式的更多信息。

如果您试图在文本文件中找到单词test1，test2或test3，则无需手动增加行值。假设文本文件在单独的行中包含每个单词，则以下代码有效

words_to_find = ("test1","test3")
file = open("User_Input.txt","r").readlines()
for line in file:
    txt = line.strip('\n')
    for word in words_to_find:
        if word in txt:
            print(F"Word: '{word}' found at line {file.index(line)+1},"F"pos: {txt.index(word)}")

我不知道该表示什么职位。

我认为只需在字符串参数中放置空格。

lexical-analysis python