如何搜索网页中出现的单词/短语？

问题描述

我的最终目标是创建一个给定文本文件的原始抄袭检查器。我计划首先按句子拆分数据，在 Google 上搜索每个句子，最后搜索 Google 返回的前几个 URL 中的每一个，以查找句子/子字符串的出现。这最后一步是我遇到麻烦的一步。

在 for 循环中遍历每个 URL 时，我首先使用 urllib.open() 读取 URL 的内容，但我不确定之后要做什么。代码附在下面，我尝试注释掉了一些解决方案。我已经导入了 googlesearch、urllib.request 和 re 库。

def plagCheck():

    global inpFile

    with open(inpFile) as data:
        sentences = data.read().split(".")

    for sentence in sentences:
        for url in search(sentence,tld='com',lang='en',num=5,start=0,stop=5,pause=2.0):
            content = urlopen(url).read()

            # if sentence in content:
            #     print("yes")
            # else:
            #     print("no")

            # matches = findall(sentence,content)
            # if len(matches) == 0:
            #     print("no")
            # else:
            #     print("yes")

解决方法

如果我正确理解您的代码，您现在有两个 Python 句子列表。看起来您已使用句点拆分它们。这会为其他类型的标点符号（？，！）创建相当大的连续句子。

我会考虑使用相似性检查器库。 Diflibb has a simliar class 然后决定要标记的某个百分比，即是否 40% 相同。这减少了您必须手动检查的内容量。

扩大标点符号的数量。这可能看起来像这样：

with open(inpFile) as data:
        # Replace all !,? with .
        sentences = data.read().replace("!",".").replace("?",".").split(".")

然后我会把你的这个文件的结果写回一个新的输出文件，就像这样

# loop each sentence and run it through google
# Compare those two sentences with the sequence matcher linked above (Difflib) 
# Add them to a dictionary with the percent,url,and sentence in question
# Sample result
results = {"sentence_num": 0,"percent": 0.8,"url": "the google url found on","original_sentence": "Red green fox over the wall"
}
outputStr = "<html>"
# loop the results and format the dictionary in a way that you can read. Ideally an HTML table with columns representing the keys above
outputStr += "<table>" # etc
with open(outputFile) as results:
   results.write(outputStr)

您甚至可以根据百分比突出显示表格行即

80% 及以上为红色 61-79% 橙色 40-60% 黄色 39% 及以下为绿色

plagiarism-detection python urllib