如何为文本建立马尔可夫模型？

问题描述

我只是要学习马尔可夫模型的实现，而且我正在尝试构建一个代码，该代码可以自动预测特定单词之前的单词。我想用它来使用这个随机词生成一个100词的构图（希望您理解我的意思）。

但是，我的代码仅返回由一个单词组成的100个单词的组合！

我很困惑，我想我错过了一些关键的事情，但是我似乎无法将想法笼罩在那。我需要一些帮助。

这是我的代码。

from bs4 import BeautifulSoup
from random import randint
from urllib.request import urlopen

#calculating the total sun of words dictionary

def summ(wordlist):
    sump=0
    for word,value in wordlist.items():
        sump+=value
    return sump

def random_index(wordlist):
    randomindex=randint(1,summ(wordlist))
     for word,value in wordlist.items():
        randomindex-=value
        if randomindex<=0:
            return word
    
def clean_text(text):
    text=text.replace('\n',' ')
    text=text.replace('"','')

    symbols=['.',',';',':']
        for symbol in symbols:
        text=text.replace(symbol,' {} '.format(symbol))
    words=text.split(' ')
    words=[word for word in words if len(word) != 0]

    #creating dictinary and dictionary and defining the appropriate terms
    wordict={}

    for i in range(1,len(words)):
        if words[i-1] not in wordict:
            wordict[words[i-1]]={}
        if words[i] not in wordict[words[i-1]]:
            wordict[words[i-1]][words[1]]=0
        wordict[words[i-1]][words[1]]+=1
    return wordict

text=str(urlopen('http://pythonscraping.com/files/inaugurationSpeech.txt').read(),'UTF-8')

wordict=clean_text(text)

    length=100
chain=['I']

for i in range(0,length):
    newWord= random_index(wordict[chain[-1]])
    chain.append(newWord)
print(' '.join(chain))

请随时问我有关代码的任何问题。

解决方法

由于没有人回答这个问题，经过一段时间的调试和代码调试，我终于找到了该错误。

您会看到，该代码用于从here获得的文本中生成随机单词，然后使用这些随机单词创建随机的100个单词的组合。如这段代码所示：

def clean_text（text）： text = text.replace（'\ n'，''） text = text.replace（'“'，''）

symbols=['.',',';',':']
    for symbol in symbols:
    text=text.replace(symbol,' {} '.format(symbol))
words=text.split(' ')
words=[word for word in words if len(word) != 0]

#creating dictinary and dictionary and defining the appropriate terms
wordict={}

for i in range(1,len(words)):
    if words[i-1] not in wordict:
        wordict[words[i-1]]={}
    if words[i] not in wordict[words[i-1]]:
        wordict[words[i-1]][words[1]]=0
    wordict[words[i-1]][words[1]]+=1
return wordict

该脚本使用字典词典工作。文本中的每个单词都会添加到字典worddict中，并在以下行中添加

：

if words[i-1] not in wordict:
        wordict[words[i-1]]={}
    if words[i] not in wordict[words[i-1]]:
        wordict[words[i-1]][words[1]]=0
    wordict[words[i-1]][words[1]]+=1
return wordict

将字典wordict中每个单词前面的

个单词添加到字典中的相应单词。因此形成了字典字典。

导致我的错误的原因是我在代码中使用int(1)而不是letter（i）。我用过线：

if words[i-1] not in wordict:
        wordict[words[i-1]]={}
    if words[i] not in wordict[words[i-1]]:
        wordict[words[i-1]][words[1]]=0
    wordict[words[i-1]][words[1]]+=1
return wordict

代替使用行：

if words[i-1] not in wordict:
        wordict[words[i-1]]={}
    if words[i] not in wordict[words[i-1]]:
        wordict[words[i-1]][words[i]]=0
    wordict[words[i-1]][words[i]]+=1
return wordict

创建字典词典（注意1和i之间的差异）。

如果需要更多说明，可以发表评论。

dictionary markov-models python-3.x text