Python 词性标注:训练后返回单词最常见的词性

问题描述

我正在尝试编写一个代码,该代码将创建一个带有键(词)、值(该词出现的 POS 标签 + 相应计数)的字典。 最终目标是了解给定单词最常用的词性标记

一个例子: most_common({"NOUN": 2,"DET": 5,"ADP": 1 }) returns "DET" 因为给定的词最常作为限定词出现。

首先,我想在一个小的带注释的语料库上训练我的代码。这是我目前所拥有的:

import pprint

trainfile = open("small_train.connlu")
list_of_lists = []
for line in trainfile:
    stripped_line = line.strip()
    line_list = stripped_line.split()
    list_of_lists.append(line_list)

list_keys = [] #a list that will contain all the keys (including duplicates)
list_values = [] #a list that will contain all the values

for line in list_of_lists:
    if line == []:
        pass
    elif line != []:
        list_keys.append(line[1]) # second column of the file contains all words
        list_values.append(line[3]) # fourth column of the file contains all POS tags (see below)

list_keys = [key.lower() for key in list_keys] #lowercase all keys - 'The' and 'the' should be assigned the same POS

我被困在这一点上。我现在需要创建一个字典,其中包含出现在语料库中的所有单词,然后是它们出现的相应 POS 标签(以及每个单词与某个 POS 标签一起出现的次数)。这是我得到的最接近的:

dict = {}

for key in range(len(list_keys)):
    dict[list_keys[key]] = list_values[key]

pprint.pprint(dict)

这会返回带有正确 POS 标签的键,但是,我不知道如何实现计数。我尝试过的任何事情都导致了错误

这是训练数据的格式(small_train.connlu)

1   The       _   DET   _ _ _ _ _ _
2   hottest   _   ADJ   _ _ _ _ _ _
3   item      _   NOUN  _ _ _ _ _ _
4   on        _   ADP   _ _ _ _ _ _
5   Christmas _   PROPN _ _ _ _ _ _
6   wish      _   NOUN  _ _ _ _ _ _
7   lists     _   NOUN  _ _ _ _ _ _
8   this      _   DET   _ _ _ _ _ _
9   year      _   NOUN  _ _ _ _ _ _
10  is        _   AUX   _ _ _ _ _ _
11  nuclear   _   ADJ   _ _ _ _ _ _
12  weapons   _   NOUN  _ _ _ _ _ _
13  .         _   PUNCT _ _ _ _ _ _

1   I         _   PRON  _ _ _ _ _ _
2   wish      _   VERB  _ _ _ _ _ _
3   you       _   PRON  _ _ _ _ _ _
4   all       _   DET   _ _ _ _ _ _
5   of        _   ADP   _ _ _ _ _ _
6   the       _   DET   _ _ _ _ _ _
7   best      _   ADJ   _ _ _ _ _ _

如果有人能提供帮助,我将不胜感激。非常感谢:)

解决方法

假设您的 list_of_lists 格式如下:

list_of_lists=[["","THE","","DET"],["","hottest","ADJ"],"the","NOUN"],"DET"]]

您可以创建一个字典,其中键是单词,值是另一个字典,其中键是 pos 标签,值是对应单词的频率。

# A dictionary of dictionaries
word_map = {}

# Iterate through the list of lists that has been prepared after parsing the file.
for line in list_of_lists:
        # Convert the word to lowercase
        word = line[1].lower()
        pos_tag = line[3]
        
        # If we have seen the word before.
        if word in word_map:
            
            # Get the inner dictionary for this word which contains the frequency of the pos tags for this word
            pos_tags_for_current_word = word_map[word]
             
            # If this pos tag has already been seen for this word then increment the frequency by 1
            if pos_tag in pos_tags_for_current_word:
                pos_tags_for_current_word[pos_tag] += 1
            # If the pos tag has been seen for the first time then add a new entry to the dictionary with a  frequency 1
            else:
                pos_tags_for_current_word[pos_tag] = 1
        
        # If the word has been seen for the first time then add a new entry for the word with value as a dictionary and the current pos tag with frequency 1.
        else:
            word_map[word] = {pos_tag:1}

打印 word_map 给我们

{'hottest': {'ADJ': 1},'the': {'DET': 2,'NOUN': 1}}

现在遍历这个字典,找到每个单词出现频率最高的 pos 标签。

for word in word_map:
        print(word,max(word_map[word].items(),key = lambda entry : entry[1]))

这是上面提到的for循环的输出:

the ('DET',2)
hottest ('ADJ',1)

这只是在任何字典中找出具有最大值的条目的一种速记方法。

max(word_map[word].items(),key = lambda entry : entry[1])
,

这是一种完整的、功能更强大的方法,可以将文件解析为一种格式,您可以对其进行其他操作(字典列表)并找到最常见的单词类型:

def parse_connlu(lines: list) -> list:
    # return a list of dict of the connlu file
    items = [[item for item in line.split()
              if item != '']
             for line in lines
             if line != '']
    return [{'pos': int(item[0]),'word': item[1].lower(),'type': item[3]}
            for item in items]

def read_connlu(path: str) -> list:
    # read the connlu file (will return list[dict])
    with open(path,'r') as file:
        lines = file.read().splitlines()
        return parse_connlu(lines)

def count(items: list) -> dict:
    # count unique elements in list
    unique = set(items)
    return {i: items.count(i) for i in unique}

def most_common(items: list):
    # get the most commen item
    counts = count(items)
    # {'NOUN': 5,'PROPN': 1,'VERB': 1,'ADP': 2,'DET': 4,'PRON': 2,'AUX': 1,'PUNCT': 1,'ADJ': 3}
    return max(counts,key=counts.get)

words = read_connlu('./small_train.connlu')
# gives us:
# [{'pos': 1,'type': 'DET','word': 'the'},#  {'pos': 2,'type': 'ADJ','word': 'hottest'},# ...

answer = most_common([i['type'] for i in words])
print(answer)
# gives us: NOUN