问题描述
我正在尝试编写一个代码,该代码将创建一个带有键(词)、值(该词出现的 POS 标签 + 相应计数)的字典。 最终目标是了解给定单词最常用的词性标记
一个例子:
most_common({"NOUN": 2,"DET": 5,"ADP": 1 }) returns "DET"
因为给定的词最常作为限定词出现。
首先,我想在一个小的带注释的语料库上训练我的代码。这是我目前所拥有的:
import pprint
trainfile = open("small_train.connlu")
list_of_lists = []
for line in trainfile:
stripped_line = line.strip()
line_list = stripped_line.split()
list_of_lists.append(line_list)
list_keys = [] #a list that will contain all the keys (including duplicates)
list_values = [] #a list that will contain all the values
for line in list_of_lists:
if line == []:
pass
elif line != []:
list_keys.append(line[1]) # second column of the file contains all words
list_values.append(line[3]) # fourth column of the file contains all POS tags (see below)
list_keys = [key.lower() for key in list_keys] #lowercase all keys - 'The' and 'the' should be assigned the same POS
我被困在这一点上。我现在需要创建一个字典,其中包含出现在语料库中的所有单词,然后是它们出现的相应 POS 标签(以及每个单词与某个 POS 标签一起出现的次数)。这是我得到的最接近的:
dict = {}
for key in range(len(list_keys)):
dict[list_keys[key]] = list_values[key]
pprint.pprint(dict)
这会返回带有正确 POS 标签的键,但是,我不知道如何实现计数。我尝试过的任何事情都导致了错误。
这是训练数据的格式(small_train.connlu)
1 The _ DET _ _ _ _ _ _
2 hottest _ ADJ _ _ _ _ _ _
3 item _ NOUN _ _ _ _ _ _
4 on _ ADP _ _ _ _ _ _
5 Christmas _ PROPN _ _ _ _ _ _
6 wish _ NOUN _ _ _ _ _ _
7 lists _ NOUN _ _ _ _ _ _
8 this _ DET _ _ _ _ _ _
9 year _ NOUN _ _ _ _ _ _
10 is _ AUX _ _ _ _ _ _
11 nuclear _ ADJ _ _ _ _ _ _
12 weapons _ NOUN _ _ _ _ _ _
13 . _ PUNCT _ _ _ _ _ _
1 I _ PRON _ _ _ _ _ _
2 wish _ VERB _ _ _ _ _ _
3 you _ PRON _ _ _ _ _ _
4 all _ DET _ _ _ _ _ _
5 of _ ADP _ _ _ _ _ _
6 the _ DET _ _ _ _ _ _
7 best _ ADJ _ _ _ _ _ _
如果有人能提供帮助,我将不胜感激。非常感谢:)
解决方法
假设您的 list_of_lists 格式如下:
list_of_lists=[["","THE","","DET"],["","hottest","ADJ"],"the","NOUN"],"DET"]]
您可以创建一个字典,其中键是单词,值是另一个字典,其中键是 pos 标签,值是对应单词的频率。
# A dictionary of dictionaries
word_map = {}
# Iterate through the list of lists that has been prepared after parsing the file.
for line in list_of_lists:
# Convert the word to lowercase
word = line[1].lower()
pos_tag = line[3]
# If we have seen the word before.
if word in word_map:
# Get the inner dictionary for this word which contains the frequency of the pos tags for this word
pos_tags_for_current_word = word_map[word]
# If this pos tag has already been seen for this word then increment the frequency by 1
if pos_tag in pos_tags_for_current_word:
pos_tags_for_current_word[pos_tag] += 1
# If the pos tag has been seen for the first time then add a new entry to the dictionary with a frequency 1
else:
pos_tags_for_current_word[pos_tag] = 1
# If the word has been seen for the first time then add a new entry for the word with value as a dictionary and the current pos tag with frequency 1.
else:
word_map[word] = {pos_tag:1}
打印 word_map 给我们
{'hottest': {'ADJ': 1},'the': {'DET': 2,'NOUN': 1}}
现在遍历这个字典,找到每个单词出现频率最高的 pos 标签。
for word in word_map:
print(word,max(word_map[word].items(),key = lambda entry : entry[1]))
这是上面提到的for循环的输出:
the ('DET',2)
hottest ('ADJ',1)
这只是在任何字典中找出具有最大值的条目的一种速记方法。
max(word_map[word].items(),key = lambda entry : entry[1])
,
这是一种完整的、功能更强大的方法,可以将文件解析为一种格式,您可以对其进行其他操作(字典列表)并找到最常见的单词类型:
def parse_connlu(lines: list) -> list:
# return a list of dict of the connlu file
items = [[item for item in line.split()
if item != '']
for line in lines
if line != '']
return [{'pos': int(item[0]),'word': item[1].lower(),'type': item[3]}
for item in items]
def read_connlu(path: str) -> list:
# read the connlu file (will return list[dict])
with open(path,'r') as file:
lines = file.read().splitlines()
return parse_connlu(lines)
def count(items: list) -> dict:
# count unique elements in list
unique = set(items)
return {i: items.count(i) for i in unique}
def most_common(items: list):
# get the most commen item
counts = count(items)
# {'NOUN': 5,'PROPN': 1,'VERB': 1,'ADP': 2,'DET': 4,'PRON': 2,'AUX': 1,'PUNCT': 1,'ADJ': 3}
return max(counts,key=counts.get)
words = read_connlu('./small_train.connlu')
# gives us:
# [{'pos': 1,'type': 'DET','word': 'the'},# {'pos': 2,'type': 'ADJ','word': 'hottest'},# ...
answer = most_common([i['type'] for i in words])
print(answer)
# gives us: NOUN