问题描述
我有一个python字符串和一个选定文本的子字符串。例如,字符串可以是
stringy = "the bee buzzed loudly"
我想在此字符串中选择文本“蜂鸣”。我有此特定字符串的字符偏移量,即4-14。因为这些是所选文本之间的字符级索引。
将这些转换为单词级别索引(即1-2)的最简单方法是什么,因为正在选择第二个和第三个单词。我有许多这样标记的字符串,我想简单高效地转换索引。数据当前存储在字典中,如下所示:
data = {"string":"the bee buzzed loudly","start_char":4,"end_char":14}
我想将其转换为这种形式
data = {"string":"the bee buzzed loudly","start_word":1,"end_word":2}
谢谢!
解决方法
这里有一个简单的列表索引方法:
# set up data
string = "the bee buzzed loudly"
words = string[4:14].split(" ") #get words from string using the charachter indices
stringLst = string.split(" ") #split string into words
dictionary = {"string":"","start_word":0,"end_word":0}
#process
dictionary["string"] = string
dictionary["start_word"] = stringLst.index(words[0]) #index of the first word in words
dictionary["end_word"] = stringLst.index(words[-1]) #index of the last
print(dictionary)
{'string': 'the bee buzzed loudly','start_word': 1,'end_word': 2}
请注意,这是假设您使用的是字符串中单词的时间顺序
,这似乎是一个令牌化问题。 我的解决方案是使用跨度标记器,然后在跨度中搜索子字符串跨度。 因此,使用nltk库:
import nltk
tokenizer = nltk.tokenize.TreebankWordTokenizer()
# or tokenizer = nltk.tokenize.WhitespaceTokenizer()
stringy = 'the bee buzzed loudly'
sub_b,sub_e = 4,14 # substring begin and end
[i for i,(b,e) in enumerate(tokenizer.span_tokenize(stringy))
if b >= sub_b and e <= sub_e]
但这有点复杂。
tokenizer.span_tokenize(stringy)
返回所标识的每个令牌/单词的跨度。
请尝试此代码;
def char_change(dic,start_char,end_char,*arg):
dic[arg[0]] = start_char
dic[arg[1]] = end_char
data = {"string":"the bee buzzed loudly","start_char":4,"end_char":14}
start_char = int(input("Please enter your start character: "))
end_char = int(input("Please enter your end character: "))
char_change(data,"start_char","end_char")
print(data)
默认词典:
data = {"string":"the bee buzzed loudly","end_char":14}
输入
Please enter your start character: 1
Please enter your end character: 2
输出字典:
{'string': 'the bee buzzed loudly','start_char': 1,'end_char': 2}