问题描述
如果使用 unicode apostrophes(而非 '
),您如何修改默认的 spacy (v3.0.5) 分词器以正确拆分英语缩写。
import spacy
nlp = spacy.load('en_core_web_sm')
apostrophes = ["'",'\u02B9','\u02BB','\u02BC','\u02BD','\u02C8','\u02CA','\u02CB','\u0060','\u00B4']
for apo in apostrophes:
text = f"don{apo}t"
print([t for t in nlp(text)])
>>>
[do,n't]
[donʹt]
[donʻt]
[donʼt]
[donʽt]
[donˈt]
[donˊt]
[donˋt]
[don`t]
[don´t]
所有示例所需的输出为 [do,n't]
我最好的猜测是使用所有可能的撇号变体来扩展默认的 tokenizer_exceptions。但这不起作用,因为 Tokenizer 特殊情况不允许修改文本。
import spacy
from spacy.util import compile_prefix_regex,compile_suffix_regex,compile_infix_regex
nlp = spacy.load('en_core_web_sm')
apostrophes = ['\u02B9','\u00B4']
default_rules = nlp.Defaults.tokenizer_exceptions
extended_rules = default_rules.copy()
for key,val in default_rules.items():
if "'" in key:
for apo in apostrophes:
extended_rules[key.replace("'",apo)] = val
rules = nlp.Defaults.tokenizer_exceptions
infix_re = compile_infix_regex(nlp.Defaults.infixes)
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
nlp.tokenizer = spacy.tokenizer.Tokenizer(
nlp.vocab,rules = extended_rules,prefix_search=prefix_re.search,suffix_search=suffix_re.search,infix_finditer=infix_re.finditer,)
apostrophes = ["'",'\u00B4']
for apo in apostrophes:
text = f"don{apo}t"
print([t for t in nlp(text)])
>>> ValueError: [E997] Tokenizer special cases are not allowed to modify the text. This would map ':`(' to ':'(' given token attributes '[{65: ":'("}]'.
解决方法
您只需要添加一个例外而不更改文本。
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.attrs import ORTH,NORM
case = [{ORTH: "do"},{ORTH: "n`t",NORM: "not"}]
tokenizer = nlp.tokenizer
tokenizer.add_special_case("don`t",case)
doc = nlp("I don`t believe in bugs")
print(list(doc))
# => [I,do,n`t,believe,in,bugs]
如果您想更改文本,您应该将其作为预处理步骤进行。