重新分配时,spacy 默认的英语分词器会发生变化

问题描述

当您在 spacy (v3.0.5) 英语语言模型中分配分词器时,en_core_web_sm 它自己的认分词器会改变其行为。

您不希望有任何变化,但它地失败了。这是为什么?

重现代码

import spacy

text = "don't you're i'm we're he's"

# No tokenizer assignment,everything is fine
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
[t.lemma_ for t in doc]
>>> ['do',"n't",'you','be','I','we','he','be']

# Default Tokenizer assignent,tokenization and therefore lemmatization fails
nlp = spacy.load('en_core_web_sm')
nlp.tokenizer = spacy.tokenizer.Tokenizer(nlp.vocab)
doc = nlp(text)
[t.lemma_ for t in doc]
>>> ["don't","you're","i'm","we're","he's"]

解决方法

要创建真正的默认分词器,必须将所有默认值传递给分词器类,而不仅仅是词汇:

from spacy.util import compile_prefix_regex,compile_suffix_regex,compile_infix_regex

rules = nlp.Defaults.tokenizer_exceptions
infix_re = compile_infix_regex(nlp.Defaults.infixes)
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

tokenizer = spacy.tokenizer.Tokenizer(
        nlp.vocab,rules = rules,prefix_search=prefix_re.search,suffix_search=suffix_re.search,infix_finditer=infix_re.finditer,)

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...