正则表达式 – 如何用OpenNLP和stringi检测句子边界？

我想打破下一个字符串的句子：

library(NLP) # NLP_0.1-7  
string <- as.String("Mr. Brown comes. He says hello. i give him coffee.")

我想演示两种不同的方法.一个来自package openNLP：

library(openNLP) # openNLP_0.2-5  

sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = "en")  
boundaries_sentences<-annotate(string,sentence_token_annotator)  
string[boundaries_sentences]  

[1] "Mr. Brown comes."   "He says hello."     "i give him coffee."

第二个来自package stringi：

library(stringi) # stringi_0.5-5  

stri_split_boundaries( string,opts_brkiter=stri_opts_brkiter('sentence'))

[[1]]  
 [1] "Mr. "                              "Brown comes. "                    
 [3] "He says hello. i give him coffee."

在第二种方式之后,我需要准备句子以删除多余的空格,或者再次将一个新的字符串分割成句子.我可以调整字符串功能来提高结果的质量吗？

当它是一个大数据时,openNLP(非常)慢,然后是字符串.
有没有办法组合stringi( – > fast)和openNLP( – >质量)？

ICU中的文本边界(在这种情况下,句子边界)分析(由此在stringi中)由Unicode UAX29中描述的规则参见 ICU Users Guide on the topic.我们读取：

[The Unicode rules] cannot detect cases such as “…Mr. Jones…”; more sophisticated tailoring would be required to detect such cases.

换句话说,如果没有自定义字典的不间断字,这实际上是在openNLP中实现的,这是不可能的.因此,用于执行此任务的几个可能的方案来合并stringi将包括：

>使用stri_split_boundaries然后编写一个函数,决定哪个错误的分割标记应该被加入.
>在文本中手动输入不间断的空格(可能在点后跟等等,先生等等)(注意,这在LaTeX中准备文档时实际上是必需的,否则在单词之间获得太大的空格).
将自定义的不间断单词列表合并到正则表达式中,并应用stri_split_regex.

等等.

正则表达式 – 如何用OpenNLP和stringi检测句子边界？

相关文章