丢弃包含嵌套目标词的较长字典匹配项

问题描述

我正在使用 tokens_lookup 来查看某些文本是否包含我字典中的单词。现在我试图找到一种方法来丢弃当字典单词处于有序单词序列中时发生的匹配。举个例子，假设爱尔兰在字典中。我想排除例如提到北爱尔兰（或包含英国的任何固定词组）的情况。我想出的唯一间接解决方案是用这些词组（例如英国）构建另一本词典。但是，当同时引用不列颠和英国时，此解决方案将不起作用。谢谢。

library("quanteda")

dict <- dictionary(list(IE = "Ireland"))

txt <- c(
  doc1 = "Ireland lorem ipsum",doc2 = "Lorem ipsum northern Ireland",doc3 = "Ireland lorem ipsum northern Ireland"
)

toks <- tokens(txt)

tokens_lookup(toks,dictionary = dict)

解决方法

您可以通过为“Northern Ireland”指定另一个字典键，其值也是“Northern Ireland”来实现。如果您在 nested_scope = "dictionary" 中使用参数 tokens_lookup()，那么这将首先匹配较长的短语，并且只匹配一次，将“爱尔兰”与“北爱尔兰”分开。通过使用与值相同的键，您可以像替换一样替换它（附带好处是现在将这两个标记“Northern”和“Ireland”合并为一个标记。

library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

dict <- dictionary(list(IE = "Ireland","Northern Ireland" = "Northern Ireland"))

txt <- c(
  doc1 = "Ireland lorem ipsum",doc2 = "Lorem ipsum Northern Ireland",doc3 = "Ireland lorem ipsum Northern Ireland"
)

toks <- tokens(txt)

tokens_lookup(toks,dictionary = dict,exclusive = FALSE,nested_scope = "dictionary",capkeys = FALSE
)
## Tokens consisting of 3 documents.
## doc1 :
## [1] "IE"    "lorem" "ipsum"
## 
## doc2 :
## [1] "Lorem"            "ipsum"            "Northern Ireland"
## 
## doc3 :
## [1] "IE"               "lorem"            "ipsum"            "Northern Ireland"

在这里，我使用 exclusive = FALSE 进行说明，以便您可以看到查找和替换的内容。您可以在运行时删除它和 capkeys 参数。

如果您想丢弃“北爱尔兰”令牌，只需使用

tokens_lookup(toks,nested_scope = "dictionary") %>%
  tokens_remove("Northern Ireland")
## Tokens consisting of 3 documents.
## doc1 :
## [1] "IE"
## 
## doc2 :
## character(0)
## 
## doc3 :
## [1] "IE"

quanteda r r

丢弃包含嵌套目标词的较长字典匹配项

问题描述

解决方法

相关问答