删除粘在类标记的 quanteda 对象的单词上的数字

问题描述

可以在 here 找到相关问题,但没有直接解决我在下面讨论的这个问题。

我的目标是删除任何出现在令牌中的数字。例如,我希望能够摆脱以下情况下的数字:13f408-k10-k 等。我使用 quanteda 作为主要工具。我有一个经典的语料库对象,我使用函数 tokens() 对其进行了标记。参数 remove_numbers = TRUE 在这种情况下似乎不起作用,因为它只是忽略令牌并将它们留在原处。如果我将 tokens_remove() 与特定的正则表达式一起使用,这会删除标记,这是我想避免的,因为我对剩余的文本内容感兴趣。

这里是我如何通过 stringr 中的函数 str_remove_all() 解决问题的一个小部分。它有效,但对于大对象可能会很慢。

我的问题是:有没有办法在不离开 quanteda 的情况下获得相同的结果(例如,在类 tokens 的对象上)?

library(quanteda)
#> Package version: 2.1.2
#> Parallel computing: 2 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
#> 
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#> 
#>     View
library(stringr)

mytext = c( "This is a sentence with correctly spaced digits like K 16.","This is a sentence with uncorrectly spaced digits like 123asd and well101.")

# Tokenizing
mytokens = tokens(mytext,remove_punct = TRUE,remove_numbers = TRUE )
mytokens
#> Tokens consisting of 2 documents.
#> text1 :
#>  [1] "This"      "is"        "a"         "sentence"  "with"      "correctly"
#>  [7] "spaced"    "digits"    "like"      "K"        
#> 
#> text2 :
#>  [1] "This"        "is"          "a"           "sentence"    "with"       
#>  [6] "uncorrectly" "spaced"      "digits"      "like"        "123asd"     
#> [11] "and"         "well101"

# the tokens "123asd" and "well101" are still there.
# I can be more specific using a regex but this removes the tokens altogether
# 
mytokens_wrong = tokens_remove( mytokens,pattern = "[[:digit:]]",valuetype = "regex")
mytokens_wrong
#> Tokens consisting of 2 documents.
#> text1 :
#>  [1] "This"      "is"        "a"         "sentence"  "with"      "correctly"
#>  [7] "spaced"    "digits"    "like"      "K"        
#> 
#> text2 :
#>  [1] "This"        "is"          "a"           "sentence"    "with"       
#>  [6] "uncorrectly" "spaced"      "digits"      "like"        "and"

# This is the workaround which seems to be working but can be very slow.
# I am using stringr::str_remove_all() function
# 
mytokens_ok = lapply( mytokens,function(x) str_remove_all( x,"[[:digit:]]" ) )
mytokens_ok
#> $text1
#>  [1] "This"      "is"        "a"         "sentence"  "with"      "correctly"
#>  [7] "spaced"    "digits"    "like"      "K"        
#> 
#> $text2
#>  [1] "This"        "is"          "a"           "sentence"    "with"       
#>  [6] "uncorrectly" "spaced"      "digits"      "like"        "asd"        
#> [11] "and"         "well"

reprex package (v0.3.0) 于 2021 年 2 月 15 日创建

解决方法

另一个答案是对 tokens_split() 的巧妙使用,但如果您想删除单词中间的数字,则它并不总是有效(因为它会将包含内部数字的原始单词拆分为两个)。>

这是从类型(唯一标记/单词)中删除数字字符的有效方法:

library("quanteda")
## Package version: 2.1.2

mytext <- c(
  "This is a sentence with correctly spaced digits like K 16.","This is a sentence with uncorrectly spaced digits like 123asd and well101."
)
toks <- tokens(mytext,remove_punct = TRUE,remove_numbers = TRUE)

# get all types with digits
typesnum <- grep("[[:digit:]]",types(toks),value = TRUE)
typesnum
## [1] "123asd"  "well101"

# replace the types with types without digits
tokens_replace(toks,typesnum,gsub("[[:digit:]]","",typesnum))
## Tokens consisting of 2 documents.
## text1 :
##  [1] "This"      "is"        "a"         "sentence"  "with"      "correctly"
##  [7] "spaced"    "digits"    "like"      "K"        
## 
## text2 :
##  [1] "This"        "is"          "a"           "sentence"    "with"       
##  [6] "uncorrectly" "spaced"      "digits"      "like"        "asd"        
## [11] "and"         "well"

请注意,我通常建议对所有正则表达式操作使用 stringi,但为了简单起见,这里使用了基本包函数。

reprex package (v1.0.0) 于 2021 年 2 月 15 日创建

,

在这种情况下,您可以 (ab) 使用 tokens_split。您在数字上拆分标记,默认情况下 tokens_split 会删除分隔符。通过这种方式,您可以在 quanteda 中完成所有操作。

library(quanteda)

mytext = c( "This is a sentence with correctly spaced digits like K 16.","This is a sentence with uncorrectly spaced digits like 123asd and well101.")

# Tokenizing
mytokens = tokens(mytext,remove_numbers = TRUE)

tokens_split(mytokens,separator = "[[:digit:]]",valuetype = "regex")
Tokens consisting of 2 documents.
text1 :
 [1] "This"      "is"        "a"         "sentence"  "with"      "correctly" "spaced"    "digits"    "like"     
[10] "K"        

text2 :
 [1] "This"        "is"          "a"           "sentence"    "with"        "uncorrectly" "spaced"      "digits"     
 [9] "like"        "asd"         "and"         "well"       

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...