ruby＆nlp：如何删除停用词和非单词例如链接，表情符号并计算uniq单词？

问题描述

一些样本数据

NEW LISTING: @AaveAave AAVE ($LEND) will SOON be listed on @OKEx! Quiz & Net Buy ? up to 3,000 $LEND: ▶️ Follow us + @AaveAave ▶️ Join Quiz: https://t.co/{.....} ▶️ RT answer & @OKEx #OKExDeFi #OKExAave ▶️ Deposit + Net Buying rebate Listing details: https://t.co/{.....} hhttps://t.co/{.....}

? ? ? ? ? ? ? ? ? ? 100,000,000 #USDT (100,568,399 USD) transferred from Tether Treasury to #Binance Tx: https://t.co/{.....}

我打算做什么

删除所有停用词
删除所有非单词（链接和表情符号）
计算单词编号和唯一单词编号

我的问题：我知道如何在R中执行此操作（使用tidytext），但是在Ruby中执行上述操作的最佳实践是什么？我四处寻觅，但不知道任何与流行相关的宝石。

感谢您的帮助

如果为tidytext，则可以完成如下所示的操作

解决方法

采用增量方法对输入进行消毒

您的问题很重要，因为您尚未真正以编程方式定义“ stop words”或“非单词”，也没有这样的通用列表。甚至没有100％准确的方法来检测所有有效的web URIs，使用其他方案的URI则要少得多，尽管有一些“比大多数方法更好和更好”的方法。

但是，作为滚动自己的基本起点：

stop_words = /\b(a|an|the|and|or)\b/
non_words  = /\s+(#\p{ASCII}|https?:|@).*?\s+/

str = <<~'EOF'
  NEW LISTING: @AaveAave AAVE ($LEND) will SOON be listed on @OKEx! Quiz &amp; Net Buy ? up to 3,000 $LEND: ▶️ Follow us + @AaveAave ▶️ Join Quiz: https://t.co/{.....} ▶️ RT answer &amp; @OKEx #OKExDeFi #OKExAave ▶️ Deposit + Net Buying rebate Listing details: https://t.co/{.....} hhttps://t.co/{.....}

  ? ? ? ? ? ? ? ? ? ? 100,000,000 #USDT (100,568,399 USD) transferred from Tether Treasury to #Binance Tx: https://t.co/{.....} 
EOF

str.gsub! stop_words,""
str.gsub! non_words,""
str.gsub! /[[:^ascii:]]/,""
str.strip!
str.squeeze!

p "Chars: #{str.length}"
#=> "Chars: 246"

p "Words: #{str.split.uniq.count}"
#=> "Words: 35"

由于输入内容尚未标准化，因此您还需要做进一步的清理。尽管如此，这仍将为您提供一个良好的起点和一种合理的方法，以找到您自己的“足够好”的解决方案。

nlp ruby ruby ruby