我知道如何单独删除标点并保留撇号:
gsub( "[^[:alnum:]']"," ",db$text )
或者如何使用tm包保持字内短划线:
removePunctuation(db$text,preserve_intra_word_dashes = TRUE)
但我无法找到同时做到这两点的方法.例如,如果我的原始句子是:
"Interested in energy/the environment/etc.? Congrats to our new e-board! Ben,Nathan,Jenny,and Adam,y'all are sure to lead the club in a great direction next year! #obama #swag"
我希望它是:
"Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"
当然,会有额外的空白区域,但我可以在以后删除它们.
我将非常感谢你的帮助.
解决方法
使用
character classes
gsub("[^[:alnum:]['-]",db$text) ## "Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"