问题描述
|
我想比较两个文本的相似性,因此我需要一个简单的功能来按时间顺序清楚地列出两个文本中出现的单词和短语。这些单词/句子应突出显示或加下划线,以便更好地可视化)
在@joris Meys思想的基础上,我添加了一个数组将文本分为句子和从属句子。
它是这样的:
textparts <- function (text){
textparts <- c(\"\\\\,\",\"\\\\.\")
i <- 1
while(i<=length(textparts)){
text <- unlist(strsplit(text,textparts[i]))
i <- i+1
}
return (text)
}
textparts1 <- textparts(\"This is a complete sentence,whereas this is a dependent clause. This thing works.\")
textparts2 <- textparts(\"This could be a sentence,whereas this is a dependent clause. Plagiarism is not cool. This thing works.\")
commonWords <- intersect(textparts1,textparts2)
commonWords <- paste(\"\\\\<(\",commonWords,\")\\\\>\",sep=\"\")
for(x in commonWords){
textparts1 <- gsub(x,\"\\\\1*\",textparts1,ignore.case=TRUE)
textparts2 <- gsub(x,textparts2,ignore.case=TRUE)
}
return(list(textparts1,textparts2))
但是,有时它起作用,有时却不起作用。
我想得到这样的结果:
> return(list(textparts1,textparts2))
[[1]]
[1] \"This is a complete sentence\" \" whereas this is a dependent clause*\" \" This thing works*\"
[[2]]
[1] \"This could be a sentence\" \" whereas this is a dependent clause*\" \" Plagiarism is not cool\" \" This thing works*\"
而我没有任何结果。
解决方法
@Chase的答案存在一些问题:
不考虑大小写差异
穿插会弄乱结果
如果有多个相似的单词,那么由于gsub调用,您会收到很多警告。
根据他的想法,有以下解决方案利用
tolower()
和正则表达式的一些不错的功能:
compareSentences <- function(sentence1,sentence2) {
# split everything on \"not a word\" and put all to lowercase
x1 <- tolower(unlist(strsplit(sentence1,\"\\\\W\")))
x2 <- tolower(unlist(strsplit(sentence2,\"\\\\W\")))
commonWords <- intersect(x1,x2)
#add word beginning and ending and put words between ()
# to allow for match referencing in gsub
commonWords <- paste(\"\\\\<(\",commonWords,\")\\\\>\",sep=\"\")
for(x in commonWords){
# replace the match by the match with star added
sentence1 <- gsub(x,\"\\\\1*\",sentence1,ignore.case=TRUE)
sentence2 <- gsub(x,sentence2,ignore.case=TRUE)
}
return(list(sentence1,sentence2))
}
得到以下结果:
text1 <- \"This is a test. Weather is fine\"
text2 <- \"This text is a test. This weather is fine. This blabalba This \"
compareSentences(text1,text2)
[[1]]
[1] \"This* is* a* test*. Weather* is* fine*\"
[[2]]
[1] \"This* text is* a* test*. This* weather* is* fine*. This* blabalba This* \"
,我敢肯定自然语言处理页面上还会有更强大的功能,但这是使用ѭ5来查找常用单词的一种解决方案。方法是读两个句子,识别常用词,并结合词和我们选择的绰号对它们进行“ 6”识别。在这里,我选择使用*
,但您可以轻松更改它或添加其他内容。
sent1 <- \"I shot the sheriff.\"
sent2 <- \"Dick Cheney shot a man.\"
compareSentences <- function(sentence1,sentence2) {
sentence1 <- unlist(strsplit(sentence1,\" \"))
sentence2 <- unlist(strsplit(sentence2,\" \"))
commonWords <- intersect(sentence1,sentence2)
return(list(
sentence1 = paste(gsub(commonWords,paste(commonWords,\"*\",sep = \"\"),sentence1),collapse = \" \"),sentence2 = paste(gsub(commonWords,sentence2),collapse = \" \")
))
}
> compareSentences(sent1,sent2)
$sentence1
[1] \"I shot* the sheriff.\"
$sentence2
[1] \"Dick Cheney shot* a man.\"