问题描述
如何将英文单词拆分为字符但保持连线完整(例如“ch”、“th”、“gh”)?
例如,对于字符串“that”,我想将其拆分为“th”、“a”、“t”,而不是“t”、“h”、“a”、“t”。
解决方法
这里有一个函数 f
可以帮助分割
dg <- c("ch","th","gh","ai")
v <- c("thanks","chain","banana","that","rain")
f <- Vectorize(function(s) {
res <- c()
while (nchar(s)) {
k <- ifelse(substr(s,1,2) %in% dg,2,1)
res <- c(res,substr(s,k))
s <- substr(s,k + 1,nchar(s))
}
res
})
你会看到
> f(v)
$thanks
[1] "th" "a" "n" "k" "s"
$chain
[1] "ch" "ai" "n"
$banana
[1] "b" "a" "n" "a" "n" "a"
$that
[1] "th" "a" "t"
$rain
[1] "r" "ai" "n"
,
strsplit("that",split = "(?<=t(?!h)|th|a)",perl = TRUE)