strsplit一个字符串为字符,但保持diagraphs

问题描述

如何将英文单词拆分为字符但保持连线完整(例如“ch”、“th”、“gh”)?

例如,对于字符串“that”,我想将其拆分为“th”、“a”、“t”,而不是“t”、“h”、“a”、“t”。

解决方法

这里有一个函数 f 可以帮助分割

dg <- c("ch","th","gh","ai")
v <- c("thanks","chain","banana","that","rain")

f <- Vectorize(function(s) {
  res <- c()
  while (nchar(s)) {
    k <- ifelse(substr(s,1,2) %in% dg,2,1)
    res <- c(res,substr(s,k))
    s <- substr(s,k + 1,nchar(s))
  }
  res
})

你会看到

> f(v)
$thanks
[1] "th" "a"  "n"  "k"  "s" 

$chain
[1] "ch" "ai" "n"

$banana
[1] "b" "a" "n" "a" "n" "a"

$that
[1] "th" "a"  "t"

$rain
[1] "r"  "ai" "n"
,
strsplit("that",split = "(?<=t(?!h)|th|a)",perl = TRUE)