Stata 从字符串中删除整个单词

问题描述

我有一个字符串变量,我想在其中删除某些单词,但许多其他单词将是部分匹配,我不想删除。我想删除单词,当且仅当它们完全匹配时。

clear
* Add in some example data
input index str50 words
1 "more mor morph test"
2 "ten tennis tenner tenth keeper"
3 "badder baddy bad other"
end

* I create a copy to compare obefore/after strip
gen strip_words = words

* This is a list of words I want removed. In reality,this is a fairly long list
local removs "mor ten bad"
* For each of words,remove the complete word from teh string
foreach w of local removs {
    replace strip_words = subinstr(strip_words,"`w'","",.) 
}

list
     +---------------------------------------------------------------+
     | index                            words            strip_words |
     |---------------------------------------------------------------|
  1. |     1              more mor morph test            e ph test   |
  2. |     2   ten tennis tenner tenth keeper     nis ner th keeper  |
  3. |     3           badder baddy bad other         der dy other   |
     +---------------------------------------------------------------+

我尝试用 replace strip_words = " " + strip_words + " " 填充一些空格,但是这也删除了分隔其他单词的空格。我想要的输出是

     +-------------------------------------------------------------------------+
     | index                            words                      strip_words |
     |-------------------------------------------------------------------------|
  1. |     1              more mor morph test              more  morph test    |
  2. |     2   ten tennis tenner tenth keeper    tennis tenner tenth keeper    |
  3. |     3           badder baddy bad other           badder baddy  other    |
     +-------------------------------------------------------------------------+
'''

解决方法

请参阅 help string functions 以了解 subinword()

clear
* Add in some example data
input index str50 words
1 "more mor morph test"
2 "ten tennis tenner tenth keeper"
3 "badder baddy bad other"
end

* I create a copy to compare obefore/after strip
gen strip_words = words

* This is a list of words I want removed. In reality,this is a fairly long list
local removs "mor ten bad"
* For each of words,remove the complete word from teh string
foreach w of local removs {
    replace strip_words = subinword(strip_words,"`w'","",.) 
}

replace strip_words = itrim(strip_words) 
,

这可以用正则表达式处理。简介:link

Stata 的基于 Unicode 的正则表达式命令支持 5:[1,2,3,4,5] 5: [1,5] 10: [1,5,0] x[0]=-0.000000 y[0]=-0.000000 x[1]=-0.000000 y[1]=-0.000000 x[2]=-0.000000 y[2]=-0.000000 x[3]=-0.000000 y[3]=-0.000000 x[4]=-0.000000 y[4]=-0.000000 x[5]=-0.000000 y[5]=-0.000000 x[6]=-0.000000 y[6]=-0.000000 x[7]=-0.000000 y[7]=-0.000000 x[8]=-0.000000 y[8]=-0.000000 x[9]=-0.000000 y[9]=-0.000000 10: [0.000000,0.000000,0.000000] 来表示单词边界。

\b

从你的例子来看,你似乎想像上面一样保留空格。否则,您可以使用 clear input index str50 words 1 "more mor morph test" 2 "ten tennis tenner tenth keeper" 3 "badder baddy bad other" end local rmv "(mor|ten|bad)" gen wanted = ustrregexra(words,"\b`rmv'\b","") list +----------------------------------------------------------------------+ | index words wanted | |----------------------------------------------------------------------| 1. | 1 more mor morph test more morph test | 2. | 2 ten tennis tenner tenth keeper tennis tenner tenth keeper | 3. | 3 badder baddy bad other badder baddy other | +----------------------------------------------------------------------+ strtrim() 删除它们。

,

使用您的示例,但使用 subinword 而不是 subinstr 您可以获得所需的输出。

clear
* Add in some example data
input index str50 words
1 "more mor morph test"
2 "ten tennis tenner tenth keeper"
3 "badder baddy bad other"
end

* I create a copy to compare obefore/after strip
gen strip_words = words
gen strip_words_2 = words

* This is a list of words I want removed. In reality,remove the complete word from teh string
foreach w of local removs {
    replace strip_words   = subinstr(strip_words,.) 
    replace strip_words_2 = subinword(strip_words_2,.)
    }

list
     

     +-------------------------------------------------------------------------------------------+
     | index                            words          strip_words                 strip_words_2 |
     |-------------------------------------------------------------------------------------------|
  1. |     1              more mor morph test           e  ph test              more  morph test |
  2. |     2   ten tennis tenner tenth keeper    nis ner th keeper    tennis tenner tenth keeper |
  3. |     3           badder baddy bad other        der dy  other           badder baddy  other |
     +-------------------------------------------------------------------------------------------+
     
     
     

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...