问题描述
我有一个字符串变量,我想在其中删除某些单词,但许多其他单词将是部分匹配,我不想删除。我想删除单词,当且仅当它们完全匹配时。
clear
* Add in some example data
input index str50 words
1 "more mor morph test"
2 "ten tennis tenner tenth keeper"
3 "badder baddy bad other"
end
* I create a copy to compare obefore/after strip
gen strip_words = words
* This is a list of words I want removed. In reality,this is a fairly long list
local removs "mor ten bad"
* For each of words,remove the complete word from teh string
foreach w of local removs {
replace strip_words = subinstr(strip_words,"`w'","",.)
}
list
+---------------------------------------------------------------+
| index words strip_words |
|---------------------------------------------------------------|
1. | 1 more mor morph test e ph test |
2. | 2 ten tennis tenner tenth keeper nis ner th keeper |
3. | 3 badder baddy bad other der dy other |
+---------------------------------------------------------------+
我尝试用 replace strip_words = " " + strip_words + " "
填充一些空格,但是这也删除了分隔其他单词的空格。我想要的输出是
+-------------------------------------------------------------------------+
| index words strip_words |
|-------------------------------------------------------------------------|
1. | 1 more mor morph test more morph test |
2. | 2 ten tennis tenner tenth keeper tennis tenner tenth keeper |
3. | 3 badder baddy bad other badder baddy other |
+-------------------------------------------------------------------------+
'''
解决方法
请参阅 help string functions
以了解 subinword()
。
clear
* Add in some example data
input index str50 words
1 "more mor morph test"
2 "ten tennis tenner tenth keeper"
3 "badder baddy bad other"
end
* I create a copy to compare obefore/after strip
gen strip_words = words
* This is a list of words I want removed. In reality,this is a fairly long list
local removs "mor ten bad"
* For each of words,remove the complete word from teh string
foreach w of local removs {
replace strip_words = subinword(strip_words,"`w'","",.)
}
replace strip_words = itrim(strip_words)
,
这可以用正则表达式处理。简介:link
Stata 的基于 Unicode 的正则表达式命令支持 5:[1,2,3,4,5]
5: [1,5]
10: [1,5,0]
x[0]=-0.000000
y[0]=-0.000000
x[1]=-0.000000
y[1]=-0.000000
x[2]=-0.000000
y[2]=-0.000000
x[3]=-0.000000
y[3]=-0.000000
x[4]=-0.000000
y[4]=-0.000000
x[5]=-0.000000
y[5]=-0.000000
x[6]=-0.000000
y[6]=-0.000000
x[7]=-0.000000
y[7]=-0.000000
x[8]=-0.000000
y[8]=-0.000000
x[9]=-0.000000
y[9]=-0.000000
10: [0.000000,0.000000,0.000000]
来表示单词边界。
\b
从你的例子来看,你似乎想像上面一样保留空格。否则,您可以使用 clear
input index str50 words
1 "more mor morph test"
2 "ten tennis tenner tenth keeper"
3 "badder baddy bad other"
end
local rmv "(mor|ten|bad)"
gen wanted = ustrregexra(words,"\b`rmv'\b","")
list
+----------------------------------------------------------------------+
| index words wanted |
|----------------------------------------------------------------------|
1. | 1 more mor morph test more morph test |
2. | 2 ten tennis tenner tenth keeper tennis tenner tenth keeper |
3. | 3 badder baddy bad other badder baddy other |
+----------------------------------------------------------------------+
和 strtrim()
删除它们。
使用您的示例,但使用 subinword
而不是 subinstr
您可以获得所需的输出。
clear
* Add in some example data
input index str50 words
1 "more mor morph test"
2 "ten tennis tenner tenth keeper"
3 "badder baddy bad other"
end
* I create a copy to compare obefore/after strip
gen strip_words = words
gen strip_words_2 = words
* This is a list of words I want removed. In reality,remove the complete word from teh string
foreach w of local removs {
replace strip_words = subinstr(strip_words,.)
replace strip_words_2 = subinword(strip_words_2,.)
}
list
+-------------------------------------------------------------------------------------------+
| index words strip_words strip_words_2 |
|-------------------------------------------------------------------------------------------|
1. | 1 more mor morph test e ph test more morph test |
2. | 2 ten tennis tenner tenth keeper nis ner th keeper tennis tenner tenth keeper |
3. | 3 badder baddy bad other der dy other badder baddy other |
+-------------------------------------------------------------------------------------------+