问题描述
我想编写一个正则表达式模式,从一串叙述中提取地址或位置,用于 35 万条记录的数据。
txn_add <- data.frame(NARRATION=c("$ $ $ +YBL PATAUDI CHOWK \ $","$ $ -ATM CASH 83181 + MAIN BHAWANA ROAD NEW DELHI $","$ $ [5839/P1TNDE06/+RAGHUBARPURA $","$ MAXIMUMOUTFITS PRIVATE LIMITED } $ ATDELHIIN- $ $ /5631 $","$ ATM CASH-N4077800-+SPRINGFIELDCOLONYFFAR IDABADHRIN-04/06/18 $ /5631 ( $ $ VERIFICATION $"))
我运行了以下正则表达式模式:
gsub(".*[:|+]([^.]+)[$|\\|\\/].*","\\1",txn_add$NARRATION)
我得到的输出为:
[1] "YBL PATAUDI CHOWK "
[2] " MAIN BHAWANA ROAD NEW DELHI "
[3] "RAGHUBARPURA "
[4] "$ MAXIMUMOUTFITS PRIVATE LIMITED } $ ATDELHIIN- $ $ /5631 $"
[5] "SPRINGFIELDCOLONYFFAR IDABADHRIN-04/06/18 $ /5631 ( $ $ VERIFICATION "
这个输出不正确,因为我必须实现一些条件: 地址可以从:
1. '+'
2. '@'
3. ' AT '
4. ':'
5. <P|S><SBI><P|S> # EXACT TEXT PRECEEDED AND FOLLOWED BY PUNCTUATION OR SPACE
6. <NNN> FOLLOWED BY <P|S|A> # 3 NUMBERS FOLLOWED BY EITHER PUNCTUATION OR SPACE OR ALPHA
并以:
1. -
2. /
3. $
4. \
5.<NNNNNNN> # Combination of numbers
可以包含
Alphabets,numbers,dot (.),dash (-),space ( ),coma(,),underscore (_) brackets(()) at (@),hash (#) and(&) semi colon (;)
[1] "YBL PATAUDI CHOWK"
[2] "MAIN BHAWANA ROAD NEW DELHI "
[3] "RAGHUBARPURA "
[4] "DELHIIN"
[5] "SPRINGFIELDCOLONYFFAR IDABADHRIN"
我无法获得所需的输出。接下来我可以尝试什么?
解决方法
您可能会使用捕获组
(?:[+@:]|\bAT(?!M))\s*([A-Z]+(?:\s+[A-Z]+)*)
说明
-
(?:
非捕获组-
[+@:]
匹配+
@
:
之一
-
|
或 -
\bAT(?!M)
匹配 AT 后面没有M
-
-
)
关闭群组 -
\s*
匹配 0+ 个空白字符 -
(
捕获组 1-
[A-Z]+(?:\s+[A-Z]+)*
匹配字符 A-Z,中间有 1 个以上的空白字符
-
-
)
关闭第 1 组
用子匹配所有前后组:
sub(".*(?:[+@:]|\\bAT(?!M))\\s*([A-Z]+(?:\\s+[A-Z]+)*).*","\\1",txn_add$NARRATION,perl=TRUE)