我想通过使用正则表达式从字符串列表中捕获名称字段.在Matlab中我这样做了:
strings = {'sn555 ID_O20-5-684_N52_2_Subt2_01.',... 'sn555 ID_O20-5-984_S52_8_Subt10_11.'}; pattern = ['sn(?<serial_number>.*) ID(_)(?<ID>.*)_(?<Class>[NS])'... '(?<Sector>.*)_(?<Point>(.*))_[Ss]ubt.*\.']; ParsedData = regexp(strings,pattern,'names');
结果(转换为数据集)是:
ParsedData = serial_number ID Class Sector Point '555' 'O20-5-684' 'N' '52' '2' '555' 'O20-5-984' 'S' '52' '8'
现在我想在R中解析这些字符串并将结果转换为数据帧.
我试过这个:
strings <- c("sn555 ID_O20-5-684_N52_2_Subt2_01.","sn555 ID_O20-5-984_S52_8_Subt10_11.") pattern <- paste0('sn(?<serial_number>.*) ID(_)(?<ID>.*)_(?<Class>[NS])','(?<Sector>.*)_(?<Point>(.*))_[Ss]ubt.*\\.'); ParsedData <- gregexpr(pattern,strings,perl = TRUE); ParsedData
不幸的是,我是R中的正则表达式的新手,输出(ParsedData)对我来说很困惑.您对如何将字符串转换为数据集有何建议?
解决方法
在过去,我编写了一个辅助函数来从名为
regcapturedmatches.R的正则表达式中提取捕获组.
您可以将它与您的数据一起使用,如下所示:
rr <- regcapturedmatches(strings,ParsedData) rr # [[1]] # serial_number X ID Class Sector Point X.1 # [1,] "555" "_" "O20-5-684" "N" "52" "2" "2" # # [[2]] # serial_number X ID Class Sector Point X.1 # [1,] "555" "_" "O20-5-984" "S" "52" "8" "8"
您将获得一个包含列名称的数组的列表.您可以将其转换为data.frame:
do.call(rbind.data.frame,rr) # serial_number X ID Class Sector Point X.1 # 1 555 _ O20-5-684 N 52 2 2 # 2 555 _ O20-5-984 S 52 8 8