Java / Groovy正则表达式可解析无分隔符的键-值对

问题描述

我无法通过正则表达式获取键值对

到目前为止的代码

String raw = '''
MA1

D. Mueller Gießer

MA2 Peter

Mustermann 2. Mann


MA3 Ulrike Mastorius Schmelzer

MA4 Heiner Becker
s 3.Mann

MA5 Rudolf Peters

Gießer

'''

Map map = [:]

ArrayList<String> split = raw.findAll("(MA\\d)+(.*)"){ full,name,value ->  map[name] = value }


println map

输出为: [MA1:,MA2:彼得,MA3:乌尔里克·马斯托里乌斯·施梅泽,MA4:海纳·贝克尔,MA5:鲁道夫·彼得斯]

在我的情况下,关键是: MA1,MA2,MA3,MA \ d(因此MA带有任意一位数字)

在下一个键出现之前,值绝对是所有内容包括换行符,制表符,空格等)

有人知道如何执行此操作吗?

预先感谢, 塞巴斯蒂安

解决方法

您可以在第二组中捕获键之后的所有内容以及所有不以键开头的行

^(MA\d+)(.*(?:\R(?!MA\d).*)*)

模式匹配

  • ^字符串的开头
  • (MA\d+)捕获匹配MA和1位以上数字的第1组
  • (捕获第2组
    • .*匹配其余行
    • (?:\R(?!MA\d).*)*匹配所有不以MA开头且后跟数字的行,其中\R匹配任何unicode换行符序列
  • )关闭第2组

Regex demo

在Java中,转义的反斜杠加倍

final String regex = "^(MA\\d+)(.*(?:\\R(?!MA\\d).*)*)";
,

使用

(?ms)^(MA\d+)(.*?)(?=\nMA\d|\z)

请参见proof

说明

                         EXPLANATION
--------------------------------------------------------------------------------
  (?ms)                    set flags for this block (with ^ and $
                           matching start and end of line) (with .
                           matching \n) (case-sensitive) (matching
                           whitespace and # normally)
--------------------------------------------------------------------------------
  ^                        the beginning of a "line"
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    MA                       'MA'
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  (                        group and capture to \2:
--------------------------------------------------------------------------------
    .*?                      any character (0 or more times (matching
                             the least amount possible))
--------------------------------------------------------------------------------
  )                        end of \2
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    \n                       '\n' (newline)
--------------------------------------------------------------------------------
    MA                       'MA'
--------------------------------------------------------------------------------
    \d                       digits (0-9)
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    \z                       the end of the string
--------------------------------------------------------------------------------
  )                        end of look-ahead