Java正则表达式将多行节与子节匹配

问题描述

作为更简单的StackOverflow question的扩展，有一个Java正则表达式可以一次性从多行文本文档中提取每个节和小节，其结构类似于

<Irrelevant line>
...
<Irrelevant line>
####<section_title>
OVERVIEW
...
...
INTRODUCTION
...
...
DETAILS
...
...
####<section_title>
OVERVIEW
...
...
INTRODUCTION
...
...
DETAILS
...
...

section_title可以是任何东西，并且每个小节标题（概述，简介，详细信息）都是该行中唯一的文本。所有其他行都可以多行包含任何文本，从空到数千个字符。

当然，当然也可以使用BufferedReader处理文档并逐行读取，但是使用正则表达式将提供更优雅的解决方案。

解决方法

以下正则表达式在迭代时将一次返回一个子节，可选地包括第一个子节的节头。

(?m)(?:^####(.*)\R)?^(OVERVIEW|INTRODUCTION|DETAILS)\R(?s:(.*?))(?=^####|^(?:OVERVIEW|INTRODUCTION|DETAILS)$|\z)

(?m)意味着^和$在正则表达式的其余部分分别匹配行的开始和结尾，因此我们使用\z来匹配输入，即$通常匹配的内容。

(?s:XXX)使.与XXX模式的任何字符匹配，包括行分隔符（\r，\n）

\R匹配\r，\n或\r\n，即匹配行分隔符，而不考虑操作系统（Windows与Linux）。

使用.*?（勉强）匹配，后跟(?=XXX)，将使正则表达式匹配文本达到但不包括XXX模式。

演示
_{（也可在regex101.com上获得）}

String regex = "(?m)(?:^####(.*)\\R)?^(OVERVIEW|INTRODUCTION|DETAILS)\\R(?s:(.*?))(?=^####|^(?:OVERVIEW|INTRODUCTION|DETAILS)$|\\z)";

String input = "<Irrelevant line>\r\n" + 
               "...\r\n" + 
               "<Irrelevant line>\r\n" + 
               "####<section_title>\r\n" + 
               "OVERVIEW\r\n" + 
               "...\r\n" + 
               "...\r\n" + 
               "INTRODUCTION\r\n" + 
               "...\r\n" + 
               "...\r\n" + 
               "DETAILS\r\n" + 
               "...\r\n" + 
               "...\r\n" + 
               "####<section_title>\r\n" + 
               "OVERVIEW\r\n" + 
               "...\r\n" + 
               "...\r\n" + 
               "INTRODUCTION\r\n" + 
               "...\r\n" + 
               "...\r\n" + 
               "DETAILS\r\n" + 
               "...\r\n" + 
               "...";

for (Matcher m = Pattern.compile(regex).matcher(input); m.find(); ) {
    String sectionTitle = m.group(1);
    String subSectionTitle = m.group(2);
    String content = m.group(3);
    if (sectionTitle != null)
        System.out.println("sectionTitle: " + sectionTitle);
    System.out.println("subSectionTitle: " + subSectionTitle);
    System.out.println("content: " + content.replaceAll("(?ms)(?<=.)^","         "));
}

输出

sectionTitle: <section_title>
subSectionTitle: OVERVIEW
content: ...
         ...

subSectionTitle: INTRODUCTION
content: ...
         ...

subSectionTitle: DETAILS
content: ...
         ...

sectionTitle: <section_title>
subSectionTitle: OVERVIEW
content: ...
         ...

subSectionTitle: INTRODUCTION
content: ...
         ...

subSectionTitle: DETAILS
content: ...
         ...

java java multiline regex