如何解决ANTLR CPP14语法中的解析错误

问题描述

我正在使用下面的ANTLR语法来解析我的代码。

https://github.com/antlr/grammars-v4/tree/master/cpp

但是使用以下代码时出现解析错误：

TEST_F(TestClass,false_positive__N)
{
  static constexpr char text[] =
    R"~~~(; ModuleID = 'a.cpp'
            source_filename = "a.cpp"

   define private i32 @"__ir_hidden#100007_"(i32 %arg1) {
     ret i32 %arg1
   }

define i32 @main(i32 %arg1) {
   %1 = call i32 @"__ir_hidden#100007_"(i32 %arg1)
   ret i32 %1
}
)~~~";

 NameMock ns(text);
 ASSERT_EQ(std::string(text),ns.getSeed());
}

错误详细信息：

line 12:29 token recognition error at: '#1'
line 12:37 token recognition error at: '"(i32 %arg1)\n'
line 12:31 missing ';' at '00007_'
line 13:2 missing ';' at 'ret'
line 13:10 mismatched input '%' expecting {'alignas','(','[','{','=',',';'}
line 14:0 missing ';' at '}'
line 15:0 mismatched input ')' expecting {'alignas',';'}
line 15:4 token recognition error at: '";\n'

解析器/词法分析器需要进行哪些修改才能正确解析输入？对此，我们将给予任何帮助。预先感谢。

解决方法

每当某些输入未正确解析时，我首先显示输入正在生成的所有令牌。如果这样做，您可能会明白为什么出问题了。另一种方法是删除大多数源，并逐渐向其中添加更多行：解析器在某个时候将失败，并且您有一个解决它的起点。

因此，如果转储您输入中正在创建的令牌，则会得到以下令牌：

Identifier                `TEST_F`
LeftParen                 `(`
Identifier                `TestClass`
Comma                     `,`
Identifier                `false_positive__N`
RightParen                `)`
LeftBrace                 `{`
Static                    `static`
Constexpr                 `constexpr`
Char                      `char`
Identifier                `text`
LeftBracket               `[`
RightBracket              `]`
Assign                    `=`
UserDefinedLiteral        `R"~~~(; ModuleID = 'a.cpp'\n            source_filename = "a.cpp"\n\n   define private i32 @"__ir_hidden#100007_"(i32 %arg1) {\n     ret i32 %arg1\n   }\n\ndefine i32 @main(i32 %arg1) {\n   %1 = call i32 @"__ir_hidden`
Directive                 `#100007_"(i32 %arg1)`
...

您会看到输入R"~~~( ... )~~~"未标记为StringLiteral。请注意，将永远不会创建StringLiteral，因为在词法分析器语法的顶部有以下规则：

Literal:
    IntegerLiteral
    | CharacterLiteral
    | FloatingLiteral
    | StringLiteral
    | BooleanLiteral
    | PointerLiteral
    | UserDefinedLiteral;

不会创建任何IntegerLiteral .. UserDefinedLiteral：它们都将成为Literal令牌。最好将此Literal规则移至解析器。我必须承认，在滚动词法分析器语法时，这有点混乱，并且修复R"~~~( ... )~~~"只会延迟另一个缠绵的难题：)。我很确定这个语法从未经过适当的测试，并且充满了错误。

如果您查看StringLiteral的词法分析器定义：

StringLiteral
 : Encodingprefix? '"' Schar* '"'
 | Encodingprefix? 'R' Rawstring
 ;

fragment Rawstring
 : '"' .*? '(' .*? ')' .*? '"'
 ;

很明显，为什么'"' .*? '(' .*? ')' .*? '"'与整个字符串文字不匹配：

您需要的规则如下：

StringLiteral
 : Encodingprefix? '"' Schar* '"'
 | Encodingprefix? 'R"' ~[(]* '(' ( . )* ')' ~["]* '"'
 ;

但这会导致( . )*消耗过多：它将捕获每个字符，然后回溯到字符流中的最后一个引号（不是您想要的）。

您真正想要的是：

StringLiteral
 : Encodingprefix? '"' Schar* '"'
 | Encodingprefix? 'R"' ~[(]* '(' ( /* break out of this loop when we see `)~~~"` */ . )* ')' ~["]* '"'
 ;

break out of this look when we see ')~~~"'部分可以通过semantic predicate完成，如下所示：

lexer grammar CPP14Lexer;

@members {
  private boolean closeDelimiterAhead(String matched) {
    // Grab everything between the matched text's first quote and first '('. Prepend a ')' and append a quote
    String delimiter = ")" + matched.substring(matched.indexOf('"') + 1,matched.indexOf('(')) + "\"";
    StringBuilder ahead = new StringBuilder();

    // Collect as much characters ahead as there are `delimiter`-chars
    for (int n = 1; n <= delimiter.length(); n++) {
      if (_input.LA(n) == CPP14Lexer.EOF) {
        throw new RuntimeException("Missing delimiter: " + delimiter);
      }
      ahead.append((char) _input.LA(n));
    }

    return delimiter.equals(ahead.toString());
  }
}

...

StringLiteral
 : Encodingprefix? '"' Schar* '"'
 | Encodingprefix? 'R"' ~[(]* '(' ( {!closeDelimiterAhead(getText())}? . )* ')' ~["]* '"'
 ;

...

如果您现在转储令牌，则会看到以下内容：

Identifier                `TEST_F`
LeftParen                 `(`
Identifier                `TestClass`
Comma                     `,`
Identifier                `false_positive__N`
RightParen                `)`
LeftBrace                 `{`
Static                    `static`
Constexpr                 `constexpr`
Char                      `char`
Identifier                `text`
LeftBracket               `[`
RightBracket              `]`
Assign                    `=`
Literal                   `R"~~~(; ModuleID = 'a.cpp'\n            source_filename = "a.cpp"\n\n   define private i32 @"__ir_hidden#100007_"(i32 %arg1) {\n     ret i32 %arg1\n   }\n\ndefine i32 @main(i32 %arg1) {\n   %1 = call i32 @"__ir_hidden#100007_"(i32 %arg1)\n   ret i32 %1\n}\n)~~~"`
Semi                      `;`
...

它是：R"~~~( ... )~~~"被正确标记为单个标记（尽管是Literal标记而不是StringLiteral ...）。当输入类似R"~~~( ... )~~"或R"~~~( ... )~~~~"时，它将引发异常，并且将成功标记化输入R"~~~( )~~" )~~~~" )~~~"

快速查看解析器语法，我发现引用了StringLiteral之类的标记，但是词法分析器永远不会产生这样的标记（如我之前提到的那样）。

请谨慎处理此语法。我不建议（盲目地）将其用于某种教育目的以外的用途。不要在生产中使用！

下面的Lexer更改帮助我解决了原始字符串解析问题

 Stringliteral
   : Encodingprefix? '"' Schar* '"'
   | Encodingprefix? '"' Schar* '" GST_TIME_FORMAT'
   | Encodingprefix? 'R' Rawstring
 ;

fragment Rawstring
 : '"'              // Match Opening Double Quote
   ( /* Handle Empty D_CHAR_SEQ without Predicates
        This should also work
        '(' .*? ')'
      */
     '(' ( ~')' | ')'+ ~'"' )* (')'+)

   | D_CHAR_SEQ
         /*  // Limit D_CHAR_SEQ to 16 characters
            { ( ( getText().length() - ( getText().indexOf("\"") + 1 ) ) <= 16 ) }?
         */
     '('
     /* From Spec :
        Any member of the source character set,except
        a right parenthesis ) followed by the initial D_CHAR_SEQUENCE
        ( which may be empty ) followed by a double quote ".

      - The following loop consumes characters until it matches the
        terminating sequence of characters for the RAW STRING
      - The options are mutually exclusive,so Only one will
        ever execute in each loop pass
      - Each Option will execute at least once.  The first option needs to
        match the ')' character even if the D_CHAR_SEQ is empty. The second
        option needs to match the closing \" to fall out of the loop. Each
        option will only consume at most 1 character
      */
     (   //  Consume everthing but the Double Quote
       ~'"'
     |   //  If text Does Not End with closing Delimiter,consume the Double Quote
       '"'
       {
            !getText().endsWith(
                 ")"
               + getText().substring( getText().indexOf( "\"" ) + 1,getText().indexOf( "(" )
                                    )
               + '\"'
             )
       }?
     )*
   )
   '"'              // Match Closing Double Quote

   /*
   // Strip Away R"D_CHAR_SEQ(...)D_CHAR_SEQ"
   //  Send D_CHAR_SEQ <TAB> ... to Parser
   {
     setText( getText().substring( getText().indexOf("\"") + 1,getText().indexOf("(")
                                 )
            + "\t"
            + getText().substring( getText().indexOf("(") + 1,getText().lastIndexOf(")")
                                 )
            );
   }
    */
 ;

 fragment D_CHAR_SEQ     // Should be limited to 16 characters
    : D_CHAR+
 ;
 fragment D_CHAR
      /*  Any member of the basic source character set except
          space,the left parenthesis (,the right parenthesis ),the backslash \,and the control characters representing
           horizontal tab,vertical tab,form feed,and newline.
      */
    : '\u0021'..'\u0023'
    | '\u0025'..'\u0027'
    | '\u002a'..'\u003f'
    | '\u0041'..'\u005b'
    | '\u005d'..'\u005f'
    | '\u0061'..'\u007e'
 ;

antlr c++context-free-grammar grammar