问题描述
我正在使用下面的ANTLR语法来解析我的代码。
https://github.com/antlr/grammars-v4/tree/master/cpp
TEST_F(TestClass,false_positive__N)
{
static constexpr char text[] =
R"~~~(; ModuleID = 'a.cpp'
source_filename = "a.cpp"
define private i32 @"__ir_hidden#100007_"(i32 %arg1) {
ret i32 %arg1
}
define i32 @main(i32 %arg1) {
%1 = call i32 @"__ir_hidden#100007_"(i32 %arg1)
ret i32 %1
}
)~~~";
NameMock ns(text);
ASSERT_EQ(std::string(text),ns.getSeed());
}
错误详细信息:
line 12:29 token recognition error at: '#1'
line 12:37 token recognition error at: '"(i32 %arg1)\n'
line 12:31 missing ';' at '00007_'
line 13:2 missing ';' at 'ret'
line 13:10 mismatched input '%' expecting {'alignas','(','[','{','=',',';'}
line 14:0 missing ';' at '}'
line 15:0 mismatched input ')' expecting {'alignas',';'}
line 15:4 token recognition error at: '";\n'
解析器/词法分析器需要进行哪些修改才能正确解析输入?对此,我们将给予任何帮助。预先感谢。
解决方法
每当某些输入未正确解析时,我首先显示输入正在生成的所有令牌。如果这样做,您可能会明白为什么出问题了。另一种方法是删除大多数源,并逐渐向其中添加更多行:解析器在某个时候将失败,并且您有一个解决它的起点。
因此,如果转储您输入中正在创建的令牌,则会得到以下令牌:
Identifier `TEST_F`
LeftParen `(`
Identifier `TestClass`
Comma `,`
Identifier `false_positive__N`
RightParen `)`
LeftBrace `{`
Static `static`
Constexpr `constexpr`
Char `char`
Identifier `text`
LeftBracket `[`
RightBracket `]`
Assign `=`
UserDefinedLiteral `R"~~~(; ModuleID = 'a.cpp'\n source_filename = "a.cpp"\n\n define private i32 @"__ir_hidden#100007_"(i32 %arg1) {\n ret i32 %arg1\n }\n\ndefine i32 @main(i32 %arg1) {\n %1 = call i32 @"__ir_hidden`
Directive `#100007_"(i32 %arg1)`
...
您会看到输入R"~~~( ... )~~~"
未标记为StringLiteral
。请注意,将永远不会创建StringLiteral
,因为在词法分析器语法的顶部有以下规则:
Literal:
IntegerLiteral
| CharacterLiteral
| FloatingLiteral
| StringLiteral
| BooleanLiteral
| PointerLiteral
| UserDefinedLiteral;
不会创建任何IntegerLiteral
.. UserDefinedLiteral
:它们都将成为Literal
令牌。最好将此Literal
规则移至解析器。我必须承认,在滚动词法分析器语法时,这有点混乱,并且修复R"~~~( ... )~~~"
只会延迟另一个缠绵的难题:)。我很确定这个语法从未经过适当的测试,并且充满了错误。
如果您查看StringLiteral
的词法分析器定义:
StringLiteral
: Encodingprefix? '"' Schar* '"'
| Encodingprefix? 'R' Rawstring
;
fragment Rawstring
: '"' .*? '(' .*? ')' .*? '"'
;
很明显,为什么'"' .*? '(' .*? ')' .*? '"'
与整个字符串文字不匹配:
您需要的规则如下:
StringLiteral
: Encodingprefix? '"' Schar* '"'
| Encodingprefix? 'R"' ~[(]* '(' ( . )* ')' ~["]* '"'
;
但这会导致( . )*
消耗过多:它将捕获每个字符,然后回溯到字符流中的最后一个引号(不是您想要的)。
您真正想要的是:
StringLiteral
: Encodingprefix? '"' Schar* '"'
| Encodingprefix? 'R"' ~[(]* '(' ( /* break out of this loop when we see `)~~~"` */ . )* ')' ~["]* '"'
;
break out of this look when we see ')~~~"'
部分可以通过semantic predicate完成,如下所示:
lexer grammar CPP14Lexer;
@members {
private boolean closeDelimiterAhead(String matched) {
// Grab everything between the matched text's first quote and first '('. Prepend a ')' and append a quote
String delimiter = ")" + matched.substring(matched.indexOf('"') + 1,matched.indexOf('(')) + "\"";
StringBuilder ahead = new StringBuilder();
// Collect as much characters ahead as there are `delimiter`-chars
for (int n = 1; n <= delimiter.length(); n++) {
if (_input.LA(n) == CPP14Lexer.EOF) {
throw new RuntimeException("Missing delimiter: " + delimiter);
}
ahead.append((char) _input.LA(n));
}
return delimiter.equals(ahead.toString());
}
}
...
StringLiteral
: Encodingprefix? '"' Schar* '"'
| Encodingprefix? 'R"' ~[(]* '(' ( {!closeDelimiterAhead(getText())}? . )* ')' ~["]* '"'
;
...
如果您现在转储令牌,则会看到以下内容:
Identifier `TEST_F`
LeftParen `(`
Identifier `TestClass`
Comma `,`
Identifier `false_positive__N`
RightParen `)`
LeftBrace `{`
Static `static`
Constexpr `constexpr`
Char `char`
Identifier `text`
LeftBracket `[`
RightBracket `]`
Assign `=`
Literal `R"~~~(; ModuleID = 'a.cpp'\n source_filename = "a.cpp"\n\n define private i32 @"__ir_hidden#100007_"(i32 %arg1) {\n ret i32 %arg1\n }\n\ndefine i32 @main(i32 %arg1) {\n %1 = call i32 @"__ir_hidden#100007_"(i32 %arg1)\n ret i32 %1\n}\n)~~~"`
Semi `;`
...
它是:R"~~~( ... )~~~"
被正确标记为单个标记(尽管是Literal
标记而不是StringLiteral
...)。当输入类似R"~~~( ... )~~"
或R"~~~( ... )~~~~"
时,它将引发异常,并且将成功标记化输入R"~~~( )~~" )~~~~" )~~~"
快速查看解析器语法,我发现引用了StringLiteral
之类的标记,但是词法分析器永远不会产生这样的标记(如我之前提到的那样)。
请谨慎处理此语法。我不建议(盲目地)将其用于某种教育目的以外的用途。不要在生产中使用!
,下面的Lexer更改帮助我解决了原始字符串解析问题
Stringliteral
: Encodingprefix? '"' Schar* '"'
| Encodingprefix? '"' Schar* '" GST_TIME_FORMAT'
| Encodingprefix? 'R' Rawstring
;
fragment Rawstring
: '"' // Match Opening Double Quote
( /* Handle Empty D_CHAR_SEQ without Predicates
This should also work
'(' .*? ')'
*/
'(' ( ~')' | ')'+ ~'"' )* (')'+)
| D_CHAR_SEQ
/* // Limit D_CHAR_SEQ to 16 characters
{ ( ( getText().length() - ( getText().indexOf("\"") + 1 ) ) <= 16 ) }?
*/
'('
/* From Spec :
Any member of the source character set,except
a right parenthesis ) followed by the initial D_CHAR_SEQUENCE
( which may be empty ) followed by a double quote ".
- The following loop consumes characters until it matches the
terminating sequence of characters for the RAW STRING
- The options are mutually exclusive,so Only one will
ever execute in each loop pass
- Each Option will execute at least once. The first option needs to
match the ')' character even if the D_CHAR_SEQ is empty. The second
option needs to match the closing \" to fall out of the loop. Each
option will only consume at most 1 character
*/
( // Consume everthing but the Double Quote
~'"'
| // If text Does Not End with closing Delimiter,consume the Double Quote
'"'
{
!getText().endsWith(
")"
+ getText().substring( getText().indexOf( "\"" ) + 1,getText().indexOf( "(" )
)
+ '\"'
)
}?
)*
)
'"' // Match Closing Double Quote
/*
// Strip Away R"D_CHAR_SEQ(...)D_CHAR_SEQ"
// Send D_CHAR_SEQ <TAB> ... to Parser
{
setText( getText().substring( getText().indexOf("\"") + 1,getText().indexOf("(")
)
+ "\t"
+ getText().substring( getText().indexOf("(") + 1,getText().lastIndexOf(")")
)
);
}
*/
;
fragment D_CHAR_SEQ // Should be limited to 16 characters
: D_CHAR+
;
fragment D_CHAR
/* Any member of the basic source character set except
space,the left parenthesis (,the right parenthesis ),the backslash \,and the control characters representing
horizontal tab,vertical tab,form feed,and newline.
*/
: '\u0021'..'\u0023'
| '\u0025'..'\u0027'
| '\u002a'..'\u003f'
| '\u0041'..'\u005b'
| '\u005d'..'\u005f'
| '\u0061'..'\u007e'
;