处理ANTLR4中的语法歧义(Handling Grammar Ambiguity in ANTLR4)

我有一个语法,应该解析下面的代码片段(作为一个例子):

vmthread programm_start { CALL main } subcall main { // Declarations DATAF i CALL i // Statements MOVEF_F 3 i }

问题是CALL语句之间存在歧义。 此操作代码在vmthread部分(并且只有CALL!)中有效,但也在那些子部分中有效。 如果我使用所有操作码和附加的OC_CALL令牌定义OP_CODES令牌,则词法分析器无法处理这种情况(显然)。

以下列表是我的语法片段(第一个词法分析器,第二个解析器):

VMTHREAD : 'vmthread' ; SUBCALL : 'subcall' ; CURLY_OPEN : '{' ; CURLY_CLOSE : '}' ; OP_CODES : 'DATA8' | 'DATAF' | 'MOVE8_8' | 'MOVEF_F' | 'CALL' ; OC_CALL : 'CALL' ; lms : vmthread subcalls+ ; vmthread : VMTHREAD name = ID CURLY_OPEN vmthreadCall CURLY_CLOSE ; vmthreadCall : oc = OC_CALL name = ID ; subcalls : SUBCALL name = ID CURLY_OPEN ins = instruction* CURLY_CLOSE ; //instruction+ instruction : oc = OP_CODES args = argumentList ; argumentList : arguments+ ; arguments : INTEGER | NUMBER | TEXT | ID ;

为了继续工作,我使用OP_CODES标记切换了vmthreadCall解析器规则中的OC_CALL标记。 这解决了现在的问题,因为代码是自动生成的。 但是用户可以输入此代码,这样可能会出错。

是否有解决方案,或者我应该将验证移动到解析器中。 在那里,我可以轻松确定vmthread部分中的语句是否仅包含call语句。

澄清一下在vmthread中只允许CALL。 在子查询中(可能不止一个)允许每个操作码(CALL +定义的每个其他操作码)。 我不想区分那些不同的CALL语句。 我知道在上下文无关语法中是不可能的。 我将在解析器中处理此问题。 我只想将vmthread限制为一个CALL语句并允许子句中的所有语句(所有操作代码)。 希望这更清楚。

I have a grammar that should parse the following snippet (as an example):

vmthread programm_start { CALL main } subcall main { // Declarations DATAF i CALL i // Statements MOVEF_F 3 i }

The problem is the ambiguity between the CALL statement. This op code is valid in the vmthread section (and only the CALL!) but also in those subcall sections. If I define a OP_CODES token with all op codes and an additional OC_CALL token, the lexer can't handle the situation (obviously).

The following listings are snippets of my grammar (first lexer, second parser):

VMTHREAD : 'vmthread' ; SUBCALL : 'subcall' ; CURLY_OPEN : '{' ; CURLY_CLOSE : '}' ; OP_CODES : 'DATA8' | 'DATAF' | 'MOVE8_8' | 'MOVEF_F' | 'CALL' ; OC_CALL : 'CALL' ; lms : vmthread subcalls+ ; vmthread : VMTHREAD name = ID CURLY_OPEN vmthreadCall CURLY_CLOSE ; vmthreadCall : oc = OC_CALL name = ID ; subcalls : SUBCALL name = ID CURLY_OPEN ins = instruction* CURLY_CLOSE ; //instruction+ instruction : oc = OP_CODES args = argumentList ; argumentList : arguments+ ; arguments : INTEGER | NUMBER | TEXT | ID ;

To continue my work I've switched the OC_CALL token in the vmthreadCall parser rule with the OP_CODES token. That solves the problem for now, because the code is auto generated. But there's the possibility that a user can type this code so this could go wrong.

Is there a solution for this or should I move the validation into the parser. There I can easily determine if the statement in the vmthread section contains just the call statement.

For clarification: In the vmthread there's only the CALL allowed. In the subcall (could be more than one) every op code is allowed (CALL + every other op code defined). And I do not want to distinguish between those different CALL statements. I know that's not possible in a context free grammar. I will handle this in the parser. I just want to restrict the vmthread to the one CALL statement and allow all statements (all op codes) in the subcalls. Hopefully that's more clear.

最满意答案

像这样更改你的词法规则:

OP_CODES : 'DATA8' | 'DATAF' | 'MOVE8_8' | 'MOVEF_F' | OP_CALL ; OC_CALL : 'CALL' ;

或者是这样的:

OP_CODES : 'DATA8' | 'DATAF' | 'MOVE8_8' | 'MOVEF_F' | CALL ; OC_CALL : CALL ; fragment CALL: 'CALL';

顺便说一下,我建议你为你的文字(比如那个CALL片段)创建明确的词法分析器规则,这将使以后的处理更容易。 ANTLR将通用名称分配给隐式创建的文字,这使得很难找出哪个标记属于哪个文字。

Change your lexer rules like this:

OP_CODES : 'DATA8' | 'DATAF' | 'MOVE8_8' | 'MOVEF_F' | OP_CALL ; OC_CALL : 'CALL' ;

or alternatively so:

OP_CODES : 'DATA8' | 'DATAF' | 'MOVE8_8' | 'MOVEF_F' | CALL ; OC_CALL : CALL ; fragment CALL: 'CALL';

Btw, I recommend that you create explicit lexer rules for your literals (like that CALL fragment), which will make later processing easier. ANTLR assigns generic names to implicitly created literals, which makes it hard to find out which token belongs to which literal.

更多推荐