语法分析实战(五)—— 完整语言
核心知识点讲解
完整语言的语法结构
在前面的几集中,我们分别实现了:
- 算术表达式的解析
- 变量声明的解析
- 控制结构的解析
- 函数定义的解析
现在,我们需要将这些部分整合起来,实现一个完整的语言解析器。一个完整的语言通常包含以下语法结构:
- 程序:由一系列函数定义组成
- 函数定义:包含返回类型、函数名、参数列表和函数体
- 函数体:由复合语句组成
- 复合语句:由一系列语句组成,包含在花括号
{}中 - 语句:可以是变量声明、赋值语句、条件语句、循环语句、返回语句等
- 表达式:可以是算术表达式、关系表达式、逻辑表达式等
完整语言的文法设计
我们可以使用上下文无关文法来描述一个完整的语言:
program → function_definition program | ε
function_definition → type_specifier IDENTIFIER '(' parameter_list ')' compound_statement
parameter_list → parameter_list ',' parameter | parameter | ε
parameter → type_specifier IDENTIFIER
compound_statement → '{' statement_list '}'
statement_list → statement statement_list | ε
statement → declaration_statement | expression_statement | if_statement | while_statement | for_statement | return_statement
declaration_statement → type_specifier IDENTIFIER ';' | type_specifier IDENTIFIER '=' expression ';'
expression_statement → expression ';'
if_statement → 'if' '(' expression ')' statement | 'if' '(' expression ')' statement 'else' statement
while_statement → 'while' '(' expression ')' statement
for_statement → 'for' '(' expression_statement expression_statement expression ')' statement
return_statement → 'return' expression ';' | 'return' ';'
expression → assignment_expression
assignment_expression → logical_or_expression | IDENTIFIER '=' assignment_expression
logical_or_expression → logical_and_expression | logical_or_expression '||' logical_and_expression
logical_and_expression → equality_expression | logical_and_expression '&&' equality_expression
equality_expression → relational_expression | equality_expression '==' relational_expression | equality_expression '!=' relational_expression
relational_expression → additive_expression | relational_expression '<' additive_expression | relational_expression '>' additive_expression | relational_expression '<=' additive_expression | relational_expression '>=' additive_expression
additive_expression → multiplicative_expression | additive_expression '+' multiplicative_expression | additive_expression '-' multiplicative_expression
multiplicative_expression → primary_expression | multiplicative_expression '*' primary_expression | multiplicative_expression '/' primary_expression | multiplicative_expression '%' primary_expression
primary_expression → IDENTIFIER | NUMBER | '(' expression ')' | IDENTIFIER '(' argument_list ')'
argument_list → argument_list ',' expression | expression | ε
type_specifier → 'int' | 'void' | 'float' | 'double' | 'char'实用案例分析
示例 1:完整语言程序
我们来实现一个递归下降分析器,用于解析以下完整的语言程序:
int main() {
int a = 10;
int b = 20;
int sum = add(a, b);
if (sum > 25) {
printf("Sum is greater than 25\n");
} else {
printf("Sum is less than or equal to 25\n");
}
return 0;
}
int add(int x, int y) {
int result = x + y;
return result;
}
void print_numbers() {
int i = 0;
while (i < 5) {
printf("%d\n", i);
i = i + 1;
}
}
void print_range(int start, int end) {
for (int i = start; i <= end; i++) {
printf("%d\n", i);
}
}代码实现:完整语言解析器
1. 词法分析器(Lexer)
首先,我们需要一个更完善的词法分析器:
# lexer.py
import re
class Token:
def __init__(self, type_, value):
self.type = type_
self.value = value
def __repr__(self):
return f"Token({self.type}, {repr(self.value)})"
token_specs = [
('TYPE', r'\b(int|void|float|double|char)\b'),
('KEYWORD', r'\b(if|else|while|for|return)\b'),
('IDENTIFIER', r'\b[a-zA-Z_][a-zA-Z0-9_]*\b'),
('NUMBER', r'\b\d+\b'),
('STRING', r'"([^"\\]|\\.)*"'),
('PLUS', r'\+'),
('MINUS', r'-'),
('MULTIPLY', r'\*'),
('DIVIDE', r'/'),
('MODULO', r'%'),
('ASSIGN', r'='),
('EQUAL', r'=='),
('NOT_EQUAL', r'!='),
('LESS', r'<'),
('LESS_EQUAL', r'<='),
('GREATER', r'>'),
('GREATER_EQUAL', r'>='),
('LOGICAL_AND', r'&&'),
('LOGICAL_OR', r'\|\|'),
('LPAREN', r'\('),
('RPAREN', r'\)'),
('LBRACE', r'\{'),
('RBRACE', r'\}'),
('SEMICOLON', r';'),
('COMMA', r','),
('WHITESPACE', r'\s+'),
('OTHER', r'.'),
]
def tokenize(code):
tokens = []
pos = 0
while pos < len(code):
match = None
for token_type, pattern in token_specs:
regex = re.compile(pattern)
match = regex.match(code, pos)
if match:
text = match.group(0)
if token_type != 'WHITESPACE':
tokens.append(Token(token_type, text))
pos = match.end(0)
break
if not match:
raise ValueError(f"Invalid token at position {pos}: {code[pos]}")
return tokens2. 抽象语法树(AST)节点
接下来,我们定义更完整的抽象语法树节点类型:
# ast.py
class ASTNode:
pass
class Program(ASTNode):
def __init__(self, function_definitions):
self.function_definitions = function_definitions
class FunctionDefinition(ASTNode):
def __init__(self, return_type, name, parameters, body):
self.return_type = return_type
self.name = name
self.parameters = parameters
self.body = body
class Parameter(ASTNode):
def __init__(self, param_type, name):
self.param_type = param_type
self.name = name
class CompoundStatement(ASTNode):
def __init__(self, statements):
self.statements = statements
class DeclarationStatement(ASTNode):
def __init__(self, var_type, var_name, initializer=None):
self.var_type = var_type
self.var_name = var_name
self.initializer = initializer
class ExpressionStatement(ASTNode):
def __init__(self, expression):
self.expression = expression
class IfStatement(ASTNode):
def __init__(self, condition, then_statement, else_statement=None):
self.condition = condition
self.then_statement = then_statement
self.else_statement = else_statement
class WhileStatement(ASTNode):
def __init__(self, condition, body):
self.condition = condition
self.body = body
class ForStatement(ASTNode):
def __init__(self, init, condition, increment, body):
self.init = init
self.condition = condition
self.increment = increment
self.body = body
class ReturnStatement(ASTNode):
def __init__(self, expression=None):
self.expression = expression
class AssignmentExpression(ASTNode):
def __init__(self, left, right):
self.left = left
self.right = right
class BinaryExpression(ASTNode):
def __init__(self, left, operator, right):
self.left = left
self.operator = operator
self.right = right
class Identifier(ASTNode):
def __init__(self, name):
self.name = name
class NumberLiteral(ASTNode):
def __init__(self, value):
self.value = value
class StringLiteral(ASTNode):
def __init__(self, value):
self.value = value
class CallExpression(ASTNode):
def __init__(self, function, arguments):
self.function = function
self.arguments = arguments3. 递归下降分析器(Parser)
现在,我们实现一个完整的递归下降分析器:
# parser.py
from lexer import tokenize
from ast import (
Program, FunctionDefinition, Parameter, CompoundStatement,
DeclarationStatement, ExpressionStatement, IfStatement, WhileStatement,
ForStatement, ReturnStatement, AssignmentExpression, BinaryExpression,
Identifier, NumberLiteral, StringLiteral, CallExpression
)
class Parser:
def __init__(self, tokens):
self.tokens = tokens
self.pos = 0
def current_token(self):
if self.pos < len(self.tokens):
return self.tokens[self.pos]
return None
def eat(self, token_type):
if self.current_token() and self.current_token().type == token_type:
self.pos += 1
else:
raise SyntaxError(f"Expected {token_type}, got {self.current_token()}")
def parse(self):
function_definitions = []
while self.current_token() and self.current_token().type == 'TYPE':
function_definitions.append(self.parse_function_definition())
return Program(function_definitions)
def parse_function_definition(self):
# 解析函数返回类型
return_type = self.current_token().value
self.eat('TYPE')
# 解析函数名
function_name = self.current_token().value
self.eat('IDENTIFIER')
# 解析参数列表
self.eat('LPAREN')
parameters = self.parse_parameter_list()
self.eat('RPAREN')
# 解析函数体
body = self.parse_compound_statement()
return FunctionDefinition(return_type, function_name, parameters, body)
def parse_parameter_list(self):
parameters = []
if self.current_token() and self.current_token().type == 'TYPE':
parameters.append(self.parse_parameter())
while self.current_token() and self.current_token().type == 'COMMA':
self.eat('COMMA')
parameters.append(self.parse_parameter())
return parameters
def parse_parameter(self):
param_type = self.current_token().value
self.eat('TYPE')
param_name = self.current_token().value
self.eat('IDENTIFIER')
return Parameter(param_type, param_name)
def parse_compound_statement(self):
self.eat('LBRACE')
statements = self.parse_statement_list()
self.eat('RBRACE')
return CompoundStatement(statements)
def parse_statement_list(self):
statements = []
while self.current_token() and self.current_token().type != 'RBRACE':
statements.append(self.parse_statement())
return statements
def parse_statement(self):
token = self.current_token()
if token.type == 'TYPE':
return self.parse_declaration_statement()
elif token.type == 'KEYWORD' and token.value == 'if':
return self.parse_if_statement()
elif token.type == 'KEYWORD' and token.value == 'while':
return self.parse_while_statement()
elif token.type == 'KEYWORD' and token.value == 'for':
return self.parse_for_statement()
elif token.type == 'KEYWORD' and token.value == 'return':
return self.parse_return_statement()
else:
return self.parse_expression_statement()
def parse_declaration_statement(self):
var_type = self.current_token().value
self.eat('TYPE')
var_name = self.current_token().value
self.eat('IDENTIFIER')
initializer = None
if self.current_token() and self.current_token().type == 'ASSIGN':
self.eat('ASSIGN')
initializer = self.parse_expression()
self.eat('SEMICOLON')
return DeclarationStatement(var_type, var_name, initializer)
def parse_expression_statement(self):
expression = self.parse_expression()
self.eat('SEMICOLON')
return ExpressionStatement(expression)
def parse_if_statement(self):
self.eat('KEYWORD') # 'if'
self.eat('LPAREN')
condition = self.parse_expression()
self.eat('RPAREN')
then_statement = self.parse_statement()
else_statement = None
if self.current_token() and self.current_token().type == 'KEYWORD' and self.current_token().value == 'else':
self.eat('KEYWORD') # 'else'
else_statement = self.parse_statement()
return IfStatement(condition, then_statement, else_statement)
def parse_while_statement(self):
self.eat('KEYWORD') # 'while'
self.eat('LPAREN')
condition = self.parse_expression()
self.eat('RPAREN')
body = self.parse_statement()
return WhileStatement(condition, body)
def parse_for_statement(self):
self.eat('KEYWORD') # 'for'
self.eat('LPAREN')
# 解析初始化表达式
init = self.parse_expression_statement()
# 解析条件表达式
condition = self.parse_expression()
self.eat('SEMICOLON')
# 解析增量表达式
increment = self.parse_expression()
self.eat('RPAREN')
# 解析循环体
body = self.parse_statement()
return ForStatement(init, condition, increment, body)
def parse_return_statement(self):
self.eat('KEYWORD') # 'return'
expression = None
if self.current_token() and self.current_token().type != 'SEMICOLON':
expression = self.parse_expression()
self.eat('SEMICOLON')
return ReturnStatement(expression)
def parse_expression(self):
return self.parse_assignment_expression()
def parse_assignment_expression(self):
left = self.parse_logical_or_expression()
if self.current_token() and self.current_token().type == 'ASSIGN':
self.eat('ASSIGN')
right = self.parse_assignment_expression()
return AssignmentExpression(left, right)
return left
def parse_logical_or_expression(self):
left = self.parse_logical_and_expression()
while self.current_token() and self.current_token().type == 'LOGICAL_OR':
operator = self.current_token().value
self.eat('LOGICAL_OR')
right = self.parse_logical_and_expression()
left = BinaryExpression(left, operator, right)
return left
def parse_logical_and_expression(self):
left = self.parse_equality_expression()
while self.current_token() and self.current_token().type == 'LOGICAL_AND':
operator = self.current_token().value
self.eat('LOGICAL_AND')
right = self.parse_equality_expression()
left = BinaryExpression(left, operator, right)
return left
def parse_equality_expression(self):
left = self.parse_relational_expression()
while self.current_token() and self.current_token().type in ('EQUAL', 'NOT_EQUAL'):
operator = self.current_token().value
self.eat(self.current_token().type)
right = self.parse_relational_expression()
left = BinaryExpression(left, operator, right)
return left
def parse_relational_expression(self):
left = self.parse_additive_expression()
while self.current_token() and self.current_token().type in ('LESS', 'LESS_EQUAL', 'GREATER', 'GREATER_EQUAL'):
operator = self.current_token().value
self.eat(self.current_token().type)
right = self.parse_additive_expression()
left = BinaryExpression(left, operator, right)
return left
def parse_additive_expression(self):
left = self.parse_multiplicative_expression()
while self.current_token() and self.current_token().type in ('PLUS', 'MINUS'):
operator = self.current_token().value
self.eat(self.current_token().type)
right = self.parse_multiplicative_expression()
left = BinaryExpression(left, operator, right)
return left
def parse_multiplicative_expression(self):
left = self.parse_primary_expression()
while self.current_token() and self.current_token().type in ('MULTIPLY', 'DIVIDE', 'MODULO'):
operator = self.current_token().value
self.eat(self.current_token().type)
right = self.parse_primary_expression()
left = BinaryExpression(left, operator, right)
return left
def parse_primary_expression(self):
token = self.current_token()
if token.type == 'IDENTIFIER':
self.eat('IDENTIFIER')
if self.current_token() and self.current_token().type == 'LPAREN':
# 函数调用
self.eat('LPAREN')
arguments = []
if self.current_token() and self.current_token().type != 'RPAREN':
arguments.append(self.parse_expression())
while self.current_token() and self.current_token().type == 'COMMA':
self.eat('COMMA')
arguments.append(self.parse_expression())
self.eat('RPAREN')
return CallExpression(Identifier(token.value), arguments)
return Identifier(token.value)
elif token.type == 'NUMBER':
self.eat('NUMBER')
return NumberLiteral(token.value)
elif token.type == 'STRING':
self.eat('STRING')
return StringLiteral(token.value)
elif token.type == 'LPAREN':
self.eat('LPAREN')
expression = self.parse_expression()
self.eat('RPAREN')
return expression
else:
raise SyntaxError(f"Unexpected token: {token}")4. 测试解析器
让我们创建一个测试文件来验证我们的完整语言解析器:
# test_parser.py
from lexer import tokenize
from parser import Parser
# 测试代码
test_code = """
int main() {
int a = 10;
int b = 20;
int sum = add(a, b);
if (sum > 25) {
printf("Sum is greater than 25\n");
} else {
printf("Sum is less than or equal to 25\n");
}
return 0;
}
int add(int x, int y) {
int result = x + y;
return result;
}
void print_numbers() {
int i = 0;
while (i < 5) {
printf("%d\n", i);
i = i + 1;
}
}
void print_range(int start, int end) {
for (int i = start; i <= end; i++) {
printf("%d\n", i);
}
}
"""
# 分词
tokens = tokenize(test_code)
print("Tokens:")
for token in tokens[:20]: # 只打印前20个token,避免输出过多
print(token)
print("...")
# 解析
parser = Parser(tokens)
program = parser.parse()
print("\nParsed Program:")
print(f"Number of functions: {len(program.function_definitions)}")
for i, func in enumerate(program.function_definitions):
print(f"\nFunction {i+1}: {func.return_type} {func.name}")
print(f"Parameters: {[(p.param_type, p.name) for p in func.parameters]}")
print(f"Body statements: {len(func.body.statements)}")代码优化与扩展
1. 错误处理改进
我们可以改进错误处理,提供更详细的错误信息,包括错误位置和建议:
def eat(self, token_type):
if self.current_token() and self.current_token().type == token_type:
self.pos += 1
else:
current = self.current_token()
current_desc = f"{current.type}: {current.value}" if current else "end of input"
raise SyntaxError(
f"Syntax error at position {self.pos}: "
f"Expected {token_type}, got {current_desc}. "
f"Suggestion: Check your syntax around this position."
)2. 支持更多数据类型
我们可以扩展类型系统,支持更多的数据类型:
def parse_type_specifier(self):
token = self.current_token()
if token.type == 'TYPE':
type_spec = token.value
self.eat('TYPE')
# 支持指针类型
while self.current_token() and self.current_token().type == 'MULTIPLY':
type_spec += '*'
self.eat('MULTIPLY')
return type_spec
else:
raise SyntaxError(f"Expected type specifier, got {token}")3. 支持数组和结构体
我们可以添加对数组和结构体的支持:
def parse_declaration_statement(self):
var_type = self.current_token().value
self.eat('TYPE')
var_name = self.current_token().value
self.eat('IDENTIFIER')
# 支持数组声明
if self.current_token() and self.current_token().type == 'LBRACKET':
self.eat('LBRACKET')
array_size = self.current_token().value
self.eat('NUMBER')
self.eat('RBRACKET')
var_type += f"[{array_size}]"
initializer = None
if self.current_token() and self.current_token().type == 'ASSIGN':
self.eat('ASSIGN')
initializer = self.parse_expression()
self.eat('SEMICOLON')
return DeclarationStatement(var_type, var_name, initializer)总结
本集我们学习了如何实现一个完整的语言解析器,包括:
- 完整语言的语法结构:程序、函数定义、函数体、复合语句、语句、表达式等
- 完整语言的文法设计:使用上下文无关文法描述完整语言
- 递归下降分析器实现:
- 解析程序结构
- 解析函数定义
- 解析各种语句(变量声明、条件语句、循环语句、返回语句等)
- 解析各种表达式(算术表达式、关系表达式、逻辑表达式、赋值表达式等)
- 构建完整的抽象语法树
- 代码优化与扩展:改进错误处理、支持更多数据类型、支持数组和结构体
通过本集的学习,你应该能够理解如何实现一个完整的语言解析器,将前面所学的各个部分整合起来。这为后续的语义分析和代码生成打下了坚实的基础。
在实际的编译器开发中,完整语言的解析会更加复杂,需要考虑更多的语法规则和边界情况,但基本的原理和方法是相通的。通过不断地练习和实践,你可以逐步提高自己的编译器开发能力。