语法分析实战（四）—— 函数定义

核心知识点讲解

函数定义的语法结构

函数定义是编程语言中非常重要的语法结构，一个典型的函数定义包含以下几个部分：

函数返回类型：函数执行完毕后返回值的类型
函数名：函数的标识符
参数列表：函数接收的参数，包含参数类型和参数名
函数体：函数的具体实现，包含在花括号 {} 中

函数定义的文法设计

我们可以使用上下文无关文法来描述函数定义的语法：

function_def → type_specifier IDENTIFIER '(' parameter_list ')' compound_statement
parameter_list → parameter_list ',' parameter | parameter | ε
parameter → type_specifier IDENTIFIER
compound_statement → '{' statement_list '}'
statement_list → statement_list statement | ε

其中：

type_specifier 表示类型说明符，如 int、void 等
IDENTIFIER 表示标识符
parameter_list 表示参数列表，可以是空的
compound_statement 表示复合语句，即函数体

实用案例分析

示例 1：简单函数定义

我们来实现一个递归下降分析器，用于解析以下简单的函数定义：

int add(int a, int b) {
    return a + b;
}

void print_hello() {
    printf("Hello, World!\n");
}

代码实现：函数定义解析器

1. 词法分析器（Lexer）

首先，我们需要一个词法分析器来将输入的源代码转换为 tokens：

# lexer.py
import re

class Token:
    def __init__(self, type_, value):
        self.type = type_
        self.value = value
    
    def __repr__(self):
        return f"Token({self.type}, {repr(self.value)})"

token_specs = [
    ('TYPE', r'\b(int|void|float|double|char)\b'),
    ('IDENTIFIER', r'\b[a-zA-Z_][a-zA-Z0-9_]*\b'),
    ('NUMBER', r'\b\d+\b'),
    ('STRING', r'"([^"\\]|\\.)*"'),
    ('PLUS', r'\+'),
    ('MINUS', r'-'),
    ('MULTIPLY', r'\*'),
    ('DIVIDE', r'/'),
    ('LPAREN', r'\('),
    ('RPAREN', r'\)'),
    ('LBRACE', r'\{'),
    ('RBRACE', r'\}'),
    ('SEMICOLON', r';'),
    ('COMMA', r','),
    ('WHITESPACE', r'\s+'),
    ('OTHER', r'.'),
]

def tokenize(code):
    tokens = []
    pos = 0
    while pos < len(code):
        match = None
        for token_type, pattern in token_specs:
            regex = re.compile(pattern)
            match = regex.match(code, pos)
            if match:
                text = match.group(0)
                if token_type != 'WHITESPACE':
                    tokens.append(Token(token_type, text))
                pos = match.end(0)
                break
        if not match:
            raise ValueError(f"Invalid token at position {pos}: {code[pos]}")
    return tokens

2. 抽象语法树（AST）节点

接下来，我们定义抽象语法树的节点类型：

# ast.py
class ASTNode:
    pass

class FunctionDefinition(ASTNode):
    def __init__(self, return_type, name, parameters, body):
        self.return_type = return_type
        self.name = name
        self.parameters = parameters
        self.body = body

class Parameter(ASTNode):
    def __init__(self, param_type, name):
        self.param_type = param_type
        self.name = name

class CompoundStatement(ASTNode):
    def __init__(self, statements):
        self.statements = statements

class ReturnStatement(ASTNode):
    def __init__(self, expression):
        self.expression = expression

class ExpressionStatement(ASTNode):
    def __init__(self, expression):
        self.expression = expression

class CallExpression(ASTNode):
    def __init__(self, function, arguments):
        self.function = function
        self.arguments = arguments

class BinaryExpression(ASTNode):
    def __init__(self, left, operator, right):
        self.left = left
        self.operator = operator
        self.right = right

class Identifier(ASTNode):
    def __init__(self, name):
        self.name = name

class StringLiteral(ASTNode):
    def __init__(self, value):
        self.value = value

3. 递归下降分析器（Parser）

现在，我们实现递归下降分析器来解析函数定义：

# parser.py
from lexer import tokenize
from ast import (
    FunctionDefinition, Parameter, CompoundStatement, ReturnStatement,
    ExpressionStatement, CallExpression, BinaryExpression, Identifier, StringLiteral
)

class Parser:
    def __init__(self, tokens):
        self.tokens = tokens
        self.pos = 0
    
    def current_token(self):
        if self.pos < len(self.tokens):
            return self.tokens[self.pos]
        return None
    
    def eat(self, token_type):
        if self.current_token() and self.current_token().type == token_type:
            self.pos += 1
        else:
            raise SyntaxError(f"Expected {token_type}, got {self.current_token()}")
    
    def parse(self):
        functions = []
        while self.current_token():
            if self.current_token().type == 'TYPE':
                functions.append(self.parse_function_def())
            else:
                break
        return functions
    
    def parse_function_def(self):
        # 解析函数返回类型
        return_type = self.current_token().value
        self.eat('TYPE')
        
        # 解析函数名
        function_name = self.current_token().value
        self.eat('IDENTIFIER')
        
        # 解析参数列表
        self.eat('LPAREN')
        parameters = self.parse_parameter_list()
        self.eat('RPAREN')
        
        # 解析函数体
        body = self.parse_compound_statement()
        
        return FunctionDefinition(return_type, function_name, parameters, body)
    
    def parse_parameter_list(self):
        parameters = []
        if self.current_token() and self.current_token().type == 'TYPE':
            parameters.append(self.parse_parameter())
            while self.current_token() and self.current_token().type == 'COMMA':
                self.eat('COMMA')
                parameters.append(self.parse_parameter())
        return parameters
    
    def parse_parameter(self):
        param_type = self.current_token().value
        self.eat('TYPE')
        param_name = self.current_token().value
        self.eat('IDENTIFIER')
        return Parameter(param_type, param_name)
    
    def parse_compound_statement(self):
        self.eat('LBRACE')
        statements = []
        while self.current_token() and self.current_token().type != 'RBRACE':
            if self.current_token().value == 'return':
                statements.append(self.parse_return_statement())
            else:
                statements.append(self.parse_expression_statement())
        self.eat('RBRACE')
        return CompoundStatement(statements)
    
    def parse_return_statement(self):
        self.eat('IDENTIFIER')  # 'return'
        expression = self.parse_expression()
        self.eat('SEMICOLON')
        return ReturnStatement(expression)
    
    def parse_expression_statement(self):
        expression = self.parse_expression()
        self.eat('SEMICOLON')
        return ExpressionStatement(expression)
    
    def parse_expression(self):
        # 简化版表达式解析，仅支持二元表达式和函数调用
        left = self.parse_primary()
        while self.current_token() and self.current_token().type in ('PLUS', 'MINUS', 'MULTIPLY', 'DIVIDE'):
            operator = self.current_token().value
            self.eat(self.current_token().type)
            right = self.parse_primary()
            left = BinaryExpression(left, operator, right)
        return left
    
    def parse_primary(self):
        token = self.current_token()
        if token.type == 'IDENTIFIER':
            self.eat('IDENTIFIER')
            if self.current_token() and self.current_token().type == 'LPAREN':
                # 函数调用
                self.eat('LPAREN')
                arguments = []
                if self.current_token() and self.current_token().type != 'RPAREN':
                    arguments.append(self.parse_expression())
                    while self.current_token() and self.current_token().type == 'COMMA':
                        self.eat('COMMA')
                        arguments.append(self.parse_expression())
                self.eat('RPAREN')
                return CallExpression(Identifier(token.value), arguments)
            return Identifier(token.value)
        elif token.type == 'STRING':
            self.eat('STRING')
            return StringLiteral(token.value)
        elif token.type == 'NUMBER':
            self.eat('NUMBER')
            return Identifier(token.value)  # 简化处理，将数字视为标识符
        elif token.type == 'LPAREN':
            self.eat('LPAREN')
            expr = self.parse_expression()
            self.eat('RPAREN')
            return expr
        else:
            raise SyntaxError(f"Unexpected token: {token}")

4. 测试解析器

让我们创建一个测试文件来验证我们的解析器：

# test_parser.py
from lexer import tokenize
from parser import Parser

# 测试代码
test_code = """
int add(int a, int b) {
    return a + b;
}

void print_hello() {
    printf("Hello, World!\n");
}
"""

# 分词
tokens = tokenize(test_code)
print("Tokens:")
for token in tokens:
    print(token)

# 解析
parser = Parser(tokens)
functions = parser.parse()

print("\nParsed Functions:")
for func in functions:
    print(f"Function: {func.return_type} {func.name}")
    print(f"Parameters: {[(p.param_type, p.name) for p in func.parameters]}")
    print(f"Body statements: {len(func.body.statements)}")
    print()

代码优化与扩展

1. 支持更多类型的参数

我们可以扩展参数解析，支持更多类型的参数，如数组参数、指针参数等：

def parse_parameter(self):
    # 解析类型说明符
    param_type = []
    while self.current_token() and self.current_token().type in ('TYPE', 'MULTIPLY'):
        param_type.append(self.current_token().value)
        self.eat(self.current_token().type)
    param_type_str = ' '.join(param_type)
    
    # 解析参数名
    param_name = self.current_token().value
    self.eat('IDENTIFIER')
    
    return Parameter(param_type_str, param_name)

2. 支持函数声明

除了函数定义，我们还可以添加对函数声明的支持：

def parse_function_declaration(self):
    # 解析函数返回类型
    return_type = self.current_token().value
    self.eat('TYPE')
    
    # 解析函数名
    function_name = self.current_token().value
    self.eat('IDENTIFIER')
    
    # 解析参数列表
    self.eat('LPAREN')
    parameters = self.parse_parameter_list()
    self.eat('RPAREN')
    
    # 函数声明以分号结尾
    self.eat('SEMICOLON')
    
    return FunctionDeclaration(return_type, function_name, parameters)

3. 错误处理改进

我们可以改进错误处理，提供更详细的错误信息：

def eat(self, token_type):
    if self.current_token() and self.current_token().type == token_type:
        self.pos += 1
    else:
        current = self.current_token()
        current_desc = f"{current.type}: {current.value}" if current else "end of input"
        raise SyntaxError(
            f"Syntax error at position {self.pos}: "
            f"Expected {token_type}, got {current_desc}"
        )

总结

本集我们学习了如何实现函数定义的语法分析，包括：

函数定义的语法结构：返回类型、函数名、参数列表和函数体
函数定义的文法设计：使用上下文无关文法描述函数定义
递归下降分析器实现：
- 解析函数定义
- 解析参数列表
- 解析函数体
- 构建抽象语法树
代码优化与扩展：支持更多类型的参数、函数声明和改进错误处理

函数定义的解析是编译器前端的重要组成部分，掌握这部分内容对于理解整个编译过程至关重要。在实际的编译器开发中，函数定义的解析会更加复杂，需要考虑更多的语法规则和边界情况，但基本的原理和方法是相通的。

通过本集的学习，你应该能够理解如何在编译器中实现函数定义的语法分析，为后续的语义分析和代码生成打下基础。