第99集：re模块：替换与分割

学习目标

掌握re.sub()函数进行字符串替换的方法
了解re.subn()函数返回替换计数的特性
熟练使用re.split()函数进行字符串分割
学习替换与分割的高级用法和技巧
能够在实际项目中应用正则表达式进行文本处理

一、替换函数：re.sub()

re.sub()是正则表达式中用于替换字符串的核心函数，它可以根据正则表达式模式匹配并替换文本内容。

1. 基本语法

re.sub(pattern, repl, string, count=0, flags=0)

参数说明：

pattern: 正则表达式模式字符串
repl: 替换的字符串或函数
string: 要处理的原始字符串
count: 最大替换次数（默认0表示替换所有匹配项）
flags: 正则表达式标志（如re.IGNORECASE、re.MULTILINE等）

返回值：

替换后的新字符串

2. 基本替换示例

import re

# 简单字符串替换
text = "Hello World! Hello Python!"
pattern = r"Hello"
new_text = re.sub(pattern, "Hi", text)
print(f"原始文本: {text}")
print(f"替换后: {new_text}")  # "Hi World! Hi Python!"

# 使用正则表达式替换
text = "我的电话号码是13812345678，你的是13987654321"
pattern = r"\d{11}"  # 匹配11位数字
new_text = re.sub(pattern, "[电话号码]", text)
print(f"原始文本: {text}")
print(f"替换后: {new_text}")  # "我的电话号码是[电话号码]，你的是[电话号码]"

# 限制替换次数
text = "a b a b a b"
pattern = r"a"
new_text = re.sub(pattern, "x", text, count=2)
print(f"原始文本: {text}")
print(f"替换前2个'a'后: {new_text}")  # "x b x b a b"

3. 使用分组引用

在替换字符串中，可以使用\1、\2等来引用捕获组的内容：

import re

# 格式化日期
text = "今天是2023/10/05，昨天是2023/10/04"
pattern = r"(\d{4})/(\d{2})/(\d{2})"
# 将YYYY/MM/DD格式转换为YYYY-MM-DD
new_text = re.sub(pattern, r"\1-\2-\3", text)
print(f"原始文本: {text}")
print(f"替换后: {new_text}")  # "今天是2023-10-05，昨天是2023-10-04"

# 提取姓名
text = "姓名：张三，年龄：25；姓名：李四，年龄：30"
pattern = r"姓名：(\w+)(?:，年龄：\d+)"
new_text = re.sub(pattern, r"\1", text)
print(f"原始文本: {text}")
print(f"提取姓名后: {new_text}")  # "张三；李四"

4. 使用函数作为替换值

当替换逻辑比较复杂时，可以使用函数作为repl参数：

import re

# 将数字转换为中文数字
text = "有123个苹果和456个橘子"

# 定义替换函数
def number_to_chinese(match):
    num_dict = {"0": "零", "1": "一", "2": "二", "3": "三", "4": "四",
               "5": "五", "6": "六", "7": "七", "8": "八", "9": "九"}
    num_str = match.group()
    result = ""
    for char in num_str:
        result += num_dict[char]
    return result

pattern = r"\d+"
new_text = re.sub(pattern, number_to_chinese, text)
print(f"原始文本: {text}")
print(f"替换后: {new_text}")  # "有一二三苹果和四五六橘子"

# 单词首字母大写
text = "hello world, this is python"

# 定义首字母大写函数
def capitalize_word(match):
    return match.group().capitalize()

pattern = r"\b\w+\b"
new_text = re.sub(pattern, capitalize_word, text)
print(f"原始文本: {text}")
print(f"替换后: {new_text}")  # "Hello World, This Is Python"

二、替换计数：re.subn()

re.subn()函数与re.sub()类似，但它返回一个元组，包含替换后的字符串和替换的次数。

1. 基本语法

re.subn(pattern, repl, string, count=0, flags=0)

参数说明：

与re.sub()相同

返回值：

一个元组：(替换后的字符串, 替换次数)

2. 使用示例

import re

# 基本用法
text = "Hello World! Hello Python! Hello Everyone!"
pattern = r"Hello"
result = re.subn(pattern, "Hi", text)
print(f"替换结果: {result[0]}")  # "Hi World! Hi Python! Hi Everyone!"
print(f"替换次数: {result[1]}")  # 3

# 限制替换次数
result = re.subn(pattern, "Hi", text, count=2)
print(f"替换结果: {result[0]}")  # "Hi World! Hi Python! Hello Everyone!"
print(f"替换次数: {result[1]}")  # 2

# 与正则表达式结合
text = "我的邮箱是user1@example.com和user2@test.com"
pattern = r"\w+@\w+\.\w+"
result = re.subn(pattern, "[邮箱]", text)
print(f"替换结果: {result[0]}")  # "我的邮箱是[邮箱]和[邮箱]"
print(f"替换次数: {result[1]}")  # 2

re.subn()特别适合需要知道替换了多少次的场景，比如日志统计、数据清洗等。

三、分割函数：re.split()

re.split()函数使用正则表达式模式来分割字符串，返回一个分割后的列表。

1. 基本语法

re.split(pattern, string, maxsplit=0, flags=0)

参数说明：

pattern: 正则表达式模式字符串
string: 要分割的原始字符串
maxsplit: 最大分割次数（默认0表示分割所有匹配项）
flags: 正则表达式标志

返回值：

分割后的字符串列表

2. 基本分割示例

import re

# 简单分割
text = "apple,banana,orange"
pattern = r","  # 以逗号分割
result = re.split(pattern, text)
print(f"原始文本: {text}")
print(f"分割结果: {result}")  # ['apple', 'banana', 'orange']

# 使用正则表达式分割
text = "python 2.x, python 3.x; python 3.10"
pattern = r"[,;\s]+"  # 以逗号、分号或空格分割
result = re.split(pattern, text)
print(f"原始文本: {text}")
print(f"分割结果: {result}")  # ['python', '2.x', 'python', '3.x', 'python', '3.10']

# 限制分割次数
result = re.split(pattern, text, maxsplit=2)
print(f"限制2次分割结果: {result}")  # ['python', '2.x', 'python 3.x; python 3.10']

3. 使用捕获组保留分割符

如果在正则表达式中使用捕获组（括号），那么分割符也会包含在结果列表中：

import re

# 保留分割符
text = "apple, banana; orange"
pattern = r"([,;])"
result = re.split(pattern, text)
print(f"原始文本: {text}")
print(f"保留分割符的结果: {result}")  # ['apple', ',', ' banana', ';', ' orange']

# 重新组合字符串
recombined = "".join(result)
print(f"重新组合: {recombined}")  # "apple, banana; orange"

# 更复杂的示例
text = "2023-10-05 14:30:45"
pattern = r"([-:\s])"
result = re.split(pattern, text)
print(f"原始文本: {text}")
print(f"分割结果: {result}")  # ['2023', '-', '10', '-', '05', ' ', '14', ':', '30', ':', '45']

四、替换与分割的高级用法

1. 使用正则表达式标志

import re

# 忽略大小写的替换
text = "Hello world! hello Python! HELLO everyone!"
pattern = r"hello"
new_text = re.sub(pattern, "Hi", text, flags=re.IGNORECASE)
print(f"原始文本: {text}")
print(f"忽略大小写替换后: {new_text}")  # "Hi world! Hi Python! Hi everyone!"

# 多行模式的替换
text = "line1: apple\nline2: banana\nline3: orange"
pattern = r"^line\d+"
new_text = re.sub(pattern, "item", text, flags=re.MULTILINE)
print(f"原始文本: {text}")
print(f"多行替换后: {new_text}")  # "item: apple\nitem: banana\nitem: orange"

# 点号匹配换行符的替换
text = "a\nb\nc"
pattern = r"a.*c"
new_text = re.sub(pattern, "xyz", text, flags=re.DOTALL)
print(f"原始文本: {text}")
print(f"DOTALL替换后: {new_text}")  # "xyz"

2. 嵌套替换

import re

# 先替换数字，再替换字母
text = "a1b2c3d4"

# 方法1：连续调用sub
step1 = re.sub(r"\d", "*", text)
step2 = re.sub(r"[a-z]", "#", step1)
print(f"两步替换: {step2}")  # "#*#*#*#*"

# 方法2：使用函数
text = "a1b2c3d4"
def complex_replace(match):
    char = match.group()
    if char.isdigit():
        return str(int(char) * 2)
    elif char.islower():
        return char.upper()
    return char

result = re.sub(r"[a-z0-9]", complex_replace, text)
print(f"复杂替换: {result}")  # "A2B4C6D8"

3. 分割与替换结合

import re

# 先分割再替换
text = "apple,banana,orange"
pattern = r","  # 以逗号分割
parts = re.split(pattern, text)
# 对每个部分进行处理
processed = [part.capitalize() + "s" for part in parts]
result = ", ".join(processed)
print(f"处理结果: {result}")  # "Apples, Bananas, Oranges"

# 直接使用替换实现
text = "apple,banana,orange"
pattern = r"\b(\w+)\b"
result = re.sub(pattern, lambda m: m.group().capitalize() + "s", text)
print(f"直接替换结果: {result}")  # "Apples,Bananas,Oranges"

五、实际应用案例

1. 数据清洗

import re

# 清洗HTML标签
html = "<div class='container'><h1>标题</h1><p>内容</p></div>"
def clean_html(text):
    # 移除HTML标签
    text = re.sub(r"<[^>]+>", "", text)
    # 移除多余空格
    text = re.sub(r"\s+", " ", text)
    # 移除首尾空格
    return text.strip()

cleaned = clean_html(html)
print(f"清洗前: {html}")
print(f"清洗后: {cleaned}")  # "标题 内容"

# 清洗电话号码
phone_numbers = ["138-1234-5678", "139 8765 4321", "158.1234.5678", "(021)12345678"]
def standardize_phone(phone):
    # 移除所有非数字字符
    phone = re.sub(r"\D", "", phone)
    # 按格式重新组合
    if len(phone) == 11:
        return f"{phone[:3]}-{phone[3:7]}-{phone[7:]}"
    elif len(phone) == 10 and phone.startswith("0"):
        return f"({phone[:3]}){phone[3:]}"
    return phone

for phone in phone_numbers:
    print(f"原始: {phone} -> 标准化: {standardize_phone(phone)}")
    # 138-1234-5678 -> 138-1234-5678
    # 139 8765 4321 -> 139-8765-4321
    # 158.1234.5678 -> 158-1234-5678
    # (021)12345678 -> (021)12345678

2. 文本格式化

import re

# 格式化代码注释
code = """def add(a, b):
    # This function adds two numbers
    return a + b

def multiply(a, b):
    # This function multiplies two numbers
    return a * b"""

def format_comments(code):
    # 将单行注释转换为文档字符串风格
    def comment_replacer(match):
        func_name = match.group("func")
        comment = match.group("comment")
        return f"def {func_name}({match.group('params')}):\n    \"\"\"{comment.strip()} \"\"\""
    
    pattern = r"def\s+(?P<func>\w+)\((?P<params>.*?)\):\s*\n\s*#\s*(?P<comment>.*?)\s*\n"
    return re.sub(pattern, comment_replacer, code, flags=re.DOTALL)

formatted = format_comments(code)
print(f"格式化前: {code}")
print(f"格式化后: {formatted}")

3. 文本提取与转换

import re

# 提取URL中的域名
text = "访问 https://www.example.com 或 http://test.org 获取信息"
def extract_domains(text):
    pattern = r"(?<=https?://)(?:www\.)?([a-zA-Z0-9.-]+)"
    domains = re.findall(pattern, text)
    return domains

domains = extract_domains(text)
print(f"提取的域名: {domains}")  # ['example.com', 'test.org']

# 将驼峰命名转换为下划线命名
def camel_to_snake(text):
    # 在大写字母前添加下划线
    text = re.sub(r"([a-z0-9])([A-Z])", r"\1_\2", text)
    # 全部转换为小写
    return text.lower()

print(f"camelCase -> {camel_to_snake('camelCase')}")  # "camel_case"
print(f"CamelCaseExample -> {camel_to_snake('CamelCaseExample')}")  # "camel_case_example"
print(f"HTTPResponse -> {camel_to_snake('HTTPResponse')}")  # "http_response"

4. 日志处理

import re

# 简化日志信息
logs = [
    "2023-10-05 14:30:45 INFO User login successful",
    "2023-10-05 14:31:23 ERROR Database connection failed",
    "2023-10-05 14:32:01 WARNING Memory usage high"
]

def simplify_log(log):
    # 只保留时间、级别和消息
    pattern = r"(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (\w+) (.+)"
    return re.sub(pattern, r"[\1] [\2] \3", log)

for log in logs:
    print(f"简化前: {log}")
    print(f"简化后: {simplify_log(log)}")
    print()

六、最佳实践与性能优化

1. 编译正则表达式

对于频繁使用的正则表达式，编译可以提高性能：

import re

# 编译正则表达式
pattern = re.compile(r"\d+")

# 多次使用
text1 = "有123个苹果"
text2 = "价格是456元"
text3 = "数量为789个"

print(f"文本1: {pattern.sub('*', text1)}")  # "有*个苹果"
print(f"文本2: {pattern.sub('*', text2)}")  # "价格是*元"
print(f"文本3: {pattern.sub('*', text3)}")  # "数量为*个"

2. 优先使用字符串方法

对于简单的替换和分割，字符串方法比正则表达式更高效：

# 好：使用字符串方法
text = "Hello World"
new_text = text.replace("Hello", "Hi")

# 不好：使用正则表达式
import re
new_text = re.sub(r"Hello", "Hi", text)

# 好：使用字符串方法分割
text = "apple,banana,orange"
parts = text.split(",")

# 不好：使用正则表达式
parts = re.split(r",", text)

3. 避免过度使用复杂正则

复杂的正则表达式会降低性能和可读性：

# 复杂正则
pattern = r"([a-z]+)(\d+)([a-z]+)"

# 可以考虑使用多个简单正则或字符串方法组合

4. 使用原始字符串

始终使用原始字符串定义正则表达式：

# 推荐
pattern = r"\d+"

# 不推荐
pattern = "\\d+"

七、总结

本集我们学习了re模块中的替换与分割功能：

**re.sub()**：用于字符串替换，可以接受字符串或函数作为替换值
**re.subn()**：类似sub()，但返回替换后的字符串和替换次数
**re.split()**：用于字符串分割，可以使用捕获组保留分割符
高级用法：结合正则表达式标志、嵌套替换、分割与替换结合等
实际应用：数据清洗、文本格式化、信息提取、日志处理等

正则表达式的替换与分割功能是文本处理中的强大工具，它们可以帮助我们高效地处理各种复杂的文本转换任务。在实际应用中，我们应该根据具体需求选择合适的方法，并注意性能优化和代码可读性。

八、课后练习

编写一个函数，将文本中的所有邮箱地址替换为[邮箱]，并统计替换次数
实现一个简单的HTML标签移除函数，可以保留指定的标签（如<p>、<br>）
编写一个函数，将字符串中的所有数字乘以2（例如："a1b2" -> "a2b4"）
实现一个函数，将驼峰命名转换为连字符命名（例如："camelCase" -> "camel-case"）
使用re.split()函数分割以下字符串，保留分隔符："apple, banana; orange"，分隔符包括逗号、分号和空格
编写一个函数，将日志中的时间戳从"YYYY-MM-DD HH:MM:SS"格式转换为"YYYY/MM/DD HH:MM:SS"格式

下一集我们将学习标准库的综合应用（第100集）。