第5章：基本检索操作

5.1 简单查询语法入门

5.1.1 什么是查询语法？

Whoosh 的查询语法是一种用于表达搜索需求的语言，它支持：

简单的关键词搜索
布尔运算（AND、OR、NOT）
通配符和模糊匹配
范围查询
短语匹配

5.1.2 基本查询示例

1. 单词查询

python              # 搜索包含 "python" 的文档
全文检索          # 搜索包含 "全文检索" 的文档

2. 多词查询（默认为 OR）

python java         # 包含 "python" 或 "java"
编程 开发          # 包含 "编程" 或 "开发"

3. AND 查询

python AND 搜索    # 同时包含 "python" 和 "搜索"
教程 AND 入门        # 同时包含 "教程" 和 "入门"

4. OR 查询

python OR java     # 包含 "python" 或 "java"
数据 OR 分析        # 包含 "数据" 或 "分析"

5. NOT 查询

python NOT java    # 包含 "python" 但不包含 "java"
教程 NOT 入门        # 包含 "教程" 但不包含 "入门"

6. 分组查询

( python OR java ) AND 教程     # (python或java) 且 教程
( 数据 OR 分析 ) AND ( python OR whoosh )

5.1.3 通配符查询

* 通配符（匹配多个字符）

py*               # 匹配以 "py" 开头的词（如 python, pycharm）
*搜索              # 匹配以 "搜索" 结尾的词
*检索*             # 中间包含 "检索"

? 通配符（匹配单个字符）

pyth?n             # 匹配 python（? 代表一个字符）
教?                 # 匹配 "教" 后面一个字符（如 教程）

5.1.4 转义特殊字符

如果搜索内容包含特殊字符，需要转义：

"python AND"       # 搜索字面意义的 "python AND"
"C#"              # 搜索字面意义的 "C#"

5.2 查询解析器（QueryParser）使用

5.2.1 QueryParser 基础

QueryParser 将查询语法字符串转换为查询对象：

from whoosh.qparser import QueryParser

# 创建查询解析器
parser = QueryParser("content", ix.schema)  # 第一个参数是默认搜索字段
query = parser.parse(u"python 搜索")         # 解析查询字符串

5.2.2 指定搜索字段

可以在查询中明确指定字段：

title:python       # 在 title 字段中搜索 "python"
content:搜索     # 在 content 字段中搜索 "搜索"
author:张三         # 在 author 字段中搜索 "张三"

代码示例：

parser = QueryParser("content", ix.schema)
query = parser.parse(u"title:python")  # 搜索标题中的 python

5.2.3 多字段查询

使用 MultifieldParser 可以在多个字段中搜索：

from whoosh.qparser import MultifieldParser

parser = MultifieldParser(["title", "content"], ix.schema)
query = parser.parse(u"python 搜索")  # 同时在标题和内容中搜索

5.2.4 查询结果处理

执行查询并获取结果：

with ix.searcher() as searcher:
    # 执行查询
    results = searcher.search(query)
    
    # 遍历结果
    for hit in results:
        print(hit['title'])           # 访问存储字段
        print(hit.score)             # 相关性评分
        print(hit.rank)              # 结果排名

5.2.5 限制结果数量

# 限制返回结果数量
results = searcher.search(query, limit=10)

# 只返回前 5 条结果
results = searcher.search(query, limit=5)

5.3 常用查询类型

5.3.1 关键词查询

Term 查询是最基础的查询类型，用于精确匹配单个词：

from whoosh.query import Term

# 创建 Term 查询
query = Term("title", "python")
query = Term("content", "搜索")

代码示例：

from whoosh.query import Term
from whoosh.index import open_dir

ix = open_dir("my_index")
with ix.searcher() as searcher:
    # 搜索标题包含 "python" 的文档
    query = Term("title", "python")
    results = searcher.search(query)
    
    for hit in results:
        print(f"标题: {hit['title']}, 评分: {hit.score}")

注意事项：

Term 查询不进行分词，精确匹配
英文区分大小写（除非使用特殊配置）
中文不受影响

5.3.2 通配符查询

Wildcard 查询使用 * 和 ? 进行模糊匹配：

from whoosh.query import Wildcard

# 基本通配符查询
query = Wildcard("title", "py*")           # 以 py 开头
query = Wildcard("content", "*搜索*")      # 包含 "搜索"
query = Wildcard("title", "pyth?n")       # pyth?n (pythxn)

代码示例：

from whoosh.query import Wildcard
from whoosh.index import open_dir

ix = open_dir("my_index")
with ix.searcher() as searcher:
    # 搜索以 "Python" 开头的标题
    query = Wildcard("title", "Python*")
    results = searcher.search(query)
    
    for hit in results:
        print(f"标题: {hit['title']}")
    
    # 搜索包含 "教程" 的内容
    query = Wildcard("content", "*教程*")
    results = searcher.search(query)
    
    for hit in results:
        print(f"匹配内容: {hit.highlights('content')}")

性能提示：

通配符查询可能比 Term 查询慢
避免使用 *xxx* 这样的前导通配符

5.3.3 前缀查询

Prefix 查询匹配以指定字符串开头的词：

from whoosh.query import Prefix

# 前缀查询
query = Prefix("title", "Python")      # 标题以 "Python" 开头
query = Prefix("category", "数据")      # 分类以 "数据" 开头

代码示例：

from whoosh.query import Prefix
from whoosh.index import open_dir

ix = open_dir("my_index")
with ix.searcher() as searcher:
    # 搜索分类以 "Python" 开头的文档
    query = Prefix("category", "Python")
    results = searcher.search(query)
    
    for hit in results:
        print(f"分类: {hit['category']}, 标题: {hit['title']}")
    
    # 搜索标签（多值）
    query = Prefix("tags", "数据")
    results = searcher.search(query)
    
    for hit in results:
        print(f"标签: {hit['tags']}, 标题: {hit['title']}")

适用场景：

搜索建议（Suggest）
分类筛选
标签导航

5.3.4 模糊查询

Fuzzy 查询允许一定程度的拼写错误：

from whoosh.query import FuzzyTerm

# 模糊查询
query = FuzzyTerm("title", "pyton", maxdist=1)  # python
query = FuzzyTerm("content", "教程", maxdist=2)  # 容错2个字符

参数说明：

maxdist：最大编辑距离（默认为 1）
prefixlength：前缀匹配长度（提高性能）

代码示例：

from whoosh.query import FuzzyTerm
from whoosh.index import open_dir

ix = open_dir("my_index")
with ix.searcher() as searcher:
    # 搜索 "Python"（允许1个字符错误）
    query = FuzzyTerm("title", "Pythn", maxdist=1)
    results = searcher.search(query)
    
    print("搜索 'Pythn' 的结果：")
    for hit in results:
        print(f"  匹配: {hit['title']}")
    
    # 搜索中文（容错）
    query = FuzzyTerm("title", "全问检索", maxdist=1)
    results = searcher.search(query)
    
    print("\n搜索 '全问检索' 的结果：")
    for hit in results:
        print(f"  匹配: {hit['title']}")

编辑距离计算：

插入：python → pythona (+1)
删除：python → pyton (+1)
替换：python → pythin (+1)

注意事项：

模糊查询性能较低，建议配合其他条件使用
对于常用错词，可以建立同义词映射

5.3.5 短语查询

Phrase 查询要求词保持指定顺序和相对位置：

from whoosh.query import Phrase

# 短语查询
query = Phrase("content", ["python", "入门", "教程"])  # "python 入门 教程"
query = Phrase("title", ["数据", "分析", "实战"])      # "数据 分析 实战"

代码示例：

from whoosh.query import Phrase
from whoosh.index import open_dir

ix = open_dir("my_index")
with ix.searcher() as searcher:
    # 搜索精确短语 "Python 入门"
    query = Phrase("content", ["Python", "入门"])
    results = searcher.search(query)
    
    print("搜索短语 'Python 入门'：")
    for hit in results:
        print(f"  匹配: {hit['title']}")
        print(f"  高亮: {hit.highlights('content')}")
    
    # 搜索连续词语
    query = Phrase("content", ["全文", "检索"])
    results = searcher.search(query)
    
    print("\n搜索短语 '全文 检索'：")
    for hit in results:
        print(f"  匹配: {hit['title']}")

slop 参数（允许词间间隔）

# 允许词间最多间隔 2 个词
query = Phrase("content", ["Python", "教程"], slop=2)
# 匹配 "Python 快速入门教程"（"Python" 和 "教程" 间有2个词）

5.4 查询结果处理

5.4.1 访问结果

with ix.searcher() as searcher:
    query = parser.parse(u"python")
    results = searcher.search(query)
    
    for hit in results:
        # 访问存储字段
        print(hit['title'])
        print(hit['content'])
        
        # 访问评分
        print(f"评分: {hit.score}")
        print(f"排名: {hit.rank}")

5.4.2 结果高亮

from whoosh.highlight import HtmlFormatter

results.formatter = HtmlFormatter(tagname="strong", classname="highlight")

for hit in results:
    # 高亮显示匹配内容
    highlighted = hit.highlights("content")
    print(f"标题: {hit['title']}")
    print(f"高亮内容: {highlighted}")

5.4.3 分页查询

# 分页查询
with ix.searcher() as searcher:
    query = parser.parse(u"python")
    
    # 第一页（每页10条）
    page_num = 1
    page_size = 10
    results = searcher.search_page(query, page_num, pagelen=page_size)
    
    for hit in results:
        print(f"{hit.rank + 1}. {hit['title']}")
    
    print(f"第 {page_num} 页 / 共 {results.pagecount} 页")

5.5 综合示例

5.5.1 完整的搜索示例

from whoosh.index import open_dir
from whoosh.qparser import QueryParser, MultifieldParser
from whoosh.query import Term, Wildcard, Phrase, FuzzyTerm, Prefix

ix = open_dir("my_index")

with ix.searcher() as searcher:
    # 示例1: Term 查询
    print("=== Term 查询 ===")
    query = Term("title", "Python")
    results = searcher.search(query)
    for hit in results:
        print(f"  {hit['title']}")
    
    # 示例2: 多字段查询
    print("\n=== 多字段查询 ===")
    parser = MultifieldParser(["title", "content"], ix.schema)
    query = parser.parse(u"python 搜索")
    results = searcher.search(query)
    for hit in results:
        print(f"  {hit['title']}")
    
    # 示例3: 短语查询
    print("\n=== 短语查询 ===")
    query = Phrase("content", ["全文", "检索"])
    results = searcher.search(query)
    for hit in results:
        print(f"  {hit['title']}")

5.5.2 实战练习

练习1：实现搜索建议功能

使用 Prefix 查询实现自动完成
限制返回结果数量

练习2：实现拼写纠错

使用 Fuzzy 查询处理用户输入错误
提供纠错提示

练习3：实现高级搜索

支持多字段查询
支持布尔运算符
显示高亮结果

本章小结

本章我们学习了：

简单查询语法：关键词、布尔运算、通配符的基本使用
QueryParser 使用：创建查询解析器、指定搜索字段、多字段查询
常用查询类型：
- Term：精确关键词查询
- Wildcard：通配符查询
- Prefix：前缀查询
- Fuzzy：模糊查询
- Phrase：短语查询
结果处理：访问结果、高亮显示、分页

通过本章的学习，你应该能够在 Whoosh 索引中执行各种类型的搜索操作。

在下一章中，我们将学习更高级的查询技巧，包括布尔组合、范围查询、排序等。