Chroma 入门教程

项目介绍

Chroma 是一个开源的轻量级向量数据库,专注于存储和检索向量嵌入,支持高效的语义搜索和推荐系统。它提供了简洁的API接口,易于集成到各种应用中,特别适合开发原型和小型应用。

主要功能

  • 向量存储:高效存储向量嵌入
  • 语义搜索:基于向量相似度的搜索
  • 推荐系统:基于向量相似度的推荐
  • 轻量级:易于部署和使用
  • 多语言支持:支持Python、JavaScript等多种编程语言
  • 内存和持久化存储:支持内存存储和持久化存储
  • 与机器学习框架集成:支持与Hugging Face等框架的集成

项目特点

  • 开源免费:完全开源,可自由使用和定制
  • 轻量级:易于安装和部署
  • 易于使用:提供简洁的API接口
  • 高性能:针对向量搜索进行了优化
  • 灵活性:支持多种向量相似度计算方法

安装与配置

安装步骤

  1. 安装Chroma
# 安装Chroma
pip install chromadb

# 安装可选依赖(用于嵌入生成)
pip install sentence-transformers

基本配置

import chromadb

# 创建Chroma客户端
# 内存存储
client = chromadb.Client()

# 或者使用持久化存储
client = chromadb.PersistentClient(path="./chroma_db")

核心概念

1. 集合(Collection)

集合是Chroma中的基本数据结构,用于存储相关的向量和元数据。

2. 文档(Document)

文档是要存储的文本内容,会被转换为向量嵌入。

3. 元数据(Metadata)

元数据是与文档相关的附加信息,如标签、类别等。

4. 向量(Vector)

向量是文档的数学表示,用于计算相似度。

5. ID(Identifier)

ID是文档的唯一标识符。

6. 嵌入函数(Embedding Function)

嵌入函数用于将文档转换为向量嵌入。

基本使用

创建集合

import chromadb

# 创建客户端
client = chromadb.Client()

# 创建集合
collection = client.create_collection(name="products")

# 或者使用持久化存储
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.create_collection(name="products")

添加文档

# 添加文档
collection.add(
    documents=[
        "A high-end smartphone with advanced features",
        "A powerful laptop for gaming and productivity",
        "Noise-canceling headphones with excellent sound quality"
    ],
    metadatas=[
        {"name": "Smartphone", "price": 999.99},
        {"name": "Laptop", "price": 1499.99},
        {"name": "Headphones", "price": 299.99}
    ],
    ids=["1", "2", "3"]
)

向量相似度搜索

# 语义搜索
results = collection.query(
    query_texts=["electronics with good sound quality"],
    n_results=2
)

print(results)

# 使用自定义向量搜索
import numpy as np

# 生成自定义向量
query_vector = np.random.rand(768).tolist()  # 768维向量

results = collection.query(
    query_embeddings=[query_vector],
    n_results=2
)

print(results)

更新和删除文档

# 更新文档
collection.update(
    ids=["1"],
    documents=["An updated smartphone with even better features"],
    metadatas=[{"name": "Smartphone", "price": 1099.99}]
)

# 删除文档
collection.delete(ids=["1"])

高级特性

1. 使用自定义嵌入函数

from chromadb.utils import embedding_functions

# 使用Hugging Face嵌入函数
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

# 创建使用自定义嵌入函数的集合
collection = client.create_collection(
    name="custom_embeddings",
    embedding_function=sentence_transformer_ef
)

# 添加文档(会自动生成嵌入)
collection.add(
    documents=["Document 1", "Document 2", "Document 3"],
    ids=["1", "2", "3"]
)

2. 过滤搜索

# 基于元数据过滤搜索
results = collection.query(
    query_texts=["electronics"],
    n_results=2,
    where={"price": {"$lt": 500}}  # 价格小于500的产品
)

print(results)

# 更复杂的过滤
results = collection.query(
    query_texts=["electronics"],
    n_results=2,
    where={
        "$and": [
            {"price": {"$gt": 300}},
            {"price": {"$lt": 1000}}
        ]
    }
)

print(results)

3. 批量操作

# 批量添加文档
documents = [f"Document {i}" for i in range(100)]
metadatas = [{"id": i} for i in range(100)]
ids = [str(i) for i in range(100)]

collection.add(
    documents=documents,
    metadatas=metadatas,
    ids=ids
)

# 批量查询
query_texts = ["Document 1", "Document 2"]
results = collection.query(
    query_texts=query_texts,
    n_results=5
)

print(results)

实际应用案例

案例1:产品推荐系统

场景:基于用户偏好推荐相关产品。

实现步骤

  1. 创建产品集合
  2. 添加产品文档和元数据
  3. 基于用户查询推荐产品
  4. 部署推荐系统

示例

import chromadb

# 创建客户端
client = chromadb.PersistentClient(path="./product_recommendation")

# 创建产品集合
collection = client.create_collection(name="products")

# 添加产品数据
products = [
    {"name": "Smartphone", "description": "A high-end smartphone with advanced features", "price": 999.99},
    {"name": "Laptop", "description": "A powerful laptop for gaming and productivity", "price": 1499.99},
    {"name": "Headphones", "description": "Noise-canceling headphones with excellent sound quality", "price": 299.99},
    {"name": "Smartwatch", "description": "A smartwatch with health tracking features", "price": 399.99},
    {"name": "Tablet", "description": "A lightweight tablet for entertainment and productivity", "price": 699.99}
]

# 准备数据
documents = [p["description"] for p in products]
metadatas = [{"name": p["name"], "price": p["price"]} for p in products]
ids = [str(i+1) for i in range(len(products))]

# 添加到集合
collection.add(
    documents=documents,
    metadatas=metadatas,
    ids=ids
)

# 推荐产品
def recommend_products(query, n=3):
    results = collection.query(
        query_texts=[query],
        n_results=n
    )
    
    recommended = []
    for i in range(n):
        product = {
            "name": results["metadatas"][0][i]["name"],
            "description": results["documents"][0][i],
            "price": results["metadatas"][0][i]["price"],
            "distance": results["distances"][0][i]
        }
        recommended.append(product)
    
    return recommended

# 测试推荐
query = "I need a device for listening to music"
recommendations = recommend_products(query)

print("Recommended products:")
for i, product in enumerate(recommendations, 1):
    print(f"{i}. {product['name']} - ${product['price']}")
    print(f"   Description: {product['description']}")
    print(f"   Similarity: {1 - product['distance']:.4f}")
    print()

案例2:语义搜索引擎

场景:构建基于语义的文档搜索引擎。

实现步骤

  1. 创建文档集合
  2. 添加文档数据
  3. 实现语义搜索接口
  4. 部署搜索引擎

示例

import chromadb

# 创建客户端
client = chromadb.PersistentClient(path="./document_search")

# 创建文档集合
collection = client.create_collection(name="documents")

# 添加文档数据
documents = [
    "Machine learning is a subset of artificial intelligence that focuses on building systems that can learn from data.",
    "Deep learning is a branch of machine learning that uses neural networks with many layers.",
    "Natural language processing is a field of AI that focuses on the interaction between computers and human language.",
    "Computer vision is a field of AI that focuses on enabling computers to interpret and understand visual information from the world.",
    "Reinforcement learning is a type of machine learning where an agent learns to make decisions by interacting with an environment."
]

# 准备数据
metadatas = [
    {"title": "Introduction to Machine Learning"},
    {"title": "Deep Learning Fundamentals"},
    {"title": "Natural Language Processing"},
    {"title": "Computer Vision Basics"},
    {"title": "Reinforcement Learning"}
]
ids = [str(i+1) for i in range(len(documents))]

# 添加到集合
collection.add(
    documents=documents,
    metadatas=metadatas,
    ids=ids
)

# 语义搜索
def search_documents(query, n=2):
    results = collection.query(
        query_texts=[query],
        n_results=n
    )
    
    search_results = []
    for i in range(n):
        result = {
            "title": results["metadatas"][0][i]["title"],
            "content": results["documents"][0][i],
            "similarity": 1 - results["distances"][0][i]
        }
        search_results.append(result)
    
    return search_results

# 测试搜索
query = "What is deep learning?"
results = search_documents(query)

print("Search results:")
for i, result in enumerate(results, 1):
    print(f"{i}. {result['title']}")
    print(f"   Content: {result['content']}")
    print(f"   Similarity: {result['similarity']:.4f}")
    print()

案例3:聊天机器人知识库

场景:构建聊天机器人的知识库,用于回答用户问题。

实现步骤

  1. 创建知识库集合
  2. 添加常见问题和答案
  3. 基于用户问题检索相关答案
  4. 部署聊天机器人

示例

import chromadb

# 创建客户端
client = chromadb.PersistentClient(path="./chatbot_knowledgebase")

# 创建知识库集合
collection = client.create_collection(name="knowledgebase")

# 添加常见问题和答案
faqs = [
    {"question": "What are your business hours?", "answer": "We are open from 9 AM to 6 PM, Monday to Friday."},
    {"question": "How do I reset my password?", "answer": "You can reset your password by clicking on the 'Forgot Password' link on the login page."},
    {"question": "What payment methods do you accept?", "answer": "We accept credit cards, debit cards, and PayPal."},
    {"question": "How do I track my order?", "answer": "You can track your order by logging into your account and navigating to the 'Orders' section."},
    {"question": "Do you offer refunds?", "answer": "Yes, we offer refunds within 30 days of purchase, provided the item is in its original condition."}
]

# 准备数据
documents = [f"Q: {faq['question']} A: {faq['answer']}" for faq in faqs]
metadatas = [{"question": faq['question'], "answer": faq['answer']} for faq in faqs]
ids = [str(i+1) for i in range(len(faqs))]

# 添加到集合
collection.add(
    documents=documents,
    metadatas=metadatas,
    ids=ids
)

# 回答用户问题
def answer_question(question):
    results = collection.query(
        query_texts=[question],
        n_results=1
    )
    
    if results["distances"][0][0] < 0.5:  # 相似度阈值
        return results["metadatas"][0][0]["answer"]
    else:
        return "I'm sorry, I don't have an answer to that question. Please contact our support team for assistance."

# 测试聊天机器人
questions = [
    "What time do you open?",
    "How can I reset my password?",
    "Do you have a refund policy?",
    "What's your return policy?"
]

print("Chatbot responses:")
for question in questions:
    answer = answer_question(question)
    print(f"Q: {question}")
    print(f"A: {answer}")
    print()

总结与展望

Chroma作为一个轻量级的向量数据库,为存储和检索向量嵌入提供了简洁而强大的工具。通过本文的介绍,你应该已经了解了Chroma的核心概念、基本使用方法和高级特性。

关键优势

  • 开源免费,可自由使用和定制
  • 轻量级,易于安装和部署
  • 易于使用,提供简洁的API接口
  • 高性能,针对向量搜索进行了优化
  • 灵活性,支持多种向量相似度计算方法

应用前景

  • 产品推荐系统
  • 语义搜索引擎
  • 聊天机器人知识库
  • 个性化推荐系统
  • 内容管理和信息检索

未来发展

Chroma团队持续改进数据库,未来可能会:

  • 支持更多的嵌入模型和向量类型
  • 提供更高级的搜索和过滤功能
  • 增强与机器学习框架的集成
  • 优化系统性能和扩展性
  • 提供更多行业特定的解决方案

通过不断学习和实践,你可以利用Chroma构建更加智能、高效的向量搜索和推荐系统,为各种场景提供有价值的AI解决方案。