Chroma 入门教程
项目介绍
Chroma 是一个开源的轻量级向量数据库,专注于存储和检索向量嵌入,支持高效的语义搜索和推荐系统。它提供了简洁的API接口,易于集成到各种应用中,特别适合开发原型和小型应用。
主要功能
- 向量存储:高效存储向量嵌入
- 语义搜索:基于向量相似度的搜索
- 推荐系统:基于向量相似度的推荐
- 轻量级:易于部署和使用
- 多语言支持:支持Python、JavaScript等多种编程语言
- 内存和持久化存储:支持内存存储和持久化存储
- 与机器学习框架集成:支持与Hugging Face等框架的集成
项目特点
- 开源免费:完全开源,可自由使用和定制
- 轻量级:易于安装和部署
- 易于使用:提供简洁的API接口
- 高性能:针对向量搜索进行了优化
- 灵活性:支持多种向量相似度计算方法
安装与配置
安装步骤
- 安装Chroma
# 安装Chroma
pip install chromadb
# 安装可选依赖(用于嵌入生成)
pip install sentence-transformers基本配置
import chromadb
# 创建Chroma客户端
# 内存存储
client = chromadb.Client()
# 或者使用持久化存储
client = chromadb.PersistentClient(path="./chroma_db")核心概念
1. 集合(Collection)
集合是Chroma中的基本数据结构,用于存储相关的向量和元数据。
2. 文档(Document)
文档是要存储的文本内容,会被转换为向量嵌入。
3. 元数据(Metadata)
元数据是与文档相关的附加信息,如标签、类别等。
4. 向量(Vector)
向量是文档的数学表示,用于计算相似度。
5. ID(Identifier)
ID是文档的唯一标识符。
6. 嵌入函数(Embedding Function)
嵌入函数用于将文档转换为向量嵌入。
基本使用
创建集合
import chromadb
# 创建客户端
client = chromadb.Client()
# 创建集合
collection = client.create_collection(name="products")
# 或者使用持久化存储
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.create_collection(name="products")添加文档
# 添加文档
collection.add(
documents=[
"A high-end smartphone with advanced features",
"A powerful laptop for gaming and productivity",
"Noise-canceling headphones with excellent sound quality"
],
metadatas=[
{"name": "Smartphone", "price": 999.99},
{"name": "Laptop", "price": 1499.99},
{"name": "Headphones", "price": 299.99}
],
ids=["1", "2", "3"]
)向量相似度搜索
# 语义搜索
results = collection.query(
query_texts=["electronics with good sound quality"],
n_results=2
)
print(results)
# 使用自定义向量搜索
import numpy as np
# 生成自定义向量
query_vector = np.random.rand(768).tolist() # 768维向量
results = collection.query(
query_embeddings=[query_vector],
n_results=2
)
print(results)更新和删除文档
# 更新文档
collection.update(
ids=["1"],
documents=["An updated smartphone with even better features"],
metadatas=[{"name": "Smartphone", "price": 1099.99}]
)
# 删除文档
collection.delete(ids=["1"])高级特性
1. 使用自定义嵌入函数
from chromadb.utils import embedding_functions
# 使用Hugging Face嵌入函数
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)
# 创建使用自定义嵌入函数的集合
collection = client.create_collection(
name="custom_embeddings",
embedding_function=sentence_transformer_ef
)
# 添加文档(会自动生成嵌入)
collection.add(
documents=["Document 1", "Document 2", "Document 3"],
ids=["1", "2", "3"]
)2. 过滤搜索
# 基于元数据过滤搜索
results = collection.query(
query_texts=["electronics"],
n_results=2,
where={"price": {"$lt": 500}} # 价格小于500的产品
)
print(results)
# 更复杂的过滤
results = collection.query(
query_texts=["electronics"],
n_results=2,
where={
"$and": [
{"price": {"$gt": 300}},
{"price": {"$lt": 1000}}
]
}
)
print(results)3. 批量操作
# 批量添加文档
documents = [f"Document {i}" for i in range(100)]
metadatas = [{"id": i} for i in range(100)]
ids = [str(i) for i in range(100)]
collection.add(
documents=documents,
metadatas=metadatas,
ids=ids
)
# 批量查询
query_texts = ["Document 1", "Document 2"]
results = collection.query(
query_texts=query_texts,
n_results=5
)
print(results)实际应用案例
案例1:产品推荐系统
场景:基于用户偏好推荐相关产品。
实现步骤:
- 创建产品集合
- 添加产品文档和元数据
- 基于用户查询推荐产品
- 部署推荐系统
示例:
import chromadb
# 创建客户端
client = chromadb.PersistentClient(path="./product_recommendation")
# 创建产品集合
collection = client.create_collection(name="products")
# 添加产品数据
products = [
{"name": "Smartphone", "description": "A high-end smartphone with advanced features", "price": 999.99},
{"name": "Laptop", "description": "A powerful laptop for gaming and productivity", "price": 1499.99},
{"name": "Headphones", "description": "Noise-canceling headphones with excellent sound quality", "price": 299.99},
{"name": "Smartwatch", "description": "A smartwatch with health tracking features", "price": 399.99},
{"name": "Tablet", "description": "A lightweight tablet for entertainment and productivity", "price": 699.99}
]
# 准备数据
documents = [p["description"] for p in products]
metadatas = [{"name": p["name"], "price": p["price"]} for p in products]
ids = [str(i+1) for i in range(len(products))]
# 添加到集合
collection.add(
documents=documents,
metadatas=metadatas,
ids=ids
)
# 推荐产品
def recommend_products(query, n=3):
results = collection.query(
query_texts=[query],
n_results=n
)
recommended = []
for i in range(n):
product = {
"name": results["metadatas"][0][i]["name"],
"description": results["documents"][0][i],
"price": results["metadatas"][0][i]["price"],
"distance": results["distances"][0][i]
}
recommended.append(product)
return recommended
# 测试推荐
query = "I need a device for listening to music"
recommendations = recommend_products(query)
print("Recommended products:")
for i, product in enumerate(recommendations, 1):
print(f"{i}. {product['name']} - ${product['price']}")
print(f" Description: {product['description']}")
print(f" Similarity: {1 - product['distance']:.4f}")
print()案例2:语义搜索引擎
场景:构建基于语义的文档搜索引擎。
实现步骤:
- 创建文档集合
- 添加文档数据
- 实现语义搜索接口
- 部署搜索引擎
示例:
import chromadb
# 创建客户端
client = chromadb.PersistentClient(path="./document_search")
# 创建文档集合
collection = client.create_collection(name="documents")
# 添加文档数据
documents = [
"Machine learning is a subset of artificial intelligence that focuses on building systems that can learn from data.",
"Deep learning is a branch of machine learning that uses neural networks with many layers.",
"Natural language processing is a field of AI that focuses on the interaction between computers and human language.",
"Computer vision is a field of AI that focuses on enabling computers to interpret and understand visual information from the world.",
"Reinforcement learning is a type of machine learning where an agent learns to make decisions by interacting with an environment."
]
# 准备数据
metadatas = [
{"title": "Introduction to Machine Learning"},
{"title": "Deep Learning Fundamentals"},
{"title": "Natural Language Processing"},
{"title": "Computer Vision Basics"},
{"title": "Reinforcement Learning"}
]
ids = [str(i+1) for i in range(len(documents))]
# 添加到集合
collection.add(
documents=documents,
metadatas=metadatas,
ids=ids
)
# 语义搜索
def search_documents(query, n=2):
results = collection.query(
query_texts=[query],
n_results=n
)
search_results = []
for i in range(n):
result = {
"title": results["metadatas"][0][i]["title"],
"content": results["documents"][0][i],
"similarity": 1 - results["distances"][0][i]
}
search_results.append(result)
return search_results
# 测试搜索
query = "What is deep learning?"
results = search_documents(query)
print("Search results:")
for i, result in enumerate(results, 1):
print(f"{i}. {result['title']}")
print(f" Content: {result['content']}")
print(f" Similarity: {result['similarity']:.4f}")
print()案例3:聊天机器人知识库
场景:构建聊天机器人的知识库,用于回答用户问题。
实现步骤:
- 创建知识库集合
- 添加常见问题和答案
- 基于用户问题检索相关答案
- 部署聊天机器人
示例:
import chromadb
# 创建客户端
client = chromadb.PersistentClient(path="./chatbot_knowledgebase")
# 创建知识库集合
collection = client.create_collection(name="knowledgebase")
# 添加常见问题和答案
faqs = [
{"question": "What are your business hours?", "answer": "We are open from 9 AM to 6 PM, Monday to Friday."},
{"question": "How do I reset my password?", "answer": "You can reset your password by clicking on the 'Forgot Password' link on the login page."},
{"question": "What payment methods do you accept?", "answer": "We accept credit cards, debit cards, and PayPal."},
{"question": "How do I track my order?", "answer": "You can track your order by logging into your account and navigating to the 'Orders' section."},
{"question": "Do you offer refunds?", "answer": "Yes, we offer refunds within 30 days of purchase, provided the item is in its original condition."}
]
# 准备数据
documents = [f"Q: {faq['question']} A: {faq['answer']}" for faq in faqs]
metadatas = [{"question": faq['question'], "answer": faq['answer']} for faq in faqs]
ids = [str(i+1) for i in range(len(faqs))]
# 添加到集合
collection.add(
documents=documents,
metadatas=metadatas,
ids=ids
)
# 回答用户问题
def answer_question(question):
results = collection.query(
query_texts=[question],
n_results=1
)
if results["distances"][0][0] < 0.5: # 相似度阈值
return results["metadatas"][0][0]["answer"]
else:
return "I'm sorry, I don't have an answer to that question. Please contact our support team for assistance."
# 测试聊天机器人
questions = [
"What time do you open?",
"How can I reset my password?",
"Do you have a refund policy?",
"What's your return policy?"
]
print("Chatbot responses:")
for question in questions:
answer = answer_question(question)
print(f"Q: {question}")
print(f"A: {answer}")
print()总结与展望
Chroma作为一个轻量级的向量数据库,为存储和检索向量嵌入提供了简洁而强大的工具。通过本文的介绍,你应该已经了解了Chroma的核心概念、基本使用方法和高级特性。
关键优势
- 开源免费,可自由使用和定制
- 轻量级,易于安装和部署
- 易于使用,提供简洁的API接口
- 高性能,针对向量搜索进行了优化
- 灵活性,支持多种向量相似度计算方法
应用前景
- 产品推荐系统
- 语义搜索引擎
- 聊天机器人知识库
- 个性化推荐系统
- 内容管理和信息检索
未来发展
Chroma团队持续改进数据库,未来可能会:
- 支持更多的嵌入模型和向量类型
- 提供更高级的搜索和过滤功能
- 增强与机器学习框架的集成
- 优化系统性能和扩展性
- 提供更多行业特定的解决方案
通过不断学习和实践,你可以利用Chroma构建更加智能、高效的向量搜索和推荐系统,为各种场景提供有价值的AI解决方案。