Weaviate 入门教程

项目介绍

Weaviate 是一个开源的向量数据库，专注于存储和检索向量嵌入，支持高效的语义搜索和推荐系统。它提供了丰富的API接口，支持多种编程语言，以及与多种机器学习框架的集成。

主要功能

向量存储和检索：高效存储和检索向量嵌入
语义搜索：基于向量相似度的搜索
推荐系统：基于向量相似度的推荐
多模态支持：支持文本、图像、音频等多种数据类型
集成机器学习框架：支持与TensorFlow、PyTorch等框架的集成
可扩展性：支持水平扩展，处理大规模数据
安全性：提供访问控制和加密功能

项目特点

开源免费：完全开源，可自由部署和定制
高性能：针对向量搜索进行了优化
易于使用：提供直观的API接口
高度可扩展：支持水平扩展，处理大规模数据
多语言支持：支持Python、JavaScript、Go等多种编程语言

安装与配置

安装步骤

使用Docker安装（推荐）

# 启动Weaviate容器
docker run -d -p 8080:8080 -p 50051:50051 --name weaviate semitechnologies/weaviate:latest

使用Docker Compose安装

# docker-compose.yml
version: '3.4'
services:
  weaviate:
    image: semitechnologies/weaviate:latest
    ports:
      - "8080:8080"
      - "50051:50051"
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: "./data"

# 启动容器
docker-compose up -d

基本配置

访问Weaviate API：默认地址为 http://localhost:8080
检查Weaviate状态：

curl http://localhost:8080/v1/meta

安装客户端库：

# Python客户端
pip install weaviate-client

# JavaScript客户端
npm install weaviate-client

核心概念

1. 类（Class）

类是Weaviate中的基本数据结构，类似于关系数据库中的表。每个类可以定义多个属性和向量表示。

2. 对象（Object）

对象是类的实例，包含属性值和向量嵌入。

3. 属性（Property）

属性是对象的特征，如文本、数字、布尔值等。

4. 向量（Vector）

向量是对象的数学表示，用于计算相似度。

5. 模式（Schema）

模式定义了Weaviate中的类和属性结构。

6. 模块（Module）

模块是Weaviate的扩展功能，如文本嵌入、图像处理等。

基本使用

创建模式

import weaviate

# 连接到Weaviate
client = weaviate.Client("http://localhost:8080")

# 定义模式
schema = {
    "classes": [
        {
            "class": "Product",
            "description": "A product in the store",
            "properties": [
                {
                    "name": "name",
                    "dataType": ["string"],
                    "description": "The name of the product"
                },
                {
                    "name": "description",
                    "dataType": ["string"],
                    "description": "The description of the product"
                },
                {
                    "name": "price",
                    "dataType": ["number"],
                    "description": "The price of the product"
                }
            ],
            "vectorizer": "text2vec-openai"  # 使用OpenAI的文本嵌入模型
        }
    ]
}

# 创建模式
client.schema.create(schema)

添加对象

# 添加对象
products = [
    {
        "name": "Smartphone",
        "description": "A high-end smartphone with advanced features",
        "price": 999.99
    },
    {
        "name": "Laptop",
        "description": "A powerful laptop for gaming and productivity",
        "price": 1499.99
    },
    {
        "name": "Headphones",
        "description": "Noise-canceling headphones with excellent sound quality",
        "price": 299.99
    }
]

# 批量添加对象
with client.batch as batch:
    for product in products:
        batch.add_data_object(
            data_object=product,
            class_name="Product"
        )

搜索对象

# 语义搜索
response = client.query.get(
    "Product",
    ["name", "description", "price"]
).with_near_text(
    {
        "concepts": ["electronics with good sound quality"]
    }
).with_limit(2).do()

print(response)

# 基于属性的搜索
response = client.query.get(
    "Product",
    ["name", "description", "price"]
).with_where(
    {
        "path": ["price"],
        "operator": "LessThan",
        "valueNumber": 500
    }
).do()

print(response)

更新和删除对象

# 更新对象
client.data_object.update(
    data_object={
        "price": 899.99
    },
    class_name="Product",
    uuid="object-uuid"
)

# 删除对象
client.data_object.delete(
    class_name="Product",
    uuid="object-uuid"
)

高级特性

1. 自定义向量

# 使用自定义向量
import numpy as np

# 生成自定义向量
vector = np.random.rand(1536).tolist()  # 1536维向量

# 添加带自定义向量的对象
client.data_object.create(
    data_object={
        "name": "Custom Product",
        "description": "A product with custom vector",
        "price": 599.99
    },
    class_name="Product",
    vector=vector
)

2. 多模态搜索

# 图像搜索
response = client.query.get(
    "Product",
    ["name", "description", "price"]
).with_near_image(
    {
        "image": "base64-encoded-image"
    }
).with_limit(2).do()

print(response)

3. 聚合查询

# 聚合查询
response = client.query.aggregate(
    "Product"
).with_fields(
    "meta { count }"
).with_group_by_fields(
    ["price"]
).do()

print(response)

实际应用案例

案例1：产品推荐系统

场景：基于用户偏好推荐相关产品。

实现步骤：

创建产品类和用户类
为产品和用户生成向量嵌入
使用向量相似度计算推荐产品
部署推荐系统

示例：

# 创建用户类
schema = {
    "classes": [
        {
            "class": "User",
            "description": "A user of the system",
            "properties": [
                {
                    "name": "name",
                    "dataType": ["string"],
                    "description": "The name of the user"
                },
                {
                    "name": "preferences",
                    "dataType": ["string"],
                    "description": "The user's preferences"
                }
            ],
            "vectorizer": "text2vec-openai"
        }
    ]
}

# 添加用户
user = {
    "name": "John Doe",
    "preferences": "I like high-quality electronics with good sound quality"
}

client.data_object.create(
    data_object=user,
    class_name="User"
)

# 获取用户向量
user_vector = client.data_object.get_by_id(
    class_name="User",
    uuid="user-uuid"
)["vector"]

# 基于用户向量推荐产品
response = client.query.get(
    "Product",
    ["name", "description", "price"]
).with_near_vector(
    {
        "vector": user_vector
    }
).with_limit(3).do()

print("Recommended products:")
for product in response["data"]["Get"]["Product"]:
    print(f"- {product['name']}: ${product['price']}")

案例2：语义搜索引擎

场景：构建基于语义的文档搜索引擎。

实现步骤：

创建文档类
为文档生成向量嵌入
实现语义搜索接口
部署搜索引擎

示例：

# 创建文档类
schema = {
    "classes": [
        {
            "class": "Document",
            "description": "A document in the system",
            "properties": [
                {
                    "name": "title",
                    "dataType": ["string"],
                    "description": "The title of the document"
                },
                {
                    "name": "content",
                    "dataType": ["text"],
                    "description": "The content of the document"
                },
                {
                    "name": "author",
                    "dataType": ["string"],
                    "description": "The author of the document"
                }
            ],
            "vectorizer": "text2vec-openai"
        }
    ]
}

# 添加文档
documents = [
    {
        "title": "Introduction to Machine Learning",
        "content": "Machine learning is a subset of artificial intelligence that focuses on building systems that can learn from data.",
        "author": "John Smith"
    },
    {
        "title": "Deep Learning Fundamentals",
        "content": "Deep learning is a branch of machine learning that uses neural networks with many layers.",
        "author": "Jane Doe"
    },
    {
        "title": "Natural Language Processing",
        "content": "Natural language processing is a field of AI that focuses on the interaction between computers and human language.",
        "author": "Bob Johnson"
    }
]

with client.batch as batch:
    for doc in documents:
        batch.add_data_object(
            data_object=doc,
            class_name="Document"
        )

# 语义搜索
query = "What is deep learning?"
response = client.query.get(
    "Document",
    ["title", "content", "author"]
).with_near_text(
    {
        "concepts": [query]
    }
).with_limit(2).do()

print("Search results:")
for doc in response["data"]["Get"]["Document"]:
    print(f"Title: {doc['title']}")
    print(f"Author: {doc['author']}")
    print(f"Content: {doc['content'][:100]}...")
    print()

案例3：图像相似性搜索

场景：基于图像相似性搜索相关图片。

实现步骤：

创建图像类
为图像生成向量嵌入
实现图像相似性搜索接口
部署图像搜索系统

示例：

# 创建图像类
schema = {
    "classes": [
        {
            "class": "Image",
            "description": "An image in the system",
            "properties": [
                {
                    "name": "name",
                    "dataType": ["string"],
                    "description": "The name of the image"
                },
                {
                    "name": "description",
                    "dataType": ["string"],
                    "description": "The description of the image"
                }
            ],
            "vectorizer": "img2vec-neural"  # 使用图像向量器
        }
    ]
}

# 添加图像（使用base64编码）
import base64

with open("image.jpg", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

client.data_object.create(
    data_object={
        "name": "Sample Image",
        "description": "A sample image for similarity search"
    },
    class_name="Image",
    vector=image_data  # 图像数据作为向量
)

# 图像相似性搜索
with open("query_image.jpg", "rb") as f:
    query_image = base64.b64encode(f.read()).decode("utf-8")

response = client.query.get(
    "Image",
    ["name", "description"]
).with_near_image(
    {
        "image": query_image
    }
).with_limit(3).do()

print("Similar images:")
for image in response["data"]["Get"]["Image"]:
    print(f"- {image['name']}: {image['description']}")

总结与展望

Weaviate作为一个功能强大的向量数据库，为构建高效的语义搜索和推荐系统提供了全面的工具和功能。通过本文的介绍，你应该已经了解了Weaviate的核心概念、基本使用方法和高级特性。

关键优势

开源免费，可自由部署和定制
高性能，针对向量搜索进行了优化
易于使用，提供直观的API接口
高度可扩展，支持水平扩展，处理大规模数据
多语言支持，支持Python、JavaScript、Go等多种编程语言

应用前景

产品推荐系统
语义搜索引擎
图像相似性搜索
自然语言处理应用
个性化推荐系统

未来发展

Weaviate团队持续改进数据库，未来可能会：

支持更多的向量模型和嵌入技术
提供更高级的搜索和推荐算法
增强多模态支持
优化系统性能和扩展性
提供更多行业特定的解决方案

通过不断学习和实践，你可以利用Weaviate构建更加智能、高效的向量搜索和推荐系统，为各种场景提供有价值的AI解决方案。