作业讲解：优化算法的实现与对比

核心知识点讲解

优化算法的基本概念

优化算法是深度学习中用于调整模型参数以最小化损失函数的方法。不同的优化算法在收敛速度、稳定性和最终性能上可能有很大差异。本章节将详细讲解几种常用优化算法的实现方法，并通过对比实验分析它们的性能特点。

常见优化算法概述

随机梯度下降（SGD）：最基本的优化算法，每次使用单个样本计算梯度并更新参数。
动量梯度下降（Momentum）：引入动量项，累积之前的梯度信息，加速收敛并减少震荡。
RMSProp：自适应学习率算法，根据参数的历史梯度平方和调整学习率。
Adam：结合了动量和RMSProp的优点，同时自适应调整学习率和利用动量信息。
AdamW：Adam的变体，将权重衰减（weight decay）与梯度更新分离，提高正则化效果。

优化算法的实现要点

参数初始化：合理的参数初始化对优化算法的性能有重要影响。
学习率调度：不同的优化算法可能需要不同的学习率调度策略。
批量大小选择：批量大小会影响优化算法的收敛速度和稳定性。
梯度裁剪：防止梯度爆炸，提高训练稳定性。
超参数调优：不同的优化算法有不同的超参数需要调优。

实用案例分析

案例1：实现基本的随机梯度下降（SGD）算法

问题描述

实现基本的随机梯度下降算法，并在简单的线性回归问题上测试其性能。

解决方案

import numpy as np
import matplotlib.pyplot as plt

# 生成模拟数据
np.random.seed(42)
x = np.random.rand(100, 1)
y = 2 + 3 * x + np.random.randn(100, 1) * 0.1

# 添加偏置项x0=1
X = np.hstack([np.ones((100, 1)), x])

# 初始化参数
theta = np.random.randn(2, 1)

# 超参数
learning_rate = 0.01
n_epochs = 1000
batch_size = 1

# 存储损失值
losses = []

# SGD实现
for epoch in range(n_epochs):
    # 随机打乱数据
    indices = np.random.permutation(len(X))
    X_shuffled = X[indices]
    y_shuffled = y[indices]
    
    epoch_loss = 0
    for i in range(0, len(X), batch_size):
        # 取一个batch的数据
        X_batch = X_shuffled[i:i+batch_size]
        y_batch = y_shuffled[i:i+batch_size]
        
        # 计算预测值
        y_pred = X_batch.dot(theta)
        
        # 计算损失
        loss = 0.5 * np.mean((y_pred - y_batch) ** 2)
        epoch_loss += loss
        
        # 计算梯度
        gradient = X_batch.T.dot(y_pred - y_batch) / batch_size
        
        # 更新参数
        theta -= learning_rate * gradient
    
    losses.append(epoch_loss / (len(X) // batch_size))
    
    if (epoch + 1) % 100 == 0:
        print(f"Epoch {epoch+1}, Loss: {losses[-1]:.4f}")

# 打印最终参数
print(f"\nFinal parameters: theta0={theta[0][0]:.4f}, theta1={theta[1][0]:.4f}")

# 绘制损失曲线
plt.figure(figsize=(10, 6))
plt.plot(range(n_epochs), losses)
plt.title('SGD Loss Curve')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.grid(True)
plt.show()

# 绘制数据和拟合直线
plt.figure(figsize=(10, 6))
plt.scatter(x, y, label='Data')
plt.plot(x, X.dot(theta), color='red', label='SGD Fit')
plt.title('Linear Regression with SGD')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.grid(True)
plt.show()

结果分析

SGD算法能够成功拟合线性回归模型，损失值逐渐下降并趋于稳定。最终参数接近真实值（theta0=2, theta1=3）。

案例2：实现动量梯度下降算法

问题描述

实现动量梯度下降算法，并与基本SGD进行性能对比。

解决方案

import numpy as np
import matplotlib.pyplot as plt

# 生成模拟数据
np.random.seed(42)
x = np.random.rand(100, 1)
y = 2 + 3 * x + np.random.randn(100, 1) * 0.1

# 添加偏置项x0=1
X = np.hstack([np.ones((100, 1)), x])

# 初始化参数
theta_sgd = np.random.randn(2, 1)
theta_momentum = np.random.randn(2, 1)

# 超参数
learning_rate = 0.01
n_epochs = 1000
batch_size = 1
momentum = 0.9  # 动量参数

# 存储损失值
losses_sgd = []
losses_momentum = []

# 初始化动量
v = np.zeros_like(theta_momentum)

# 训练过程
for epoch in range(n_epochs):
    # 随机打乱数据
    indices = np.random.permutation(len(X))
    X_shuffled = X[indices]
    y_shuffled = y[indices]
    
    epoch_loss_sgd = 0
    epoch_loss_momentum = 0
    
    for i in range(0, len(X), batch_size):
        # 取一个batch的数据
        X_batch = X_shuffled[i:i+batch_size]
        y_batch = y_shuffled[i:i+batch_size]
        
        # SGD更新
        y_pred_sgd = X_batch.dot(theta_sgd)
        loss_sgd = 0.5 * np.mean((y_pred_sgd - y_batch) ** 2)
        epoch_loss_sgd += loss_sgd
        gradient_sgd = X_batch.T.dot(y_pred_sgd - y_batch) / batch_size
        theta_sgd -= learning_rate * gradient_sgd
        
        # 动量SGD更新
        y_pred_momentum = X_batch.dot(theta_momentum)
        loss_momentum = 0.5 * np.mean((y_pred_momentum - y_batch) ** 2)
        epoch_loss_momentum += loss_momentum
        gradient_momentum = X_batch.T.dot(y_pred_momentum - y_batch) / batch_size
        v = momentum * v + learning_rate * gradient_momentum
        theta_momentum -= v
    
    losses_sgd.append(epoch_loss_sgd / (len(X) // batch_size))
    losses_momentum.append(epoch_loss_momentum / (len(X) // batch_size))
    
    if (epoch + 1) % 100 == 0:
        print(f"Epoch {epoch+1}, SGD Loss: {losses_sgd[-1]:.4f}, Momentum Loss: {losses_momentum[-1]:.4f}")

# 打印最终参数
print(f"\nSGD Final parameters: theta0={theta_sgd[0][0]:.4f}, theta1={theta_sgd[1][0]:.4f}")
print(f"Momentum Final parameters: theta0={theta_momentum[0][0]:.4f}, theta1={theta_momentum[1][0]:.4f}")

# 绘制损失曲线对比
plt.figure(figsize=(10, 6))
plt.plot(range(n_epochs), losses_sgd, label='SGD')
plt.plot(range(n_epochs), losses_momentum, label='Momentum SGD')
plt.title('SGD vs Momentum SGD Loss Curves')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()

# 绘制数据和拟合直线
plt.figure(figsize=(10, 6))
plt.scatter(x, y, label='Data')
plt.plot(x, X.dot(theta_sgd), color='red', label='SGD Fit')
plt.plot(x, X.dot(theta_momentum), color='green', label='Momentum SGD Fit')
plt.title('Linear Regression: SGD vs Momentum SGD')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.grid(True)
plt.show()

结果分析

动量梯度下降算法通常比基本SGD收敛更快，损失值下降更平滑。在这个案例中，两种算法都能成功拟合模型，但动量SGD的收敛速度更快。

案例3：实现RMSProp优化算法

问题描述

实现RMSProp优化算法，并与SGD和动量SGD进行性能对比。

解决方案

import numpy as np
import matplotlib.pyplot as plt

# 生成模拟数据
np.random.seed(42)
x = np.random.rand(100, 1)
y = 2 + 3 * x + np.random.randn(100, 1) * 0.1

# 添加偏置项x0=1
X = np.hstack([np.ones((100, 1)), x])

# 初始化参数
theta_sgd = np.random.randn(2, 1)
theta_momentum = np.random.randn(2, 1)
theta_rmsprop = np.random.randn(2, 1)

# 超参数
learning_rate = 0.01
n_epochs = 1000
batch_size = 1
momentum = 0.9  # 动量参数
beta = 0.9  # RMSProp的衰减参数
epsilon = 1e-8  # 防止除零的小常数

# 存储损失值
losses_sgd = []
losses_momentum = []
losses_rmsprop = []

# 初始化动量和RMSProp的累积变量
v = np.zeros_like(theta_momentum)
r = np.zeros_like(theta_rmsprop)

# 训练过程
for epoch in range(n_epochs):
    # 随机打乱数据
    indices = np.random.permutation(len(X))
    X_shuffled = X[indices]
    y_shuffled = y[indices]
    
    epoch_loss_sgd = 0
    epoch_loss_momentum = 0
    epoch_loss_rmsprop = 0
    
    for i in range(0, len(X), batch_size):
        # 取一个batch的数据
        X_batch = X_shuffled[i:i+batch_size]
        y_batch = y_shuffled[i:i+batch_size]
        
        # SGD更新
        y_pred_sgd = X_batch.dot(theta_sgd)
        loss_sgd = 0.5 * np.mean((y_pred_sgd - y_batch) ** 2)
        epoch_loss_sgd += loss_sgd
        gradient_sgd = X_batch.T.dot(y_pred_sgd - y_batch) / batch_size
        theta_sgd -= learning_rate * gradient_sgd
        
        # 动量SGD更新
        y_pred_momentum = X_batch.dot(theta_momentum)
        loss_momentum = 0.5 * np.mean((y_pred_momentum - y_batch) ** 2)
        epoch_loss_momentum += loss_momentum
        gradient_momentum = X_batch.T.dot(y_pred_momentum - y_batch) / batch_size
        v = momentum * v + learning_rate * gradient_momentum
        theta_momentum -= v
        
        # RMSProp更新
        y_pred_rmsprop = X_batch.dot(theta_rmsprop)
        loss_rmsprop = 0.5 * np.mean((y_pred_rmsprop - y_batch) ** 2)
        epoch_loss_rmsprop += loss_rmsprop
        gradient_rmsprop = X_batch.T.dot(y_pred_rmsprop - y_batch) / batch_size
        r = beta * r + (1 - beta) * gradient_rmsprop ** 2
        theta_rmsprop -= learning_rate * gradient_rmsprop / (np.sqrt(r) + epsilon)
    
    losses_sgd.append(epoch_loss_sgd / (len(X) // batch_size))
    losses_momentum.append(epoch_loss_momentum / (len(X) // batch_size))
    losses_rmsprop.append(epoch_loss_rmsprop / (len(X) // batch_size))
    
    if (epoch + 1) % 100 == 0:
        print(f"Epoch {epoch+1}, SGD Loss: {losses_sgd[-1]:.4f}, Momentum Loss: {losses_momentum[-1]:.4f}, RMSProp Loss: {losses_rmsprop[-1]:.4f}")

# 打印最终参数
print(f"\nSGD Final parameters: theta0={theta_sgd[0][0]:.4f}, theta1={theta_sgd[1][0]:.4f}")
print(f"Momentum Final parameters: theta0={theta_momentum[0][0]:.4f}, theta1={theta_momentum[1][0]:.4f}")
print(f"RMSProp Final parameters: theta0={theta_rmsprop[0][0]:.4f}, theta1={theta_rmsprop[1][0]:.4f}")

# 绘制损失曲线对比
plt.figure(figsize=(10, 6))
plt.plot(range(n_epochs), losses_sgd, label='SGD')
plt.plot(range(n_epochs), losses_momentum, label='Momentum SGD')
plt.plot(range(n_epochs), losses_rmsprop, label='RMSProp')
plt.title('Optimization Algorithms Loss Curves')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()

# 绘制数据和拟合直线
plt.figure(figsize=(10, 6))
plt.scatter(x, y, label='Data')
plt.plot(x, X.dot(theta_sgd), color='red', label='SGD Fit')
plt.plot(x, X.dot(theta_momentum), color='green', label='Momentum SGD Fit')
plt.plot(x, X.dot(theta_rmsprop), color='blue', label='RMSProp Fit')
plt.title('Linear Regression: Different Optimization Algorithms')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.grid(True)
plt.show()

结果分析

RMSProp算法通过自适应调整学习率，通常能够更快地收敛到最优解。在这个案例中，三种算法都能成功拟合模型，但RMSProp的收敛速度更快，损失值下降更平稳。

案例4：实现Adam优化算法

问题描述

实现Adam优化算法，并与其他优化算法进行全面对比。

解决方案

import numpy as np
import matplotlib.pyplot as plt

# 生成模拟数据
np.random.seed(42)
x = np.random.rand(100, 1)
y = 2 + 3 * x + np.random.randn(100, 1) * 0.1

# 添加偏置项x0=1
X = np.hstack([np.ones((100, 1)), x])

# 初始化参数
theta_sgd = np.random.randn(2, 1)
theta_momentum = np.random.randn(2, 1)
theta_rmsprop = np.random.randn(2, 1)
theta_adam = np.random.randn(2, 1)

# 超参数
learning_rate = 0.01
n_epochs = 1000
batch_size = 1
momentum = 0.9  # 动量参数
beta1 = 0.9  # Adam的一阶矩衰减参数
beta2 = 0.999  # Adam的二阶矩衰减参数
epsilon = 1e-8  # 防止除零的小常数

# 存储损失值
losses_sgd = []
losses_momentum = []
losses_rmsprop = []
losses_adam = []

# 初始化优化器的累积变量
v = np.zeros_like(theta_momentum)  # 动量
r = np.zeros_like(theta_rmsprop)  # RMSProp
m = np.zeros_like(theta_adam)  # Adam的一阶矩
v_adam = np.zeros_like(theta_adam)  # Adam的二阶矩

# 训练过程
for epoch in range(n_epochs):
    # 随机打乱数据
    indices = np.random.permutation(len(X))
    X_shuffled = X[indices]
    y_shuffled = y[indices]
    
    epoch_loss_sgd = 0
    epoch_loss_momentum = 0
    epoch_loss_rmsprop = 0
    epoch_loss_adam = 0
    
    for i in range(0, len(X), batch_size):
        # 取一个batch的数据
        X_batch = X_shuffled[i:i+batch_size]
        y_batch = y_shuffled[i:i+batch_size]
        
        # SGD更新
        y_pred_sgd = X_batch.dot(theta_sgd)
        loss_sgd = 0.5 * np.mean((y_pred_sgd - y_batch) ** 2)
        epoch_loss_sgd += loss_sgd
        gradient_sgd = X_batch.T.dot(y_pred_sgd - y_batch) / batch_size
        theta_sgd -= learning_rate * gradient_sgd
        
        # 动量SGD更新
        y_pred_momentum = X_batch.dot(theta_momentum)
        loss_momentum = 0.5 * np.mean((y_pred_momentum - y_batch) ** 2)
        epoch_loss_momentum += loss_momentum
        gradient_momentum = X_batch.T.dot(y_pred_momentum - y_batch) / batch_size
        v = momentum * v + learning_rate * gradient_momentum
        theta_momentum -= v
        
        # RMSProp更新
        y_pred_rmsprop = X_batch.dot(theta_rmsprop)
        loss_rmsprop = 0.5 * np.mean((y_pred_rmsprop - y_batch) ** 2)
        epoch_loss_rmsprop += loss_rmsprop
        gradient_rmsprop = X_batch.T.dot(y_pred_rmsprop - y_batch) / batch_size
        r = beta2 * r + (1 - beta2) * gradient_rmsprop ** 2
        theta_rmsprop -= learning_rate * gradient_rmsprop / (np.sqrt(r) + epsilon)
        
        # Adam更新
        y_pred_adam = X_batch.dot(theta_adam)
        loss_adam = 0.5 * np.mean((y_pred_adam - y_batch) ** 2)
        epoch_loss_adam += loss_adam
        gradient_adam = X_batch.T.dot(y_pred_adam - y_batch) / batch_size
        
        # 更新一阶矩和二阶矩
        m = beta1 * m + (1 - beta1) * gradient_adam
        v_adam = beta2 * v_adam + (1 - beta2) * gradient_adam ** 2
        
        # 偏差校正
        t = epoch * (len(X) // batch_size) + i // batch_size + 1
        m_hat = m / (1 - beta1 ** t)
        v_hat = v_adam / (1 - beta2 ** t)
        
        # 参数更新
        theta_adam -= learning_rate * m_hat / (np.sqrt(v_hat) + epsilon)
    
    losses_sgd.append(epoch_loss_sgd / (len(X) // batch_size))
    losses_momentum.append(epoch_loss_momentum / (len(X) // batch_size))
    losses_rmsprop.append(epoch_loss_rmsprop / (len(X) // batch_size))
    losses_adam.append(epoch_loss_adam / (len(X) // batch_size))
    
    if (epoch + 1) % 100 == 0:
        print(f"Epoch {epoch+1}, SGD Loss: {losses_sgd[-1]:.4f}, Momentum Loss: {losses_momentum[-1]:.4f}, RMSProp Loss: {losses_rmsprop[-1]:.4f}, Adam Loss: {losses_adam[-1]:.4f}")

# 打印最终参数
print(f"\nSGD Final parameters: theta0={theta_sgd[0][0]:.4f}, theta1={theta_sgd[1][0]:.4f}")
print(f"Momentum Final parameters: theta0={theta_momentum[0][0]:.4f}, theta1={theta_momentum[1][0]:.4f}")
print(f"RMSProp Final parameters: theta0={theta_rmsprop[0][0]:.4f}, theta1={theta_rmsprop[1][0]:.4f}")
print(f"Adam Final parameters: theta0={theta_adam[0][0]:.4f}, theta1={theta_adam[1][0]:.4f}")

# 绘制损失曲线对比
plt.figure(figsize=(12, 8))
plt.plot(range(n_epochs), losses_sgd, label='SGD')
plt.plot(range(n_epochs), losses_momentum, label='Momentum SGD')
plt.plot(range(n_epochs), losses_rmsprop, label='RMSProp')
plt.plot(range(n_epochs), losses_adam, label='Adam')
plt.title('Optimization Algorithms Loss Curves')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()

# 绘制数据和拟合直线
plt.figure(figsize=(12, 8))
plt.scatter(x, y, label='Data')
plt.plot(x, X.dot(theta_sgd), color='red', label='SGD Fit')
plt.plot(x, X.dot(theta_momentum), color='green', label='Momentum SGD Fit')
plt.plot(x, X.dot(theta_rmsprop), color='blue', label='RMSProp Fit')
plt.plot(x, X.dot(theta_adam), color='purple', label='Adam Fit')
plt.title('Linear Regression: Different Optimization Algorithms')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.grid(True)
plt.show()

结果分析

Adam优化算法结合了动量和RMSProp的优点，通常能够更快地收敛到最优解，并且对超参数的敏感性较低。在这个案例中，Adam算法的收敛速度最快，损失值下降最平稳。

深度学习中的优化算法实现

案例5：在神经网络中实现不同的优化算法

问题描述

在一个简单的神经网络模型中实现不同的优化算法，并比较它们的性能。

解决方案

import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.utils import to_categorical

# 加载MNIST数据集
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# 数据预处理
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)

# 创建模型函数
def create_model():
    model = Sequential([
        Flatten(input_shape=(28, 28)),
        Dense(128, activation='relu'),
        Dense(64, activation='relu'),
        Dense(10, activation='softmax')
    ])
    return model

# 训练函数
def train_model(optimizer, name):
    model = create_model()
    model.compile(optimizer=optimizer,
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    
    history = model.fit(x_train, y_train,
                        validation_split=0.2,
                        epochs=20,
                        batch_size=32,
                        verbose=1)
    
    # 评估模型
    test_loss, test_acc = model.evaluate(x_test, y_test, verbose=0)
    print(f"\n{name} Test Accuracy: {test_acc:.4f}")
    
    return history, test_acc

# 定义优化器
optimizers = [
    ('sgd', 'SGD'),
    ('sgd_momentum', 'SGD with Momentum'),
    ('rmsprop', 'RMSProp'),
    ('adam', 'Adam')
]

# 训练不同优化器的模型
histories = {}
test_accs = {}

for opt_name, display_name in optimizers:
    print(f"\n{'='*60}")
    print(f"Training with {display_name}...")
    print(f"{'='*60}")
    
    if opt_name == 'sgd':
        history, acc = train_model('sgd', display_name)
    elif opt_name == 'sgd_momentum':
        history, acc = train_model('sgd', display_name)
    elif opt_name == 'rmsprop':
        history, acc = train_model('rmsprop', display_name)
    elif opt_name == 'adam':
        history, acc = train_model('adam', display_name)
    
    histories[display_name] = history
    test_accs[display_name] = acc

# 绘制损失曲线对比
plt.figure(figsize=(12, 8))
for name, history in histories.items():
    plt.plot(history.history['loss'], label=f'{name} - Train')
    plt.plot(history.history['val_loss'], label=f'{name} - Val')
plt.title('Optimization Algorithms Loss Curves')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()

# 绘制准确率曲线对比
plt.figure(figsize=(12, 8))
for name, history in histories.items():
    plt.plot(history.history['accuracy'], label=f'{name} - Train')
    plt.plot(history.history['val_accuracy'], label=f'{name} - Val')
plt.title('Optimization Algorithms Accuracy Curves')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()

# 绘制测试准确率对比
plt.figure(figsize=(10, 6))
names = list(test_accs.keys())
accs = list(test_accs.values())
plt.bar(names, accs)
plt.title('Test Accuracy Comparison')
plt.xlabel('Optimizer')
plt.ylabel('Accuracy')
plt.ylim(0.9, 1.0)
for i, acc in enumerate(accs):
    plt.text(i, acc + 0.001, f'{acc:.4f}', ha='center')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

结果分析

在神经网络训练中，不同优化算法的性能差异更加明显：

SGD：收敛速度较慢，可能需要更多的epochs才能达到较好的性能。
动量SGD：比基本SGD收敛更快，尤其是在开始阶段。
RMSProp：收敛速度比动量SGD更快，能够更快地达到较低的损失值。
Adam：通常是表现最好的优化算法，收敛速度快，最终性能也较好。

代码示例：自定义优化器实现

案例6：自定义实现Adam优化器

import numpy as np

class AdamOptimizer:
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.learning_rate = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = None  # 一阶矩
        self.v = None  # 二阶矩
        self.t = 0  # 时间步
    
    def initialize(self, params):
        """初始化动量和速度"""
        self.m = {}
        self.v = {}
        for key in params:
            self.m[key] = np.zeros_like(params[key])
            self.v[key] = np.zeros_like(params[key])
    
    def update(self, params, grads):
        """执行参数更新"""
        if self.m is None:
            self.initialize(params)
        
        self.t += 1
        updated_params = {}
        
        for key in params:
            # 更新一阶矩
            self.m[key] = self.beta1 * self.m[key] + (1 - self.beta1) * grads[key]
            # 更新二阶矩
            self.v[key] = self.beta2 * self.v[key] + (1 - self.beta2) * (grads[key] ** 2)
            # 偏差校正
            m_hat = self.m[key] / (1 - self.beta1 ** self.t)
            v_hat = self.v[key] / (1 - self.beta2 ** self.t)
            # 参数更新
            updated_params[key] = params[key] - self.learning_rate * m_hat / (np.sqrt(v_hat) + self.epsilon)
        
        return updated_params

# 测试自定义Adam优化器
if __name__ == "__main__":
    # 生成简单的二次函数数据
    np.random.seed(42)
    x = np.random.rand(100, 1)
    y = 3 * x ** 2 + 2 * x + 1 + np.random.randn(100, 1) * 0.1
    
    # 定义模型参数
    params = {
        'w1': np.random.randn(1, 1),
        'w2': np.random.randn(1, 1),
        'b': np.random.randn(1, 1)
    }
    
    # 定义模型预测函数
    def predict(params, x):
        return params['w1'] * x ** 2 + params['w2'] * x + params['b']
    
    # 定义损失函数
    def compute_loss(params, x, y):
        y_pred = predict(params, x)
        return np.mean((y_pred - y) ** 2)
    
    # 定义梯度计算函数
    def compute_grads(params, x, y):
        y_pred = predict(params, x)
        grads = {
            'w1': np.mean(2 * (y_pred - y) * x ** 2),
            'w2': np.mean(2 * (y_pred - y) * x),
            'b': np.mean(2 * (y_pred - y))
        }
        return grads
    
    # 初始化优化器
    optimizer = AdamOptimizer(learning_rate=0.01)
    
    # 训练模型
    losses = []
    n_epochs = 1000
    
    for epoch in range(n_epochs):
        # 计算梯度
        grads = compute_grads(params, x, y)
        # 更新参数
        params = optimizer.update(params, grads)
        # 计算损失
        loss = compute_loss(params, x, y)
        losses.append(loss)
        
        if (epoch + 1) % 100 == 0:
            print(f"Epoch {epoch+1}, Loss: {loss:.4f}")
    
    # 打印最终参数
    print(f"\nFinal parameters:")
    print(f"w1: {params['w1'][0][0]:.4f}")
    print(f"w2: {params['w2'][0][0]:.4f}")
    print(f"b: {params['b'][0][0]:.4f}")
    
    # 绘制损失曲线
    plt.figure(figsize=(10, 6))
    plt.plot(range(n_epochs), losses)
    plt.title('Adam Optimizer Loss Curve')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.grid(True)
    plt.show()
    
    # 绘制数据和拟合曲线
    plt.figure(figsize=(10, 6))
    plt.scatter(x, y, label='Data')
    x_sorted = np.sort(x, axis=0)
    plt.plot(x_sorted, predict(params, x_sorted), color='red', label='Adam Fit')
    plt.title('Quadratic Regression with Custom Adam Optimizer')
    plt.xlabel('x')
    plt.ylabel('y')
    plt.legend()
    plt.grid(True)
    plt.show()

结果分析

自定义的Adam优化器能够成功训练二次回归模型，损失值逐渐下降并趋于稳定。最终参数接近真实值（w1=3, w2=2, b=1）。

优化算法的选择与调优

不同场景下的优化算法选择

大规模数据集：优先选择Adam或RMSProp，它们的自适应学习率能够加速收敛。
小批量训练：动量SGD或Adam能够减少小批量带来的噪声影响。
微调预训练模型：Adam通常表现较好，能够在较小的学习率下精细调整参数。
生成模型：Adam或其变体（如AdamW）通常是不错的选择。
强化学习：RMSProp或Adam在许多强化学习算法中表现良好。

优化算法的超参数调优

学习率：
- 通常从0.001开始尝试
- 对于SGD，可能需要较大的学习率（如0.01）
- 对于Adam，较小的学习率（如0.0001）可能更稳定
批量大小：
- 较小的批量大小（32-128）通常能够提供更好的泛化性能
- 较大的批量大小可以利用硬件并行性，加速训练
动量参数：
- 动量系数通常设置在0.9左右
- 较大的动量系数可以加速收敛，但可能导致震荡
Adam的超参数：
- beta1：通常设置为0.9
- beta2：通常设置为0.999
- epsilon：通常设置为1e-8

常见问题与解决方案

训练速度慢：
- 尝试使用Adam或RMSProp
- 调整学习率
- 增大批量大小
训练不稳定：
- 减小学习率
- 使用梯度裁剪
- 检查数据标准化
过拟合：
- 增加正则化
- 使用AdamW替代Adam
- 早停法
梯度爆炸：
- 使用梯度裁剪
- 检查网络架构
- 调整初始化方法

总结与实践建议

优化算法的重要性：选择合适的优化算法可以显著提高模型的训练速度和最终性能。
实践中的选择：
- 对于大多数深度学习任务，Adam是一个不错的默认选择
- 对于需要精确控制的场景，可以尝试SGD with Momentum
- 对于某些特定任务，可能需要尝试其他优化算法
实验与对比：
- 在实际项目中，建议尝试多种优化算法并比较它们的性能
- 不同的模型架构和数据集可能对优化算法有不同的偏好
持续监控：
- 训练过程中应密切关注损失曲线和验证指标
- 根据训练动态调整优化策略
综合考虑：
- 优化算法只是模型训练的一部分，还需要考虑数据质量、模型架构、正则化等因素
- 最终的模型性能是多个因素共同作用的结果

通过本章节的学习，我们深入了解了各种优化算法的实现原理和性能特点。在实际项目中，我们应该根据具体任务的特点选择合适的优化算法，并通过实验找到最佳的超参数配置，以达到最优的模型性能。