DNN

神经网络

上图是一个四层的神经网络，有三个隐藏层。

用 L 表示层数， $n^{[l]}$ 表示第 l 个隐藏层的神经元数目， $a^{[l]}$ 来记作 l 层激活后的结果。用激活函数 $g^{[l]}$ 计算 $z^{[l]}$

前向传播和反向传播

前向传播 的步骤可以写成

z^{[l]}= W^{[l]}·A^{[l-1]}+b^{[l]}\tag{1}

A^{[l]}= g^{[l]}(Z^{[l]})\tag{2}

前向传播需要喂入 $A^{[0]}$ 也就是 $X$ ，来初始化；初始化的是第一层的输入值。 $a^{[0]}$ 对应于一个训练样本的输入特征，而 $A^{[0]}$ 对应于一整个训练样本的输入特征，所以这就是这条链的第一个前向函数的输入，重复这个步骤就可以从左到右计算前向传播。

反向传播 的步骤可以写成

dz^{[l]}= da^{[l]}*g'^{[l]}(z^{[l]})\tag{3}

dw^{[l]}= dz^{[l]}·a^{[l-1]}\tag{4}

db^{[l]}= dz^{[l]}\tag{5}

da^{[l-1]}= dz^{[l]}·W^{[l] T}\tag{6}

向量化表示

dZ^{[l]} = dA^{[l]} * g'^{[l]}(Z^{[l]})\tag{7}

dW^{[l]} = \frac{1}{m} dZ^{[l]} \cdot A^{[l-1] T}\tag{8}

db^{[l]} = \frac{1}{m} np.sum(dZ^{[l]}, axis = 1, keepdims = True)\tag{9}

dA^{[l-1]} = W^{[l] T} \cdot dZ^{[l]}\tag{10}

正向传播是来预测的，反向传播是用来求梯度的，求出来的梯度会用梯度下降法牛顿法等优化方法求解权重 w 和 b，进而用公式 $w^{[l]}=w^{[l]}-\alpha dw^{[l]}$ , $b^{[l]}=b^{[l]}-\alpha db^{[l]}$ 更新参数

核对矩阵的维数

实现深度神经网络时，可以拿出一张纸过一遍算法中矩阵的维数来当作检查代码是否有错的方法。

$w$ 的维度是（下一层的维数，上一层的维数），即 $w^{[l]}:(n^{[l]},n^{[l-1]})$

$b$ 的维度是（下一层的维数，1），即 $b^{[l]}:(n^{[l]},1)$

$z^{[l]},a^{[l]}:(n^{[l]},1)$

$dw^{[l]}$ 和 $w^{[l]}$ 维度相同， $db^{[l]}$ 和 $b^{[l]}$ 维度相同，且 $w$ 和 $b$ 向量化维度不变，但 $z,a$ 以及 $x$ 的维度向量化后会发生变化

向量化后:

$Z^{[l]}$ 可以看成由每个单独的 $Z^{[l]}$ 叠加而得到， $Z^{[l]}=(z^{[l][1]},z^{[l][2]},z^{[l][3]},...,z^{[l][m]})$

m 为训练集大小，所以其维度不再是 $(n^{[l]},1)$ ，而是 $(n^{[l]},m)$

$A^{[l]}:(n^{[l]},m)$

在做深度神经网络的反向传播时，一定要确认所有的矩阵维数是前后一致的，可以大大提高代码通过率。

代码实现

总体思想就是由简单到复杂，先去考虑单层/两层神经网络的情况，再将其作为辅助函数进行扩展

初始化

两层神经网路

def initialize_parameters(n_x, n_h, n_y):
    """
    参数：
    n_x -- 输入层大小
    n_h -- 隐藏层大小
    n_y -- 输出层大小
    
    返回：
    parameters -- 一个 Python 字典，包含你的参数：
                    W1 -- 形状为 (n_h, n_x) 的权重矩阵
                    b1 -- 形状为 (n_h, 1) 的偏置向量
                    W2 -- 形状为 (n_y, n_h) 的权重矩阵
                    b2 -- 形状为 (n_y, 1) 的偏置向量
    """
    

    W1 = np.random.randn(n_h,n_x)*0.01
    W2 = np.random.randn(n_y,n_h)*0.01
    b1 = np.zeros((n_h,1))
    b2 = np.zeros((n_y,1))
    //即时报错，便于后期调试
    assert(W1.shape == (n_h, n_x))
    assert(b1.shape == (n_h, 1))
    assert(W2.shape == (n_y, n_h))
    assert(b2.shape == (n_y, 1))
    
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    
    return parameters

L 层神经网络

def initialize_parameters_deep(layer_dims):
    """
    参数：
    layer_dims -- Python 数组（list），包含网络每一层的维度
    
    返回：
    parameters -- 一个 Python 字典，包含你的参数 "W1", "b1", ..., "WL", "bL"：
                    Wl -- 形状为 (layer_dims[l], layer_dims[l-1]) 的权重矩阵
                    bl -- 形状为 (layer_dims[l], 1) 的偏置向量
    """
    
    np.random.seed(3)
    parameters = {}
    L = len(layer_dims)            # 网络中的层数

    for i in range(1, L):
        parameters['W'+str(i)] = np.random.randn(layer_dims[i],layer_dims[i-1])*0.01
        parameters['b'+str(i)] = np.zeros((layer_dims[i],1))
        
        assert(parameters['W' + str(i)].shape == (layer_dims[i], layer_dims[i-1]))
        assert(parameters['b' + str(i)].shape == (layer_dims[i], 1))

        
    return parameters

前向传播

- LINEAR

- LINEAR -> ACTIVATION，其中 ACTIVATION 可以是 ReLU 或 Sigmoid。

- [LINEAR -> RELU] $\times$ (L-1) -> LINEAR -> SIGMOID（整个模型）

线性前向传播

线性前向模块（对所有样本进行向量化）计算如下公式：

Z^{[l]} = W^{[l]}A^{[l-1]} +b^{[l]}\tag{1}

def linear_forward(A, W, b):
    """
    实现某一层前向传播的线性部分。

    参数：
    A -- 来自前一层的激活值（或输入数据）：(前一层大小, 样本数)
    W -- 权重矩阵：形状为 (当前层大小, 前一层大小) 的 numpy 数组
    b -- 偏置向量：形状为 (当前层大小, 1) 的 numpy 数组

    返回：
    Z -- 激活函数的输入（也称为 pre-activation 参数）
    cache -- 一个 Python 元组，包含 "A"、"W" 和 "b"；用于高效计算反向传播
    """

    Z = np.dot(W,A)+b
    
    assert(Z.shape == (W.shape[0], A.shape[1]))
    cache = (A, W, b)
    
    return Z, cache

线性-激活前向传播

两种激活函数的引入：

Sigmoid： $\sigma(Z) = \sigma(W A + b) = \frac{1}{ 1 + e^{-(W A + b)}}$ 。已实现 sigmoid 函数（见附录）。该函数会返回两个值：激活值 “A” 以及包含 “Z” 的 “cache”（它会作为输入传给对应的反向传播函数）。使用方式如下：

1	A, activation_cache = sigmoid(Z)

ReLU：ReLU 的数学形式为 $A = RELU(Z) = max(0, Z)$ 。已实现 relu 函数（见附录）。该函数同样会返回两个值：激活值 “A” 以及包含 “Z” 的 “cache”（它会作为输入传给对应的反向传播函数）。使用方式如下：

1	A, activation_cache = relu(Z)

LINEAR-> ACTIVATION 层的前向传播。数学关系为：

A^{[l]} = g(Z^{[l]}) = g(W^{[l]}A^{[l-1]} +b^{[l]})\tag{2}

def linear_activation_forward(A_prev, W, b, activation):
    """
    实现 LINEAR->ACTIVATION 层的前向传播

    参数：
    A_prev -- 来自前一层的激活值（或输入数据）：(前一层大小, 样本数)
    W -- 权重矩阵：形状为 (当前层大小, 前一层大小) 的 numpy 数组
    b -- 偏置向量：形状为 (当前层大小, 1) 的 numpy 数组
    activation -- 本层使用的激活函数，字符串："sigmoid" 或 "relu"

    返回：
    A -- 激活函数的输出（也称为 post-activation 值）
    cache -- 一个 Python 元组，包含 "linear_cache" 和 "activation_cache"；
             用于高效计算反向传播
    """
    
    linear_cache = (A_prev,W,b)
    if activation == "sigmoid":
        Z = np.dot(W, A_prev) + b
        A, activation_cache = sigmoid(Z)
    
    elif activation == "relu":
        Z = np.dot(W, A_prev) + b
        A, activation_cache = relu(Z)
    
    assert (A.shape == (W.shape[0], A_prev.shape[1]))
    cache = (linear_cache, activation_cache)

    return A, cache

L 层模型

- [LINEAR -> RELU] $\times$ (L-1) -> LINEAR -> SIGMOID（整个模型）

在下面的代码中，变量 AL 表示 $A^{[L]} = \sigma(Z^{[L]}) = \sigma(W^{[L]} A^{[L-1]} + b^{[L]})$ 。（有时也被称为 $\hat{Y}$ ）

def L_model_forward(X, parameters):
    """
    实现 [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID 的前向传播计算
    
    参数：
    X -- 输入数据，形状为 (输入维度, 样本数) 的 numpy 数组
    parameters -- initialize_parameters_deep() 的输出
    
    返回：
    AL -- 最后一层的 post-activation 值
    caches -- cache 列表，包含：
                每一次 linear_relu_forward() 的 cache（共有 L-1 个，下标从 0 到 L-2）
                linear_sigmoid_forward() 的 cache（只有 1 个，下标为 L-1）
    """

    caches = []
    A = X
    L = len(parameters) // 2                  # 神经网络的层数
    
    # 实现 [LINEAR -> RELU]*(L-1)。将 "cache" 添加到 "caches" 列表中。
    for i in range(1, L):
        A_prev = A
        A, cache = linear_activation_forward(A_prev, parameters['W'+str(i)], parameters['b'+str(i)], "relu")
        caches.append(cache)
    
    # 实现 LINEAR -> SIGMOID。将 "cache" 添加到 "caches" 列表中。
    AL, cache = linear_activation_forward(A, parameters['W'+str(L)], parameters['b'+str(L)], "sigmoid")
    caches.append(cache)
    
    assert(AL.shape == (1,X.shape[1]))
            
    return AL, caches

代价函数

使用下面的公式计算交叉熵代价 $J$ ：

-\frac{1}{m} \sum_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L] (i)}\right)) \tag{3}

def compute_cost(AL, Y):
    """
    实现代价函数（对应公式 (3)）。

    参数：
    AL -- 与标签预测对应的概率向量，形状 (1, 样本数)
    Y -- 真实“标签”向量（例如：非猫为 0，猫为 1），形状 (1, 样本数)

    返回：
    cost -- 交叉熵代价
    """
    
    m = Y.shape[1]

    # 根据 aL 和 y 计算损失。
    cost = -1/m*np.sum(Y*np.log(AL)+(1-Y)*np.log(1-AL))
    
    cost = np.squeeze(cost)      # 确保 cost 的形状符合预期（例如把 [[17]] 变为 17）。
    assert(cost.shape == ())
    
    return cost

反向传播

线性反向传播

对于第 $l$ 层，线性部分为： $Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}$

使用输入 $dZ^{[l]}$ 可以计算三个输出 $(dW^{[l]}, db^{[l]}, dA^{[l-1]})$ 。需要用到以下公式：

dW^{[l]} = \frac{\partial \mathcal{L} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l-1] T} \tag{4}

db^{[l]} = \frac{\partial \mathcal{L} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l] (i)}\tag{5}

dA^{[l-1]} = \frac{\partial \mathcal{L} }{\partial A^{[l-1]}} = W^{[l] T} dZ^{[l]} \tag{6}

def linear_backward(dZ, cache):
    """
    实现单层（第 l 层）反向传播的线性部分

    参数：
    dZ -- 代价函数相对于线性输出（当前层 l）的梯度
    cache -- 来自当前层前向传播的缓存 (A_prev, W, b) 元组

    返回：
    dA_prev -- 代价函数相对于上一层（l-1）激活值的梯度，形状与 A_prev 相同
    dW -- 代价函数相对于 W（当前层 l）的梯度，形状与 W 相同
    db -- 代价函数相对于 b（当前层 l）的梯度，形状与 b 相同
    """
    A_prev, W, b = cache
    m = A_prev.shape[1]

    dA_prev = np.dot(W.T, dZ)
    dW = 1/m*np.dot(dZ, A_prev.T)
    db = 1/m*np.sum(dZ,axis=1,keepdims=True)
    
    assert (dA_prev.shape == A_prev.shape)
    assert (dW.shape == W.shape)
    assert (db.shape == b.shape)
    
    return dA_prev, dW, db

线性-激活反向传播

预先提供的两个反向函数（见附录）

sigmoid_backward：实现 SIGMOID 单元的反向传播。调用方式：

1	dZ = sigmoid_backward(dA, activation_cache)

relu_backward：实现 RELU 单元的反向传播。调用方式：

1	dZ = relu_backward(dA, activation_cache)

def linear_activation_backward(dA, cache, activation):
    """
    实现 LINEAR->ACTIVATION 层的反向传播。
    
    参数：
    dA -- 当前层 l 的 post-activation 梯度
    cache -- 我们为高效反向传播而保存的缓存 (linear_cache, activation_cache) 元组
    activation -- 本层使用的激活函数，字符串："sigmoid" 或 "relu"
    
    返回：
    dA_prev -- 代价函数相对于上一层（l-1）激活值的梯度，形状与 A_prev 相同
    dW -- 代价函数相对于 W（当前层 l）的梯度，形状与 W 相同
    db -- 代价函数相对于 b（当前层 l）的梯度，形状与 b 相同
    """
    linear_cache, activation_cache = cache
    
    if activation == "relu":
        dZ = relu_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ,linear_cache)
        
    elif activation == "sigmoid":
        dZ = sigmoid_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ,linear_cache)
    
    return dA_prev, dW, db

L 层模型

def L_model_backward(AL, Y, caches):
    """
    实现 [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID 这一整体结构的反向传播
    
    参数：
    AL -- 概率向量，前向传播（L_model_forward()）的输出
    Y -- 真实“标签”向量（非猫为 0，猫为 1）
    caches -- cache 列表，包含：
                每一层 activation 为 "relu" 的 linear_activation_forward() 的 cache（即 caches[l]，其中 l = 0...L-2）
                activation 为 "sigmoid" 的 linear_activation_forward() 的 cache（即 caches[L-1]）
    
    返回：
    grads -- 包含梯度的字典：
             grads["dA" + str(l)] = ...
             grads["dW" + str(l)] = ...
             grads["db" + str(l)] = ...
    """
    grads = {}
    L = len(caches)
    m = AL.shape[1]
    Y = Y.reshape(AL.shape)
    # 初始化反向传播
    dAL = -(np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
    # 注意编号方式不同，cache 下标为 0~L-1，dA|dW|db 下标为 1~L
    current_cache = caches[L - 1]
    # 第 L 层（SIGMOID -> LINEAR）的梯度
    grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] = linear_activation_backward(dAL, current_cache, activation="sigmoid")
    
    for i in reversed(range(L - 1)):
        # 第 l 层：（RELU -> LINEAR）梯度
        current_cache = caches[i]
        dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA" + str(i + 2)], current_cache, activation="relu")
        grads["dA" + str(i + 1)] = dA_prev_temp
        grads["dW" + str(i + 1)] = dW_temp
        grads["db" + str(i + 1)] = db_temp
    # eg.L = 4 时，i 的范围为 2、1、0，对应 3、2、1 层，第 L 层需要用到下一层传回来的梯度    
    return grads

更新参数

使用梯度下降来更新模型参数：

W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]} \tag{7}

b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]} \tag{8}

def update_parameters(parameters, grads, learning_rate):
    """
    使用梯度下降更新参数
    
    参数：
    parameters -- 包含参数的 Python 字典
    grads -- 包含梯度的 Python 字典（L_model_backward 的输出）
    
    返回：
    parameters -- 更新后的参数字典：
                  parameters["W" + str(l)] = ...
                  parameters["b" + str(l)] = ...
    """
    
    L = len(parameters) // 2 # 神经网络的层数

    # 对每个参数应用更新规则。使用 for 循环。
    for i in range(1,L+1):
        parameters['W'+str(i)] = parameters['W'+str(i)] - learning_rate* grads["dW"+str(i)]
        parameters['b'+str(i)] = parameters['b'+str(i)] - learning_rate* grads["db"+str(i)]
    return parameters

附录

import numpy as np

def sigmoid(Z):
    """
    Implements the sigmoid activation in numpy
    
    Arguments:
    Z -- numpy array of any shape
    
    Returns:
    A -- output of sigmoid(z), same shape as Z
    cache -- returns Z as well, useful during backpropagation
    """
    
    A = 1/(1+np.exp(-Z))
    cache = Z
    
    return A, cache

def relu(Z):
    """
    Implement the RELU function.

    Arguments:
    Z -- Output of the linear layer, of any shape

    Returns:
    A -- Post-activation parameter, of the same shape as Z
    cache -- a python dictionary containing "A" ; stored for computing the backward pass efficiently
    """
    
    A = np.maximum(0,Z)
    
    assert(A.shape == Z.shape)
    
    cache = Z 
    return A, cache


def relu_backward(dA, cache):
    """
    Implement the backward propagation for a single RELU unit.

    Arguments:
    dA -- post-activation gradient, of any shape
    cache -- 'Z' where we store for computing backward propagation efficiently

    Returns:
    dZ -- Gradient of the cost with respect to Z
    """
    
    Z = cache
    dZ = np.array(dA, copy=True) # just converting dz to a correct object.
    
    # When z <= 0, you should set dz to 0 as well. 
    dZ[Z <= 0] = 0
    
    assert (dZ.shape == Z.shape)
    
    return dZ

def sigmoid_backward(dA, cache):
    """
    Implement the backward propagation for a single SIGMOID unit.

    Arguments:
    dA -- post-activation gradient, of any shape
    cache -- 'Z' where we store for computing backward propagation efficiently

    Returns:
    dZ -- Gradient of the cost with respect to Z
    """
    
    Z = cache
    
    s = 1/(1+np.exp(-Z))
    dZ = dA * s * (1-s)
    
    assert (dZ.shape == Z.shape)
    
    return dZ