DNN
上图是一个四层的神经网络,有三个隐藏层。
用 L 表示层数,n [ l ] n^{[l]} n [ l ] 表示第 l 个隐藏层的神经元数目,a [ l ] a^{[l]} a [ l ] 来记作 l 层激活后的结果。用激活函数 g [ l ] g^{[l]} g [ l ] 计算 z [ l ] z^{[l]} z [ l ]
前向传播和反向传播
前向传播 的步骤可以写成
z [ l ] = W [ l ] ⋅ A [ l − 1 ] + b [ l ] (1) z^{[l]}= W^{[l]}·A^{[l-1]}+b^{[l]}\tag{1}
z [ l ] = W [ l ] ⋅ A [ l − 1 ] + b [ l ] ( 1 )
A [ l ] = g [ l ] ( Z [ l ] ) (2) A^{[l]}= g^{[l]}(Z^{[l]})\tag{2}
A [ l ] = g [ l ] ( Z [ l ] ) ( 2 )
前向传播需要喂入 A [ 0 ] A^{[0]} A [ 0 ] 也就是 X X X ,来初始化;初始化的是第一层的输入值。a [ 0 ] a^{[0]} a [ 0 ] 对应于一个训练样本的输入特征,而 A [ 0 ] A^{[0]} A [ 0 ] 对应于一整个训练样本的输入特征,所以这就是这条链的第一个前向函数的输入,重复这个步骤就可以从左到右计算前向传播。
反向传播 的步骤可以写成
d z [ l ] = d a [ l ] ∗ g ′ [ l ] ( z [ l ] ) (3) dz^{[l]}= da^{[l]}*g'^{[l]}(z^{[l]})\tag{3}
d z [ l ] = d a [ l ] ∗ g ′ [ l ] ( z [ l ] ) ( 3 )
d w [ l ] = d z [ l ] ⋅ a [ l − 1 ] (4) dw^{[l]}= dz^{[l]}·a^{[l-1]}\tag{4}
d w [ l ] = d z [ l ] ⋅ a [ l − 1 ] ( 4 )
d b [ l ] = d z [ l ] (5) db^{[l]}= dz^{[l]}\tag{5}
d b [ l ] = d z [ l ] ( 5 )
d a [ l − 1 ] = d z [ l ] ⋅ W [ l ] T (6) da^{[l-1]}= dz^{[l]}·W^{[l] T}\tag{6}
d a [ l − 1 ] = d z [ l ] ⋅ W [ l ] T ( 6 )
向量化表示
d Z [ l ] = d A [ l ] ∗ g ′ [ l ] ( Z [ l ] ) (7) dZ^{[l]} = dA^{[l]} * g'^{[l]}(Z^{[l]})\tag{7}
d Z [ l ] = d A [ l ] ∗ g ′ [ l ] ( Z [ l ] ) ( 7 )
d W [ l ] = 1 m d Z [ l ] ⋅ A [ l − 1 ] T (8) dW^{[l]} = \frac{1}{m} dZ^{[l]} \cdot A^{[l-1] T}\tag{8}
d W [ l ] = m 1 d Z [ l ] ⋅ A [ l − 1 ] T ( 8 )
d b [ l ] = 1 m n p . s u m ( d Z [ l ] , a x i s = 1 , k e e p d i m s = T r u e ) (9) db^{[l]} = \frac{1}{m} np.sum(dZ^{[l]}, axis = 1, keepdims = True)\tag{9}
d b [ l ] = m 1 n p . s u m ( d Z [ l ] , a x i s = 1 , k e e p d i m s = T r u e ) ( 9 )
d A [ l − 1 ] = W [ l ] T ⋅ d Z [ l ] (10) dA^{[l-1]} = W^{[l] T} \cdot dZ^{[l]}\tag{10}
d A [ l − 1 ] = W [ l ] T ⋅ d Z [ l ] ( 1 0 )
正向传播是来预测的,反向传播是用来求梯度的,求出来的梯度会用梯度下降法牛顿法等优化方法求解权重 w 和 b,进而用公式 w [ l ] = w [ l ] − α d w [ l ] w^{[l]}=w^{[l]}-\alpha dw^{[l]} w [ l ] = w [ l ] − α d w [ l ] , b [ l ] = b [ l ] − α d b [ l ] b^{[l]}=b^{[l]}-\alpha db^{[l]} b [ l ] = b [ l ] − α d b [ l ] 更新参数
核对矩阵的维数
实现深度神经网络时,可以拿出一张纸过一遍算法中矩阵的维数来当作检查代码是否有错的方法。
w w w 的维度是(下一层的维数,上一层的维数),即 w [ l ] : ( n [ l ] , n [ l − 1 ] ) w^{[l]}:(n^{[l]},n^{[l-1]}) w [ l ] : ( n [ l ] , n [ l − 1 ] )
b b b 的维度是(下一层的维数,1),即 b [ l ] : ( n [ l ] , 1 ) b^{[l]}:(n^{[l]},1) b [ l ] : ( n [ l ] , 1 )
z [ l ] , a [ l ] : ( n [ l ] , 1 ) z^{[l]},a^{[l]}:(n^{[l]},1) z [ l ] , a [ l ] : ( n [ l ] , 1 )
d w [ l ] dw^{[l]} d w [ l ] 和 w [ l ] w^{[l]} w [ l ] 维度相同,d b [ l ] db^{[l]} d b [ l ] 和 b [ l ] b^{[l]} b [ l ] 维度相同,且 w w w 和 b b b 向量化维度不变,但 z , a z,a z , a 以及 x x x 的维度向量化后会发生变化
向量化后:
Z [ l ] Z^{[l]} Z [ l ] 可以看成由每个单独的 Z [ l ] Z^{[l]} Z [ l ] 叠加而得到,Z [ l ] = ( z [ l ] [ 1 ] , z [ l ] [ 2 ] , z [ l ] [ 3 ] , . . . , z [ l ] [ m ] ) Z^{[l]}=(z^{[l][1]},z^{[l][2]},z^{[l][3]},...,z^{[l][m]}) Z [ l ] = ( z [ l ] [ 1 ] , z [ l ] [ 2 ] , z [ l ] [ 3 ] , . . . , z [ l ] [ m ] )
m 为训练集大小,所以其维度不再是 ( n [ l ] , 1 ) (n^{[l]},1) ( n [ l ] , 1 ) ,而是 ( n [ l ] , m ) (n^{[l]},m) ( n [ l ] , m )
A [ l ] : ( n [ l ] , m ) A^{[l]}:(n^{[l]},m) A [ l ] : ( n [ l ] , m )
在做深度神经网络的反向传播时,一定要确认所有的矩阵维数是前后一致的,可以大大提高代码通过率。
代码实现
总体思想就是由简单到复杂,先去考虑单层/两层神经网络的情况,再将其作为辅助函数进行扩展
初始化
两层神经网路
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 def initialize_parameters (n_x, n_h, n_y ): """ 参数: n_x -- 输入层大小 n_h -- 隐藏层大小 n_y -- 输出层大小 返回: parameters -- 一个 Python 字典,包含你的参数: W1 -- 形状为 (n_h, n_x) 的权重矩阵 b1 -- 形状为 (n_h, 1) 的偏置向量 W2 -- 形状为 (n_y, n_h) 的权重矩阵 b2 -- 形状为 (n_y, 1) 的偏置向量 """ W1 = np.random.randn(n_h,n_x)*0.01 W2 = np.random.randn(n_y,n_h)*0.01 b1 = np.zeros((n_h,1 )) b2 = np.zeros((n_y,1 )) //即时报错,便于后期调试 assert (W1.shape == (n_h, n_x)) assert (b1.shape == (n_h, 1 )) assert (W2.shape == (n_y, n_h)) assert (b2.shape == (n_y, 1 )) parameters = {"W1" : W1, "b1" : b1, "W2" : W2, "b2" : b2} return parameters
L 层神经网络
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 def initialize_parameters_deep (layer_dims ): """ 参数: layer_dims -- Python 数组(list),包含网络每一层的维度 返回: parameters -- 一个 Python 字典,包含你的参数 "W1", "b1", ..., "WL", "bL": Wl -- 形状为 (layer_dims[l], layer_dims[l-1]) 的权重矩阵 bl -- 形状为 (layer_dims[l], 1) 的偏置向量 """ np.random.seed(3 ) parameters = {} L = len (layer_dims) for i in range (1 , L): parameters['W' +str (i)] = np.random.randn(layer_dims[i],layer_dims[i-1 ])*0.01 parameters['b' +str (i)] = np.zeros((layer_dims[i],1 )) assert (parameters['W' + str (i)].shape == (layer_dims[i], layer_dims[i-1 ])) assert (parameters['b' + str (i)].shape == (layer_dims[i], 1 )) return parameters
前向传播
- LINEAR
- LINEAR -> ACTIVATION,其中 ACTIVATION 可以是 ReLU 或 Sigmoid。
- [LINEAR -> RELU] × \times × (L-1) -> LINEAR -> SIGMOID(整个模型)
线性前向传播
线性前向模块(对所有样本进行向量化)计算如下公式:
Z [ l ] = W [ l ] A [ l − 1 ] + b [ l ] (1) Z^{[l]} = W^{[l]}A^{[l-1]} +b^{[l]}\tag{1}
Z [ l ] = W [ l ] A [ l − 1 ] + b [ l ] ( 1 )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 def linear_forward (A, W, b ): """ 实现某一层前向传播的线性部分。 参数: A -- 来自前一层的激活值(或输入数据):(前一层大小, 样本数) W -- 权重矩阵:形状为 (当前层大小, 前一层大小) 的 numpy 数组 b -- 偏置向量:形状为 (当前层大小, 1) 的 numpy 数组 返回: Z -- 激活函数的输入(也称为 pre-activation 参数) cache -- 一个 Python 元组,包含 "A"、"W" 和 "b";用于高效计算反向传播 """ Z = np.dot(W,A)+b assert (Z.shape == (W.shape[0 ], A.shape[1 ])) cache = (A, W, b) return Z, cache
线性-激活前向传播
两种激活函数的引入:
Sigmoid :σ ( Z ) = σ ( W A + b ) = 1 1 + e − ( W A + b ) \sigma(Z) = \sigma(W A + b) = \frac{1}{ 1 + e^{-(W A + b)}} σ ( Z ) = σ ( W A + b ) = 1 + e − ( W A + b ) 1 。已实现 sigmoid 函数(见附录)。该函数会返回 两个 值:激活值 “A” 以及包含 “Z” 的 “cache”(它会作为输入传给对应的反向传播函数)。使用方式如下:
1 A, activation_cache = sigmoid(Z)
ReLU :ReLU 的数学形式为 A = R E L U ( Z ) = m a x ( 0 , Z ) A = RELU(Z) = max(0, Z) A = R E L U ( Z ) = m a x ( 0 , Z ) 。已实现 relu 函数(见附录)。该函数同样会返回 两个 值:激活值 “A” 以及包含 “Z” 的 “cache”(它会作为输入传给对应的反向传播函数)。使用方式如下:
1 A, activation_cache = relu(Z)
LINEAR-> ACTIVATION 层的前向传播。数学关系为:
A [ l ] = g ( Z [ l ] ) = g ( W [ l ] A [ l − 1 ] + b [ l ] ) (2) A^{[l]} = g(Z^{[l]}) = g(W^{[l]}A^{[l-1]} +b^{[l]})\tag{2}
A [ l ] = g ( Z [ l ] ) = g ( W [ l ] A [ l − 1 ] + b [ l ] ) ( 2 )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 def linear_activation_forward (A_prev, W, b, activation ): """ 实现 LINEAR->ACTIVATION 层的前向传播 参数: A_prev -- 来自前一层的激活值(或输入数据):(前一层大小, 样本数) W -- 权重矩阵:形状为 (当前层大小, 前一层大小) 的 numpy 数组 b -- 偏置向量:形状为 (当前层大小, 1) 的 numpy 数组 activation -- 本层使用的激活函数,字符串:"sigmoid" 或 "relu" 返回: A -- 激活函数的输出(也称为 post-activation 值) cache -- 一个 Python 元组,包含 "linear_cache" 和 "activation_cache"; 用于高效计算反向传播 """ linear_cache = (A_prev,W,b) if activation == "sigmoid" : Z = np.dot(W, A_prev) + b A, activation_cache = sigmoid(Z) elif activation == "relu" : Z = np.dot(W, A_prev) + b A, activation_cache = relu(Z) assert (A.shape == (W.shape[0 ], A_prev.shape[1 ])) cache = (linear_cache, activation_cache) return A, cache
L 层模型
- [LINEAR -> RELU] × \times × (L-1) -> LINEAR -> SIGMOID(整个模型)
在下面的代码中,变量 AL 表示 A [ L ] = σ ( Z [ L ] ) = σ ( W [ L ] A [ L − 1 ] + b [ L ] ) A^{[L]} = \sigma(Z^{[L]}) = \sigma(W^{[L]} A^{[L-1]} + b^{[L]}) A [ L ] = σ ( Z [ L ] ) = σ ( W [ L ] A [ L − 1 ] + b [ L ] ) 。(有时也被称为 Y ^ \hat{Y} Y ^ )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 def L_model_forward (X, parameters ): """ 实现 [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID 的前向传播计算 参数: X -- 输入数据,形状为 (输入维度, 样本数) 的 numpy 数组 parameters -- initialize_parameters_deep() 的输出 返回: AL -- 最后一层的 post-activation 值 caches -- cache 列表,包含: 每一次 linear_relu_forward() 的 cache(共有 L-1 个,下标从 0 到 L-2) linear_sigmoid_forward() 的 cache(只有 1 个,下标为 L-1) """ caches = [] A = X L = len (parameters) // 2 for i in range (1 , L): A_prev = A A, cache = linear_activation_forward(A_prev, parameters['W' +str (i)], parameters['b' +str (i)], "relu" ) caches.append(cache) AL, cache = linear_activation_forward(A, parameters['W' +str (L)], parameters['b' +str (L)], "sigmoid" ) caches.append(cache) assert (AL.shape == (1 ,X.shape[1 ])) return AL, caches
代价函数
使用下面的公式计算交叉熵代价 J J J :
− 1 m ∑ i = 1 m ( y ( i ) log ( a [ L ] ( i ) ) + ( 1 − y ( i ) ) log ( 1 − a [ L ] ( i ) ) ) (3) -\frac{1}{m} \sum_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L] (i)}\right)) \tag{3}
− m 1 i = 1 ∑ m ( y ( i ) log ( a [ L ] ( i ) ) + ( 1 − y ( i ) ) log ( 1 − a [ L ] ( i ) ) ) ( 3 )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 def compute_cost (AL, Y ): """ 实现代价函数(对应公式 (3))。 参数: AL -- 与标签预测对应的概率向量,形状 (1, 样本数) Y -- 真实“标签”向量(例如:非猫为 0,猫为 1),形状 (1, 样本数) 返回: cost -- 交叉熵代价 """ m = Y.shape[1 ] cost = -1 /m*np.sum (Y*np.log(AL)+(1 -Y)*np.log(1 -AL)) cost = np.squeeze(cost) assert (cost.shape == ()) return cost
反向传播
线性反向传播
对于第 l l l 层,线性部分为:Z [ l ] = W [ l ] A [ l − 1 ] + b [ l ] Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]} Z [ l ] = W [ l ] A [ l − 1 ] + b [ l ]
使用输入 d Z [ l ] dZ^{[l]} d Z [ l ] 可以计算三个输出 ( d W [ l ] , d b [ l ] , d A [ l − 1 ] ) (dW^{[l]}, db^{[l]}, dA^{[l-1]}) ( d W [ l ] , d b [ l ] , d A [ l − 1 ] ) 。需要用到以下公式:
d W [ l ] = ∂ L ∂ W [ l ] = 1 m d Z [ l ] A [ l − 1 ] T (4) dW^{[l]} = \frac{\partial \mathcal{L} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l-1] T} \tag{4}
d W [ l ] = ∂ W [ l ] ∂ L = m 1 d Z [ l ] A [ l − 1 ] T ( 4 )
d b [ l ] = ∂ L ∂ b [ l ] = 1 m ∑ i = 1 m d Z [ l ] ( i ) (5) db^{[l]} = \frac{\partial \mathcal{L} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l] (i)}\tag{5}
d b [ l ] = ∂ b [ l ] ∂ L = m 1 i = 1 ∑ m d Z [ l ] ( i ) ( 5 )
d A [ l − 1 ] = ∂ L ∂ A [ l − 1 ] = W [ l ] T d Z [ l ] (6) dA^{[l-1]} = \frac{\partial \mathcal{L} }{\partial A^{[l-1]}} = W^{[l] T} dZ^{[l]} \tag{6}
d A [ l − 1 ] = ∂ A [ l − 1 ] ∂ L = W [ l ] T d Z [ l ] ( 6 )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 def linear_backward (dZ, cache ): """ 实现单层(第 l 层)反向传播的线性部分 参数: dZ -- 代价函数相对于线性输出(当前层 l)的梯度 cache -- 来自当前层前向传播的缓存 (A_prev, W, b) 元组 返回: dA_prev -- 代价函数相对于上一层(l-1)激活值的梯度,形状与 A_prev 相同 dW -- 代价函数相对于 W(当前层 l)的梯度,形状与 W 相同 db -- 代价函数相对于 b(当前层 l)的梯度,形状与 b 相同 """ A_prev, W, b = cache m = A_prev.shape[1 ] dA_prev = np.dot(W.T, dZ) dW = 1 /m*np.dot(dZ, A_prev.T) db = 1 /m*np.sum (dZ,axis=1 ,keepdims=True ) assert (dA_prev.shape == A_prev.shape) assert (dW.shape == W.shape) assert (db.shape == b.shape) return dA_prev, dW, db
线性-激活反向传播
预先提供的两个反向函数(见附录)
sigmoid_backward :实现 SIGMOID 单元的反向传播。调用方式:
1 dZ = sigmoid_backward(dA, activation_cache)
relu_backward :实现 RELU 单元的反向传播。调用方式:
1 dZ = relu_backward(dA, activation_cache)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 def linear_activation_backward (dA, cache, activation ): """ 实现 LINEAR->ACTIVATION 层的反向传播。 参数: dA -- 当前层 l 的 post-activation 梯度 cache -- 我们为高效反向传播而保存的缓存 (linear_cache, activation_cache) 元组 activation -- 本层使用的激活函数,字符串:"sigmoid" 或 "relu" 返回: dA_prev -- 代价函数相对于上一层(l-1)激活值的梯度,形状与 A_prev 相同 dW -- 代价函数相对于 W(当前层 l)的梯度,形状与 W 相同 db -- 代价函数相对于 b(当前层 l)的梯度,形状与 b 相同 """ linear_cache, activation_cache = cache if activation == "relu" : dZ = relu_backward(dA, activation_cache) dA_prev, dW, db = linear_backward(dZ,linear_cache) elif activation == "sigmoid" : dZ = sigmoid_backward(dA, activation_cache) dA_prev, dW, db = linear_backward(dZ,linear_cache) return dA_prev, dW, db
L 层模型
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 def L_model_backward (AL, Y, caches ): """ 实现 [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID 这一整体结构的反向传播 参数: AL -- 概率向量,前向传播(L_model_forward())的输出 Y -- 真实“标签”向量(非猫为 0,猫为 1) caches -- cache 列表,包含: 每一层 activation 为 "relu" 的 linear_activation_forward() 的 cache(即 caches[l],其中 l = 0...L-2) activation 为 "sigmoid" 的 linear_activation_forward() 的 cache(即 caches[L-1]) 返回: grads -- 包含梯度的字典: grads["dA" + str(l)] = ... grads["dW" + str(l)] = ... grads["db" + str(l)] = ... """ grads = {} L = len (caches) m = AL.shape[1 ] Y = Y.reshape(AL.shape) dAL = -(np.divide(Y, AL) - np.divide(1 - Y, 1 - AL)) current_cache = caches[L - 1 ] grads["dA" + str (L)], grads["dW" + str (L)], grads["db" + str (L)] = linear_activation_backward(dAL, current_cache, activation="sigmoid" ) for i in reversed (range (L - 1 )): current_cache = caches[i] dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA" + str (i + 2 )], current_cache, activation="relu" ) grads["dA" + str (i + 1 )] = dA_prev_temp grads["dW" + str (i + 1 )] = dW_temp grads["db" + str (i + 1 )] = db_temp return grads
更新参数
使用梯度下降来更新模型参数:
W [ l ] = W [ l ] − α d W [ l ] (7) W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]} \tag{7}
W [ l ] = W [ l ] − α d W [ l ] ( 7 )
b [ l ] = b [ l ] − α d b [ l ] (8) b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]} \tag{8}
b [ l ] = b [ l ] − α d b [ l ] ( 8 )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 def update_parameters (parameters, grads, learning_rate ): """ 使用梯度下降更新参数 参数: parameters -- 包含参数的 Python 字典 grads -- 包含梯度的 Python 字典(L_model_backward 的输出) 返回: parameters -- 更新后的参数字典: parameters["W" + str(l)] = ... parameters["b" + str(l)] = ... """ L = len (parameters) // 2 for i in range (1 ,L+1 ): parameters['W' +str (i)] = parameters['W' +str (i)] - learning_rate* grads["dW" +str (i)] parameters['b' +str (i)] = parameters['b' +str (i)] - learning_rate* grads["db" +str (i)] return parameters
附录
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 import numpy as npdef sigmoid (Z ): """ Implements the sigmoid activation in numpy Arguments: Z -- numpy array of any shape Returns: A -- output of sigmoid(z), same shape as Z cache -- returns Z as well, useful during backpropagation """ A = 1 /(1 +np.exp(-Z)) cache = Z return A, cache def relu (Z ): """ Implement the RELU function. Arguments: Z -- Output of the linear layer, of any shape Returns: A -- Post-activation parameter, of the same shape as Z cache -- a python dictionary containing "A" ; stored for computing the backward pass efficiently """ A = np.maximum(0 ,Z) assert (A.shape == Z.shape) cache = Z return A, cache def relu_backward (dA, cache ): """ Implement the backward propagation for a single RELU unit. Arguments: dA -- post-activation gradient, of any shape cache -- 'Z' where we store for computing backward propagation efficiently Returns: dZ -- Gradient of the cost with respect to Z """ Z = cache dZ = np.array(dA, copy=True ) dZ[Z <= 0 ] = 0 assert (dZ.shape == Z.shape) return dZ def sigmoid_backward (dA, cache ): """ Implement the backward propagation for a single SIGMOID unit. Arguments: dA -- post-activation gradient, of any shape cache -- 'Z' where we store for computing backward propagation efficiently Returns: dZ -- Gradient of the cost with respect to Z """ Z = cache s = 1 /(1 +np.exp(-Z)) dZ = dA * s * (1 -s) assert (dZ.shape == Z.shape) return dZ