文章10
标签6
分类2

神经网络入门

零基础入门深度学习(3) - 神经网络和反向传播算法这篇文章算是由浅入深地将深度学习的原理做了透彻的阐述,结合李宏毅老师的课程记录自己的学习体会。

感知器的激活函数是阶跃函数,神经元的激活函数一般是sigmoid函数或者tanh函数,其中,

sigmoid(x)=\frac{1}{1+e^{-x}}​​,sigmoid函数的导数可以用自身来表示,即y=sigmoid(x), y'=y(1-y)

假如存在一个由一个输入层,一个输出层和三个隐藏层的神经网络,假设其激活函数fsigmoid函数,权重矩阵分别为W_1,W_2,W_3,W_4,每个隐藏层的输出分别是\vec{a}_1,\vec{a}_2,\vec{a}_3,神经网络的输入为\vec{x},输出为\vec{y}​​,如下图所示:

\vec{a}_1=f(W_1\centerdot\vec{x})\\ \vec{a}_2=f(W_2\centerdot\vec{a}_1)\\ \vec{a}_3=f(W_3\centerdot\vec{a}_2)\\ \vec{y}=f(W_4\centerdot\vec{a}_3)\\ \vec{y}=f(W_4\centerdot f(W_3\centerdot f(W_2\centerdot f(W_1\centerdot\vec{x}))))

可见,根据公式(一),如果没有激活函数,则输出值只会是线性的,无法模拟非线性曲线,即:

\vec{y}=W_4\centerdot W_3\centerdot W_2\centerdot W_1\centerdot\vec{x}=W\centerdot\vec{x}

反向传播算法

weights更新算法为:w_{ji} += \eta\delta_jx_{ji}​​​

其中,w_{ji}​​​是节点i​​​到节点j​​​的权重,\eta​​​是学习速率,\delta_j​​​是节点j​​的误差项,x_{ji}​​是节点i​​​传递到下一层的输出,如果是偏置项,则输出置为1即可。误差项定义如下:

  • 输出层节点
\delta_i=y_i(1-y_i)(t_i-y_i)

其中,y_i是节点i输出值t_i是样本对应于节点i目标值。如节点8的误差项应该是

\delta_8=y_1(1-y_1)(t_1-y_1)
  • 隐藏层节点
\delta_i=a_i(1-a_i)\sum_{k\in{outputs}}w_{ki}\delta_k

其中,a_i是节点的输出值,\delta_k是下一层节点k​的误差项。如隐藏层节点4的误差项:

\delta_4=a_4(1-a_4)(w_{84}\delta_8+w_{94}\delta_9)

推导

以目标值和输出值的差方和作为样本d(即(\vec{x},\vec{t}))的误差

E_d\equiv\frac{1}{2}\sum_{i\in outputs}(t_i-y_i)^2

则由公式1可知,E_d是权重W_{i, j}的函数,而W_{ji}在整个神经网络中仅用于计算j节点的加权输入net_j,故E_d同时也是net_j​的函数,按照梯度下降算法进行优化

w_{ji} -=\eta\frac{\partial{E_d}}{\partial{w_{ji}}}\\ \frac{\partial{E_d}}{\partial{w_{ji}}} =\frac{\partial{E_d}}{\partial{net_j}}\frac{\partial{net_j}}{\partial{w_{ji}}}\\ =\frac{\partial{E_d}}{\partial{net_j}}\frac{\partial{\sum_{i}{w_{ji}}x_{ji}}}{\partial{w_{ji}}}\\ =\frac{\partial{E_d}}{\partial{net_j}}x_{ji}

对于\frac{\partial{E_d}}{\partial{net_j}},需要区分输出层隐藏层两种情况。

  • 输出层

y_j=sigmoid(net_j)可得

\frac{\partial{E_d}}{\partial{net_j}} =\frac{\partial{E_d}}{\partial{y_j}}\frac{\partial{y_j}}{\partial{net_j}}\\

\frac{\partial{E_d}}{\partial{y_j}}=\frac{\partial}{\partial{y_j}}\frac{1}{2}\sum_{i\in outputs}(t_i-y_i)^2\\ =\frac{\partial}{\partial{y_j}}\frac{1}{2}(t_j-y_j)^2\\ =-(t_j-y_j) \frac{\partial{y_j}}{\partial{net_j}}=\frac{\partial sigmoid(net_j)}{\partial{net_j}}\\ =y_j(1-y_j)\\

\frac{\partial{E_d}}{\partial{net_j}}=-(t_j-y_j)y_j(1-y_j)

\delta_j=-\frac{\partial{E_d}}{\partial{net_j}}=(t_j-y_j)y_j(1-y_j)w_{ji} +=\eta\delta_jx_{ji}

  • 隐藏层

首先定义节点j​的所有直接下游节点集合Downstream(j),则

\frac{\partial{E_d}}{\partial{net_j}}=\sum_{k\in Downstream(j)}\frac{\partial{E_d}}{\partial{net_k}}\frac{\partial{net_k}}{\partial{net_j}}\\ =\sum_{k\in Downstream(j)}\frac{\partial{E_d}}{\partial{net_k}}\frac{\partial{net_k}}{\partial{a_j}}\frac{\partial{a_j}}{\partial{net_j}}\\ =\sum_{k\in Downstream(j)}\frac{\partial{E_d}}{\partial{net_k}}kw_{kj}\frac{\partial{a_j}}{\partial{net_j}}\\ =\sum_{k\in Downstream(j)}\frac{\partial{E_d}}{\partial{net_k}}w_{kj}a_j(1-a_j)\\ =a_j(1-a_j)\sum_{k\in Downstream(j)}\frac{\partial{E_d}}{\partial{net_k}}w_{kj}

文章中对这里的表述存在瑕疵,此时只有下游节点的\frac{\partial{E_d}}{\partial{net_k}}​是不确定的,可进行如下分析:假设当前层为最后一层隐藏层,即输出层的上一层,则\frac{\partial{E_d}}{\partial{net_k}}​可直接套用上一种情况中输出层的计算结果,即公式(17),如此便能确定最后一层隐藏层的\frac{\partial{E_d}}{\partial{net_j}}​;若当前层为倒数第二层,则可同样推导出公式(22)的计算结果,而下游节点的\frac{\partial{E_d}}{\partial{net_k}}​​​​​已经通过计算得出,则倒数第二层的值也能确定...以此类推,便可以从输出层逐层倒推,这也就是反向传播算法名字的由来。

类似地,为方便计算和表示,将负号消除,令

\delta_j =-\frac{\partial{E_d}}{\partial{net_j}}\\ =a_j(1-a_j)\sum_{k\in Downstream(j)}-\frac{\partial{E_d}}{\partial{net_k}}w_{kj}\\ =a_j(1-a_j)\sum_{k\in Downstream(j)}\delta_kw_{kj}

测试demo

只含一个隐层的前馈网络是一个通用的函数逼近器,也就是说,在不限制隐层节点数的情况下, 两层 (只含一个隐层 )的BP网络可实现任意的非线性映射。

目标是使用一个只含有单个hidden layer的BP网络拟合|sin(x)|曲线

import numpy as np
import matplotlib.pyplot as plt
import random

data_num = 300
x_set = np.linspace(0, 10, num=data_num)
y_set = [np.abs(np.sin(i)) + np.random.random(1)[0] / 10 for i in x_set]
eta = 0.2
plt.plot(x_set, y_set)

# 初始权重不能一样
W1 = np.random.random((8, 2))
W2 = np.random.randn(9)

def sigmoid(x):
    return 1 / (1 + np.e ** (-x))

def compute(vec_w, vex_x):
    return sigmoid(np.dot(vec_w, vex_x))

def forward(vec_x):
    vec_a = []
    for i in range(8):
        vec_a.append(compute(W1[i], vec_x))
    vec_a.append(1)
    pred_y = compute(W2, vec_a)
    return pred_y, vec_a

# 训练2000000次
for i in range(2000000):
    # 使用SGD
    # j代表随机选择的第几个样本
    j = random.choice(np.arange(data_num))
    # 向前传播,计算y的预测值
    vec_x = [x_set[j], 1]
    pred_y, vec_a = forward(vec_x)

    # 更新权重
    # 首先更新W2

    delta_y = pred_y * (1 - pred_y) * (y_set[j] - pred_y)
    for k in range(9):
        W2[k] += eta * delta_y * vec_a[k]

    # 更新W1
    for l in range(8):
        delta_h = vec_a[l] * (1 - vec_a[l]) * W2[l] * delta_y
        for m in range(2):
            W1[l][m] += eta * delta_h * vec_x[m]

# 输出训练好的模型
y_set = []
for x in x_set:
    vec_x = [x, 1]
    pred_y, vec_a = forward(vec_x)
    y_set.append(pred_y)

plt.plot(x_set, y_set, color='coral',)
plt.show()

该神经网络的隐藏层有8个神经元。训练了两百万次,拟合曲线如下:

通过测试demo,得到了以下结论:

  1. 训练次数越高,拟合效果越好
  2. \eta参数需要手动调整,不断尝试

pytorch 框架

测试demo用pytorch框架实现:

import numpy as np
from torch import nn
import torch
import matplotlib.pyplot as plt
import math

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.net = nn.Sequential(
            # 隐藏层1->8,激活函数为Sigmoid
            nn.Linear(1, 8), nn.Sigmoid(),
            # 输出层8->1,无激活函数
            nn.Linear(8, 1)
        )

    def forward(self, input: torch.FloatTensor):
        return self.net(input)

# GTX 1650速度没有i5 9400F快
# device = torch.device('cuda' if torch.cuda.is_available else 'cpu')

x_set = torch.linspace(-2 * math.pi, 2 * math.pi, 400)
y_set = torch.tensor([i + ((torch.rand(1) - 0.5) / 10)for i in torch.sin(x_set).tolist()])

# 一维=>二维
# 400=>400 * 1
x = x_set.unsqueeze(-1)
y = y_set.unsqueeze(-1)

net = Net()

# Adam效果好得多
#optimizer = torch.optim.Adam(net.parameters(), lr=0.1)
optimizer = torch.optim.SGD(net.parameters(), lr=0.5)
loss_func = nn.MSELoss()

for i in range(20000):

    prediction = net(x)
    loss = loss_func(prediction, y)
    # 反向传播前将梯度先置为0
    optimizer.zero_grad()
    # 反向传播优化
    loss.backward()
    optimizer.step()
    # 打印loss function
    if (i + 1) % 100 == 0:
        print("step: {0} , loss: {1}".format(i + 1, loss.item()))

predict = net(x)

plt.plot(x, y, label="fact")
plt.plot(x, predict.detach().numpy(), label="predict")
plt.title("sin function")
plt.xlabel("x")
plt.ylabel("sin(x)")
plt.legend()
plt.show()

仅训练20000次,与前面demo不同之处在于输出层没有使用激活函数。其输出结果为:

得到以下结论:

  1. 输出层不应使用激活函数,否则不能拟合y轴以下的曲线。
  2. 隐藏层层数与拟合效果正相关
  3. ReLU函数性能比Sigmoid函数好,\eta参数一般比Sigmoid函数小一个数量级

FasionMNIST

Fashion-MNIST,用于取代原始的MNIST数据集(因为之前的首页体太简单,识别率很高)。TorchVision.datasets已经添加了对MNIST的支持

以下是展示图片的代码。training_data包含了6000张图片,其中的每个元素包含了图片的像素信息(28*28),每个像素信息以0-1范围的灰度值表示。

from torch import nn
import torch
import matplotlib.pyplot as plt
from torchvision import datasets
from torchvision.transforms import ToTensor

training_data = datasets.FashionMNIST(
    root="./data",
    train=True,
    download=False,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root="./data",
    train=False,
    download=False,
    transform=ToTensor()
)

labels_map = {
    0: "T-Shirt",
    1: "Trouser",
    2: "Pullover",
    3: "Dress",
    4: "Coat",
    5: "Sandal",
    6: "Shirt",
    7: "Sneaker",
    8: "Bag",
    9: "Ankle Boot",
}
print(training_data[0][0].squeeze())
figure = plt.figure(figsize=(8, 8))
cols, rows = 3, 3
for i in range(1, cols * rows + 1):
    sample_idx = torch.randint(len(training_data), size=(1,)).item() # 随机index
    img, label = training_data[sample_idx]
    figure.add_subplot(rows, cols, i) # 添加子图,从1开始
    plt.title(labels_map[label]) # 每张图片的名字
    plt.axis("off") #取消显示X、Y轴刻度
    plt.imshow(img.squeeze(), cmap="gray") # 1*28*28 => 28*28
plt.show()

利用之前的全连接神经网络训练FasionMNIST代码如下:

import numpy as np
from torch import nn
import torch
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

training_data = datasets.FashionMNIST(
    root="./data",
    train=True,
    download=False,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root="./data",
    train=False,
    download=False,
    transform=ToTensor()
)

labels_map = {
    0: "T-Shirt",
    1: "Trouser",
    2: "Pullover",
    3: "Dress",
    4: "Coat",
    5: "Sandal",
    6: "Shirt",
    7: "Sneaker",
    8: "Bag",
    9: "Ankle Boot",
}

train_dataloader = DataLoader(training_data, batch_size=100, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=100, shuffle=True)

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(784, 300), nn.ReLU(),
            nn.Linear(300, 600), nn.ReLU(),
            nn.Linear(600, 300), nn.ReLU(),
            nn.Linear(300, 10)
        )

    def forward(self, input: torch.FloatTensor):
        return self.net(input)

net = Net()

# 定义优化器和损失函数
optimizer = torch.optim.Adam(Net.parameters(net), lr=0.001)
loss_func = nn.MSELoss()

def customize_y(batch_y):
    # 100 => 100 * 10
    length = len(batch_y)
    y = torch.zeros(length, 10)
    for row in range(length):
        column = batch_y[row]
        y[row][column] = 1
    # print(y)
    return y

for epoch in range(10):
    loss = None
    # 100*28*28 => 100*784
    for batch_pixel, batch_lable in train_dataloader:

        batch_pixel = torch.flatten(batch_pixel, 1)
        batch_lable = customize_y(batch_lable)

        batch_prediction = net(batch_pixel)
        # print(batch_prediction.size())
        # print(batch_lable.size())
        loss = loss_func(batch_prediction, batch_lable)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    if (epoch + 1) % 1 == 0:
        print("step: {0} , loss: {1}".format(epoch + 1, loss.item()))

# 测试准确率
total_accuracy = []
for batch_pixel, batch_lable in test_dataloader:
    batch_pixel = torch.flatten(batch_pixel, 1)
    batch_prediction = net(batch_pixel)

    batch_num = len(batch_prediction)
    correct = 0

    for i in range(batch_num):
        single = list(batch_prediction[i])
        # 概率最大的值即是输出
        max_index = single.index(max(single))
        if max_index == batch_lable[i]:
            correct += 1

    accuracy = correct / batch_num
    total_accuracy.append(accuracy)
    print("accuracy:{:.2%}".format(accuracy))

print("total accuracy:{:.2%}".format(np.mean(total_accuracy)))

上面的demo仅训练了10次,得到:total accuracy:88.51%

0 评论