1. 损失函数总览

PyTorch 的 Loss Function（损失函数）都在 torch.nn.functional 里，也提供了封装好的类在 torch.nn 里。PyTorch 里有关有 18 个损失函数，常用的有 5 个，分别是：

回归模型：

torch.nn.L1Loss
torch.nn.MSELoss

分类模型：

torch.nn.BCELoss
torch.nn.BCEWithLogitsLoss
torch.nn.CrossEntropyLoss
torch.nn.NLLLoss

损失函数是用来衡量模型的单个预测与真实值的差异的：
$$Loss=f(\hat{y}-y)$$
还有额外的两个概念：Cost Function（代价函数）是 N 个预测值的损失函数平均值：
$$Cost=\frac{1}{N}\sum^N_if(\hat{y_i}-y_i)$$
而 Objective Function（目标函数）是最终需要优化的函数：
$$Obj=Cost+Regularization$$

还有其它的损失函数，学识有限，暂时不理解。希望以后有缘能够接触。

2. 回归损失函数

回归模型有两种方法进行评估：MAE（mean absolute error）和 MSE（mean squared error）。

torch.nn.L1Loss(reduction='mean')

这个类对应了 MAE 损失函数：
$$\ell=L={l_1,…l_n},\quad l_n=|\hat{y}-y|$$

torch.nn.MSELoss(reduction='mean')

这个类对应了 MSE 损失函数：
$$\ell=L={l_1,…l_n},\quad l_n=(\hat{y}-y)^2$$
上面两个类中的 reduction 规定了获得 $\ell$ 后的行为，有 none、sum 和 mean 三个。none 表示不对 $\ell$ 进行任何处理；sum 表示对 $\ell$ 进行求和；mean 表示对 $\ell$ 进行平均。默认为求平均。

>>> y = torch.tensor([1.1, 1.2, 1.3])
>>> y_hat = torch.tensor([1., 1., 1.])

>>> criterion_none = nn.L1Loss(reduction='none') # 什么都不做
>>> criterion_none(y_hat, y)
tensor([0.1000, 0.2000, 0.3000])

>>> criterion_mean = nn.L1Loss(reduction='mean') # 求平均
>>> criterion_mean(y_hat, y)
tensor(0.2000)

>>> criterion_sum = nn.L1Loss(reduction='sum') # 求和
>>> criterion_sum(y_hat, y)
tensor(0.6000)

3. 分类损失函数

3.1 交叉熵

自信息是一个事件发生的概率的负对数：
$$I(x)=-log[p(x)]$$
信息熵用来描述一个事件的不确定性公式为
$$H(P)=-\sum^N_iP(x_i)logP(x_i)$$
一个确定的事件的信息熵为 0，一个事件越不确定，信息熵就越大。

交叉熵，用来衡量在给定的真实分布下，使用非真实分布指定的策略消除系统的不确定性所需要付出努力的大小，表达式为
$$H(P,Q)=-\sum^B_{i=1}P(x_i)logQ(x_i)$$
相对熵又叫 “K-L 散度”，用来描述预测事件对真实事件的概率偏差。
$$D_{KL}(P,Q)=E\bigg[log\frac{P(x)}{Q(x)}\bigg]\
=E\bigg[logP(x)-logQ(x)\bigg]\
=\sum^N_{i=1}P(x_i)[logP(x_i)-logQ(x_i)]\
=\sum^N_{i=1}P(x_i)logP(x_i)-\sum^N_{i=1}P(x_i)logQ(x_i)\
=H(P,Q)-H(P)$$
而交叉熵的表达式为
$$H(P,Q)=-\sum^N_{i=1}P(x_i)logQ(x_i)$$
可见 $H(P,Q)=H(P)+D_{KL}(P,Q)$，即交叉熵是信息熵和相对熵的和。上面的 $P$ 是事件的真实分布，$Q$ 是预测出来的分布。所以优化 $H(P,Q)$ 等价于优化 $H(Q)$，因为 $H(P)$ 是已知不变的。

3.2 分类损失函数

下面我们来了解最常用的四个分类损失函数。

torch.nn.BCELoss(weight=None, reduction='mean')
这个类实现了二分类交叉熵。
$$l_n=-w_n[y_n\cdot logx_n+(1-y_n)\cdot log(1-x_n)]$$
使用这个类时要注意，输入值（不是分类）的范围要在 $(0,1)$ 之间，否则会报错。

>>> inputs = torch.tensor([[1, 2], [2, 2], [3, 4], [4, 5]], dtype=torch.float)
>>> target = torch.tensor([[1, 0], [1, 0], [0, 1], [0, 1]], dtype=torch.float)

>>> criterion = nn.BCELoss()
>>> criterion(inputs, target)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
...
RuntimeError: all elements of input should be between 0 and 1

通常可以先使用 F.sigmoid 处理一下数据。

torch.nn.BCEWithLogitsLoss(weight=None, reduction='mean', pos_weight=None)
与上面的 torch.nn.BCELoss 相似，只是 $x$ 先使用了 sigmoid 处理了一下，这样就不需要手动使用 sigmoid 的了。
$$l_n=-w_n[y_n\cdot log\sigma(x_n)+(1-y_n)\cdot log(1-\sigma(x_n))]$$
torch.nn.NLLLoss(weight=None, ignore_index=-100, reduction='mean')
NLLLoss 的全称为 “negative log likelihood loss”，其作用是实现负对数似然函数中的负号。
$$\ell=L={l_1,…,l_N},\quad l_n=-w_{y_n}x_{n,y_n}$$
torch.nn.CrossEntropyLoss(weight=None, ignore_index=-100, reduction='mean')
这个类结合了 nn.LogSoftmax 和 nn.NLLLoss。这个类的运算可以写成：
$$loss(class)=weight[class]\bigg(-\text{log}\bigg(\frac{\text{exp}(x[class])}{\sum_j\text{exp}(x[j])}\bigg)\bigg)\
=weight[class]\bigg(-x[class]+\text{log}\bigg(\sum_j\text{exp}(x[j]\bigg)\bigg)$$
对比上面 $H(P,Q)$ 的公式，因为已知的 $x$ 的事件概率已知，所以 $P(x)$ 为 1；因为是单个事件，所以 $\sum^N_{i=1}$ 也为 1。所以上面的式子就简化成了 $H(P,Q)=-logQ(x)$。然后我们需要把 $x[class]$ 归一化到一个概率分布中，所以使用 softmax。

torch.nn.KLDivLoss(reduction='mean')
这个类就是上面提到的相对熵。
$$l(x,y)=L={1_1,…l_N}, l_n=y_n\cdot(\text{log}\ y_n-x_n)$$
这几个类的参数类似，除了上面提到的 reduction，还有一个 weight，就是每一个类别的权重。下面用例子来解释交叉熵和 weight 是如何运作的。我们先定义一组数据，使用 numpy 推演一下：

inputs = torch.tensor([[1, 1], [1, 2], [3, 3]], dtype=torch.float)
target = torch.tensor([0, 0, 1],dtype=torch.long)

idx = target[0]

input_ = inputs.detach().numpy()[idx]      # [1, 1]
target_ = target.numpy()[idx]              # [0]

# 第一项
x_class = input_[target_]

# 第二项
sigma_exp_x = np.sum(list(map(np.exp, input_)))
log_sigma_exp_x = np.log(sigma_exp_x)

# 输出 loss
loss_1 = -x_class + log_sigma_exp_x

结果为

1 2	>>> print("第一个样本loss为: ", loss_1) 第一个样本loss为: 0.6931473

现在我们再使用 PyTorch 来计算：

1
2
3

>>> criterion_ce = nn.CrossEntropyLoss(reduction='none')
>>> criterion_ce(inputs, target)
tensor([0.6931, 1.3133, 0.6931])

可以看到，结果是一致的。现在我们再看看 weight：

>>> weight = torch.tensor([0.1, 0.9], dtype=torch.float)
>>> criterion_ce = nn.CrossEntropyLoss(weight=weight, reduction='none')
>>> criterion_ce(inputs, target)
tensor([0.0693, 0.1313, 0.6238])

与没有权重的交叉熵进行比较后可以发现，每一个值都乘以了 $\frac{p_i}{\sum{p_i}}$。当 reduction 为 sum 和 mean 的时候，交叉熵的加权总和或者平均值再除以权重的和。

3.3 总结

F.sigmoid + torch.nn.BCELoss = torch.nn.BCEWithLogitsLoss
nn.LogSoftmax + nn.NLLLoss = torch.nn.CrossEntropyLoss

欢迎关注我的微信公众号“花解语 NLP”：