深度学习在计算机视觉中的突破：方法解析、因果探讨与公平挑战

需积分: 39 200 浏览量更新于2024-07-09 收藏 1.07MB PDF 举报

本文深入探讨了计算机视觉领域中深度学习的关键要素，从方法论到实际应用及其潜在问题。首先，深度学习的核心是其深度架构，它能够将复杂的视觉任务分解为一系列逐步抽象的处理步骤，如卷积神经网络（CNN）中的特征提取和池化层。这种分层结构使得模型能够自动学习到输入图像中的底层特征和高级模式，从而实现高精度的识别和分类。其次，标准梯度下降优化算法在非凸损失函数中的表现是深度学习成功的关键。通过迭代调整权重参数 W，模型能够在局部最小化误差的同时，逐渐逼近全局最优解。特别是在GPU等并行计算硬件的支持下，大规模数据集上的训练得以高效进行，促进了计算机视觉技术的发展。然而，深度学习并非完美无缺。它面临的问题包括缺乏可解释性，即我们难以理解模型内部是如何做决策的，这在医疗诊断等关键领域可能带来风险。此外，深度模型可能会捕获并放大训练数据中的偏见，导致不公平的结果。例如，在人脸识别或招聘决策中，如果训练数据存在性别、种族等方面的偏差，模型可能会无意中复制这些偏见。为了提高深度学习的透明度和公平性，研究人员正在探索生成模型（如生成对抗网络GANs）用于解释模型决策，以及因果推理方法来分析输入（X）与输出（y）之间的因果关系。此外，公平性研究关注的是如何设计和实施策略，确保模型在处理不同群体时能避免歧视和偏见。计算机视觉中的深度学习是一个既充满机遇又具有挑战的领域。尽管其强大的表现在许多任务中取得了显著成就，但理解和解决其中的解释性、因果性和公平性问题，是推动这一技术向前发展的重要课题。未来的研究将继续关注模型的可解释性增强、公平性保障以及在复杂环境下的稳健性能提升。

pushes the neural network output 





(



)

; 



towards 1 and therefore its log towards the maximum value 0.

Similarly, when 

()

= 0, the neural network output is similarly pushed towards 0.



()

= 









(



)

; 



, 

()



= 

(



)

log 󰇡





(



)

; 



󰇢(1 

(



)

) log 󰇡1 





(



)

; 



󰇢

(6)

We need the derivatives of this loss function with respect to the parameters . The derivative represents

the sensitivity of the output with respect to a single parameter. The parameter 



is iteratively updated directly

in proportion to this derivative until the gradient descends to zero. At a gradient of zero, we intuitively expect

the cost function to be at a local minima with respect to the focal parameter 



. As shown below, the derivative

is quite straightforward for a single neuron.





= 



+ 









;

















()







()





= 

,







(



)

; 



; 





(



)

; 





 (













,





)

Unlike a single neuron above, calculating the derivatives of the loss function with respect to the neuron

parameters is somewhat nontrivial in deep networks. The output of the last layer 



does not hint at a tractable

form for the parameter gradients at layer . The derivative for a parameter in layer  requires accounting for

derivatives with all neurons in layer + 1, since each parameter on layer  contributes to all neurons on layer

+ 1 (Figure 3). The backpropagation algorithm [46] makes deep network training tractable by iteratively

applying the chain rule. Fortunately, the application of the chain rule on parameter 

,



(paramter  of neuron

 in layer ) simplifies to depend only on the derivative of the loss with the layer  output 





and the

corresponding layer  input 





[28].



()



,





()















()



,





(



)

y

u



(

u



u

,

u

,

u

,

u



u

,

u

,

u

,

u



u

,

u

,

u

,

)

u

,

W

,



(7)

Electronic copy available at: https://ssrn.com/abstract=3395476

Figure 3: The derivative for a first layer parameter 

,



with respect to the loss calculated by chaining

derivatives across all network branches.

SGD on deep neural networks typically uses a minibatch instead of a single observation to approximate

the derivative at each iteration. The learning rate , which represents how fast the parameter descends to its

cost minimizing value, may also need gradual updates itself. “A learning rate that is too small leads to painfully

slow convergence, while a learning rate that is too large can hinder convergence and cause the loss function to

fluctuate around the minimum or even to diverge” [68]. This relates to three problems with the standard SGD

update rule, namely, (1) the learning rate is not adjusted at different learning stages, (2) the same learning rate

applies to all parameters and (3) the learning rate does not depend on the local cost function surface (just the

derivative). Newton et al. [61] provide an overview of stochastic gradient descent improvements for

optimization. As an example, Byrd et al. [10] adjust sample size in every iteration to a minimum value such that

the standard error of the estimated gradient is small relative to its norm. Iterate-averaging methods allow long

steps within the basic SGD iteration but average the resulting iterates offline, to account for the increased noise

in the iterates.

A few variants of gradient descent have been found to be particularly efficient for deep neural networks.

Momentum SGD [65], one of the most common variants, accumulates gradients over a few update iterations.

Nesterov’s accelerated gradient [60], Adagrad [73], Adadelta [88] and RMSProp are a few other improvements

that work well on a case-by-case basis. The Adam optimizer [38], likely the most widely used in deeper

architectures, is an improvement on RMSProp. It accumulates the average of both the gradients 





and their

second order moments

(







)



, then performs updates in proportion to these two quantities 



/









= 









(

1 



)







; 





= 









(

1 



)

(







)







 =













;





 =













 =  













(8)

2.2 Optimization Challenges

The SGD optimization faces two crucial challenges, namely, underfitting and overfitting. First, the gradient

of the loss with respect to layer parameters is small for early layers [77]. This is the result of nonlinear activations

(e.g., sigmoid, tanh) that repeatedly map their input onto a small range between [0,1]. Thus, the eventual output

loss value is relatively insensitive to early layer parameters. A very small gradient means that the parameter

values descend towards the cost minimizing optimal slowly. This is called the vanishing gradient problem. In

practice, a very slow optimization means that parameters are suboptimal (and underfit) even after an extremely

Electronic copy available at: https://ssrn.com/abstract=3395476

剩余31页未读，继续阅读

weixin_38651445

粉丝: 7

深度学习在计算机视觉中的突破：方法解析、因果探讨与公平挑战

为什么：关于因果关系的新科学中文版.pdf.zip

基于深度学习的事件因果关系抽取综述

关于反事实在推断治疗因果关系中的作用-研究论文

超越“认识论痴迷”：客观性、法律和当代科学-研究论文

简要介绍计算机的功能：来自英国税收优惠政策的证据-研究论文

因果关系的潜在结果和有向无环图方法：经济学实证实践的相关性-研究论文

DAME：DAME论文的实验代码。 对于Python包，请参见：https：github.comalmost-matching-exactlyDAME-FLAME-Python-Package

通过必要性和充分性的局部解释：统一理论和实践-研究论文

有限实验的因果可运输性-研究论文

因果效应的计量经济学识别：案例研究的图形方法-研究论文

最新资源

DAME：DAME论文的实验代码。对于Python包，请参见：https：github.comalmost-matching-exactlyDAME-FLAME-Python-Package