2018年CAA-PRMI专委会通讯(2):自然语言对话、机器学习与深度强化学习进展

需积分: 10 101 浏览量更新于2024-07-17 收藏 5.67MB PDF 举报

CAA-PRMI专委会通讯2018年第2期（总第2期）是一份关注人工智能与机器智能领域的专业资讯，该期内容涵盖了多个重要的研究主题和技术动态。本期的核心话题包括： 1. **特约专栏：自然语言对话：现状与未来** - 来自北京字节跳动科技有限公司AILab的李航探讨了近年来自然语言对话系统的快速发展。他提到，尽管语音助手、智能客服等应用已经普及，但系统在实用性与技术局限性之间存在挑战。当前的对话系统虽然日益实用，但计算机理解复杂对话、处理情感和语境的能力仍有待提高。 2. **机器学习进展** - 林宙辰教授对一阶优化算法在机器学习中的最新进展进行了简评，展示了这些算法在优化问题上的重要性和未来发展趋势。 3. **多模态学习研究综述** - 许倩倩和黄庆明合作，总结了多模态学习领域的最新研究成果，涉及不同模态数据（如文本、图像、声音等）的融合与学习。 4. **视觉跟踪进展** - 张开华和刘青山教授对过去十年视觉跟踪技术的发展做了全面回顾，探讨了关键技术和应用场景的进步。 5. **深度强化学习质疑与未来** - 侯宇清和陈玉荣针对深度强化学习及其免模型强化学习的质疑展开讨论，提出了可能的方向和解决方案。 6. **专委动态** - 介绍了CAA模式识别与机器智能专委会举办的模式识别与机器智能前沿论坛，以及各专家成员的介绍。 7. **专家风采** - 王井东研究员的访谈提供了关于其研究工作的深入洞察，以及人工智能领域的专家观点。 8. **领域进展** - 分别探讨了人脸图像妆容编辑、高光谱图像分类、物体分析以及弱监督式图像语义分割的技术综述，展现了深度学习在这些领域的应用。 9. **企业技术** - 格灵深瞳公司分享了他们开发的一种分布式人脸识别模型训练方法，强调了实际应用中的技术实践。 10. **信息天地** - 介绍了第24届国际模式识别会议（ICPR2018）的筹备情况，反映了国际学术界的最新动态。本期专委会通讯紧密围绕AI技术的核心议题，既展示了最新的研究成果，也关注了行业的实践应用和发展趋势，对于了解和跟进机器智能领域的最新进展具有很高的价值。

展开



问题需要󰇛1/



󰇜次通讯[41]，解一般凸的光滑问题需要

󰇛1/󰇜或 󰇛1/

√



󰇜 次通讯[24]，解强凸的问题需要

󰇛log󰇛1/󰇜次通讯[55,51]；对于随机算法，解一般凸问

题则需要󰇛1/



󰇜次通讯，其中为离真解的误差。Lan 等

人[29]提出的 Decentralized Communication Sliding

(DCS)方法及其随机化版本 Stochastic Decentralized

Communication Sliding(SDCS)可以把通讯开销分别降

低到󰇛1/󰇜（一般凸情形）和󰇛1/

√



󰇜（强凸情形）。另

外，算法需要把分布式系统的拓扑，即每个计算单元可

以与哪些计算单元通讯，考虑在内，对于去中心化结构

的分布式系统，系统的连通度直接影响着算法的收敛速

度，典型的结果是收敛速度和分布式系统邻接矩阵的

Laplacian 矩阵的次小特征值成反比[51]。

(六)交替方向法

交替方向法在求解稀疏和低秩模型中成为机器学习

领域主流算法之一[32,37,33,31,36,34]，因此重新引起

了研究者的极大兴趣。主要的进展表现在如下几个方面。

首先，从两变量块发展到多变量块（变量块数大于 2）。

传统的交替方向法一般只考虑两变量块情形，但是在机

器学习中为了方便求解，需要引入辅助变量，导致经常

出现多变量块情形。直接套用两变量块情形的算法不能

保证收敛。He 等人[20]、Lin 等人[33]、Wang 等人[58]

通过不同改进方式解决了收敛性问题。He 等人的方案

[33]需要进行高斯回代，在求解稀疏、低秩模型时会破

坏近似解的稀疏、低秩结构。而 Lin 等人的方案[33]无

需高斯回代，因此在求解稀疏、低秩模型时更有优势。

Wang 等人[58]则是通过把变量块的更新次序改为随机

来获得概率意义上的收敛。其次，交替方向法和随机算

法的结合求解线性约束的期望或经验期望目标函数极小

化问题，得到随机交替方向法[44,14]。第三，交替方向

法的加速[37,45]，运用 Nesterov 外推法，可以把光滑部

分的收敛速度从󰇛



󰇜提高到󰇛



󰇜（其他部分不变）；

同时研究非遍历（non-ergodic，即当前迭代点离最优解的

误差）意义下的收敛速度[31,14]，而最原始的交替方向

法只有遍历（ergodic，即所有迭代点的适当加权平均离

最优解的误差）意义下的收敛速度分析。第四，面向非凸

问题的交替方向法[21]，将面向凸问题的交替方向法进

行改造，可以适用于求解非凸问题。第五，交替方向法的

大统一框架。交替方向法的变量块更新方式一般分为串

行更新和并行更新两种方式。Lu 等人[36]通过引入超优

极小化的观点，统一了现有常见的交替方向法，并可随

意指定部分变量块并行更新、部分串行更新。

(七)改进的收敛性和收敛速度的分析

这部分其实不能算作一个完全独立的部分，其大部

分内容已糅合在前面几个部分里了，但把这些内容重新

进行组合单列一节有助于读者获得全貌。

一个优化算法被设计出来后就面临着两个基本的理

论问题：1.算法收敛吗？2.算法收敛得多快？算法收敛

的涵义是迭代序列收敛到问题的解集，而不仅仅是迭代

序列作为序列收敛。如果证明不了在较强的收敛意义下

的收敛，通常的办法是提出较弱意义下的收敛性。迭代

序列

󰇝



󰇞

收敛到一个最优解是最强的收敛性（经典意义），

差一点的是

󰇝

󰇛



󰇜

󰇞

收敛到最优目标函数值（非遍历意义），

再差一点的就是

∑









收敛到最优目标函数值

（遍历意义）或

󰇝



󰇞

的任意聚点都是临界点（critical

point），其中遍历意义下的收敛性是近年来新提出的，在

带约束情况下也有对应定义[33]。对于传统的交替方向

法，其在经典意义下的收敛速度一直都没有分析出来，

最近才有遍历意义下的收敛速度󰇛



󰇜[33]；通过对算

法进行适当改造，可以证明出非遍历意义下的收敛速度

󰇛



󰇜[31,14]。近年来，基于 Kurdyka-Łojasiewicz 条件

[2]，往往可以把迭代序列的任意聚点都是临界点的结论

加强为序列本身收敛到临界点。

在提高收敛速度方面，Y.Nesterov 于 1983 年提出的

加速算法[42]是一个标志性的事件，它可以把一般凸的

问题从通常的梯度下降法的󰇛



󰇜收敛速度提高到

󰇛



󰇜。但是直到 2009 年，Nesterov 加速算法都没有在

机器学习界引起重视，一个重要的原因是它是面向可微

的目标函数的，而机器学习里的问题很多都是不可微的。

2009 年，A.Beck 通过近邻算子将 Nesterov 算法推广到

解目标函数为光滑函数加非光滑函数情形，其中非光滑

函数的近邻算子容易计算，才使得 Nesterov 算法在机器

学习领域大放异彩，Nesterov 本人也在 NIPS 2014 被邀

请做特邀报告。后来基于 Nesterov 外推技巧的加速算法

往往都被称作 Nesterov 加速算法，被应用于交替方向法

[31,45]、随机梯度法[53]、异步算法[14]的加速。在强

凸情形，虽然都是线性收敛，但是加速算法可以降低收

敛率中条件数的阶[47,50]。对于随机梯度法和异步算法，

方差约减技巧也是重要的加速方法[26,15]。另外，目标

函数的强凸性对于提高收敛速度很有帮助，但在稀疏、



低秩表示中，往往不需要假定全局的强凸性，而只需要

在稀疏度或秩不超过给定数值的集合上具有强凸性就可

以了，这称为受限性强凸（Restricted Strong Convexity）

[28]。

四、结语

虽然优化算法在基础理论方面已经较为完善了，但

是在应用层面对更快、复杂度更低的优化算法的需求还

是永无止境的。本文对近年来一阶算法的进展只做了非

常简要的点评。未来理论方面的突破应当在非凸优化方

面，而技术上需全面地依赖随机算法才能有效地处理高

维和海量的数据。

致谢

本文受国家自然科学基金资助（编号：61625301、

61731018）。方聪和李欢对本文亦有贡献。

References

[1] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and

Tengyu Ma. Finding approximate local minima faster than gradient descent.

In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of

Computing, pages 1195–1199. ACM, 2017.

[2] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating

minimization and projection methods for nonconvex problems: An

approach based on the Kurdyka-Łojasiewicz inequality. Mathematics of

Operations Research,35:438–457, 2010.

[3] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding

algorithm for linear inverse problems. SIAM Journal on Imaging Sciences,

2(1):183–202, 2009.

[4] Dimitri P. Bertsekas. Nonlinear Programming. Athena Scientiﬁc, 2 edition,

1999.

[5] Dimitri P. Bertsekas and John N. Tsitsiklis. Parallel and Distributed

Computation: Numerical Methods. Prentice-Hall, 1989.

[6] Stephen Boyd and Lieven Vandenberghe. Convex Optimization.

Cambridge University Press,2004.

[7] J. Cai, E.J. Candès, and Z. Shen. A singular value thresholding algorithm

for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982,

2010.

[8] E.J. Candès, Michael B Wakin, and Stephen P Boyd. Enhancing sparsity by

reweighted ℓ



minimization. Journal of Fourier Analysis and

Applications, 14(5-6):877–905, 2008.

[9] Tianyi Chen, Georgios B. Giannakis, Tao Sun, and Wotao Yin. LAG: Lazily

aggregated gradient for communication-efﬁcient distributed learning.

arXiv preprint, arXiv:1805.09965, 2018.

[10] Edwin K. P. Chong and Stanislaw H. Zak. An Introduction to Optimization.

John Wiley & Sons, Inc., 4 edition, 2013.

[11] P. Domingos. A few useful things to know about machine learning.

Communications of the ACM, 55(10):78–87, 2012.

[12] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods

for online learning and stochastic optimization. Journal of Machine

Learning Research, 12(July):2121–2159, 2011.

[13] John Duchi and Yoram Singer. Efﬁcient online and batch learning using

forward backward splitting. Journal of Machine Learning Research,

10(December):2899–2934, 2009.

[14] Cong Fang, Feng Cheng, and Zhouchen Lin. Faster and non-ergodic o(1/k)

stochastic alternating direction method of multipliers. In Advances in

Neural Information Processing Systems, 2017.

[15] Cong Fang and Zhouchen Lin. Parallel asynchronous stochastic variance

reduction for nonconvex optimization. In AAAI Conference on Artiﬁcial

Intelligence, 2017.

[16] S. Foucart and H. Rauhut. A Mathematical Introduction to Compressive

Sensing. Springer, 2013.

[17] Marguerite Frank and Philip Wolfe. An algorithm for quadratic

programming. Naval Research Logistics Quarterly, 3(1-2):95–110, 1956.

[18] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle

points–online stochastic gradient for tensor decomposition. In Conference

on Learning Theory, 2015.

[19] Robert Hannah and Wotao Yin. On unbounded delays in asynchronous

parallel ﬁxed-point algorithms. Journal of Scientiﬁc Computing,

76(1):299–326, 2018.

[20] Bingsheng He, Min Tao, and Xiaoming Yuan. Alternating direction method

with Gaussian back substitution for separable convex programming. SIAM

Journal on Optimization, 22(2):313–340, 2012.

[21] M. Hong, Z.Q. Luo, and M. Razaviyayn. Convergence analysis of

alternating direction method of multipliers for a family of nonconvex

problems. SIAM Journal on Optimization, 26(1):337–364, 2016.

[22] Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex

optimization. In International Conference on Machine Learning, pages

427–435, 2013.

[23] Martin Jaggi, Marek Sulovsk, et al. A simple algorithm for nuclear norm

regularized problems. In International Conference on Machine Learning,

pages 471–478, 2010.

[24] D. Jakovetic, J. Xavier, and J. Moura. Fast distributed gradient methods.

IEEE Transactions on Automatic Control, 59:1131–1146, 2014.

[25] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I

Jordan. How to escape saddle points efﬁciently. In International

Conference on Machine Learning, pages 1724–1732, 2017.

[26] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using

predictive variance reduction. In Advances in Neural Information

Processing Systems, 2013.

[27] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic

optimization. In ICLR, 2015.

[28] M.J. Lai and W. Yin. Augmented 



and nuclear-norm models with a

globally linearly convergent algorithm. SIAM Journal on Imaging Sciences,

pp. 183-202, 6(2):1059–1091, 2013.

[29] Guanghui Lan, Soomin Lee, and Yi Zhou. Communication-efﬁcient

algorithms for decentralized and stochastic optimization. arXiv preprint,

arXiv:1701.03961, 2017.

[30] Huan Li and Zhouchen Lin. Accelerated proximal gradient methods for

nonconvex programming. In Advances in Neural Information Processing

Systems, pages 612–620, 2015.

[31] Huan Li and Zhouchen Lin. Accelerated alternating direction method of

multipliers: an optimal O(1/K) nonergodic analysis. arXiv preprint,

arXiv:1608.06366, 2017.

[32] Z. Lin, R. Liu, and Z. Su. Linearized alternating direction method with

adaptive penalty for low rank representation. In Advances in Neural

Information Processing Systems, pages 612–620,2011.

[33] Zhouchen Lin, Risheng Liu, and Huan Li. Linearized alternating direction

method with parallel splitting and adaptive penalty for separable convex

programs in machine learning. Machine Learning, 99(2):287–325, 2015.

[34] Zhouchen Lin and Hongyang Zhang. Low-Rank Models in Visual Analysis:

Theories, Algorithms, and Applications. Academic Press, 2017.

[35] C. Lu, Z. Lin, and S. Yan. Smoothed low rank and sparse matrix recovery

by iteratively reweighted least squared minimization. IEEE Transactions

on Image Processing, 24(2):646–654, 2015.

[36] Canyi Lu, Jiashi Feng, Shuicheng Yan, and Zhouchen Lin. A unified

alternating direction method of multipliers by majorization minimization.

submitted to IEEE Transactions on Pattern Analysis and Machine

Intelligence, 2017.

[37] Canyi Lu, Huan Li, Zhouchen Lin, and Shuicheng Yan. Fast proximal

linearized alternating direction method of multiplier with parallel splitting.

In AAAI Conference on Artiﬁcial Intelligence, pages 739–745, 2016.

[38] Michael W. Mahoney. Randomized algorithms for matrices and data.

Foundations and Trends in Machine Learning, 3(2):123–224, 2011.

[39] Julien Mairal. Incremental majorization-minimization optimization with

application to large-scale machine learning. SIAM Journal on Optimization,

25(2):829–855, 2015.

[40] Tomoya Murata and Taiji Suzuki. Doubly accelerated stochastic variance

reduced dual averaging method for regularized empirical risk minimization.

In Advances in Neural Information Processing Systems, pages 608–617,

2017.

[41] A. Nedic and A. Ozdaglar. Distributed subgradient methods for multi-agent

optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.

[42] Y. Nesterov. A method of solving a convex programming problem with

convergence rate 󰇛1/



󰇜. Soviet Mathematics Doklady, 27(2):372–376,

1983.

[43] Yurii Nesterov, Introductory Lectures on Convex Optimization: A Basic

Course. Springer, 2004.

[44] Hua Ouyang, Niao He, Long Q. Tran, and Alexander Gray. Stochastic

alternating direction method of multipliers. In International Conference on

Machine Learning, 2013.

[45] Y. Ouyang, Y. Chen, G. Lan, and E. Pasiliao. An accelerated linearized

alternating direction method of multipliers. SIAM Journal on Imaging



Scineces, 8(1):644–681, 2015.

[46] Zhimin Peng, Yangyang Xu, Ming Yan, and Wotao Yin. Arock: An

algorithmic framework for asynchronous parallel coordinate updates. SIAM

Journal of Scientific Computing, 38(5):A2851– A2879, 2016.

[47] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal

for strongly convex stochastic optimization. In International Conference

on Machine Learning, 2012.

[48] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of

Adam and beyond. In International Conference on Learning

Representation, 2018.

[49] Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic

Frank-Wolfe methods for nonconvex optimization. In 54th Annual Allerton

Conference on Communication, Control, and Computing, 2016.

[50] N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with

an exponential convergence rate for strongly-convex optimization with

ﬁnite training sets. In Advances in Neural Information Processing Systems,

2012.

[51] Kevin Scaman, Francis Bach, Sébastien Bubeck, Yin Tat Lee, and Laurent

Massoulié. Optimal algorithms for smooth and strongly convex distributed

optimization in networks. In International Conference on Machine

Learning, 2017.

[52] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine

Learning: From Theory to Algorithms. Cambridge University Press, 2014.

[53] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth

optimization: Convergence results and optimal averaging schemes. In

International Conference on Machine Learning, 2013.

[54] Fanhua Shang, Yuanyuan Liu, James Cheng, Zhi-Quan Luo, and Zhouchen

Lin. Bilinear factor matrix norm minimization for robust PCA: Algorithms

and applications. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 2018.

[55] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. EXTRA: An exact ﬁrst-

order algorithm for decentralized consensus optimization. SIAM Journal

on Optimization, 25(2):944–966, 2015.

[56] Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery using

nonconvex optimization. In International Conference on Machine

Learning, pages 2351–2360, 2015.

[57] Ruoyu Sun, Zhi-Quan Luo, and Yinyu Ye. On the expected convergence of

randomly permuted ADMM. In arxiv:1503.06387, 2015.

[58] Yu Wang, Wotao Yin, and Jinshan Zeng. Global convergence of ADMM in

nonconvex nonsmooth optimization. arXiv preprint, arXiv:1511.06324,

2018.

[59] D. H. Wolpert and W. G. Macready. No free lunch theorems for

optimization. IEEE Transaction on Evolutionary Computation, 1(1):67–82,

1997.

[60] Stephen J. Wright. Coordinate descent algorithms. Mathematical

Programming

, 151(1):3–34, 2015.

[61] Lin Xiao. Dual averaging methods for regularized stochastic learning and

online optimization. Journal of Machine Learning Research,

11(October):2543–2596, 2010.

[62] Chen Xu, Zhouchen Lin, and Hongbin Zha. A uniﬁed convex surrogate for

the schatten-p norm. In AAAI Conference on Artiﬁcial Intelligence, 2017.

[63] Yuchen Zhang, John C. Duchi, and Martin J. Wainwright. Communication-

efficient algorithms for statistical optimization. Journal of Machine

Learning Research, 14(1):3321–3363, 2013.

[64] Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis

of stochastic gradient Langevin dynamics. In Conference on Learning

Theory, 2017.

[65] Shuxin Zheng, Qi Meng, Taifeng Wang, Wei Chen, Nenghai Yu, Zhi-Ming

Ma, and Tie-Yan Liu. Asynchronous stochastic gradient descent with delay

compensation for distributed deep learning. In International Conference on

Machine Learning, 2017.

[66] Wangmeng Zuo, Deyu Meng, Lei Zhang, Xiangchu Feng, and David Zhang.

A generalized iterated shrinkage algorithm for non-convex sparse coding.

In International Conference on Computer Vision, pages 217–224, 2013.

作者介绍

林宙辰，北京大学教授，主要研究

领域为机器学习、计算机视觉和

数值优化。国际模式识别学会

（IAPR）和电气与电子工程师协

会（IEEE）会士（Fellow）。任计

算机视觉两大顶级期刊 IEEE模式

分析与机器智能汇刊（IEEE TPAMI）和国际计算机视觉

杂志（IJCV）的编委及 CVPR、ICCV、NIPS、AAAI 等

多个国际顶级会议的领域主席。国家杰出青年科学基金

获得者。任中国图象图形学学会机器视觉专委会主任、

中国自动化学会模式识别与机器智能专委会副主任。发

表学术论文 170 余篇，出版学术专著 1 本，谷歌学术引

用超万次，拥有美国专利 43 项、中国专利 6 项。2012 年

入职北京大学以来指导博士生 10 余名，毕业 7 名，其中

2 名现分别在 CMU 和牛津大学做博士后、2 名获石青云

学术论文奖。

(责任编辑：郭裕兰)

剩余63页未读，继续阅读

身份认证购VIP最低享 7 折!

30元优惠券

Abefocuson

粉丝: 0

2018年CAA-PRMI专委会通讯(2):自然语言对话、机器学习与深度强化学习进展

Ember.js前端开发实践指南：chihaya-CAA-radio接口应用

GitHub上SCSS项目caa-minions的深入解析

CAA V5-6R2016完整安装指南

CAA-100天馈线测试仪信维科技CAA-100天馈线测试仪.pdf

Report-Impact-of-NRC-CAA-on-Nomadic-Communities-in-India

CATIA-CAA-二次开发详细教程.pdf

caa-minions.github.io

CAA-add-Menu.rar_菜单_Visual_C++_

cat3k-caa-universalk9.16.12.08.SPA.bin

chihaya:CAA-radio 的 Ember.js 接口

最新资源