论文笔记：Machine learning: Trends, perspectives, and prospects

论文概述

论文由M. I. Jordan和T. M. Mitchell在2015年发表于Science。看题目就知道这是一篇关于ML的综述性论文，主要对机器学习的基本概念、发展状况、应用、常见的学习算法做了介绍。初学者能通过本篇构建一个基本的ML概念图。论文结构与概要如下：

Introduce：提出机器学习的概念和发展现状
Drivers of machine-learning progress：机器学习在数据密集型行业的影响
Core methods and recent progress
- 监督学习：已经产生很多基于函数映射类型的方法，包括决策树、决策林、逻辑回归、支持向量机、神经网络、内核机和贝叶斯分类器。其中深度学习在图像、视频、音频领域取得了很好的效果。
- 无监督学习：主要作用是降纬。
- 强化学习：它的训练数据中可用的信息介于监督学习和非监督学习之间。强化学习中的训练数据仅提供有关动作是否正确的指示来代替训练样本的Label。如果某个行为不正确，则继续寻找正确动作。
- 此外，基于监督学习和无监督学习还提出了半监督学习——“在监督学习的上下文中利用未标记的数据来增强标记的数据，而判别训练则将为无监督学习而开发的体系结构与利用标签的优化公式相结合。”
Emerging trends
- 日益趋于”大型“、”并行“、”分布式“的应用平台
- 日益被重视的”隐私“问题
Opportunities and challenges
- 迁移学习
- 私有数据公有化

理解与随想

这篇是导师要求翻译的文章，看完觉得通篇太过概要，只能管中窥豹，需要认识到看这篇入门远远不够。当然，看完这篇对不曾了解过的强化学习、半监督学习有了大概的思路，也是不小的收获。

同时作者在最后部分提出的迁移学习方向和数据公有化方向，在如今貌似也能找到一些痕迹。所以看这篇文章最大的收获应该是“关于今后机器学习发展方向的思考”。

在机器学习领域，学术界一直在解决工业界问题（或许所有领域皆如此）。工业界当前使用的大多是监督学习模型，监督学习的缺点在于需要大量真实数据去拟合实际分布。一方面，工业界走向收集大量数据的大型应用；另一方面学术界为了减少Label标注工作提出半监督学习方向，为了减少训练所需数据量提出小样本学习方向（其中就有伪数据合成、迁移学习、自监督学习、少样本学习(Few shot learning)），甚至提出了无需样本的强化学习方向。能够看出：未来模型所需样本应该会越来越少。

关于社会对机器学习的期待，正如本文所说，大众一方面期待机器学习应用带来便利，另一方面又害怕因机器学习所需数据对自己隐私的侵犯。所以减少、加密特征，增加特征信息密度也会成为可能的发展方向。

原文与翻译

Machine learning addresses the question of how to build computers that improve automatically through experience. It is one of today’s most rapidly growing technical fields, lying at the intersection of computer science and statistics, and at the core of artificial intelligence and data science. Recent progress in machine learning has been driven both by the development of new learning algorithms and theory and by the ongoing explosion in the availability of online data and low-cost computation. The adoption of data-intensive machine-learning methods can be found throughout science, technology and commerce, leading to more evidence-based decision-making across many walks of life, including health care, manufacturing, education, financial modeling, policing, and marketing.

机器学习解决了如何构建通过经验自动改进计算机的问题。它是当今发展最迅速的领域之一，属于计算机科学和统计学的交叉学科，并且是人工智能和数据科学的核心。机器学习的最新进展中既有新的算法和理论的发展，也受到了数据在线化和计算低成本化的推动。科学、技术和商业领域都能采用数据密集型的机器学习方法，使健康医疗、制造、教育、财务建模、警务和市场营销等领域有了更多基于证据的决策。

Machine learning is a discipline focused on two interrelated questions: How can one construct computer systems that automatically improve through experience? and What are the fundamental statistical-computational-information-theoretic laws that govern all learning systems, including computers, humans, and organizations? The study of machine learning is important both for addressing these fundamental scientific and engineering questions and for the highly practical computer software it has produced and fielded across many applications.

机器学习是一个专注于两个相关问题的学科：“如何构建通过经验自动改进计算机的系统？”和“统筹所有学习系统（包括计算机、人类和组织）的基本统计计算信息理论是什么？”。机器学习研究对解决基本的科学和工程问题，以及对于已经投入使用的许多应用程序都非常重要。

Machine learning has progressed dramatically over the past two decades, from laboratory curiosity to a practical technology in widespread commercial use. Within artificial intelligence (AI), machine learning has emerged as the method of choice for developing practical software for computer vision, speech recognition, natural language processing, robot control, and other applications. Many developers of AI systems now recognize that, for many applications, it can be far easier to train a system by showing it examples of desired input-output behavior than to program it manually by anticipating the desired response for all possible inputs. The effect of machine learning has also been felt broadly across computer science and across a range of industries concerned with data-intensive issues, such as consumer services, the diagnosis of faults in complex systems, and the control of logistics chains. There has been a similarly broad range of effects across empirical sciences, from biology to cosmology to social science, as machine-learning methods have been developed to analyze highthroughput experimental data in novel ways. See Fig. 1 for a depiction of some recent areas of application of machine learning.

在过去的二十年间，机器学习有巨大地进步——从实验室玩具懂到广泛的商业用途的实用技术。在人工智能领域，机器学习已经成为计算机视觉、语音识别、自然语言处理、机器人控制和其他一些应用的可选方法。现在许多AI开发人员都认识到：对于大多数应用，通过显示所输输入输出示例来训练系统比通过对所有可能输入的预期响应进行人工编程要容易得多。计算机科学以及涉及数据密集型问题的一系列行业也广泛地感受到了机器学习的影响，例如消费者服务、复杂系统中的故障诊断以及物流链控制。从生物学到宇宙学再到社会科学，这些经验科学领域也有类似的广泛影响，因为机器学习方法在以全新的方式分析大量实验数据。有关机器学习的一些最新应用领域的描述，请参见图1。

A learning problem can be defined as the problem of improving some measure of performance when executing some task, through some type of training experience. For example, in learning to detect credit-card fraud, the task is to assign a label of “fraud” or “not fraud” to any given credit-card transaction. The performance metric to be improved might be the accuracy of this fraud classifier, and the training experience might consist of a collection of historical credit-card transactions, each labeled in retrospect as fraudulent or not. Alternatively, one might define a different performance metric that assigns a higher penalty when “fraud” is labeled “not fraud” than when “not fraud” is incorrectly labeled “fraud.” One might also define a different type of training experience—for example, by including unlabeled credit-card transactions along with labeled examples.

学习问题可以定义为通过某种类型的训练经验来提高执行某些任务时的某种性能度量的问题。例如：在学习检测信用卡欺诈时，任务是为任何给定的信用卡交易标记“欺诈”或“非欺诈”标签。待改进的性能指标可能是该欺诈分类器的准确性，并且训练“经验”可能包括历史信用卡交易集，每笔交易事后都被模型标记为欺诈或不欺诈。或者也可以定义一个不同的性能指标，当“欺诈”被标记为“非欺诈”时或“非欺诈”被错误地标记为“欺诈”时，给指标分配更高的惩罚。也许还可以定义一种不同类型的训练方式，例如在标记示例中加入未标记的信用卡交易。(PS:斜体代表不确定)

A diverse array of machine-learning algorithms has been developed to cover the wide variety of data and problem types exhibited across different machine-learning problems. Conceptually, machine-learning algorithms can be viewed as searching through a large space of candidate programs, guided by training experience, to find a program that optimizes the performance metric. Machine-learning algorithms vary greatly, in part by the way in which they represent candidate programs (e.g., decision trees, mathematical functions, and general programming languages) and in part by the way in which they search through this space of programs (e.g., optimization algorithms with well-understood convergence guarantees and evolutionary search methods that evaluate successive generations of randomly mutated programs). Here, we focus on approaches that have been particularly successful to date.

人们已经开发了各种各样的机器学习算法，以涵盖不同机器学习问题的各种不同的数据和问题类型。从概念上讲，机器学习算法可以看作是在训练数据指导下搜索大量候选程序，以找到优化性能指标的程序。机器学习算法的变化很大，部分原因是它们表示候选程序的方式（例如决策树、数学函数和通用编程语言），部分原因是它们在程序的此空间中进行搜索的方式（例如：具有充分理解的收敛性保证的优化算法和评估随机突变程序的连续生成的进化搜索方法）。这里我们重点谈迄今为止非常成功的方法。

Many algorithms focus on function approximation problems, where the task is embodied in a function (e.g., given an input transaction, output a “fraud” or “not fraud” label), and the learning problem is to improve the accuracy of that function, with experience consisting of a sample of known input-output pairs of the function. In some cases, the function is represented explicitly as a parameterized functional form; in other cases, the function is implicit and obtained via a search process, a factorization, an optimization procedure, or a simulation-based procedure. Even when implicit, the function generally depends on parameters or other tunable degrees of freedom, and training corresponds to finding values for these parameters that optimize the performance metric.

许多算法关注函数逼近问题，它具有由函数的已知输入/输出对组成的示例的经验，其中任务体现在函数中（例如：给定输入事务，输出“欺诈”或“非欺诈”标签），学习问题是提高该函数的准确性。在某些情况下，该功能被明确表示为参数化功能形式；在另一些情况下，该函数隐式，可以通过搜索过程、分解、优化过程或基于模拟的过程获得。即使是隐式的，该函数通常也取决于参数或其他可调自由变量，而训练就是为了找到能够优化性能指标的参数值。

Whatever the learning algorithm, a key scientific and practical goal is to theoretically characterize the capabilities of specific learning algorithms and the inherent difficulty of any given learning problem: How accurately can the algorithm learn from a particular type and volume of training data? How robust is the algorithm to errors in its modeling assumptions or to errors in the training data? Given a learning problem with a given volume of training data, is it possible to design a successful algorithm or is this learning problem fundamentally intractable? Such theoretical characterizations of machine-learning algorithms and problems typically make use of the familiar frameworks of statistical decision theory and computational complexity theory. In fact, attempts to characterize machine-learning algorithms theoretically have led to blends of statistical and computational theory in which the goal is to simultaneously characterize the sample complexity (how much data are required to learn accurately) and the computational complexity (how much computation is required) and to specify how these depend on features of the learning algorithm such as the representation it uses for what it learns. A specific form of computational analysis that has proved particularly useful in recent years has been that of optimization theory, with upper and lower bounds on rates of convergence of optimization procedures merging well with the formulation of machine-learning problems as the optimization of a performance metric.

无论使用哪种学习算法，关键的科学和分析目标都是从理论上描述特定学习算法的功能和给定学习问题的难点：该算法如何从特定类型和数量的训练数据中学习得准确？该算法对其建模假设中的错误或训练数据中的错误的鲁棒性如何？给定具体数量的训练数据的学习问题，是否有可能设计成功的算法，或发现此学习问题从根本上是棘手的？机器学习算法和问题的这种理论上的特点利用了统计决策理论和计算复杂性理论的熟悉框架。实际上，从理论上描述机器学习的尝试已经导致统计和计算理论的融合，其目的是同时描述样本复杂度（需要多少数据才能准确学习）和计算复杂度（需要多少计算量），并指定这些内容如何依赖于学习算法，例如将它用于学习的表示形式。最近几年证明特别有用的一种特殊形式的计算分析是优化理论，优化程序收敛速度的上限和下限与作为性能指标优化的机器学习问题的公式很好地融合在一起。

As a field of study, machine learning sits at the crossroads of computer science, statistics and a variety of other disciplines concerned with automatic improvement over time, and inference and decision-making under uncertainty. Related disciplines include the psychological study of human learning, the study of evolution, adaptive control theory, the study of educational practices, neuroscience, organizational behavior, and economics. Although the past decade has seen increased crosstalk with these other fields, we are just beginning to tap the potential synergies and the diversity of formalisms and experimental methods used across these multiple fields for studying systems that improve with experience.

作为一个研究领域，机器学习处于计算机科学、统计和其他各种学科的十字路口，这些学科与时俱进的自动改进，以及不确定性下的推理和决策。相关学科包括人类学习的心理学研究、进化论研究、自适应控制理论、教育实践研究、神经科学、组织行为学和经济学。尽管在过去的十年中，与其他领域的串扰有所增加，但我们才刚刚开始挖掘潜在的协同作用，以及在这些多个领域中使用的形式主义和实验方法的多样性，以研究随经验而改进的系统。

Drivers of machine-learning progress 机器学习进度的驱动力

The past decade has seen rapid growth in the ability of networked and mobile computing systems to gather and transport vast amounts of data, a phenomenon often referred to as “Big Data.” The scientists and engineers who collect such data have often turned to machine learning for solutions to the problem of obtaining useful insights, predictions, and decisions from such data sets. Indeed, the sheer size of the data makes it essential to develop scalable procedures that blend computational and statistical considerations, but the issue is more than the mere size of modern data sets; it is the granular, personalized nature of much of these data. Mobile devices and embedded computing permit large amounts of data to be gathered about individual humans, and machine-learning algorithms can learn from these data to customize their services to the needs and circumstances of each individual. Moreover, these personalized services can be connected, so that an overall service emerges that takes advantage of the wealth and diversity of data from many individuals while still customizing to the needs and circumstances of each. Instances of this trend toward capturing and mining large quantities of data to improve services and productivity can be found across many fields of commerce, science, and government. Historical medical records are used to discover which patients will respond best to which treatments; historical traffic data are used to improve traffic control and reduce congestion; historical crime data are used to help allocate local police to specific locations at specific times; and large experimental data sets are captured and curated to accelerate progress in biology, astronomy, neuroscience, and other dataintensive empirical sciences. We appear to be at the beginning of a decades-long trend toward increasingly data-intensive, evidence-based decisionmaking across many aspects of science, commerce, and government.

过去的十年可以看到网络和移动计算系统收集和传输大量数据的能力迅速增长，这种现象通常被称为“大数据”。收集此类数据的科学家和工程师经常转向机器学习，以解决从数据集中获得有用的见解、预测和决策的问题。庞大的数据量的确对于开发融合了计算和统计考虑因素的可扩展程序至关重要，但问题不仅仅在于现代数据集的大小，而在于许多数据的细节和特性。移动设备和嵌入式计算能够收集大量个人数据，并且机器学习算法可以从这些数据中学习并根据每个人的需求和情况定制其服务。此外，这些个性化服务可以连接在一起形成整体服务，该服务可以利用大量的丰富的个人数据并根据每个人的需求和情况进行定制。在商业、科学和政府的许多领域中都可以找到这种趋势，即抓取和挖掘大量数据以改善服务和生产率的趋势。使用历史病历来发现哪些患者对哪种治疗最有效；历史交通数据用于改善交通控制和减少拥堵；历史犯罪数据用于帮助在特定时间将当地警察分配到特定地点；并收集和整理大型实验数据集以加快生物学、天文学、神经科学和其他数据密集型经验科学的进展。我们似乎正处于数十年来的趋势之初：在科学、商业和政府的许多方面出现了越来越多的数据密集型、基于证据的决策。

With the increasing prominence of large-scale data in all areas of human endeavor has come a wave of new demands on the underlying machinelearning algorithms. For example, huge data sets require computationally tractable algorithms, highly personal data raise the need for algorithms that minimize privacy effects, and the availability of huge quantities of unlabeled data raises the challenge of designing learning algorithms to take advantage of it. The next sections survey some of the effects of these demands on recent work in machine-learning algorithms, theory, and practice.

随着大规模数据在人类努力的各个领域中日益重要，已经对底层机器学习算法提出了新的要求。例如，海量数据集需要易于计算处理的算法；高度私人化的数据产生了对最小化隐私影响算法的需求；而大量未标记数据的可用性希望设计能够利用它的学习算法。下一部分将调查这些需求对机器学习算法、理论和实践中最新工作的影响。

Core methods and recent progress 核心方法和最新进展

The most widely used machine-learning methods are supervised learning methods. Supervised learning systems, including spam classifiers of e-mail, face recognizers over images, and medical diagnosis systems for patients, all exemplify the function approximation problem discussed earlier, where the training data take the form of a collection of (x, y) pairs and the goal is to produce a prediction y* in response to a query x*. The inputs x may be classical vectors or they may be more complex objects such as documents, images, DNA sequences, or graphs. Similarly, many different kinds of output y have been studied. Much progress has been made by focusing on the simple binary classification problem in which y takes on one of two values (for example, “spam” or “not spam”), but there has also been abundant research on problems such as multiclass classification (where y takes on one of K labels), multilabel classification (where y is labeled simultaneously by several of the K labels), ranking problems (where y provides a partial order on some set), and general structured prediction problems (where y is a combinatorial object such as a graph, whose components may be required to satisfy some set of constraints). An example of the latter problem is part-of-speech tagging, where the goal is to simultaneously label every word in an input sentence x as being a noun, verb, or some other part of speech. Supervised learning also includes cases in which y has realvalued components or a mixture of discrete and real-valued components.

最广泛使用的机器学习方法是监督学习方法。有监督的学习系统包括电子邮件的垃圾邮件分类器、图像上的人脸识别器以及患者的医疗诊断系统，它们都是前面讨论的函数逼近问题的应用，其中训练数据采用（x，y）对的集合形式并且目标是响应查询x*产生预测y*。输入X可以是经典向量也可以是更复杂的对象，例如文档、图像、DNA序列或图形。与之相似，也已经研究了许多不同种类的输出Y。Y取两个值之一（例如“垃圾邮件”或“非垃圾邮件”）的简单二进制分类问题已经取得了很大的进步，但对于诸如多类分类（其中Y代表K个标签中的一个）、多标签分类（其中Y由K个标签中的几个同时标记）、排名问题（其中Y在某些集合上提供偏序）和一般的结构化预测问题（其中Y是一个组合对象（例如图形），就可能需要其组件才能满足某些约束集。后者的应用是词性标记，它的目的是同时将输入句子X中的每个单词标记为名词、动词或语音的其他部分。监督学习还包括Y具有连续值或离散连续值混合的情况。

Supervised learning systems generally form their predictions via a learned mapping f(x), which produces an output y for each input x (or a probability distribution over y given x). Many different forms of mapping f exist, including decision trees, decision forests, logistic regression, support vector machines, neural networks, kernel machines, and Bayesian classifiers. A variety of learning algorithms has been proposed to estimate these different types of mappings, and there are also generic procedures such as boosting and multiple kernel learning that combine the outputs of multiple learning algorithms. Procedures for learning f from data often make use of ideas from optimization theory or numerical analysis, with the specific form of machinelearning problems (e.g., that the objective function or function to be integrated is often the sum over a large number of terms) driving innovations. This diversity of learning architectures and algorithms reflects the diverse needs of applications, with different architectures capturing different kinds of mathematical structures, offering different levels of amenability to post-hoc visualization and explanation, and providing varying trade-offs between computational complexity, the amount of data, and performance.

监督学习系统通常通过学习映射f(x)形成预测，该映射为每个输入X生成输出Y（或给定X的Y上的概率分布）。存在许多不同形式的映射f，包括决策树、决策林、逻辑回归、支持向量机、神经网络、内核机和贝叶斯分类器。已经有各种学习算法来估计这些不同类型的映射并且还存在诸如boosting和多核学习之类的通用过程，其结合了多种学习算法的输出。从数据中学习f的过程经常利用优化理论或数值分析中的思想，并以特定形式的机器学习问题来推动创新（例如目标函数或集成函数包含大量项）。学习架构和算法的多样性反映了应用的不同需求，不同的架构捕获不同类型的数学结构，为事后可视化和解释提供了不同级别的适应性，并计算空间和时间复杂度。

One high-impact area of progress in supervised learning in recent years involves deep networks, which are multilayer networks of threshold units, each of which computes some simple parameterized function of its inputs(9, 10). Deep learning systems make use of gradient-based optimization algorithms to adjust parameters throughout such a multilayered network based on errors at its output. Exploiting modern parallel computing architectures, such as graphics processing units originally developed for video gaming, it has been possible to build deep learning systems that contain billions of parameters and that can be trained on the very large collections of images, videos, and speech samples available on the Internet. Such large-scale deep learning systems have had a major effect in recent years in computer vision (11) and speech recognition (12), where they have yielded major improvements in performance over previous approaches (see Fig. 2). Deep network methods are being actively pursued in a variety of additional applications from natural language translation to collaborative filtering.

近年来在监督学习中一个具有重大影响的进步领域就是深度网络，它是阈值单元的多层网络，每个网络都计算其输入的一些简单参数化函数。深度学习系统利用基于梯度的优化算法和基于输出值的错误来调整整个多层网络中的参数。利用最初为视频游戏开发的图形处理单元等现代并行计算架构可以构建包含数十亿参数的深度学习系统，并且可以对网上大量图像、视频和语音样本进行训练。近年来，这样的大规模深度学习系统在计算机视觉（11）和语音识别（12）中产生了重大影响，与以前的方法相比它们在性能方面产生了重大改进（见图2）。从自然语言翻译到协同过滤，各应用都在积极地寻求深度网络解决办法。

The internal layers of deep networks can be viewed as providing learned representations of the input data. While much of the practical success in deep learning has come from supervised learning methods for discovering such representations, efforts have also been made to develop deep learning algorithms that discover useful representations of the input without the need for labeled training data (13). The general problem is referred to as unsupervised learning, a second paradigm in machine-learning research (2).

可以将深度网络的内部层视为提供学习的输入数据表示形式。虽然深度学习的许多实际成功都来自找出标记的监督学习方法，但人们仍在努力开发能找出输入的有用表示形式而无需标记训练数据的深度学习算法（13）。这种问题一般问题被称为无监督学习，这是机器学习研究中的第二个方法（2）。

Broadly, unsupervised learning generally involves the analysis of unlabeled data under assumptions about structural properties of the data (e.g., algebraic, combinatorial, or probabilistic). For example, one can assume that data lie on a low-dimensional manifold and aim to identify that manifold explicitly from data. Dimension reduction methods—including principal components analysis, manifold learning, factor analysis, random projections, and autoencoders (1, 2)—make different specific assumptions regarding the underlying manifold (e.g., that it is a linear subspace, a smooth nonlinear manifold, or a collection of submanifolds). Another example of dimension reduction is the topic modeling framework depicted in Fig. 3. A criterion function is defined that embodies these assumptions—often making use of general statistical principles such as maximum likelihood, the method of moments, or Bayesian integration—and optimization or sampling algorithms are developed to optimize the criterion. As another example, clustering is the problem of finding a partition of the observed data (and a rule for predicting future data) in the absence of explicit labels indicating a desired partition. A wide range of clustering procedures has been developed, all based on specific assumptions regarding the nature of a “cluster.” In both clustering and dimension reduction, the concern with computational complexity is paramount, given that the goal is to exploit the particularly large data sets that are available if one dispenses with supervised labels.

广义地讲，无监督学习通常涉及在有关数据的结构特性（例如代数、组合或概率）的假设下对未标记数据进行分析。例如可以假设数据位于低维流形上并旨在从数据中明确标识该流形。降维方法（包括主成分分析、流形学习、因子分析、随机投影和自动编码器（1、2））对基础流形做出了不同的特定假设（例如它是线性子空间、平滑非线性流形或子流形的集合）。降维的另一个示例是图3中描述的主题建模框架。定义了一个标准函数来体现这些假设（通常利用一般的统计原理，例如最大似然、矩方法或贝叶斯积分）并优化或设计了采样算法来提升标准。作为另一示例，聚类是在缺少Label的情况下找到观测数据的分区（以及用于预测未来数据的规则）的问题。根据有关“集群”性质的特定假设已经开发了各种各样的集群程序。在聚类和降维方面，考虑到目标是要利用特别大的数据集（如果人们放弃了监督标签的话），那么对计算复杂性的关注是至关重要的。

A third major machine-learning paradigm is reinforcement learning (14, 15). Here, the information available in the training data is intermediate between supervised and unsupervised learning. Instead of training examples that indicate the correct output for a given input, the training data in reinforcement learning are assumed to provide only an indication as to whether an action is correct or not; if an action is incorrect, there remains the problem of finding the correct action. More generally, in the setting of sequences of inputs, it is assumed that reward signals refer to the entire sequence; the assignment of credit or blame to individual actions in the sequence is not directly provided. Indeed, although simplified versions of reinforcement learning known as bandit problems are studied, where it is assumed that rewards are provided after each action, reinforcement learning problems typically involve a general control-theoretic setting in which the learning task is to learn a control strategy (a “policy”) for an agent acting in an unknown dynamical environment, where that learned strategy is trained to chose actions for any given state, with the objective of maximizing its expected reward over time. The ties to research in control theory and operations research have increased over the years, with formulations such as Markov decision processes and partially observed Markov decision processes providing points of contact (15, 16). Reinforcement-learning algorithms generally make use of ideas that are familiar from the control-theory literature, such as policy iteration, value iteration, rollouts, and variance reduction, with innovations arising to address the specific needs of machine learning (e.g., largescale problems, few assumptions about the unknown dynamical environment, and the use of supervised learning architectures to represent policies). It is also worth noting the strong ties between reinforcement learning and many decades of work on learning in psychology and neuroscience, one notable example being the use of reinforcement learning algorithms to predict the response of dopaminergic neurons in monkeys learning to associate a stimulus light with subsequent sugar reward (17).

机器学习的第三个主要的方法是强化学习（14、15）。它的训练数据中可用的信息介于监督学习和非监督学习之间。强化学习中的训练数据仅提供有关动作是否正确的指示来代替训练样本的Label。如果某个行为不正确，则继续寻找正确动作。更一般地，在输入序列的设置中假设奖励信号指的是整个序列，而不是直接为序列中的单个动作分配功劳或责备。尽管的确研究了强化学习的简化形式（称为强盗问题）并假定在每次动作后都会提供奖励，但强化学习问题通常涉及一般的控制理论设置，其中学习任务是学习控制策略（一个在未知动态环境中行动的行动者的“政策”）。在这个过程中，良好的策略经过训练可以针对任何给定状态选择行动以期随着时间的推移最大化其预期回报。多年来控制理论研究与运筹学研究之间的联系日益紧密，诸如马尔可夫决策过程和部分观察到的马尔可夫决策过程等公式提供了联系点（15、16）。强化学习算法通常利用控制理论文献中熟悉的思想，例如策略迭代、值迭代、推广和减少方差并出现了一些创新技术来满足机器学习的特定需求（例如大规模问题、关于未知动态环境的一些假设以及使用监督学习体系来表示策略的假设）。值得注意的是强化学习与数十年来心理学和神经科学学习工作之间的紧密联系，一个值得注意的例子是使用强化学习算法来预测猴子中多巴胺能神经元的反应，从而学习将刺激光与随后的糖奖励联系起来（17）。

Although these three learning paradigms help to organize ideas, much current research involves blends across these categories. For example, semisupervised learning makes use of unlabeled data to augment labeled data in a supervised learning context, and discriminative training blends architectures developed for unsupervised learning with optimization formulations that make use of labels. Model selection is the broad activity of using training data not only to fit a model but also to select from a family of models, and the fact that training data do not directly indicate which model to use leads to the use of algorithms developed for bandit problems and to Bayesian optimization procedures. Active learning arises when the learner is allowed to choose data points and query the trainer to request targeted information, such as the label of an otherwise unlabeled example. Causal modeling is the effort to go beyond simply discovering predictive relations among variables, to distinguish which variables causally influence others (e.g., a high white-blood-cell count can predict the existence of an infection, but it is the infection that causes the high white-cell count). Many issues influence the design of learning algorithms across all of these paradigms, including whether data are available in batches or arrive sequentially over time, how data have been sampled, requirements that learned models be interpretable by users, and robustness issues that arise when data do not fit prior modeling assumptions.

尽管这三种学习方法有助于组织思想，但当前许多研究涉及这些类别的融合。例如半监督学习在监督学习的上下文中利用未标记的数据来增强标记的数据，而判别训练则将为无监督学习而开发的体系结构与利用标签的优化公式相结合。模型选择是使用训练数据的一种宽泛的方法，不仅可以使用训练数据来拟合模型，还可以从一系列模型中进行选择。但训练数据不能直接指示要使用哪个模型，这使人们开发了针对匪徒问题的算法并进行贝叶斯优化的程序。当允许学习模型选择数据点并查询训练数据以请求有针对性的信息时（例如未标注示例的标签），就会出现主动学习。因果建模不仅可以简单地发现变量之间的预测关系，还可以区分哪些变量对其他变量有因果关系（例如高白细胞含量可以预测感染的存在，也是导致高水平白血球含量的因素）。许多问题都会影响所有这些方法中学习算法的设计，包括数据是成批提供还是随时间顺序到达、数据是如何采样的、对用户可解释的学习模型的要求以及数据完成后不符合先前的建模假设等健壮性问题。

Emerging trends 新兴趋势

The field of machine learning is sufficiently young that it is still rapidly expanding, often by inventing new formalizations of machine-learning problems driven by practical applications. (An example is the development of recommendation systems, as described in Fig. 4.) One major trend driving this expansion is a growing concern with the environment in which a machine-learning algorithm operates. The word “environment” here refers in part to the computing architecture; whereas a classical machine-learning system involved a single program running on a single machine, it is now common for machine-learning systems to be deployed in architectures that include many thousands or ten of thousands of processors, such that communication constraints and issues of parallelism and distributed processing take center stage. Indeed, as depicted in Fig. 5, machine-learning systems are increasingly taking the form of complex collections of software that run on large-scale parallel and distributed computing platforms and provide a range of algorithms and services to data analysts.

机器学习领域还很年轻仍在迅速扩展，其发展常常解决实际应用驱动的机器学习问题来实现。（例如推荐系统的开发，如图4所示。）这种发展的一个主要趋势是人们越来越关注机器学习算法的运行环境。这里的“环境”一词是指计算架构。传统的机器学习系统涉及在单个机器上运行的单个程序，但是现在将机器学习系统部署在包含成千上万个处理器的体系结构中非常普遍，这使通信限制和并行性与分布式处理问题成为焦点。如图5所示，机器学习系统的确越来越多地采用复杂软件集合的形式，这些软件在大型并行和分布式计算平台上运行并为数据分析人员提供一系列算法和服务。

The word “environment” also refers to the source of the data, which ranges from a set of people who may have privacy or ownership concerns, to the analyst or decision-maker who may have certain requirements on a machine-learning system (for example, that its output be visualizable), and to the social, legal, or political framework surrounding the deployment of a system. The environment also may include other machine learning systems or other agents, and the overall collection of systems may be cooperative or adversarial. Broadly speaking, environments provide various resources to a learning algorithm and place constraints on those resources. Increasingly, machine-learning researchers are formalizing these relationships, aiming to design algorithms that are provably effective in various environments and explicitly allow users to express and control trade-offs among resources.

“环境”一词也指数据的来源，范围从可能有隐私或所有权问题的人到对机器学习系统（使其输出可视化）可能有特定要求的分析师或决策者，以及围绕系统部署的社会、法律或政治框架。“环境”还可以包括其他机器学习系统或其他系统，并且系统的总体集合可以是协作或对抗的。广义上讲，环境为学习算法提供了各种资源并对这些资源施加约束。机器学习研究人员正越来越多地确认这些关系，旨在设计在各种环境中都证明有效的算法并明确允许用户表达和控制资源之间的权衡。

As an example of resource constraints, let us suppose that the data are provided by a set of individuals who wish to retain a degree of privacy. Privacy can be formalized via the notion of “differential privacy,” which defines a probabilistic channel between the data and the outside world such that an observer of the output of the channel cannot infer reliably whether particular individuals have supplied data or not (18). Classical applications of differential privacy have involved insuring that queries (e.g., “what is the maximum balance across a set of accounts?”) to a privatized database return an answer that is close to that returned on the nonprivate data. Recent research has brought differential privacy into contact with machine learning, where queries involve predictions or other inferential assertions (e.g., “given the data I’ve seen so far, what is the probability that a new transaction is fraudulent?”) (19, 20). Placing the overall design of a privacy-enhancing machine-learning system within a decision-theoretic framework provides users with a tuning knob whereby they can choose a desired level of privacy that takes into account the kinds of questions that will be asked of the data and their own personal utility for the answers. For example, a person may be willing to reveal most of their genome in the context of research on a disease that runs in their family but may ask for more stringent protection if information about their genome is being used to set insurance rates.

作为资源限制的示例，让我们假设数据由一组希望保留隐私的个人所提供。隐私可以通过“差异隐私”的概念来形式化。“差异隐私”定义了数据与外部世界之间的概率通道，使得通道输出的观察者无法可靠地推断出特定个人是否提供了数据（18）。差异隐私的经典应用涉及确保对私有化数据库的查询（例如“一组帐户中的最大余额是多少？”）返回的答案与对非私有数据返回的答案相近。最近的研究已将差异性隐私与机器学习联系起来，其中查询涉及预测或其他推断性断言（例如，“鉴于我到目前为止所看到的数据，新交易有欺诈性的可能性是多少？”）（19， 20）。将隐私增强机器学习系统的总体设计放在决策理论框架内，为用户提供了一个调节旋钮，使他们可以选择所需的隐私级别，其中要考虑将要询问的数据和信息的种类。他们自己的实用程序来寻找答案。例如，一个人可能愿意在有关其家庭中所患疾病的研究中揭示其大部分基因组，但如果有关其基因组的信息被用于设定保险费率，则可能要求更严格的保护。

Communication is another resource that needs to be managed within the overall context of a distributed learning system. For example, data may be distributed across distinct physical locations because their size does not allow them to be aggregated at a single site or because of administrative boundaries. In such a setting, we may wish to impose a bit-rate communication constraint on the machine-learning algorithm. Solving the design problem under such a constraint will generally show how the performance of the learning system degrades under decrease in communication bandwidth, but it can also reveal how the performance improves as the number of distributed sites (e.g., machines or processors) increases, trading off these quantities against the amount of data (21, 22). Much as in classical information theory, this line of research aims at fundamental lower bounds on achievable performance and specific algorithms that achieve those lower bounds.

通信是另一个需要在分布式学习系统的整体环境中进行管理的资源。例如，数据可能分布在不同的物理位置，因为它们的大小或者由于管理边界不允许它们在单个站点上聚合。在这种情况下，我们可能希望对机器学习算法施加一个比特率通信约束。在这种约束下解决设计问题通常将显示出学习系统的性能如何在通信带宽降低的情况下降低，但也可以揭示随着数据量对应的分布式站点（例如机器或处理器）数量的增加交易性能如何提高（21，22）。就像在经典信息论中一样，这方面的研究主要针对可实现的性能的基本下限和实现这些下限的特定算法。

A major goal of this general line of research is to bring the kinds of statistical resources studied in machine learning (e.g., number of data points, dimension of a parameter, and complexity of a hypothesis class) into contact with the classical computational resources of time and space. Such a bridge is present in the “probably approximately correct” (PAC) learning framework, which studies the effect of adding a polynomial-time computation constraint on this relationship among error rates, training data size, and other parameters of the learning algorithm (3). Recent advances in this line of research include various lower bounds that establish fundamental gaps in performance achievable in certain machine-learning problems (e.g., sparse regression and sparse principal components analysis) via polynomial-time and exponential-time algorithms (23). The core of the problem, however, involves time-data tradeoffs that are far from the polynomial/exponential boundary. The large data sets that are increasingly the norm require algorithms whose time and space requirements are linear or sublinear in the problem size (number of data points or number of dimensions). Recent research focuses on methods such as subsampling, random projections, and algorithm weakening to achieve scalability while retaining statistical control (24, 25). The ultimate goal is to be able to supply time and space budgets to machine-learning systems in addition to accuracy requirements, with the system finding an operating point that allows such requirements to be realized.

这项一般性研究的主要目标是使机器学习中研究的统计资源的种类（例如数据量、参数的维数和假设类的复杂性）与经典的时间和空间计算资源联系起来。这种桥梁存在于“大概近似正确”（PAC）学习框架中，该框架研究在误差率、训练数据大小和学习算法的其他参数之间的这种关系上添加多项式时间计算约束的效果（3 ）。该研究领域的最新进展包括各种下界，这些下界通过多项式时间和指数时间算法在某些机器学习问题（例如稀疏回归和稀疏主成分分析）中可实现的性能之间建立了根本性的差距（23）。然而问题的核心涉及时间数据折衷，该折衷远离多项式/指数边界。越来越多的大型数据集要求算法的时间和空间复杂度要求在问题大小（数据点数或维数）方面是线性或次线性的。最近的研究集中在子采样、随机投影和削弱算法以实现可伸缩性同时保留统计控制的方法上（24，25）。最终目标是除了准确性要求外，还能够为机器学习系统提供时间和空间预算，并且系统会找到一个可以实现此类要求的工作点。

Opportunities and challenges 机遇与挑战

Despite its practical and commercial successes, machine learning remains a young field with many underexplored research opportunities. Some of these opportunities can be seen by contrasting current machine-learning approaches to the types of learning we observe in naturally occurring systems such as humans and other animals, organizations, economies, and biological evolution. For example, whereas most machinelearning algorithms are targeted to learn one specific function or data model from one single data source, humans clearly learn many different skills and types of knowledge, from years of diverse training experience, supervised and unsupervised, in a simple-to-more-difficult sequence (e.g., learning to crawl, then walk, then run). This has led some researchers to begin exploring the question of how to construct computer lifelong or never-ending learners that operate nonstop for years, learning thousands of interrelated skills or functions within an overall architecture that allows the system to improve its ability to learn one skill based on having learned another (26–28). Another aspect of the analogy to natural learning systems suggests the idea of team-based, mixed-initiative learning. For example, whereas current machinelearning systems typically operate in isolation to analyze the given data, people often work in teams to collect and analyze data (e.g., biologists have worked as teams to collect and analyze genomic data, bringing together diverse experiments and perspectives to make progress on this difficult problem). New machine-learning methods capable of working collaboratively with humans to jointly analyze complex data sets might bring together the abilities of machines to tease out subtle statistical regularities from massive data sets with the abilities of humans to draw on diverse background knowledge to generate plausible explanations and suggest new hypotheses. Many theoretical results in machine learning apply to all learning systems, whether they are computer algorithms, animals, organizations, or natural evolution. As the field progresses, we may see machine-learning theory and algorithms increasingly providing models for understanding learning in neural systems, organizations, and biological evolution and see machine learning benefit from ongoing studies of these other types of learning systems.

尽管在实践和商业上取得了成功，但机器学习仍然是一个年轻的领域，有许多未开发的研究机会。通过将当前的机器学习方法与我们在自然发生的系统（例如人类和其他动物、组织、经济和生物进化）中观察到的学习类型进行对比就可以发现其中一些机会。例如，尽管大多数机器学习算法的目标是从一个单一的数据源中学习一种特定的功能或数据模型，但人类显然可以通过多年的有监督和无监督的训练经验，以从易到难的方式学习许多不同的技能和知识类型。（例如，学习爬行，然后走路，然后跑步）。这导致一些研究人员开始探索以下问题：如何构建多年不间断运行的终身学习者或永无止境的学习器，在整个体系结构中学习数千种相互关联的技能或功能，使系统能够用基于其他技能的学习器提高其学习技能的能力（26-28）。类比自然学习系统的另一个方面提出了基于团队的混合学习的思想。例如，尽管当前的机器学习系统通常是独立运行以分析给定的数据，但人们通常会以团队的形式收集和分析数据（例如，生物学家作为团队来收集和分析基因组数据，将各种实验和观点整合在一起这个难题的进展）。能够与人类合作以共同分析复杂数据集的新机器学习方法可能将机器从大型数据集中挑逗出细微统计规律的能力与人类利用各种背景知识来产生合理解释并提出新的假设。机器学习的许多理论结果都适用于所有学习系统，无论它们是计算机算法、动物、组织还是自然进化。随着该领域的发展，我们可能会看到机器学习理论和算法越来越多地提供了用于理解神经系统、组织和生物进化中的学习的模型，并且看到机器学习将从这些其他类型的学习系统的持续研究中受益。

As with any powerful technology, machine learning raises questions about which of its potential uses society should encourage and discourage. The push in recent years to collect new kinds of personal data, motivated by its economic value, leads to obvious privacy issues, as mentioned above. The increasing value of data also raises a second ethical issue: Who will have access to, and ownership of, online data, and who will reap its benefits? Currently, much data are collected by corporations for specific uses leading to improved profits, with little or no motive for data sharing. However, the potential benefits that society could realize, even from existing online data, would be considerable if those data were to be made available for public good.

与任何强大的技术一样，机器学习引发了一个问题，即社会应鼓励和劝阻哪些潜在用途。如上所述，近年来由于其经济价值的原因，人们试图收集新型个人数据导致了明显的隐私问题。数据价值的增长还引发了第二个道德问题：谁将有权访问和拥有在线数据？谁将从中受益？当前公司为特定用途收集了大量数据从而提高了利润，而很少或没有动机共享数据。但是如果将这些数据提供给公众，社会甚至可以从现有的在线数据中实现的潜在利益将是巨大的。

To illustrate, consider one simple example of how society could benefit from data that is already online today by using this data to decrease the risk of global pandemic spread from infectious diseases. By combining location data from online sources (e.g., location data from cell phones, from credit-card transactions at retail outlets, and from security cameras in public places and private buildings) with online medical data (e.g., emergency room admissions), it would be feasible today to implement a simple system to telephone individuals immediately if a person they were in close contact with yesterday was just admitted to the emergency room with an infectious disease, alerting them to the symptoms they should watch for and precautions they should take. Here, there is clearly a tension and trade-off between personal privacy and public health, and society at large needs to make the decision on how to make this trade-off. The larger point of this example, however, is that, although the data are already online, we do not currently have the laws, customs, culture, or mechanisms to enable society to benefit from them, if it wishes to do so. In fact, much of these data are privately held and owned, even though they are data about each of us. Considerations such as these suggest that machine learning is likely to be one of the most transformative technologies of the 21st century. Although it is impossible to predict the future, it appears essential that society begin now to consider how to maximize its benefits.

为了说明这一点，请考虑一个简单的示例，该示例说明了社会如何通过使用这些数据来降低传染病导致的全球大流行扩散的风险从而从中获得好处。通过将来自在线资源的位置数据（例如来自手机的位置数据、来自零售店的信用卡交易以及来自公共场所和私人建筑物中的安全摄像机）与在线医疗数据（例如急诊室门票）相结合。如果一个昨天与他密切联系的人刚被送进急诊室感染传染病，就可以开发一个简单的系统立即打电话给相应人员并提醒他们应注意的症状和应采取的预防措施。在这个例子中，个人隐私与公共卫生之间显然存在紧张和权衡关系，整个社会都需要做出如何权衡的决定。此示例的最大意义是：尽管数据已经在线，但我们目前尚无法律、习俗、文化或机制，只要社会愿意便可以使社会从中受益。实际上，尽管这些数据是关于我们每个人的数据，但其中许多都是私人持用和拥有的。这些考虑表明机器学习可能是21世纪最具变革性的技术之一。尽管无法预测未来，但似乎社会必须开始考虑如何最大限度地发挥其利益。

有趣的链接

论文PDF