Deep Learning Ian Goodfellow Yoshua Bengio Aaron Courville

Contents Website

vii

Acknowledgments

viii

Notation

xi

1

Introduction 1.1 Who Should Read This Book? . . . . . . . . . . . . . . . . . . . . 1.2 Historical Trends in Deep Learning . . . . . . . . . . . . . . . . .

1 8 11

I

Applied Math and Machine Learning Basics

29

2

Linear Algebra 2.1 Scalars, Vectors, Matrices and Tensors . . 2.2 Multiplying Matrices and Vectors . . . . . 2.3 Identity and Inverse Matrices . . . . . . . 2.4 Linear Dependence and Span . . . . . . . 2.5 Norms . . . . . . . . . . . . . . . . . . . . 2.6 Special Kinds of Matrices and Vectors . . 2.7 Eigendecomposition . . . . . . . . . . . . . 2.8 Singular Value Decomposition . . . . . . . 2.9 The Moore-Penrose Pseudoinverse . . . . . 2.10 The Trace Operator . . . . . . . . . . . . 2.11 The Determinant . . . . . . . . . . . . . . 2.12 Example: Principal Components Analysis

. . . . . . . . . . . .

31 31 34 36 37 39 40 42 44 45 46 47 48

Probability and Information Theory 3.1 Why Probability? . . . . . . . . . . . . . . . . . . . . . . . . . . .

53 54

3

i

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

CONTENTS

3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 4

5

II 6

Random Variables . . . . . . . . . . . . . . Probability Distributions . . . . . . . . . . . Marginal Probability . . . . . . . . . . . . . Conditional Probability . . . . . . . . . . . The Chain Rule of Conditional Probabilities Independence and Conditional Independence Expectation, Variance and Covariance . . . Common Probability Distributions . . . . . Useful Properties of Common Functions . . Bayes’ Rule . . . . . . . . . . . . . . . . . . Technical Details of Continuous Variables . Information Theory . . . . . . . . . . . . . . Structured Probabilistic Models . . . . . . .

Numerical Computation 4.1 Overflow and Underflow . . . . 4.2 Poor Conditioning . . . . . . . 4.3 Gradient-Based Optimization . 4.4 Constrained Optimization . . . 4.5 Example: Linear Least Squares

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Machine Learning Basics 5.1 Learning Algorithms . . . . . . . . . . . 5.2 Capacity, Overfitting and Underfitting . 5.3 Hyperparameters and Validation Sets . . 5.4 Estimators, Bias and Variance . . . . . . 5.5 Maximum Likelihood Estimation . . . . 5.6 Bayesian Statistics . . . . . . . . . . . . 5.7 Supervised Learning Algorithms . . . . . 5.8 Unsupervised Learning Algorithms . . . 5.9 Stochastic Gradient Descent . . . . . . . 5.10 Building a Machine Learning Algorithm 5.11 Challenges Motivating Deep Learning . . Deep Networks: Modern Practices

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

56 56 58 59 59 60 60 62 67 70 71 73 75

. . . . .

80 80 82 82 93 96

. . . . . . . . . . .

98 99 110 120 122 131 135 140 146 151 153 155 166

Deep Feedforward Networks 168 6.1 Example: Learning XOR . . . . . . . . . . . . . . . . . . . . . . . 171 6.2 Gradient-Based Learning . . . . . . . . . . . . . . . . . . . . . . . 177 ii

CONTENTS

6.3 6.4 6.5 6.6

. . . .

191 197 204 224

7

Regularization for Deep Learning 7.1 Parameter Norm Penalties . . . . . . . . . . . . . . . . . . . . . . 7.2 Norm Penalties as Constrained Optimization . . . . . . . . . . . . 7.3 Regularization and Under-Constrained Problems . . . . . . . . . 7.4 Dataset Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Noise Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . 7.7 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . 7.8 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9 Parameter Tying and Parameter Sharing . . . . . . . . . . . . . . 7.10 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . 7.11 Bagging and Other Ensemble Methods . . . . . . . . . . . . . . . 7.12 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.13 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . 7.14 Tangent Distance, Tangent Prop, and Manifold Tangent Classifier

228 230 237 239 240 242 243 244 246 253 254 256 258 268 270

8

Optimization for Training Deep Models 8.1 How Learning Differs from Pure Optimization 8.2 Challenges in Neural Network Optimization . 8.3 Basic Algorithms . . . . . . . . . . . . . . . . 8.4 Parameter Initialization Strategies . . . . . . 8.5 Algorithms with Adaptive Learning Rates . . 8.6 Approximate Second-Order Methods . . . . . 8.7 Optimization Strategies and Meta-Algorithms

. . . . . . .

274 275 282 294 301 306 310 317

. . . . . . . . .

330 331 335 339 345 347 358 360 362 363

9

Hidden Units . . . . . . . . . . . . . . . . . . . . . . . . Architecture Design . . . . . . . . . . . . . . . . . . . . . Back-Propagation and Other Differentiation Algorithms Historical Notes . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

Convolutional Networks 9.1 The Convolution Operation . . . . . . . . . . . . . . . 9.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Convolution and Pooling as an Infinitely Strong Prior . 9.5 Variants of the Basic Convolution Function . . . . . . 9.6 Structured Outputs . . . . . . . . . . . . . . . . . . . . 9.7 Data Types . . . . . . . . . . . . . . . . . . . . . . . . 9.8 Efficient Convolution Algorithms . . . . . . . . . . . . 9.9 Random or Unsupervised Features . . . . . . . . . . . iii

. . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . .

CONTENTS

9.10 The Neuroscientific Basis for Convolutional Networks . . . . . . . 364 9.11 Convolutional Networks and the History of Deep Learning . . . . 371 10 Sequence Modeling: Recurrent and Recursive Nets 10.1 Unfolding Computational Graphs . . . . . . . . . . . . . . . 10.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . 10.3 Bidirectional RNNs . . . . . . . . . . . . . . . . . . . . . . . 10.4 Encoder-Decoder Sequence-to-Sequence Architectures . . . . 10.5 Deep Recurrent Networks . . . . . . . . . . . . . . . . . . . 10.6 Recursive Neural Networks . . . . . . . . . . . . . . . . . . . 10.7 The Challenge of Long-Term Dependencies . . . . . . . . . . 10.8 Echo State Networks . . . . . . . . . . . . . . . . . . . . . . 10.9 Leaky Units and Other Strategies for Multiple Time Scales . 10.10 The Long Short-Term Memory and Other Gated RNNs . . . 10.11 Optimization for Long-Term Dependencies . . . . . . . . . . 10.12 Explicit Memory . . . . . . . . . . . . . . . . . . . . . . . . 11 Practical Methodology 11.1 Performance Metrics . . . . . . . . . . . . . 11.2 Default Baseline Models . . . . . . . . . . . 11.3 Determining Whether to Gather More Data 11.4 Selecting Hyperparameters . . . . . . . . . . 11.5 Debugging Strategies . . . . . . . . . . . . . 11.6 Example: Multi-Digit Number Recognition . 12 Applications 12.1 Large-Scale Deep Learning . . 12.2 Computer Vision . . . . . . . 12.3 Speech Recognition . . . . . . 12.4 Natural Language Processing 12.5 Other Applications . . . . . . III

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . . . . . . . .

. . . . . .

. . . . .

. . . . . . . . . . . .

. . . . . .

. . . . .

. . . . . . . . . . . .

373 375 378 394 396 398 400 401 404 406 408 413 416

. . . . . .

421 422 425 426 427 436 440

. . . . .

443 443 452 458 461 478

Deep Learning Research

486

13 Linear Factor Models 13.1 Probabilistic PCA and Factor Analysis . 13.2 Independent Component Analysis (ICA) 13.3 Slow Feature Analysis . . . . . . . . . . 13.4 Sparse Coding . . . . . . . . . . . . . . . iv

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

489 490 491 493 496

CONTENTS

13.5 Manifold Interpretation of PCA . . . . . . . . . . . . . . . . . . . 499 14 Autoencoders 14.1 Undercomplete Autoencoders . . . . . . . . . 14.2 Regularized Autoencoders . . . . . . . . . . . 14.3 Representational Power, Layer Size and Depth 14.4 Stochastic Encoders and Decoders . . . . . . . 14.5 Denoising Autoencoders . . . . . . . . . . . . 14.6 Learning Manifolds with Autoencoders . . . . 14.7 Contractive Autoencoders . . . . . . . . . . . 14.8 Predictive Sparse Decomposition . . . . . . . 14.9 Applications of Autoencoders . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

15 Representation Learning 15.1 Greedy Layer-Wise Unsupervised Pretraining . . 15.2 Transfer Learning and Domain Adaptation . . . . 15.3 Semi-Supervised Disentangling of Causal Factors 15.4 Distributed Representation . . . . . . . . . . . . . 15.5 Exponential Gains from Depth . . . . . . . . . . 15.6 Providing Clues to Discover Underlying Causes .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

16 Structured Probabilistic Models for Deep Learning 16.1 The Challenge of Unstructured Modeling . . . . . . . . . 16.2 Using Graphs to Describe Model Structure . . . . . . . . 16.3 Sampling from Graphical Models . . . . . . . . . . . . . 16.4 Advantages of Structured Modeling . . . . . . . . . . . . 16.5 Learning about Dependencies . . . . . . . . . . . . . . . 16.6 Inference and Approximate Inference . . . . . . . . . . . 16.7 The Deep Learning Approach to Structured Probabilistic 17 Monte Carlo Methods 17.1 Sampling and Monte Carlo Methods . . . . . . . . 17.2 Importance Sampling . . . . . . . . . . . . . . . . . 17.3 Markov Chain Monte Carlo Methods . . . . . . . . 17.4 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . 17.5 The Challenge of Mixing between Separated Modes

. . . . .

. . . . .

. . . . .

. . . . . . . . .

502 503 504 508 509 510 515 521 523 524

. . . . . .

526 528 536 541 546 553 554

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Models

558 559 563 580 582 582 584 585

. . . . .

590 590 592 595 599 599

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . .

. . . . .

18 Confronting the Partition Function 605 18.1 The Log-Likelihood Gradient . . . . . . . . . . . . . . . . . . . . 606 18.2 Stochastic Maximum Likelihood and Contrastive Divergence . . . 607 v

CONTENTS

18.3 18.4 18.5 18.6 18.7

Pseudolikelihood . . . . . . . . . . . Score Matching and Ratio Matching Denoising Score Matching . . . . . . Noise-Contrastive Estimation . . . . Estimating the Partition Function . .

19 Approximate Inference 19.1 Inference as Optimization . . . . . 19.2 Expectation Maximization . . . . . 19.3 MAP Inference and Sparse Coding 19.4 Variational Inference and Learning 19.5 Learned Approximate Inference . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

20 Deep Generative Models 20.1 Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . 20.2 Restricted Boltzmann Machines . . . . . . . . . . . . . . . 20.3 Deep Belief Networks . . . . . . . . . . . . . . . . . . . . . 20.4 Deep Boltzmann Machines . . . . . . . . . . . . . . . . . . 20.5 Boltzmann Machines for Real-Valued Data . . . . . . . . . 20.6 Convolutional Boltzmann Machines . . . . . . . . . . . . . 20.7 Boltzmann Machines for Structured or Sequential Outputs 20.8 Other Boltzmann Machines . . . . . . . . . . . . . . . . . 20.9 Back-Propagation through Random Operations . . . . . . 20.10 Directed Generative Nets . . . . . . . . . . . . . . . . . . . 20.11 Drawing Samples from Autoencoders . . . . . . . . . . . . 20.12 Generative Stochastic Networks . . . . . . . . . . . . . . . 20.13 Other Generation Schemes . . . . . . . . . . . . . . . . . . 20.14 Evaluating Generative Models . . . . . . . . . . . . . . . . 20.15 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . .

615 617 619 620 623

. . . . .

631 633 634 635 638 651

. . . . . . . . . . . . . . .

654 654 656 660 663 676 683 685 686 687 692 711 714 716 717 720

Bibliography

721

Index

777

vi

Website www.deeplearningbook.org

This book is accompanied by the above website. The website provides a variety of supplementary material, including exercises, lecture slides, corrections of mistakes, and other resources that should be useful to both readers and instructors.

vii

Deep Learning - GitHub

2.12 Example: Principal Components Analysis . . . . . . . . . . . . . 48. 3 Probability and .... 11.3 Determining Whether to Gather More Data . . . . . . . . . . . . 426.

118KB Sizes 115 Downloads 357 Views

Recommend Documents

Deep Learning with H2O.pdf - GitHub
best-in-class algorithms such as Random Forest, Gradient Boosting and Deep Learning at scale. .... elegant web interface or fully scriptable R API from H2O CRAN package. · grid search for .... takes to cut the learning rate in half (e.g., 10−6 mea

Essence of Machine Learning (and Deep Learning) - GitHub
... Expectation-Maximisation (EM), Variational Inference (VI), sampling-based inference methods. 4. Model selection. Keywords: cross-validation. 24. Modelling ...

Brief Introduction to Machine Learning without Deep Learning - GitHub
is an excellent course “Deep Learning” taught at the NYU Center for Data ...... 1.7 for graphical illustration. .... PDF. CDF. Mean. Mode. (b) Gamma Distribution. Figure 2.1: In these two ...... widely read textbook [25] by Williams and Rasmussen

An Exploration of Deep Learning in Content-Based Music ... - GitHub
Apr 20, 2015 - 10. Chord comparison functions and examples in mir_eval. 125. 11 ..... Chapter VII documents the software contributions resulting from this study, ...... of such high-performing systems, companies like Google, Facebook, ...

Deep Gaussian Processes - GitHub
Because the log-normal distribution is heavy-tailed and its domain is bounded .... of layers as long as D > 100. ..... Deep learning via Hessian-free optimization.

Forecasting Lung Cancer Diagnoses with Deep Learning - GitHub
Apr 22, 2017 - Worldwide 1.6 million people die from lung cancer each year, and there are 225,000 new diagnoses of lung cancer per year in the U.S. ... of forecasting whether a patient will be diagnosed with lung cancer in the next year given a CT sc

DEEP LEARNING BOOKLET_revised.pdf
Whoops! There was a problem loading more pages. Whoops! There was a problem previewing this document. Retrying... Download. Connect more apps.

Cybercrime in the Deep Web - GitHub
May 14, 2016 - We are based on anarchistic control so nobody haz power certainly not power over the servers or. * - domains who ever says that this or that person haz power here, are trolls and mostly agents of factions. * - that haz butthurt about t

Download Deep Learning
Download Deep Learning (Adaptive Computation and Machine. Learning series) Full ePUB ... speech recognition, computer vision, online recommendation ...

Deep Learning with Differential Privacy
Oct 24, 2016 - can train deep neural networks with non-convex objectives, under a ... Machine learning systems often comprise elements that contribute to ...

Microsoft Learning Experiences - GitHub
Performance for SQL Based Applications. Then, if you have not already done so, ... In the Save As dialog box, save the file as plan1.sqlplan on your desktop. 6.

Microsoft Learning Experiences - GitHub
A Windows, Linux, or Mac OS X computer. • Azure Storage Explorer. • The lab files for this course. • A Spark 2.0 HDInsight cluster. Note: If you have not already ...

Microsoft Learning Experiences - GitHub
Start Microsoft SQL Server Management Studio and connect to your database instance. 2. Click New Query, select the AdventureWorksLT database, type the ...

Microsoft Learning Experiences - GitHub
performed by writing code to manipulate data in R or Python, or by using some of the built-in modules ... https://cran.r-project.org/web/packages/dplyr/dplyr.pdf. ... You can also import custom R libraries that you have uploaded to Azure ML as R.

Microsoft Learning Experiences - GitHub
Developing SQL Databases. Lab 4 – Creating Indexes. Overview. A table named Opportunity has recently been added to the DirectMarketing schema within the database, but it has no constraints in place. In this lab, you will implement the required cons

Microsoft Learning Experiences - GitHub
create a new folder named iislogs in the root of your Azure Data Lake store. 4. Open the newly created iislogs folder. Then click Upload, and upload the 2008-01.txt file you viewed previously. Create a Job. Now that you have uploaded the source data

Microsoft Learning Experiences - GitHub
will create. The Azure ML Web service you will create is based on a dataset that you will import into. Azure ML Studio and is designed to perform an energy efficiency regression experiment. What You'll Need. To complete this lab, you will need the fo

Microsoft Learning Experiences - GitHub
Lab 2 – Using a U-SQL Catalog. Overview. In this lab, you will create an Azure Data Lake database that contains some tables and views for ongoing big data processing and reporting. What You'll Need. To complete the labs, you will need the following

Microsoft Learning Experiences - GitHub
The final Execute R/Python Script. 4. Edit the comment of the new Train Model module, and set it to Decision Forest. 5. Connect the output of the Decision Forest Regression module to the Untrained model (left) input of the new Decision Forest Train M

Microsoft Learning Experiences - GitHub
Page 1 ... A web browser and Internet connection. Create an Azure ... Now you're ready to start learning how to build data science and machine learning solutions.

Microsoft Learning Experiences - GitHub
In this lab, you will explore and visualize the data Rosie recorded. ... you will use the Data Analysis Pack in Excel to apply some statistical functions to Rosie's.

Microsoft Learning Experiences - GitHub
created previously. hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles. /data/storefile Stocks. 8. Wait for the MapReduce job to complete. Query the Bulk Loaded Data. 1. Enter the following command to start the HBase shell. hbase shell. 2.