Tera-scale deep learning Quoc  V.  Le  

Stanford  University  and  Google  

Joint  work  with  

Kai  Chen  

Greg  Corrado  

Rajat  Monga   Andrew  Ng  

AddiNonal   Thanks:  

Jeff  Dean   MaQhieu  Devin  

Marc Aurelio   Paul  Tucker   Ranzato  

Ke  Yang  

Samy  Bengio,  Zhenghao  Chen,  Tom  Dean,  Pangwei  Koh,   Mark  Mao,  Jiquan  Ngiam,  Patrick  Nguyen,  Andrew  Saxe,   Mark  Segal,  Jon  Shlens,    Vincent  Vanhouke,    Xiaoyun  Wu,     Peng  Xe,  Serena  Yeung,  Will  Zou  

Machine  Learning  successes  

Face  recogniNon  

OCR  

RecommendaNon  systems  

Autonomous  car  

Email  classificaNon  

Web  page  ranking  

Feature  ExtracNon  

Classifier  

Feature  extracNon   (Mostly  hand-­‐cra]ed  features)  

Hand-­‐Cra]ed  Features   Computer  vision:       …   SIFT/HOG  

SURF   Speech  RecogniNon:      

…  

MFCC  

Spectrogram  

ZCR  

New  feature-­‐designing  paradigm  

Unsupervised  Feature  Learning  /  Deep  Learning       ReconstrucNon  ICA     Expensive  and  typically  applied  to  small  problems  

The  Trend  of  BigData  

Outline   No  maQer  the  algorithm,  more  features  always  more  successful.     -­‐        ReconstrucNon  ICA  

-­‐  ApplicaNons  to  videos,  cancer  images   -­‐  Ideas  for  scaling  up   -­‐  Scaling  up  Results  

Topographic  Independent  Component  Analysis  (TICA)  

1.  Feature  computaNon  

(   W  T   1  

2  

)  

(  

W  T   9  

2  

)  

2.  Learning  

W  T   9  

W  T   1   W   1  

W   9  

W  =    

Input  data:  

W   1   W   2   .   .   W   10000  

Topographic  Independent  Component  Analysis  (TICA)  

Invariance  explained   Images   Features  

F1  

F2  

Pooled  feature  of  F1  and  F2  

Image1  

Image2   Loc1  

Loc2  

1  

0  

0  

1  

sqrt(1  2  +  02    )  =  1  

sqrt(0  2    +  12    )  =  1  

Same  value  regardless  the  locaNon  of  the  edge  

TICA:  

ReconstrucNon  ICA:  

Equivalence  between  Sparse  Coding,  Autoencoders,  RBMs  and  ICA   Build  deep  architecture  by  treaNng  the  output  of  one  layer  as  input  to   another  layer   Le,  et  al.,  ICA  with  Reconstruc1on  Cost  for  Efficient  Overcomplete  Feature  Learning.  NIPS  2011  

ReconstrucNon  ICA:  

Le,  et  al.,  ICA  with  Reconstruc1on  Cost  for  Efficient  Overcomplete  Feature  Learning.  NIPS  2011  

ReconstrucNon  ICA:  

Data  whitening  

Le,  et  al.,  ICA  with  Reconstruc1on  Cost  for  Efficient  Overcomplete  Feature  Learning.  NIPS  2011  

TICA:  

ReconstrucNon  ICA:  

Data  whitening  

Le,  et  al.,  ICA  with  Reconstruc1on  Cost  for  Efficient  Overcomplete  Feature  Learning.  NIPS  2011  

Why  RICA?  

Algorithms  

Speed  

Ease  of  training  

Invariant  Features    

Sparse  Coding   RBMs/Autoencoders   TICA   ReconstrucNon  ICA  

Le,  et  al.,  ICA  with  Reconstruc1on  Cost  for  Efficient  Overcomplete  Feature  Learning.  NIPS  2011  

Summary  of  RICA   -­‐  Two-­‐layered  network   -­‐  ReconstrucNon  cost  instead  of  orthogonality  constraints   -­‐  Learns  invariant  features    

ApplicaNons  of  RICA  

AcNon  recogniNon  

Sit  up  

Eat  

Run  

Drive  Car  

Answer  phone  

Stand  up  

Le,  et  al.,  Learning  hierarchical  spa1o-­‐temporal  features  for     ac1on  recogni1on  with  independent  subspace  analysis.  CVPR  2011  

Get    Out  of  Car  

Kiss  

Shake  hands  

Le,  et  al.,  Learning  hierarchical  spa1o-­‐temporal  features  for     ac1on  recogni1on  with  independent  subspace  analysis.  CVPR  2011  

94  

55  

KTH  

92  

51  

90  

49  

88  

47  

86  

43  

45   41  

84  

39  

82  

37   35  

80   Hessian/SURF  

pLSA  

HOF  

GRBMs  

3DCNN   HMAX  

HOG  

76  

UCF  

85  

Hessian/SURF  

Learned  Features  

87  

75  

83  

74  

81  

73  

79  

72  

77  

71  

75  

70  

Hessian/SURF  

Hollywood2  

53  

HOG  

Hessian   HOG.HOF  

HOF  

HOG3D  

Learned  Features  

HOG/HOF  

HOG3D  

GRBMS  

HOF  

Learned  Features  

YouTube  

Combined   Engineered  Features  

Le,  et  al.,  Learning  hierarchical  spa1o-­‐temporal  features  for     ac1on  recogni1on  with  independent  subspace  analysis.  CVPR  2011  

Learned  Features  

Cancer  classificaNon  

92%  

ApoptoNc  

90%  

Viable  tumor   region  

88%  

86%  

Necrosis  

84%   Hand  engineered  Features  

…   Le,  et  al.,  Learning  Invariant  Features  of  Tumor  Signatures.  ISBI  2012  

RICA  

Scaling  up     deep  RICA  networks  

Scaling  up  Deep  Learning  

Deep  learning  data  

Real  data  

It’s  beQer  to  have  more  features!   No  maQer  the  algorithm,  more  features  always  more  successful.  

Coates,  et  al.,  An  Analysis  of  Single-­‐Layer  Networks  in  Unsupervised  Feature  Learning.  AISTATS’11  

Most  are     local  features  

Local  recepNve  field  networks   Machine  #1  

Machine  #2  

RICA  features  

Image  

Le,  et  al.,  Tiled  Convolu1onal  Neural  Networks.  NIPS  2010  

Machine  #3  

Machine  #4  

Challenges  with  1000s  of  machines  

Asynchronous  Parallel  SGDs  

Parameter  server  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Asynchronous  Parallel  SGDs  

Parameter  server  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Summary  of  Scaling  up   -­‐  Local  connecNvity   -­‐  Asynchronous  SGDs    

…  And  more   -­‐  RPC  vs  MapReduce   -­‐  Prefetching   -­‐  Single  vs  Double   -­‐  Removing  slow  machines   -­‐  OpNmized  So]max   -­‐  …  

10  million  200x200  images     1  billion  parameters  

Training   RICA  

RICA  

Dataset:  10  million  200x200  unlabeled  images    from  YouTube/Web     Train  on  2000  machines  (16000  cores)  for  1  week     1.15  billion  parameters   -­‐  100x  larger  than  previously  reported     -­‐  Small  compared  to  visual  cortex    

RICA  

Image  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

The  face  neuron  

Top  sNmuli  from  the  test  set  

OpNmal  sNmulus     by  numerical  opNmizaNon  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Random  distractors   Faces  

Frequency  

Feature  value   Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

0  pixels  

20  pixels  

Feature  response  

Feature  response  

Invariance  properNes  

0  pixels   VerNcal  shi]s  

o   90  

o   0   3D  rotaNon  angle  

Feature  response  

Feature  response  

Horizontal  shi]s  

20  pixels  

0.4x  

1x  

1.6x  

Scale  factor  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Top  sNmuli  from  the  test  set  

OpNmal  sNmulus     by  numerical  opNmizaNon  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Random  distractors   Pedestrians  

Frequency  

Feature  value   Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Top  sNmuli  from  the  test  set  

OpNmal  sNmulus     by  numerical  opNmizaNon  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Random  distractors   Cat  faces  

Frequency  

Feature  value   Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

ImageNet  classificaNon   22,000  categories     14,000,000  images     Hand-­‐engineered  features  (SIFT,  HOG,  LBP),     SpaNal  pyramid,    SparseCoding/Compression    

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

22,000  is  a  lot  of  categories…     …   smoothhound,  smoothhound  shark,  Mustelus  mustelus   American  smooth  dogfish,  Mustelus  canis   Florida  smoothhound,  Mustelus  norrisi   whiteNp  shark,  reef  whiteNp  shark,  Triaenodon  obseus   AtlanNc  spiny  dogfish,  Squalus  acanthias   Pacific  spiny  dogfish,  Squalus  suckleyi   hammerhead,  hammerhead  shark   smooth  hammerhead,  Sphyrna  zygaena   smalleye  hammerhead,  Sphyrna  tudes   shovelhead,  bonnethead,  bonnet  shark,  Sphyrna  Nburo   angel  shark,  angelfish,  SquaNna  squaNna,  monkfish   electric  ray,  crampfish,  numbfish,  torpedo   smalltooth  sawfish,  PrisNs  pecNnatus   guitarfish   roughtail  sNngray,  DasyaNs  centroura   buQerfly  ray   eagle  ray   spoQed  eagle  ray,  spoQed  ray,  Aetobatus  narinari   cownose  ray,  cow-­‐nosed  ray,  Rhinoptera  bonasus   manta,  manta  ray,  devilfish   AtlanNc  manta,  Manta  birostris   devil  ray,  Mobula  hypostoma   grey  skate,  gray  skate,  Raja  baNs   liQle  skate,  Raja  erinacea   …  

SNngray  

Mantaray  

Best  sNmuli  

Feature  1  

Feature  2  

Feature  3  

Feature  4  

Feature  5  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Best  sNmuli  

Feature  6  

Feature  7    

Feature  8  

Feature  9  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Best  sNmuli  

Feature  10  

Feature  11    

Feature  12  

Feature  13  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

0.005%   9.5%   Random  guess  

State-­‐of-­‐the-­‐art   (Weston,  Bengio  ‘11)  

?  

Feature  learning     From  raw  pixels  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

0.005%   9.5%   15.8%   Random  guess  

State-­‐of-­‐the-­‐art   (Weston,  Bengio  ‘11)  

Feature  learning     From  raw  pixels  

ImageNet  2009  (10k  categories):  Best  published  result:  17%                                                                                                                        (Sanchez  &  Perronnin  ‘11  ),                                                                                                                        Our  method:  20%     Using  only  1000  categories,  our  method  >  50%    

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Other  results   No  maQer  the  algorithm,  more  features  always  more  successful.     -­‐  We  also  have  great  features  for     -­‐  Speech  recogniNon   -­‐  Word-­‐vector  embedding  for  NLPs  

Conclusions   •  RICA  learns  invariant  features   •  Face  neuron  with  totally  unlabeled  data                    with  enough  training  and  data   •  State-­‐of-­‐the-­‐art  performances  on     –  AcNon  RecogniNon   –  Cancer  image  classificaNon   –  ImageNet  

0.005%  

ImageNet   9.5%  

Random  guess  

Best  published  result  

15.8%   Our  method  

94   92   90   88   86   84   82   80   Cancer  classificaNon  

Feature  visualizaNon  

AcNon  recogniNon  benchmarks  

AcNon  recogniNon  

Face  neuron  

References   •  Q.V.  Le,  M.A.  Ranzato,  R.  Monga,  M.  Devin,  G.  Corrado,  K.  Chen,  J.  Dean,  A.Y.   Ng.  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.   ICML,  2012.   •  Q.V.  Le,  J.  Ngiam,  Z.  Chen,  D.  Chia,  P.  Koh,  A.Y.  Ng.  Tiled  Convolu8onal  Neural   Networks.  NIPS,  2010.     •  Q.V.  Le,  W.Y.  Zou,  S.Y.  Yeung,  A.Y.  Ng.  Learning  hierarchical  spa8o-­‐temporal   features  for  ac8on  recogni8on  with  independent  subspace  analysis.  CVPR,   2011.   •  Q.V.  Le,  J.  Ngiam,  A.  Coates,  A.  Lahiri,  B.  Prochnow,  A.Y.  Ng.     On  op8miza8on  methods  for  deep  learning.  ICML,  2011.     •  Q.V.  Le,  A.  Karpenko,  J.  Ngiam,  A.Y.  Ng.    ICA  with  Reconstruc8on  Cost  for   Efficient  Overcomplete  Feature  Learning.  NIPS,  2011.     •  Q.V.  Le,  J.  Han,  J.  Gray,  P.  Spellman,  A.  Borowsky,  B.  Parvin.  Learning  Invariant   Features  for  Tumor  Signatures.  ISBI,  2012.     •  I.J.  Goodfellow,  Q.V.  Le,  A.M.  Saxe,  H.  Lee,  A.Y.  Ng,    Measuring  invariances  in   deep  networks.  NIPS,  2009.  

hQp://ai.stanford.edu/~quocle  

Tera-scale deep learning - Research at Google

The Trend of BigData .... Scaling up Deep Learning. Real data. Deep learning data ... Le, et al., Building high-‐level features using large-‐scale unsupervised ...

11MB Sizes 4 Downloads 475 Views

Recommend Documents

Learning with Deep Cascades - Research at Google
based on feature monomials of degree k, or polynomial functions of degree k, ... on finding the best trade-off between computational cost and classification accu-.

Deep Learning Methods for Efficient Large ... - Research at Google
Jul 26, 2017 - Google Cloud & YouTube-8M Video. Understanding Challenge ... GAP scores are from private leaderboard. Models. MoNN. LSTM GRU.

Large-Scale Deep Learning for Intelligent ... - Research at Google
Android. Apps. GMail. Image Understanding. Maps. NLP. Photos. Robotics. Speech. Translation many research uses.. YouTube … many others . ... Page 10 ...

Deep Learning in Speech Synthesis - Research at Google
Aug 31, 2013 - Heiga Zen. Deep Learning in Speech Synthesis. August 31st, 2013. 6 of 50 ..... w/ vs w/o grouping questions (e.g., vowel, fricative). − Grouping ...

Resurrecting the sigmoid in deep learning ... - Research at Google
Since error information backpropagates faithfully and isometrically through the network, this stronger requirement is called dynamical isometry [10]. A theoretical analysis of exact solutions to the nonlinear dynamics of learning in deep linear netwo

Development and Validation of a Deep Learning ... - Research at Google
Nov 29, 2016 - CR1/DGi/CR2, and Topcon NW using 45° fields of view. ..... A, Model performance on the tuning set (24 360 images) as a function of number.

Deep Boosting - Proceedings of Machine Learning Research
We give new data-dependent learning bounds for convex ensembles. These guarantees are expressed in terms of the Rademacher complexities of the sub-families. Hk and the mixture weight assigned to each Hk, in ad- dition to the familiar margin terms and

Maths Research - Deep Learning Website.pdf
... Cumbria www.cumbria.ac.uk/LED ). Andy Ash (Deep Learning Teaching School Alliance http://www.deeplearningtsa.co.uk/ ). With seven teacher researchers: Lucy Evans, Ann Kirk, Rosie Ross, Paula Spenceley, Vicky. Stout, Adam Vasco and Keri Williams.

Deep Boosting - Proceedings of Machine Learning Research
ysis, with performance guarantees in terms of the margins ... In many successful applications of AdaBoost, H is reduced .... Our proof technique exploits standard tools used to de- ..... {0,..., 9}, fold i was used for testing, fold i +1(mod 10).

Learning with Weighted Transducers - Research at Google
b Courant Institute of Mathematical Sciences and Google Research, ... over a vector space are the polynomial kernels of degree d ∈ N, Kd(x, y)=(x·y + 1)d, ..... Computer Science, pages 262–273, San Francisco, California, July 2008. Springer-.

UNSUPERVISED CONTEXT LEARNING FOR ... - Research at Google
grams. If an n-gram doesn't appear very often in the training ... for training effective biasing models using far less data than ..... We also described how to auto-.

UNSUPERVISED LEARNING OF SEMANTIC ... - Research at Google
model evaluation with only a small fraction of the labeled data. This allows us to measure the utility of unlabeled data in reducing an- notation requirements for any sound event classification application where unlabeled data is plentiful. 4.1. Data

Scalable Hierarchical Multitask Learning ... - Research at Google
Feb 24, 2014 - on over 1TB data for up to 1 billion observations and 1 mil- ..... Wc 2,1. (16). The coefficients λ1 and λ2 govern the trade-off between generic sparsity ..... years for each school correspond to the subtasks of the school. ID. Thus 

Semi-supervised Sequence Learning - Research at Google
1http://ai.Stanford.edu/amaas/data/sentiment/index.html. 3 .... email with an average length of 267 words and a maximum length of 11,925 words. Attachments,.

Multiframe Deep Neural Networks for Acoustic ... - Research at Google
windows going up to 400 ms. Given this very long temporal context, it is tempting to wonder whether one can run neural networks at a lower frame rate than the ...

Compressing Deep Neural Networks using a ... - Research at Google
tractive model for many learning tasks; they offer great rep- resentational power ... differs fundamentally in the way the low-rank approximation is obtained and ..... 4Specifically: “answer call”, “decline call”, “email guests”, “fast

DEEP MIXTURE DENSITY NETWORKS FOR ... - Research at Google
Statistical parametric speech synthesis (SPSS) using deep neural net- works (DNNs) has .... is the set of input/output pairs in the training data, N is the number ... The speech analysis conditions and model topologies were similar to those used ...

Large Scale Distributed Deep Networks - Research at Google
second point, we trained a large neural network of more than 1 billion parameters and .... rameter server service for an updated copy of its model parameters.

LARGE SCALE DEEP NEURAL NETWORK ... - Research at Google
ral networks, deep learning, audio indexing. 1. INTRODUCTION. More than one billion people ... recognition technology can be an attractive and useful service.

Deep Shot: A Framework for Migrating Tasks ... - Research at Google
contact's information with a native Android application. We make ... needed to return a page [10]. ... mobile operating systems such as Apple iOS and Android.

DeViSE: A Deep Visual-Semantic Embedding ... - Research at Google
matches state-of-the-art performance on the 1000-class ImageNet object .... found 500- and 1,000-D embeddings to be a good compromise between training speed, ..... In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), ...

Real-Time Pedestrian Detection With Deep ... - Research at Google
More specifically, we employ the soft-cascade from Benenson et al. [4] ..... ments. IEEE Trans. on Pattern Analysis and Machine Intelligence (T-PAMI), 2009.

Deep Neural Networks for Small Footprint Text ... - Research at Google
dimensional log filterbank energy features extracted from a given frame, together .... [13] B. Yegnanarayana and S.P. Kishore, “AANN: an alternative to. GMM for ...