Tera-scale deep learning Quoc  V.  Le  

Stanford  University  and  Google  

Joint  work  with  

Kai  Chen  

Greg  Corrado  

Rajat  Monga   Andrew  Ng  

AddiNonal   Thanks:  

Jeff  Dean   MaQhieu  Devin  

Marc Aurelio   Paul  Tucker   Ranzato  

Ke  Yang  

Samy  Bengio,  Zhenghao  Chen,  Tom  Dean,  Pangwei  Koh,   Mark  Mao,  Jiquan  Ngiam,  Patrick  Nguyen,  Andrew  Saxe,   Mark  Segal,  Jon  Shlens,    Vincent  Vanhouke,    Xiaoyun  Wu,     Peng  Xe,  Serena  Yeung,  Will  Zou  

Machine  Learning  successes  

Face  recogniNon  

OCR  

RecommendaNon  systems  

Autonomous  car  

Email  classificaNon  

Web  page  ranking  

Feature  ExtracNon  

Classifier  

Feature  extracNon   (Mostly  hand-­‐cra]ed  features)  

Hand-­‐Cra]ed  Features   Computer  vision:       …   SIFT/HOG  

SURF   Speech  RecogniNon:      

…  

MFCC  

Spectrogram  

ZCR  

New  feature-­‐designing  paradigm  

Unsupervised  Feature  Learning  /  Deep  Learning       ReconstrucNon  ICA     Expensive  and  typically  applied  to  small  problems  

The  Trend  of  BigData  

Outline   No  maQer  the  algorithm,  more  features  always  more  successful.     -­‐        ReconstrucNon  ICA  

-­‐  ApplicaNons  to  videos,  cancer  images   -­‐  Ideas  for  scaling  up   -­‐  Scaling  up  Results  

Topographic  Independent  Component  Analysis  (TICA)  

1.  Feature  computaNon  

(   W  T   1  

2  

)  

(  

W  T   9  

2  

)  

2.  Learning  

W  T   9  

W  T   1   W   1  

W   9  

W  =    

Input  data:  

W   1   W   2   .   .   W   10000  

Topographic  Independent  Component  Analysis  (TICA)  

Invariance  explained   Images   Features  

F1  

F2  

Pooled  feature  of  F1  and  F2  

Image1  

Image2   Loc1  

Loc2  

1  

0  

0  

1  

sqrt(1  2  +  02    )  =  1  

sqrt(0  2    +  12    )  =  1  

Same  value  regardless  the  locaNon  of  the  edge  

TICA:  

ReconstrucNon  ICA:  

Equivalence  between  Sparse  Coding,  Autoencoders,  RBMs  and  ICA   Build  deep  architecture  by  treaNng  the  output  of  one  layer  as  input  to   another  layer   Le,  et  al.,  ICA  with  Reconstruc1on  Cost  for  Efficient  Overcomplete  Feature  Learning.  NIPS  2011  

ReconstrucNon  ICA:  

Le,  et  al.,  ICA  with  Reconstruc1on  Cost  for  Efficient  Overcomplete  Feature  Learning.  NIPS  2011  

ReconstrucNon  ICA:  

Data  whitening  

Le,  et  al.,  ICA  with  Reconstruc1on  Cost  for  Efficient  Overcomplete  Feature  Learning.  NIPS  2011  

TICA:  

ReconstrucNon  ICA:  

Data  whitening  

Le,  et  al.,  ICA  with  Reconstruc1on  Cost  for  Efficient  Overcomplete  Feature  Learning.  NIPS  2011  

Why  RICA?  

Algorithms  

Speed  

Ease  of  training  

Invariant  Features    

Sparse  Coding   RBMs/Autoencoders   TICA   ReconstrucNon  ICA  

Le,  et  al.,  ICA  with  Reconstruc1on  Cost  for  Efficient  Overcomplete  Feature  Learning.  NIPS  2011  

Summary  of  RICA   -­‐  Two-­‐layered  network   -­‐  ReconstrucNon  cost  instead  of  orthogonality  constraints   -­‐  Learns  invariant  features    

ApplicaNons  of  RICA  

AcNon  recogniNon  

Sit  up  

Eat  

Run  

Drive  Car  

Answer  phone  

Stand  up  

Le,  et  al.,  Learning  hierarchical  spa1o-­‐temporal  features  for     ac1on  recogni1on  with  independent  subspace  analysis.  CVPR  2011  

Get    Out  of  Car  

Kiss  

Shake  hands  

Le,  et  al.,  Learning  hierarchical  spa1o-­‐temporal  features  for     ac1on  recogni1on  with  independent  subspace  analysis.  CVPR  2011  

94  

55  

KTH  

92  

51  

90  

49  

88  

47  

86  

43  

45   41  

84  

39  

82  

37   35  

80   Hessian/SURF  

pLSA  

HOF  

GRBMs  

3DCNN   HMAX  

HOG  

76  

UCF  

85  

Hessian/SURF  

Learned  Features  

87  

75  

83  

74  

81  

73  

79  

72  

77  

71  

75  

70  

Hessian/SURF  

Hollywood2  

53  

HOG  

Hessian   HOG.HOF  

HOF  

HOG3D  

Learned  Features  

HOG/HOF  

HOG3D  

GRBMS  

HOF  

Learned  Features  

YouTube  

Combined   Engineered  Features  

Le,  et  al.,  Learning  hierarchical  spa1o-­‐temporal  features  for     ac1on  recogni1on  with  independent  subspace  analysis.  CVPR  2011  

Learned  Features  

Cancer  classificaNon  

92%  

ApoptoNc  

90%  

Viable  tumor   region  

88%  

86%  

Necrosis  

84%   Hand  engineered  Features  

…   Le,  et  al.,  Learning  Invariant  Features  of  Tumor  Signatures.  ISBI  2012  

RICA  

Scaling  up     deep  RICA  networks  

Scaling  up  Deep  Learning  

Deep  learning  data  

Real  data  

It’s  beQer  to  have  more  features!   No  maQer  the  algorithm,  more  features  always  more  successful.  

Coates,  et  al.,  An  Analysis  of  Single-­‐Layer  Networks  in  Unsupervised  Feature  Learning.  AISTATS’11  

Most  are     local  features  

Local  recepNve  field  networks   Machine  #1  

Machine  #2  

RICA  features  

Image  

Le,  et  al.,  Tiled  Convolu1onal  Neural  Networks.  NIPS  2010  

Machine  #3  

Machine  #4  

Challenges  with  1000s  of  machines  

Asynchronous  Parallel  SGDs  

Parameter  server  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Asynchronous  Parallel  SGDs  

Parameter  server  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Summary  of  Scaling  up   -­‐  Local  connecNvity   -­‐  Asynchronous  SGDs    

…  And  more   -­‐  RPC  vs  MapReduce   -­‐  Prefetching   -­‐  Single  vs  Double   -­‐  Removing  slow  machines   -­‐  OpNmized  So]max   -­‐  …  

10  million  200x200  images     1  billion  parameters  

Training   RICA  

RICA  

Dataset:  10  million  200x200  unlabeled  images    from  YouTube/Web     Train  on  2000  machines  (16000  cores)  for  1  week     1.15  billion  parameters   -­‐  100x  larger  than  previously  reported     -­‐  Small  compared  to  visual  cortex    

RICA  

Image  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

The  face  neuron  

Top  sNmuli  from  the  test  set  

OpNmal  sNmulus     by  numerical  opNmizaNon  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Random  distractors   Faces  

Frequency  

Feature  value   Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

0  pixels  

20  pixels  

Feature  response  

Feature  response  

Invariance  properNes  

0  pixels   VerNcal  shi]s  

o   90  

o   0   3D  rotaNon  angle  

Feature  response  

Feature  response  

Horizontal  shi]s  

20  pixels  

0.4x  

1x  

1.6x  

Scale  factor  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Top  sNmuli  from  the  test  set  

OpNmal  sNmulus     by  numerical  opNmizaNon  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Random  distractors   Pedestrians  

Frequency  

Feature  value   Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Top  sNmuli  from  the  test  set  

OpNmal  sNmulus     by  numerical  opNmizaNon  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Random  distractors   Cat  faces  

Frequency  

Feature  value   Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

ImageNet  classificaNon   22,000  categories     14,000,000  images     Hand-­‐engineered  features  (SIFT,  HOG,  LBP),     SpaNal  pyramid,    SparseCoding/Compression    

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

22,000  is  a  lot  of  categories…     …   smoothhound,  smoothhound  shark,  Mustelus  mustelus   American  smooth  dogfish,  Mustelus  canis   Florida  smoothhound,  Mustelus  norrisi   whiteNp  shark,  reef  whiteNp  shark,  Triaenodon  obseus   AtlanNc  spiny  dogfish,  Squalus  acanthias   Pacific  spiny  dogfish,  Squalus  suckleyi   hammerhead,  hammerhead  shark   smooth  hammerhead,  Sphyrna  zygaena   smalleye  hammerhead,  Sphyrna  tudes   shovelhead,  bonnethead,  bonnet  shark,  Sphyrna  Nburo   angel  shark,  angelfish,  SquaNna  squaNna,  monkfish   electric  ray,  crampfish,  numbfish,  torpedo   smalltooth  sawfish,  PrisNs  pecNnatus   guitarfish   roughtail  sNngray,  DasyaNs  centroura   buQerfly  ray   eagle  ray   spoQed  eagle  ray,  spoQed  ray,  Aetobatus  narinari   cownose  ray,  cow-­‐nosed  ray,  Rhinoptera  bonasus   manta,  manta  ray,  devilfish   AtlanNc  manta,  Manta  birostris   devil  ray,  Mobula  hypostoma   grey  skate,  gray  skate,  Raja  baNs   liQle  skate,  Raja  erinacea   …  

SNngray  

Mantaray  

Best  sNmuli  

Feature  1  

Feature  2  

Feature  3  

Feature  4  

Feature  5  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Best  sNmuli  

Feature  6  

Feature  7    

Feature  8  

Feature  9  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Best  sNmuli  

Feature  10  

Feature  11    

Feature  12  

Feature  13  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

0.005%   9.5%   Random  guess  

State-­‐of-­‐the-­‐art   (Weston,  Bengio  ‘11)  

?  

Feature  learning     From  raw  pixels  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

0.005%   9.5%   15.8%   Random  guess  

State-­‐of-­‐the-­‐art   (Weston,  Bengio  ‘11)  

Feature  learning     From  raw  pixels  

ImageNet  2009  (10k  categories):  Best  published  result:  17%                                                                                                                        (Sanchez  &  Perronnin  ‘11  ),                                                                                                                        Our  method:  20%     Using  only  1000  categories,  our  method  >  50%    

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Other  results   No  maQer  the  algorithm,  more  features  always  more  successful.     -­‐  We  also  have  great  features  for     -­‐  Speech  recogniNon   -­‐  Word-­‐vector  embedding  for  NLPs  

Conclusions   •  RICA  learns  invariant  features   •  Face  neuron  with  totally  unlabeled  data                    with  enough  training  and  data   •  State-­‐of-­‐the-­‐art  performances  on     –  AcNon  RecogniNon   –  Cancer  image  classificaNon   –  ImageNet  

0.005%  

ImageNet   9.5%  

Random  guess  

Best  published  result  

15.8%   Our  method  

94   92   90   88   86   84   82   80   Cancer  classificaNon  

Feature  visualizaNon  

AcNon  recogniNon  benchmarks  

AcNon  recogniNon  

Face  neuron  

References   •  Q.V.  Le,  M.A.  Ranzato,  R.  Monga,  M.  Devin,  G.  Corrado,  K.  Chen,  J.  Dean,  A.Y.   Ng.  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.   ICML,  2012.   •  Q.V.  Le,  J.  Ngiam,  Z.  Chen,  D.  Chia,  P.  Koh,  A.Y.  Ng.  Tiled  Convolu8onal  Neural   Networks.  NIPS,  2010.     •  Q.V.  Le,  W.Y.  Zou,  S.Y.  Yeung,  A.Y.  Ng.  Learning  hierarchical  spa8o-­‐temporal   features  for  ac8on  recogni8on  with  independent  subspace  analysis.  CVPR,   2011.   •  Q.V.  Le,  J.  Ngiam,  A.  Coates,  A.  Lahiri,  B.  Prochnow,  A.Y.  Ng.     On  op8miza8on  methods  for  deep  learning.  ICML,  2011.     •  Q.V.  Le,  A.  Karpenko,  J.  Ngiam,  A.Y.  Ng.    ICA  with  Reconstruc8on  Cost  for   Efficient  Overcomplete  Feature  Learning.  NIPS,  2011.     •  Q.V.  Le,  J.  Han,  J.  Gray,  P.  Spellman,  A.  Borowsky,  B.  Parvin.  Learning  Invariant   Features  for  Tumor  Signatures.  ISBI,  2012.     •  I.J.  Goodfellow,  Q.V.  Le,  A.M.  Saxe,  H.  Lee,  A.Y.  Ng,    Measuring  invariances  in   deep  networks.  NIPS,  2009.  

hQp://ai.stanford.edu/~quocle  

Tera-scale deep learning - Research at Google

The Trend of BigData .... Scaling up Deep Learning. Real data. Deep learning data ... Le, et al., Building high-‐level features using large-‐scale unsupervised ...

11MB Sizes 4 Downloads 132 Views

Recommend Documents

Learning with Deep Cascades - Research at Google
based on feature monomials of degree k, or polynomial functions of degree k, ... on finding the best trade-off between computational cost and classification accu-.

Deep Learning Methods for Efficient Large ... - Research at Google
Jul 26, 2017 - Google Cloud & YouTube-8M Video. Understanding Challenge ... GAP scores are from private leaderboard. Models. MoNN. LSTM GRU.

Deep Boosting - Proceedings of Machine Learning Research
ysis, with performance guarantees in terms of the margins ... In many successful applications of AdaBoost, H is reduced .... Our proof technique exploits standard tools used to de- ..... {0,..., 9}, fold i was used for testing, fold i +1(mod 10).

Semi-supervised Sequence Learning - Research at Google
1http://ai.Stanford.edu/amaas/data/sentiment/index.html. 3 .... email with an average length of 267 words and a maximum length of 11,925 words. Attachments,.

UNSUPERVISED CONTEXT LEARNING FOR ... - Research at Google
grams. If an n-gram doesn't appear very often in the training ... for training effective biasing models using far less data than ..... We also described how to auto-.

Deep Shot: A Framework for Migrating Tasks ... - Research at Google
contact's information with a native Android application. We make ... needed to return a page [10]. ... mobile operating systems such as Apple iOS and Android.

Large Scale Distributed Deep Networks - Research at Google
second point, we trained a large neural network of more than 1 billion parameters and .... rameter server service for an updated copy of its model parameters.

LARGE SCALE DEEP NEURAL NETWORK ... - Research at Google
ral networks, deep learning, audio indexing. 1. INTRODUCTION. More than one billion people ... recognition technology can be an attractive and useful service.

Dynamic Model Selection for Hierarchical Deep ... - Research at Google
Figure 2: An illustration of the equivalence between single layers ... assignments as Bernoulli random variables and draw a dif- ..... lowed by 50% Dropout.