Aggregating Frame-level Features for Large-Scale Video Classification in the Google Cloud & YouTube-8M Video Understanding Challenge Shaoxiang Chen1, Xi Wang1, Yongyi Tang2, Xinpeng Chen3, Zuxuan Wu1, Yu-Gang Jiang1 1Fudan University 2Sun Yat-Sen University 3Wuhan University

Google Cloud & YouTube-8M Video Understanding Challenge • Video multi-label classification • 4,716 classes • 1.8 classes per video • Large amount of data (used in the challenge) Partition

Number of Samples

Train

4,906,660 (70%)

Validate

1,401,828 (20%)

Test

700,640 (10%)

Total

7,009,128

• Audio and rgb features extracted from CNNs are provided in both frame and video level.

Summary • • • • • • •

Our models are: RNN variants, NetVLAD and DBoF. By default we used the MoE (Mixture of Experts) as video level classifier. Our solution is implemented in TensorFlow based on the starter code. It takes 3-5 days to train our frame-level models on a single GPU. Achieved 0.84198 GAP on the public 50% test data and 4th place in the challenge. Paper: https://arxiv.org/pdf/1707.00803.pdf Code & Documentation: https://github.com/forwchen/yt8m

Model learning overview

Models & Training Model

Variations

LSTM

-

LSTM

Layer normalization & Recurrent dropout

RNN

Residual connections

GRU

-

GRU

Bi-directional

GRU

Recurrent dropout

GRU

Feature transformation

RWA

-

NetVLAD

-

DBoF

-

MoE

-

Training settings: • Learning rate: 0.001, decays every epoch • Batch size: 256 (RNNs), 1024(NetVLAD) • Adam optimizer

DBoF & MoE: https://github.com/google/youtube-8m Recurrent Weighted Average: https://github.com/jostmey/rwa RNN with residual connections: https://github.com/NickShahML/tensorflow_with_latest_papers

Models details

Structure of RWA (Recurrent Weighted Average) from [1]

Structure RNN with residual connections from [2] [1] Ostmeyer, Jared, and Lindsay Cowell. "Machine Learning on Sequential Data Using a Recurrent Weighted Average." arXiv 2017. [2] Wu, Yonghui, et al. "Google's neural machine translation system: Bridging the gap between human and machine translation." arXiv 2016.

Models details Input sequence

𝑇×𝐷

𝒂1 , 𝒂2 , … , 𝒂𝐾

𝒓1 , 𝒓2 , … , 𝒓𝑆 sample

𝑆×𝐷

1d-conv & softmax

𝒖1 , 𝒖2 , … , 𝒖𝐾

[𝒗1 , 𝒗2 , … , 𝒗𝐾 ]

𝐾×𝐷

𝐾𝐷

𝑆×𝐾 𝑆

𝒗𝑘 = ෍ 𝑎𝑖𝑘 (𝒓𝑖 − 𝒖𝑘 ) 𝑖=1

Cluster means

𝐿 = 1024

Arandjelovic, Relja, et al. "NetVLAD: CNN architecture for weakly supervised place recognition.“ CVPR 2016

Results (Single model) Model

GAP@20

NetVLAD

0.79175

LSTM

0.80907

GRU

0.80688

RWA

0.79622

RNN-Residual

0.81039

GRU-Dropout

0.81118

LSTM-Layernorm

0.80390

GRU-Bidirectional

0.80665

GRU-feature-trans

0.78644

Results (Ensemble) Ensembles

GAP@20

NetVLAD

0.80895

LSTM

0.81571

GRU

0.81786

RWA

0.81007

RNN-Residual

0.81510

GRU-Dropout

0.82523

Ensemble 1

0.83996

Ensemble 2

0.83481

Ensemble 3 (searching)

0.83581

Ensemble 4

0.84198

Fusion weights: • Empirical, based on valid/test set performance • Searching / learning over small split of valid set

Model ensembles: fusion of 3-5 checkpoint predictions Ensemble 1: fusion of 9 model ensembles Ensemble 2: fusion of another ~20 models with equal weights Ensemble 3: fusion of the same models as 2, weights are obtained with searching Ensemble 4: fusion Ensemble 1, 3 and others.

Q&A

Aggregating Frame-level Features for Large ... - Research at Google

RNN. Residual connections. GRU. -. GRU. Bi-directional ... "NetVLAD: CNN architecture for weakly supervised place recognition.“ CVPR 2016. ×. ×. ×. × sample.

397KB Sizes 5 Downloads 386 Views

Recommend Documents

Aggregating Frame-level Features for Large ... - Research at Google
Google Cloud & YouTube-8M Video Understanding Chal- lenge, which can be ... and 0.84193 on the private 50% of test data, ranking 4th out of 650 teams ...

Building High-level Features Using Large Scale ... - Research at Google
Using Large Scale Unsupervised Learning. Quoc V. Le ... a significant challenge for problems where labeled data are rare. ..... have built a software framework called DistBelief that ... Surprisingly, the best neuron in the network performs.

Building High-level Features Using Large Scale ... - Research at Google
same network is sensitive to other high-level concepts such as cat faces and human bod- ies. Starting with these learned features, we trained our network to ...

Building high-level features using large scale ... - Research at Google
Model parallelism, Data parallelism. Image ... Face neuron with totally unlabeled data with enough training and data. •. State-‐of-‐the-‐art performances on.

DISCRIMINATIVE FEATURES FOR LANGUAGE ... - Research at Google
language recognition system. We train the ... lar approach to language recognition has been the MAP-SVM method [1] [2] ... turned into a linear classifier computing score dl(u) for utter- ance u in ... the error rate on a development set. The first .

Aggregating Reviews to Rank Products and ... - Research at Google
Wall Street Journal publicized that the average rating for top review sites is an astoundingly positive 4.3 out of 5 stars ... have different rating scales (1-5 stars, 0-10 stars, etc.) ... Proceedings of the Fourth International AAAI Conference on W

Backoff Inspired Features for Maximum Entropy ... - Research at Google
Sep 14, 2014 - lem into many binary language modeling problems (one versus the rest) and ... 4: repeat. 5: t ← t + 1. 6: {θ1. 1,...,θK. L } ← IPMMAP(D1,...,DK , Θt−1, n). 7: .... SuffixBackoff (NG+S); (3) n-gram features plus PrefixBackoffj.

Areal and Phylogenetic Features for Multilingual ... - Research at Google
munity is the growing need to support low and even zero- resource languages [8, 9]. For speech ... and Australian English (en-AU). This, however, is not suffi-.

Areal and Phylogenetic Features for Multilingual ... - Research at Google
times phylogenetic and areal representations lead to significant multilingual synthesis quality ... phonemic configurations into a unified canonical representation.

Spherical Random Features for Polynomial ... - Research at Google
Their training complexity ... procedure can utilize the fast training and testing of linear methods while still preserving much of ...... Mathematical Programming, 127(1):3–30, 2011. ... In Computer Vision and Pattern Recognition (CVPR), pages.

Large Vocabulary Automatic Speech ... - Research at Google
Sep 6, 2015 - child speech relatively better than adult. ... Speech recognition for adults has improved significantly over ..... caying learning rate was used. 4.1.

Cost-Efficient Dragonfly Topology for Large ... - Research at Google
Evolving technology and increasing pin-bandwidth motivate the use of high-radix .... cost comparison of the dragonfly topology to alternative topologies using a detailed cost model. .... energy (cooling) cost within the first 3 years of purchase [8].

Large-scale speaker identification - Research at Google
promises excellent scalability for large-scale data. 2. BACKGROUND. 2.1. Speaker identification with i-vectors. Robustly recognizing a speaker in spite of large ...

Pre-Initialized Composition For Large ... - Research at Google
available on the Google Android platform. Index Terms: WFST ..... 10. 15. 20. 25. 30 parallelism = 1 memory (gbytes). % time o verhead q q dynamic n=320 n=80.

Deep Learning Methods for Efficient Large ... - Research at Google
Jul 26, 2017 - Google Cloud & YouTube-8M Video. Understanding Challenge ... GAP scores are from private leaderboard. Models. MoNN. LSTM GRU.

Optimal Content Placement for a Large-Scale ... - Research at Google
CONTENT and network service providers are facing an explosive growth in ... a 1-h 4 K video takes up about 20 GB of disk [2], and today's. VoD providers are ...

Up Next: Retrieval Methods for Large Scale ... - Research at Google
KDD'14, August 24–27, 2014, New York, NY, USA. Copyright 2014 ACM .... YouTube official blog [1, 3] or work by Simonet [25] for more information about the ...

Cultivating DNN Diversity for Large Scale Video ... - Research at Google
develop new approaches to video analysis and classifica- ... We then provide analysis on the link ...... tion and Software Technology, 39(10):707 – 717, 1997.

Large-Scale Deep Learning for Intelligent ... - Research at Google
Android. Apps. GMail. Image Understanding. Maps. NLP. Photos. Robotics. Speech. Translation many research uses.. YouTube … many others . ... Page 10 ...

ON LATTICE GENERATION FOR LARGE ... - Research at Google
The history conditioned lexical prefix tree search strategy requires only slight modifications to efficiently generate ... where h(t, q) is the history h of the best predecessor state: P(t, q) = {(t , p[a], h) : a ∈ αP (t, q), t = τB(t, a)} .....

cost-efficient dragonfly topology for large-scale ... - Research at Google
radix or degree increases, hop count and hence header ... 1. 10. 100. 1,000. 10,000. 1985 1990 1995 2000 2005 2010. Year .... IEEE CS Press, 2006, pp. 16-28.

Efficient Topologies for Large-scale Cluster ... - Research at Google
... to take advantage of additional packing locality and fewer optical links with ... digital systems – e.g., server clusters, internet routers, and storage-area networks.