Aggregating Frame-level Features for Large-Scale Video Classification in the Google Cloud & YouTube-8M Video Understanding Challenge Shaoxiang Chen1, Xi Wang1, Yongyi Tang2, Xinpeng Chen3, Zuxuan Wu1, Yu-Gang Jiang1 1Fudan University 2Sun Yat-Sen University 3Wuhan University

Google Cloud & YouTube-8M Video Understanding Challenge • Video multi-label classification • 4,716 classes • 1.8 classes per video • Large amount of data (used in the challenge) Partition

Number of Samples

Train

4,906,660 (70%)

Validate

1,401,828 (20%)

Test

700,640 (10%)

Total

7,009,128

• Audio and rgb features extracted from CNNs are provided in both frame and video level.

Summary • • • • • • •

Our models are: RNN variants, NetVLAD and DBoF. By default we used the MoE (Mixture of Experts) as video level classifier. Our solution is implemented in TensorFlow based on the starter code. It takes 3-5 days to train our frame-level models on a single GPU. Achieved 0.84198 GAP on the public 50% test data and 4th place in the challenge. Paper: https://arxiv.org/pdf/1707.00803.pdf Code & Documentation: https://github.com/forwchen/yt8m

Model learning overview

Models & Training Model

Variations

LSTM

-

LSTM

Layer normalization & Recurrent dropout

RNN

Residual connections

GRU

-

GRU

Bi-directional

GRU

Recurrent dropout

GRU

Feature transformation

RWA

-

NetVLAD

-

DBoF

-

MoE

-

Training settings: • Learning rate: 0.001, decays every epoch • Batch size: 256 (RNNs), 1024(NetVLAD) • Adam optimizer

DBoF & MoE: https://github.com/google/youtube-8m Recurrent Weighted Average: https://github.com/jostmey/rwa RNN with residual connections: https://github.com/NickShahML/tensorflow_with_latest_papers

Models details

Structure of RWA (Recurrent Weighted Average) from [1]

Structure RNN with residual connections from [2] [1] Ostmeyer, Jared, and Lindsay Cowell. "Machine Learning on Sequential Data Using a Recurrent Weighted Average." arXiv 2017. [2] Wu, Yonghui, et al. "Google's neural machine translation system: Bridging the gap between human and machine translation." arXiv 2016.

Models details Input sequence

𝑇×𝐷

𝒂1 , 𝒂2 , … , 𝒂𝐾

𝒓1 , 𝒓2 , … , 𝒓𝑆 sample

𝑆×𝐷

1d-conv & softmax

𝒖1 , 𝒖2 , … , 𝒖𝐾

[𝒗1 , 𝒗2 , … , 𝒗𝐾 ]

𝐾×𝐷

𝐾𝐷

𝑆×𝐾 𝑆

𝒗𝑘 = ෍ 𝑎𝑖𝑘 (𝒓𝑖 − 𝒖𝑘 ) 𝑖=1

Cluster means

𝐿 = 1024

Arandjelovic, Relja, et al. "NetVLAD: CNN architecture for weakly supervised place recognition.“ CVPR 2016

Results (Single model) Model

[email protected]

NetVLAD

0.79175

LSTM

0.80907

GRU

0.80688

RWA

0.79622

RNN-Residual

0.81039

GRU-Dropout

0.81118

LSTM-Layernorm

0.80390

GRU-Bidirectional

0.80665

GRU-feature-trans

0.78644

Results (Ensemble) Ensembles

[email protected]

NetVLAD

0.80895

LSTM

0.81571

GRU

0.81786

RWA

0.81007

RNN-Residual

0.81510

GRU-Dropout

0.82523

Ensemble 1

0.83996

Ensemble 2

0.83481

Ensemble 3 (searching)

0.83581

Ensemble 4

0.84198

Fusion weights: • Empirical, based on valid/test set performance • Searching / learning over small split of valid set

Model ensembles: fusion of 3-5 checkpoint predictions Ensemble 1: fusion of 9 model ensembles Ensemble 2: fusion of another ~20 models with equal weights Ensemble 3: fusion of the same models as 2, weights are obtained with searching Ensemble 4: fusion Ensemble 1, 3 and others.

Q&A

Aggregating Frame-level Features for Large ... - Research at Google

RNN. Residual connections. GRU. -. GRU. Bi-directional ... "NetVLAD: CNN architecture for weakly supervised place recognition.“ CVPR 2016. ×. ×. ×. × sample.

397KB Sizes 4 Downloads 139 Views

Recommend Documents

Building High-level Features Using Large Scale ... - Research at Google
Using Large Scale Unsupervised Learning. Quoc V. Le ... a significant challenge for problems where labeled data are rare. ..... have built a software framework called DistBelief that ... Surprisingly, the best neuron in the network performs.

Aggregating Reviews to Rank Products and ... - Research at Google
Wall Street Journal publicized that the average rating for top review sites is an astoundingly positive 4.3 out of 5 stars ... have different rating scales (1-5 stars, 0-10 stars, etc.) ... Proceedings of the Fourth International AAAI Conference on W

Areal and Phylogenetic Features for Multilingual ... - Research at Google
munity is the growing need to support low and even zero- resource languages [8, 9]. For speech ... and Australian English (en-AU). This, however, is not suffi-.

Areal and Phylogenetic Features for Multilingual ... - Research at Google
times phylogenetic and areal representations lead to significant multilingual synthesis quality ... phonemic configurations into a unified canonical representation.

Large Vocabulary Automatic Speech ... - Research at Google
Sep 6, 2015 - child speech relatively better than adult. ... Speech recognition for adults has improved significantly over ..... caying learning rate was used. 4.1.

Cost-Efficient Dragonfly Topology for Large ... - Research at Google
Evolving technology and increasing pin-bandwidth motivate the use of high-radix .... cost comparison of the dragonfly topology to alternative topologies using a detailed cost model. .... energy (cooling) cost within the first 3 years of purchase [8].

Pre-Initialized Composition For Large ... - Research at Google
available on the Google Android platform. Index Terms: WFST ..... 10. 15. 20. 25. 30 parallelism = 1 memory (gbytes). % time o verhead q q dynamic n=320 n=80.

Deep Learning Methods for Efficient Large ... - Research at Google
Jul 26, 2017 - Google Cloud & YouTube-8M Video. Understanding Challenge ... GAP scores are from private leaderboard. Models. MoNN. LSTM GRU.

cost-efficient dragonfly topology for large-scale ... - Research at Google
radix or degree increases, hop count and hence header ... 1. 10. 100. 1,000. 10,000. 1985 1990 1995 2000 2005 2010. Year .... IEEE CS Press, 2006, pp. 16-28.