Hierarchical Learning

Oren B. Yeshua (oby1) CS 6253 – Spring 2007

Hierarchical Learning

Oren B. Yeshua (oby1) CS 6253 – Spring 2007

Agenda z z z

Hierarchical Learning – what/why? A general model Constraining the model z

Two concerns z z

Learning structure Learning from induced concepts ƒ ƒ

z

z z

Small subconcepts – CSL algorithm Larger subconcepts – hard

Learner/Teacher tradeoff

Open Questions Practical Considerations

Motivation z z

General intelligence / cognitive computing Use learning as we know it as a building block z z

z

To learn intractable concept classes To learn continuously from the environment

ML/Cognitive Computing community often makes assumptions along these lines z

Examples z

Utgoff / Stracuzzi ƒ

“[Labels for every subconcept with every example] may be more information than is strictly necessary…a matter of communication efficiency…not a major concern of ours…provide [all truth values] avoiding all the problems related to discourse.”

Motivation (continued) z

Valiant – Neuroidal Model ƒ

ƒ

“If the system consists of a chain of circuit units trained in sequence in the above manner, then the errors in one circuit need not propagate to the next. Each circuit will aim to be accurate in the PAC sense as a function of the external inputs–the fact that intermediate levels of gates only approximate the functions that the trainer intended is not necessarily harmful as long as each layer relates to the approximations at the previous layer in a robustly learnable manner. At each internal level, these internal feature sets may nevertheless permit accurate PAC learning at that next stage. That this is indeed possible for natural data sets remains to be proved. Some analysis of this issue of hierarchical learning has been attempted [cites Rivest and Sloan 1994]” For more: http://halcyon.googlepages.com/CLT

A Model for Hier. Learning z

CLT hardness results indicate more information is necessary to learn hard concept classes z

z

z

Un-learnable target concept c* ∈ C broken into polynomially many learnable subconcepts y1,…,ys Example oracles EXD,EX1,EX2, EX3, … , EXs provided to the learner z z

z

Provide more labels

EXD: draws x ∈ X at random from distribution D EXi: can be “chained” to EXD to compute yi(x)

Learner must output h ∈ H that predicts c*

Constraining the Model z

Learn in sequence of L lessons, 1 ≤ L ≤ s z z z z

Lesson 1: EX1,…,EXd1 is provided Lesson 2: EXd1+1,…,EXd2 is provided … Lesson L: Exd(L-1)+1,…,EXdL is provided z

Can further constrain setting L=s ƒ

z z

Learner decides when to advance to next lesson EX oracles from previous lesson are recalled z

z z z

Learn one concept per lesson

Learner can request example from past lesson (penalty?)

L may or may not be provided to the learner Constrain size/type of subconcept classes PAC – h must be an ε,δ-approximation of c*

Two Concerns z

Learning structure z

z

How do we organize the concept hierarchy? What should we learn from what? Nature/environment/inputs/learnabi lity impose structure z

Simple strategy: learn whatever you can from current input and existing knowledge ƒ

z

Hypothesis testing

STL – Stream to Layers algorithm ƒ ƒ ƒ

Ugtoff & Stracuzzi Tested w/ 1-hidden layer NN subconcept learners Toy concept classes ƒ Card stackability ƒ Two-Clumps

z

Learning from induced concepts z

Theoretically challenging z

z

z z

Required for learning structure (but not vice versa) Crucial for cognitive systems CSL z

z

No simple strategy

Rivest & Sloan

Big open question

The Problem Challenge z

Lessons above the first have attributes that may not match true value z

z

z

Some attributes are approximations to true attributes computed by previously learned subconcepts Attributes appear “noisy” with noise rate = error rate of subconcept hypothesis Noise accumulates as number of learned subconcepts increases

A Small Step z

When subconcept size is small (polynomial) hierarchical learning is possible/easy/robust to noise (Rivest & Sloan ‘94) z

z

Constraints: z z z

z

Strong result / constrained setting Subconcept classes Hi, |Hi| ≤ K = poly(n) Limit to one subconcept per lesson EX oracle recalled after lesson

Key insight z

Maintain version-space of hypotheses Fi for each yi z z

Enables accurate examples at each lesson via “filtering” Enables reliable and probably useful learning (stronger than PAC)

The CSL Algorithm CSL(ε,δ,s) ε’← ε/sK δ’← δ/2s m ← ln(K/δ’)/ε’ for i ← 1 to s # advance to next lesson get EXi for j ← 1 to 2m # create filtered sample (x,y) ← Push-Button(EXi) # draw exmple if F1,…,Fi-1 agree on x add (x,y) to sample if size(sample) = m break if size(sample) < m return FAIL # not enough examples else Fi ← all hyp consistent w/ sample return F1,…,Fs

CSL Intuition z

Since target concept must be in F1 , common output must be correct for given example when all f ∈ F1 agree on x

F1

f(x)

f1

0

f2

0

f5

0

f9

0

F1 agrees on x y1(x) = F1(x) = 0

CSL Intuition (continued) z

If F1,…,Fi-1 agree on x, then z

z z

z

F1(x),…,Fi-1(x) = y1(x),…,yi-1(x)

Can learn Fi from x, y1(x),…,yi-1(x) h(x,F1,…,Fs) is reliable z If CSL returns FAIL then abstain z If F1,…,Fs agree on x, output Fs(x) = ys(x) = c*(x) z Else abstain

h is probably useful: w/ probability 1- δ , z z

CSL doesn’t FAIL Prx ∈ D[F1,…,Fs agree on x] > 1- ε

Taking a Bigger Bite z

z

CSL analysis suggests investigating attribute noise Sloan (’95) seems to have gone down this path Monomials k-DNF z

z

Results Implications z

Unlikely

EMALpoly

≥ ε/n ln(n/ε)

≥ ε/nk ln(nk/ε)

EMAL

< ε/(1+ ε)

< ε/(1+ ε)

EURApoly

≥ ½-ω , ω>0

open

EURA

< 1/2

< 1/2

EPRA

< 2ε

< 2ε

Open Questions z

What is the largest subconcept class learnable in a hierarchical setting? z

z

How?

Can the right teacher/learner (model constraints) tradeoff enable learning of larger subconcepts? z

Message passing between nodes z Same depth z Different depth ƒ

z

Feedback

Practical Considerations z z z

Promising results have been reported in practice Are the theoretical models too focused on “worst-case” analysis? Accuracy on attribute noise is measured with respect to a noise-free test set. In practice, all data is noisy – should measure with respect to noisy input data z New noise models?

Courtesy of:

xkcd "The problem with perspective is that it's bi-directional."

Hierarchical Learning

The CSL Algorithm. CSL(ε,δ,s) ε'← ε/sK δ'← δ/2s m ← ln(K/δ')/ε' for i ← 1 to s get EX i. # advance to next lesson for j ← 1 to 2m # create filtered sample. (x,y) ← Push-Button(EX i. ) # draw exmple if F. 1. ,…,F i-1 agree on x add (x,y) to sample if size(sample) = m break if size(sample) < m return FAIL # not enough examples.

295KB Sizes 0 Downloads 248 Views

Recommend Documents

Scalable Hierarchical Multitask Learning ... - Research at Google
Feb 24, 2014 - on over 1TB data for up to 1 billion observations and 1 mil- ..... Wc 2,1. (16). The coefficients λ1 and λ2 govern the trade-off between generic sparsity ..... years for each school correspond to the subtasks of the school. ID. Thus 

Structure-Perceptron Learning of a Hierarchical Log ...
the same dimension at all levels of the hierarchy although they denote different subparts, thus have different semantic meanings. The conditional distribution over all the states is given by a log-linear model: P(y|x; α) = 1. Z(x; α) exp{Φ(x, y) Â

Learning Hierarchical Fuzzy Rule-based Systems for a Mobile ...
mobile robot control must be capable of coping with a high dimensional .... space into a fixed number of linguistic symbols. ... discount rate. The estimated convergence time can .... National Students Conference of National Alliance of.

Learning Hierarchical Bag of Words using Naive ... - GitHub Pages
Abstract. Image analysis tasks such as classification, clustering, detec- .... taking the middle ground and developing frameworks that exploit the advantages ... use any clustering technique but since this is symbolic data in a large vocabulary.

Learning hierarchical invariant spatio-temporal features for action ...
way to learn features directly from video data. More specif- ically, we .... We will call the first and second layer units simple and pool- ing units, respectively.

Hierarchical Decomposition.pdf
Page 2 of 24. 2. Background. • Functional decomposition of models (1993). • Hierarchical modules (2005). – Many systems naturally hierarchical. – Easier to ...

Hierarchical networks
quantitative information we introduce a threshold T to convert the correlation matrix into ... similarity between this picture and the one obtained by plotting the ... 1 We refer the interested reader to http://www.ffn.ub.es/albert/synchro.html, wher

Hierarchical networks
systems, and in particular the paradigmatic analysis of large populations of coupled oscillators. [6–8]. The connection between ... a large variety of synchronization patterns and sufficiently flexible to be adapted to many different contexts [10].

Hierarchical networks
May 21, 2008 - parameter' to characterize the level of entrainment between oscillators. However, this definition, although suitable for mean-field models, is not efficient to identify local dynamic effects. In particular, it does not give information

Hierarchical networks
May 21, 2008 - Online at stacks.iop.org/JPhysA/41/224007. Abstract .... Some of them are homogeneous in degree, whereas other networks have special ...

hierarchical mtmm - UdG
Application to the Measurement of Egocentered Social Networks. 1. Lluís Coromina .... Figure 1: Path diagram for the MTMM model for trait (Ti) and method (Mj).

hierarchical mtmm - UdG
Application to the Measurement of Egocentered Social Networks. 1 ..... 10. Therefore Muthén (1989,1990) proposes to utilise another estimator known as the ...

Hierarchical Grammatical Evolution
in the language de ned by the grammar (the phenotype) by means ... to weakly comply to the variational inheritance principle, stating. Permission to make digital ...

Bayesian Hierarchical Curve Registration
The analysis often proceeds by synchronization of the data through curve registration. In this article we propose a Bayesian hierarchical model for curve ...

Hierarchical Linked Views
visualizations designed to be representative of the database in its entirety; similar to the aggregate concept by Goldstein et al. [13]. Generally, this level of display ...

Hierarchical Grammatical Evolution
Jul 19, 2017 - ant Weighted HGE (WHGE), two novel genotype-phenotype map- ... ability to evolve programs in any language, using a user-provided.

Nonparametric Hierarchical Bayesian Model for ...
employed in fMRI data analysis, particularly in modeling ... To distinguish these functionally-defined clusters ... The next layer of this hierarchical model defines.

MULTICAST, HIERARCHICAL AND FAST HANDOVER ...
*Department of Computer and Communication System Engineering, Faculty of Engineering. **Department ... Laptop is used as Mobile Host (MH), two personal.

SEMIFRAGILE HIERARCHICAL WATERMARKING IN A ... - CiteSeerX
The pn-sequence is spectrally shaped by replicating the white noise horizontally, vertically ... S3 ≡ {X ∈ CNXM : Hw · Xw − Hw · Xw,0. ≤ θ} (4). 3.4. Robustness ...

SEMIFRAGILE HIERARCHICAL WATERMARKING IN A ...
Established crypto- graphic techniques and protocols enable integrity/ownership veri- ... The optimal transform domain watermark embedding method is a good ...

On Enhanced Hierarchical Modulations
Aug 27, 2007 - Qualcomm Incorp., FLO Air Interface Specification , 80-T0314-1 Rev. D. 6. 3GPP2, Ultra Mobile Broad Physical Layer, C.P0084-001, February ...

BwE: Flexible, Hierarchical Bandwidth ... - Research at Google
Aug 21, 2015 - cation. For example, it may be the highest priority for one service to .... Granularity and Scale: Our network and service capac- ity planners need ...