Hierarchical Learning
Oren B. Yeshua (oby1) CS 6253 – Spring 2007
Hierarchical Learning
Oren B. Yeshua (oby1) CS 6253 – Spring 2007
Agenda z z z
Hierarchical Learning – what/why? A general model Constraining the model z
Two concerns z z
Learning structure Learning from induced concepts
z
z z
Small subconcepts – CSL algorithm Larger subconcepts – hard
Learner/Teacher tradeoff
Open Questions Practical Considerations
Motivation z z
General intelligence / cognitive computing Use learning as we know it as a building block z z
z
To learn intractable concept classes To learn continuously from the environment
ML/Cognitive Computing community often makes assumptions along these lines z
Examples z
Utgoff / Stracuzzi
“[Labels for every subconcept with every example] may be more information than is strictly necessary…a matter of communication efficiency…not a major concern of ours…provide [all truth values] avoiding all the problems related to discourse.”
Motivation (continued) z
Valiant – Neuroidal Model
“If the system consists of a chain of circuit units trained in sequence in the above manner, then the errors in one circuit need not propagate to the next. Each circuit will aim to be accurate in the PAC sense as a function of the external inputs–the fact that intermediate levels of gates only approximate the functions that the trainer intended is not necessarily harmful as long as each layer relates to the approximations at the previous layer in a robustly learnable manner. At each internal level, these internal feature sets may nevertheless permit accurate PAC learning at that next stage. That this is indeed possible for natural data sets remains to be proved. Some analysis of this issue of hierarchical learning has been attempted [cites Rivest and Sloan 1994]” For more: http://halcyon.googlepages.com/CLT
A Model for Hier. Learning z
CLT hardness results indicate more information is necessary to learn hard concept classes z
z
z
Un-learnable target concept c* ∈ C broken into polynomially many learnable subconcepts y1,…,ys Example oracles EXD,EX1,EX2, EX3, … , EXs provided to the learner z z
z
Provide more labels
EXD: draws x ∈ X at random from distribution D EXi: can be “chained” to EXD to compute yi(x)
Learner must output h ∈ H that predicts c*
Constraining the Model z
Learn in sequence of L lessons, 1 ≤ L ≤ s z z z z
Lesson 1: EX1,…,EXd1 is provided Lesson 2: EXd1+1,…,EXd2 is provided … Lesson L: Exd(L-1)+1,…,EXdL is provided z
Can further constrain setting L=s
z z
Learner decides when to advance to next lesson EX oracles from previous lesson are recalled z
z z z
Learn one concept per lesson
Learner can request example from past lesson (penalty?)
L may or may not be provided to the learner Constrain size/type of subconcept classes PAC – h must be an ε,δ-approximation of c*
Two Concerns z
Learning structure z
z
How do we organize the concept hierarchy? What should we learn from what? Nature/environment/inputs/learnabi lity impose structure z
Simple strategy: learn whatever you can from current input and existing knowledge
z
Hypothesis testing
STL – Stream to Layers algorithm
Ugtoff & Stracuzzi Tested w/ 1-hidden layer NN subconcept learners Toy concept classes Card stackability Two-Clumps
z
Learning from induced concepts z
Theoretically challenging z
z
z z
Required for learning structure (but not vice versa) Crucial for cognitive systems CSL z
z
No simple strategy
Rivest & Sloan
Big open question
The Problem Challenge z
Lessons above the first have attributes that may not match true value z
z
z
Some attributes are approximations to true attributes computed by previously learned subconcepts Attributes appear “noisy” with noise rate = error rate of subconcept hypothesis Noise accumulates as number of learned subconcepts increases
A Small Step z
When subconcept size is small (polynomial) hierarchical learning is possible/easy/robust to noise (Rivest & Sloan ‘94) z
z
Constraints: z z z
z
Strong result / constrained setting Subconcept classes Hi, |Hi| ≤ K = poly(n) Limit to one subconcept per lesson EX oracle recalled after lesson
Key insight z
Maintain version-space of hypotheses Fi for each yi z z
Enables accurate examples at each lesson via “filtering” Enables reliable and probably useful learning (stronger than PAC)
The CSL Algorithm CSL(ε,δ,s) ε’← ε/sK δ’← δ/2s m ← ln(K/δ’)/ε’ for i ← 1 to s # advance to next lesson get EXi for j ← 1 to 2m # create filtered sample (x,y) ← Push-Button(EXi) # draw exmple if F1,…,Fi-1 agree on x add (x,y) to sample if size(sample) = m break if size(sample) < m return FAIL # not enough examples else Fi ← all hyp consistent w/ sample return F1,…,Fs
CSL Intuition z
Since target concept must be in F1 , common output must be correct for given example when all f ∈ F1 agree on x
F1
f(x)
f1
0
f2
0
f5
0
f9
0
F1 agrees on x y1(x) = F1(x) = 0
CSL Intuition (continued) z
If F1,…,Fi-1 agree on x, then z
z z
z
F1(x),…,Fi-1(x) = y1(x),…,yi-1(x)
Can learn Fi from x, y1(x),…,yi-1(x) h(x,F1,…,Fs) is reliable z If CSL returns FAIL then abstain z If F1,…,Fs agree on x, output Fs(x) = ys(x) = c*(x) z Else abstain
h is probably useful: w/ probability 1- δ , z z
CSL doesn’t FAIL Prx ∈ D[F1,…,Fs agree on x] > 1- ε
Taking a Bigger Bite z
z
CSL analysis suggests investigating attribute noise Sloan (’95) seems to have gone down this path Monomials k-DNF z
z
Results Implications z
Unlikely
EMALpoly
≥ ε/n ln(n/ε)
≥ ε/nk ln(nk/ε)
EMAL
< ε/(1+ ε)
< ε/(1+ ε)
EURApoly
≥ ½-ω , ω>0
open
EURA
< 1/2
< 1/2
EPRA
< 2ε
< 2ε
Open Questions z
What is the largest subconcept class learnable in a hierarchical setting? z
z
How?
Can the right teacher/learner (model constraints) tradeoff enable learning of larger subconcepts? z
Message passing between nodes z Same depth z Different depth
z
Feedback
Practical Considerations z z z
Promising results have been reported in practice Are the theoretical models too focused on “worst-case” analysis? Accuracy on attribute noise is measured with respect to a noise-free test set. In practice, all data is noisy – should measure with respect to noisy input data z New noise models?
Courtesy of:
xkcd "The problem with perspective is that it's bi-directional."