Tree models with Scikit-Learn Great learners with little assumptions Material: https://github.com/glouppe/talk-pydata2015

Gilles Louppe (@glouppe) CERN

PyData, April 3, 2015

Outline 1

Motivation

2

Growing decision trees

3

Random forests

4

Boosting

5

Reading tree leaves

6

Summary

2 / 26

Motivation

3 / 26

Running example

From physicochemical properties (alcohol, acidity, sulphates, ...),

learn a model

to predict wine taste preferences.

4 / 26

Outline 1 Motivation 2 Growing decision trees 3 Random Forests 4 Boosting 5 Reading tree leaves 6 Summary

Supervised learning • Data comes as a finite learning set L = (X, y) where Input samples are given as an array of shape (n samples, n features) E.g., feature values for wine physicochemical properties: # fixed acidity, volatile acidity, ... X = [[ 7.4 0. ... 0.56 9.4 0. ] [ 7.8 0. ... 0.68 9.8 0. ] ... [ 7.8 0.04 ... 0.65 9.8 0. ]] Output values are given as an array of shape (n samples,) E.g., wine taste preferences (from 0 to 10): y = [5 5 5 ... 6 7 6] • The goal is to build an estimator ϕL : X 7→ Y minimizing

Err (ϕL ) = EX ,Y {L(Y , ϕL .predict(X ))}. 5 / 26

Decision trees (Breiman et al., 1984) 𝒙

X2

t5



t3

0.5

t4



𝑡2 𝑋2 ≤ 0.5

𝑋1 ≤ 0.7

Split node

Leaf node

>

>

𝑡3

𝑡5

𝑡4 0.7

𝑡1

X1

𝑝(𝑌 = 𝑐|𝑋 = 𝒙)

function BuildDecisionTree(L) Create node t if the stopping criterion is met for t then Assign a model to ybt else Find the split on L that maximizes impurity decrease s ∗ = arg max i(t) − pL i(tLs ) − pR i(tRs ) s

Partition L into LtL ∪ LtR according to s ∗ tL = BuildDecisionTree(LtL ) tR = BuildDecisionTree(LtR ) end if return t end function 6 / 26

Composability of decision trees Decision trees can be used to solve several machine learning tasks by swapping the impurity and leaf model functions:

0-1 loss (classification) ybt = arg maxc∈Y p(c|t), i(t) = entropy(t) or i(t) = gini(t)

Mean squared error (regression) ybt = mean(y |t), i(t) =

1 Nt

P

x,y ∈Lt (y

− ybt )2

Least absolute deviance (regression) ybt = median(y |t), i(t) =

1 Nt

P

x,y ∈Lt

|y − ybt |

Density estimation ybt = N(µt , Σt ), i(t) = differential entropy(t) 7 / 26

sklearn.tree # Fit a decision tree from sklearn.tree import DecisionTreeRegressor estimator = DecisionTreeRegressor(criterion="mse", max_leaf_nodes=5)

# # # # #

Set i(t) function Tune model complexity with max_leaf_nodes, max_depth or min_samples_split

estimator.fit(X_train, y_train) # Predict target values y_pred = estimator.predict(X_test) # MSE on test data from sklearn.metrics import mean_squared_error score = mean_squared_error(y_test, y_pred) >>> 0.572049826453

8 / 26

Visualize and interpret # Display tree from sklearn.tree import export_graphviz export_graphviz(estimator, out_file="tree.dot", feature_names=feature_names)

9 / 26

Strengths and weaknesses of decision trees • Non-parametric model, proved to be consistent. • Support heterogeneous data (continuous, ordered or

categorical variables). • Flexibility in loss functions (but choice is limited). • Fast to train, fast to predict. In the average case, complexity of training is Θ(pN log2 N). • Easily interpretable. • Low bias, but usually high variance Solution: Combine the predictions of several randomized trees into a single model.

10 / 26

Outline 1 Motivation 2 Growing decision trees 3 Random Forests 4 Boosting 5 Reading tree leaves 6 Summary

Random Forests (Breiman, 2001; Geurts et al., 2006) 𝒙 𝜑1

𝜑𝑀



𝑝𝜑𝑚 (𝑌 = 𝑐|𝑋 = 𝒙)

𝑝𝜑1 (𝑌 = 𝑐|𝑋 = 𝒙)



𝑝𝜓 (𝑌 = 𝑐|𝑋 = 𝒙)

Randomization • Bootstrap samples • Random selection of K 6 p split variables • Random selection of the threshold

} Random Forests

} Extra-Trees 11 / 26

Bias and variance

12 / 26

Bias-variance decomposition Theorem. For the squared error loss, the bias-variance decomposition of the expected generalization error EL {Err (ψL,θ1 ,...,θM (x))} at X = x of an ensemble of M randomized models ϕL,θm is EL {Err (ψL,θ1 ,...,θM (x))} = noise(x) + bias2 (x) + var(x), where noise(x) = Err (ϕB (x)), bias2 (x) = (ϕB (x) − EL,θ {ϕL,θ (x)})2 , 1 − ρ(x) 2 σL,θ (x). var(x) = ρ(x)σ2L,θ (x) + M and where ρ(x) is the Pearson correlation coefficient between the predictions of two randomized trees built on the same learning set. 13 / 26

Diagnosing the error of random forests (Louppe, 2014)

• Bias: Identical to the bias of a single randomized tree. 2 • Variance: var(x) = ρ(x)σ2L,θ (x) + 1−ρ(x) M σL,θ (x)

As M → ∞, var(x) → ρ(x)σ2L,θ (x)

The stronger the randomization, ρ(x) → 0, var(x) → 0. The weaker the randomization, ρ(x) → 1, var(x) → σ2L,θ (x)

Bias-variance trade-off. Randomization increases bias but makes it possible to reduce the variance of the corresponding ensemble model. The crux of the problem is to find the right trade-off.

14 / 26

Tuning randomization in sklearn.ensemble from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor from sklearn.cross_validation import ShuffleSplit from sklearn.learning_curve import validation_curve # Validation of max_features, controlling randomness in forests param_name = "max_features" param_range = range(1, X.shape[1]+1) for Forest, color, label in [(RandomForestRegressor, "g", "RF"), (ExtraTreesRegressor, "r", "ETs")]: _, test_scores = validation_curve( Forest(n_estimators=100, n_jobs=-1), X, y, cv=ShuffleSplit(n=len(X), n_iter=10, test_size=0.25), param_name=param_name, param_range=param_range, scoring="mean_squared_error") test_scores_mean = np.mean(-test_scores, axis=1) plt.plot(param_range, test_scores_mean, label=label, color=color) plt.xlabel(param_name) plt.xlim(1, max(param_range)) plt.ylabel("MSE") plt.legend(loc="best") plt.show() 15 / 26

Tuning randomization in sklearn.ensemble

Best-tradeoff: ExtraTrees, for max features=6. 16 / 26

Strengths and weaknesses of forests • One of the best off-the-self learning algorithm, requiring

almost no tuning. • Fine control of bias and variance through averaging and

randomization, resulting in better performance. • Moderately fast to train and to predict. e log2 N) e for RFs (where N e = 0.632N) Θ(MK N Θ(MKN log N) for ETs • Embarrassingly parallel (use n jobs). • Less interpretable than decision trees.

17 / 26

Outline 1 Motivation 2 Growing decision trees 3 Random Forests 4 Boosting 5 Reading tree leaves 6 Summary

Gradient Boosted Regression Trees (Friedman, 2001) • GBRT fits an additive model of the form

ϕ(x) =

M X

γm hm (x)

m=1

• The ensemble is built in a forward stagewise manner, where

y

each regression tree hm is an approximate successive gradient step. 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

Ground truth

tree 1

+



2

x

6

10

tree 2

2

x

6

10

tree 3

+

2

x

6

10

2

x

6

10

18 / 26

Careful tuning required from sklearn.ensemble import GradientBoostingRegressor from sklearn.cross_validation import ShuffleSplit from sklearn.grid_search import GridSearchCV # Careful tuning is required to obtained good results param_grid = {"learning_rate": [0.1, 0.01, 0.001], "subsample": [1.0, 0.9, 0.8], "max_depth": [3, 5, 7], "min_samples_leaf": [1, 3, 5]} est = GradientBoostingRegressor(n_estimators=1000) grid = GridSearchCV(est, param_grid, cv=ShuffleSplit(n=len(X), n_iter=10, test_size=0.25), scoring="mean_squared_error", n_jobs=-1).fit(X, y) gbrt = grid.best_estimator_

See our PyData 2014 tutorial for further guidance https://github.com/pprett/pydata-gbrt-tutorial 19 / 26

Strengths and weaknesses of GBRT

• Often more accurate than random forests. • Flexible framework, that can adapt to arbitrary loss functions. • Fine control of under/overfitting through regularization (e.g.,

learning rate, subsampling, tree structure, penalization term in the loss function, etc). • Careful tuning required. • Slow to train, fast to predict.

20 / 26

Outline 1 Motivation 2 Growing decision trees 3 Random Forests 4 Boosting 5 Reading tree leaves 6 Summary

Variable importances importances = pd.DataFrame() # Variable importances with Random Forest, default parameters est = RandomForestRegressor(n_estimators=10000, n_jobs=-1).fit(X, y) importances["RF"] = pd.Series(est.feature_importances_, index=feature_names) # Variable importances with Totally Randomized Trees est = ExtraTreesRegressor(max_features=1, max_depth=3, n_estimators=10000, n_jobs=-1).fit(X, y) importances["TRTs"] = pd.Series(est.feature_importances_, index=feature_names) # Variable importances with GBRT importances["GBRT"] = pd.Series(gbrt.feature_importances_, index=feature_names) importances.plot(kind="barh")

21 / 26

Variable importances

Importances are measured only through the eyes of the model. They may not tell the entire nor the same story! (Louppe et al., 2013) 22 / 26

Partial dependence plots Relation between the response Y and a subset of features, marginalized over all other features. from sklearn.ensemble.partial_dependence import plot_partial_dependence plot_partial_dependence(gbrt, X, features=[1, 10], feature_names=feature_names)

23 / 26

Embedding from sklearn.ensemble import RandomTreesEmbedding from sklearn.decomposition import TruncatedSVD # Project wines through a forest of totally randomized trees # and use the leafs the samples end into as a high-dimensional representation hasher = RandomTreesEmbedding(n_estimators=1000) X_transformed = hasher.fit_transform(X) # Plot wines on a plane using the 2 principal components svd = TruncatedSVD(n_components=2) coords = svd.fit_transform(X_transformed) n_values = 10 + 1 # Wine preferences are from 0 to 10 cm = plt.get_cmap("hsv") colors = (cm(1. * i / n_values) for i in range(n_values)) for k, c in zip(range(n_values), colors): plt.plot(coords[y == k, 0], coords[y == k, 1], ’.’, label=k, color=c) plt.legend() plt.show()

24 / 26

Embedding

Can you guess what these 2 clusters correspond to? 25 / 26

Outline 1 Motivation 2 Growing decision trees 3 Random Forests 4 Boosting 5 Reading tree leaves 6 Summary

Summary • Tree-based methods offer a flexible and efficient

non-parametric framework for classification and regression. • Applicable to a wide variety of problems, with a fine control

over the model that is learned. • Assume a good feature representation – i.e., tree-based

methods are often not that good on very raw input data, like pixels, speech signals, etc. • Insights on the problem under study (variable importances,

dependence plots, embedding, ...). • Efficient implementation in Scikit-Learn.

26 / 26

Join us on https://github.com/ scikit-learn/scikit-learn

References

Breiman, L. (2001). Random Forests. Machine learning, 45(1):5–32. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and regression trees. Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics, pages 1189–1232. Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1):3–42. Louppe, G. (2014). Understanding random forests: From theory to practice. arXiv preprint arXiv:1407.7502. Louppe, G., Wehenkel, L., Sutera, A., and Geurts, P. (2013). Understanding variable importances in forests of randomized trees. In Advances in Neural Information Processing Systems, pages 431–439.

Tree models with Scikit-Learn Great learners with little ... - CiteSeerX

Apr 3, 2015 - Data comes as a finite learning set L = (X, y) where. Input samples are given as an array of shape (n samples, n features). E.g., feature values ...

2MB Sizes 3 Downloads 226 Views

Recommend Documents

set identification in models with multiple equilibria - CiteSeerX
is firm i's strategy in market m, and it is equal to 1 if firm i enters market m, ..... We are now in a position to state the corollary5, which is the main tool in the .... Bi-partite graph representing the admissible connections between observable o

Coping with Diverse Semantic Models when Routing ... - CiteSeerX
It is accepted that such routing systems need to cope with mobile and volatile sources ... such a semantic-based CBN a Knowledge-Based Network. (KBN). .... expressivity of an individual ontology is a good indicator of the reasoning ..... http://www.c

Coping with Diverse Semantic Models when Routing ... - CiteSeerX
ontology languages by the Semantic Web initiative at the. World Wide Web ...... Co-located with the 15th International World-Wide Web Conference,. Edinburgh ...

Get Deal With Tree Lopping In Canberra With Perfection.pdf ...
There was a problem loading more pages. Retrying... Get Deal With Tree Lopping In Canberra With Perfection.pdf. Get Deal With Tree Lopping In Canberra With ...

Ranking with decision tree
This is an online mistake-driven procedure initialized with ... Decision trees can, to some degree, overcome these shortcomings of perceptron-based ..... Research Program of Chinese Academy of Sciences (06S3011S01), National Key Technology R&D Pro- .

Models of Wage Dynamics - CiteSeerX
Dec 13, 2005 - interest rate. ... earning (and saving) in the first period. ... The earning growth rate is higher when A and w2 are higher, and when r and w1 are ...

Models of Wage Dynamics - CiteSeerX
Dec 13, 2005 - Department of Economics. Concordia University .... expected output conditional on the worker is fit (¯YF ) or unfit (¯YU ). Notice that ¯YF >. ¯.

CDO mapping with stochastic recovery - CiteSeerX
B(d, T) := BaseProtection(d, T) − BasePremium(d, T). (11) to which the payment of any upfront amounts (usually exchanged on the third business day after trade date) should be added. A Single Tranche CDO can be constructed as the difference of two b

Sponsored Search Auctions with Markovian Users - CiteSeerX
Google, Inc. 76 Ninth Avenue, 4th Floor, New ... tisers who bid in order to have their ad shown next to search results for specific keywords. .... There are some in- tuitive user behavior models that express overall click-through probabilities in.

On Regular Temporal Logics with Past - CiteSeerX
In fact, we show that RTL is exponentially more succinct than the cores of PSL and SVA. Furthermore, we present a translation of RTL into language-equivalent ...

Repeated Games with General Discounting - CiteSeerX
Aug 7, 2015 - Together they define a symmetric stage game. G = (N, A, ˜π). The time is discrete and denoted by t = 1,2,.... In each period, players choose ...

Equilibrium Directed Search with Multiple Applications! - CiteSeerX
Jan 30, 2006 - labor market in which unemployed workers make multiple job applications. Specifically, we consider a matching process in which job seekers, ...

Contour Grouping with Partial Shape Similarity - CiteSeerX
... and Information Engineering,. Huazhong University of Science and Technology, Wuhan 430074, China ... Temple University, Philadelphia, PA 19122, USA ... described a frame integrates top-down with bottom-up segmentation, in which ... The partial sh

Approaching Psychological Science With Kuhn's Eyes ... - CiteSeerX
As further illustration of this broader issue of concern, for ... making is proposed to impede progress in ... beginning of the science's advancement to unification.

CDO mapping with stochastic recovery - CiteSeerX
Figure 2: Base correlation surface for CDX IG Series 11 on 8 January. 2009 obtained using the stochastic recovery model. 4.2 Time and strike dimensions.

Investigating Retrieval Performance with Manually-Built ... - CiteSeerX
models outperforms relevance models for a subset of the queries and ... source in addition to queries to help in understanding a user's information need and to ..... retrieval task) and “Transportation” is closer than “Energy”, and there is n