Experiments with Random Projections for Machine Learning Dmitriy Fradkin Joint Work with David Madigan

1

Purpose

To evaluate the effectiveness of Random Projections (RPs), compared to PCA, with different machine learning algorithms.

2

Supervised Learning Problem

Inductive supervised learning infers a functional relation y = f (x) from a set of training examples T = {(x1 , y1 ), . . . , (xn , yn )}.   In what follows the inputs are vectors xi = xi1 , . . . , xip in


3

The Need for Dimensionality Reduction

Data with large dimensionality (p) presents problems for many machine learning algorithms: • their computational complexity can be superlinear in p • they may need complexity control to avoid overfitting Traditional methods such as PCA/SVD are computationally expensive: • PCA is O(p2 n) + O(p3 ) [Golub and van Loan, 1983] • SVD is somewhat more efficient: for sparse matrices with r non-zero entries per column there are O(prn) algorithms [Papadimitriou et. al. 1998]

4

Johnson-Lindenstrauss Theorem

A theorem due to Johnson and Lindenstrauss (JL Theorem) states that for a set of points of size n in p-dimensional Euclidean space there exists a linear transformation of the data into a q-dimensional space, q ≥ O(−2 log(n)) that preserves distances up to a factor 1 ±  [Johnson and Lindenstrauss, 1984].

5

“Database-Friendly Random Projections”

Theorem 1 [Achlioptas, 2001] Given n points in

0 and q ≥ 4+2∗β 2 3 2

√1 XP , q



3

for projection matrix P . Then, mapping from X to let E = E preserves distances up to factor 1 ±  for all rows in X with probability (1 − n−β ). Projection matrix P , p × q, can be constructed in one of the following ways: • rij = ±1 with probability 0.5 each √ • rij = 3 ∗ (±1 with probability 1/6 each, or 0 with probability 2/3)

6

Time Complexity of Random Projections

The above projections are easy to implement and to compute. Constructing a p × q random matrix is O(pq). Performing the projection for n points is O(npq).

7

Theoretical Effectiveness 10000

Lower bound on q

8000

6000

4000

2000

0 0

500

Figure 1:

1000

1500

2000 2500 3000 Number of points

3500

4000

4500

5000

Plot of lower bound q of dimensionality of random projections as a

function of number of points. Upper curve corresponds to  = 0.1, middle one - to  = 0.2, lowest one to  = 0.5. β = 1 for all of these, allowing a deviation by a factor greater than  with probability

1 n.

8

Previous Experiments

[Bingham and Mannila, 2001] experimentally show that RP preserve similarity (inner products) well even when dimensionality of projection is moderate. They also compare performance of RP to PCA, SVD and DCT. Their data had p = 5000, n = 2262 for text data, and p = 2500, n = 1000 for image data. Projections were done to q ∈ [1, 800].

9

Other Work with Random Projections • Theoretical Approximate Nearest Neighbor algorithm with polynomial preprocessing and query time polynomial in p and log n [Indyk and Motwani, 1998]. Also, the first tight bounds on the quality of randomized dimensionality reduction. • Learning mixtures of Gaussians in high dimensions [Dasgupta 1999], [Dasgupta, 2000]. Combination of RP with EM algorithm gives good classification results on a hand-written digit dataset. • Preservation of volumes and affine distances [Magen 2002]. • Deterministic algorithm for constructing JL mappings [Engebretsen, Indyk and O’Donnell 2002], used to derandomize several randomized algorithms. • Approximate kernel computations [Achlioptas, McSherry and Sch¨olkopf, 2001], similarity computations for histogram models [Thaper et. al 2002]. 10

Our Implementation of Random Projections

We chose to implement the first of the methods suggested by Achlioptas: • rij = ±1 with probability 0.5 each Since we are not concerned with preserving distances per se, but only with preserving separation between points, we do not scale our projection: E = XP instead of E = √1q XP

11

Description of Data

Ionosphere, Spambase and Internet Ads were taken from UCI repository. Colon and Leukemia were first used in [Alon et. al 1999] and [Golub et. al. 1999] respectively. Table 1: Name

# Instances

# Attributes

Ion

351

34

Spam

4601

57

Ads

3279

1554

Colon

62

2000

Leukemia

72

3571

12

Choice of Projection Dimensions • Colon and Leukemia datasets are of a high dimensionality but have few points. Thus we would expect RP to high dimensions to lead to good results, while PCA results should stop changing after some point. For these dataset we perform projections into spaces of dimensionality 5, 10, 25, 50, 100, 200 and 500. • Ionosphere and Spam are relatively low-dimensional but have many more points than Colon and Leukemia datasets. Such combination in theory leaves little space for RP to improve, while PCA should be able to do well. We project to dimensions 5, 10, 15, 20, 25 and 30. • Ads dataset is both large and high-dimensional, and seems to fall somewhere between the others. We perform projections are done to 5, 10, 25, 50, 100, 200 and 500.

13

Experimental Setup

We compare PCA and RP using a number of standard machine learning tools: • decision trees (C4.5 - [Quinlan, 1993]) • linear SVM (SVMLight - [Joachims, 1999]) • nearest neighbor (NN) Test set sizes were kept constant over different splits: Ionosphere 51, Spambase - 1601, Colon - 12, Leukemia - 12, Ads - 1079.

14

Experimental Procedure Require: Dataset D, set of projection dimensions {d1 , . . . , dk }, number of test/training splits s to be done (we perform 30 splits for Ads and 100 splits for other datasets) 1: for i = 1, . . . , s do 2: split D into training set and test set 3: normalize the data (estimating mean and variance from the training set) 4: for d0 = d1 , . . . , dk do 5: do a PCA on training set and project both training and d0 test data into < 6: create a random projection matrix as described above and 0 project both training and test data into
Results on Ion Dataset C4.5

1NN

100

100

95

95

90

90

85

85

80

80

75

75

70

70

Original

65

Original

65

PCA

PCA

RP

RP

60

60 5

10

15

20

25

30

5

10

5NN

15

20

25

30

SVM

100

100

95

95

90

90

85

85

80

80

75

75

70

70

Original

65

Original

65

PCA

PCA

RP

RP

60

60 5

10

15

20

25

30

5

10

15

20

25

30

16

Results on Spam Dataset C4.5

1NN

100

100

95

95

90

90

85

85

80

80

75

75

70

70

Original

65

Original

65

PCA

PCA

RP

RP

60

60 5

10

15

20

25

30

5

10

5NN

15

20

25

30

SVM

100

100

95

95

90

90

85

85

80

80

75

75

70

70

Original

65

Original

65

PCA

PCA

RP

RP

60

60 5

10

15

20

25

30

5

10

15

20

25

30

17

Results on Ads Dataset C4.5

1NN

100

100

95

95

90

90

85

85

80

80

75

75

70

70

Original

65

Original

65

PCA

PCA

RP

RP

60

60 50

100

150

200

250

300

350

400

450

500

50

100

150

5NN

200

250

300

350

400

450

500

450

500

SVM

100

100

95

95

90

90

85

85

80

80

75

75

70

70

Original

65

Original

65

PCA

PCA

RP

RP

60

60 50

100

150

200

250

300

350

400

450

500

50

100

150

200

250

300

350

400

18

Results on Colon Dataset C4.5

1NN

100

100

95

95

90

90

85

85

80

80

75

75

70

70

Original

65

Original

65

PCA

PCA

RP

RP

60

60 50

100

150

200

250

300

350

400

450

500

50

100

150

5NN

200

250

300

350

400

450

500

450

500

SVM

100

100

95

95

90

90

85

85

80

80

75

75

70

70

Original

65

Original

65

PCA

PCA

RP

RP

60

60 50

100

150

200

250

300

350

400

450

500

50

100

150

200

250

300

350

400

19

Results on Leukemia Dataset C4.5

1NN

100

100

95

95

90

90

85

85

80

80

75

75

70

70

Original

65

Original

65

PCA

PCA

RP

RP

60

60 50

100

150

200

250

300

350

400

450

500

50

100

150

5NN

200

250

300

350

400

450

500

450

500

SVM

100

100

95

95

90

90

85

85

80

80

75

75

70

70

Original

65

Original

65

PCA

PCA

RP

RP

60

60 50

100

150

200

250

300

350

400

450

500

50

100

150

200

250

300

350

400

20

Discussion of C4.5 performance

• C4.5 does well with low-dimensional PCA projections (on Ionosphere, Colon and Leukemia datasets), but its performance deteriorates after that and doesn’t improve. • Performance with RP is poor: after some initial improvement the accuracy curve seems to level out. Decision trees rely on informativeness of individual attributes and construct axis-parallel boundaries for their decisions. They don’t deal well with transformations of the attributes, and are sensitive to noise. Random Projections and decision trees are perhaps not a good combination.

21

Discussion of NN performance

• Nearest Neighbor Methods appear to be least affected by reduction in dimensionality through PCA or RP. • PCA projection into a low dimensional space actually improves NN’s accuracy on Ionosphere and Ads datasets. • NN results with RP approach those in the original space (or PCA) quite rapidly. Such behavior of NN methods can be explained by their exclusive reliance on distance computations.

22

Discussion of SVM performance

• SVM does worse in projection spaces (both with PCA and RP) than in the original space. • Its performance improves noticeably as the dimensionality of projections increases. • Performance of PCA is much better initially, but RPs are catching up to it.

23

Discussion of data complexity We kept track of the number of support vectors used in each projection: • PCA on Ads, Colon and Leukemia datasets led to fewer support vectors, while on Spam and Ionosphere data the number of support vectors was somewhat higher for PCA than in the original space. • RPs resulted in about the same number of support vectors on Colon and Leukemia Datasets, but much higher numbers on Ads, Spam and Ionosphere. • For both PCA and RP, as the dimensionality of the projections approached the original dimensionality, the number of support vectors approached that used in the original space. • The number of support vectors when using PCA was always less than when using RP in lower dimensions. 24

Conclusions

• RPs performance was (predictably) below the level of PCA. • RPs performance was improving noticeably with increasing dimensionality. • RPs seem well suited for use with Nearest Neighbor methods. • Decision tree did not combine with RP in a satisfactory way.

25

Directions for Further Study

• Explore performance on significantly larger datasets • Ensembles of classifiers trained on different projections – different projections to the same dimension – projections to different dimensions

26

We would like to thank Andrei Anghelescu for providing the NN code.

27

Experiments with Random Projections for Machine ...

The Need for Dimensionality Reduction. Data with large dimensionality (p) presents problems for many machine learning algorithms: • their computational complexity can be superlinear in p. • they may need complexity control to avoid overfitting. Traditional methods such as PCA/SVD are computationally expensive:.

127KB Sizes 0 Downloads 160 Views

Recommend Documents

Experiments with Random Projections for Machine ...
Division of Computer and Information Sciences. Rutgers University ... Department of Statistics Rutgers University. Piscataway, NJ ... points in p, as an n×p matrix X, we want to find the best. (in least squares ..... Berkeley, California, USA, 1999.

Fast Random Projections using Lean Walsh ... - Research at Google
∗Yale University, Department of Computer Science, Supported by NGA and AFOSR. †Google Research. ‡Yale University, Department of Mathematics, Program ...

Ensemble Methods for Machine Learning Random ...
Because of bootstrapping, p(picking sample for tree Ti) = 1 – exp(-1) = 63.2%. – We have free cross-validation-like estimates! ○. Classification error / accuracy.

COMPARISON METHODS FOR PROJECTIONS AND.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. COMPARISON ...

Statistics for Online Experiments - Optimizely
Although we know you value data and hard facts when growing your business, you make .... difference between the variation and control groups. Of course ...

NHLRumourReport.com Points Projections 2017-2018
Matt Duchene. COL. C. 79. 24. 32. 56. 45/85. 72 .... 114 Matthew Tkachuk. CGY. LW. 80. 20. 29. 49. 35/70 .... 187 Zack Smith. OTT. LW. 79. 22. 20. 42. 25/50.

Experiments with "etwork Economies
Each agent sends link proposals to all other agents p i. D (p i j. ) 4. jD1 e н0,18. 4 . ( Mutually agreed upon links are formed (i.e. link ij occurs iff p i j p j i. D 1). ( A network is 4ust the set of all agreed upon links (g D нij : Vi, j e m8

Experiments with Semi-Supervised and Unsupervised Learning
Highly frequent in language and communication, metaphor represents a significant challenge for. Natural Language Processing (NLP) applications.

Experiments with Semi-Supervised and Unsupervised Learning
The system of Martin. (1990) explained linguistic metaphors through finding the corresponding metaphorical ...... Source Cluster: polish clean scrape scrub soak.

Speech Recognition with Segmental Conditional Random Fields
learned weights with error back-propagation. To explore the utility .... [6] A. Mohamed, G. Dahl, and G.E. Hinton, “Deep belief networks for phone recognition,” in ...

Robust Utility Maximization with Unbounded Random ...
pirical Analysis in Social Sciences (G-COE Hi-Stat)” of Hitotsubashi University is greatly ... Graduate School of Economics, The University of Tokyo ...... Tech. Rep. 12, Dept. Matematica per le Decisioni,. University of Florence. 15. Goll, T., and

AN AUTOREGRESSIVE PROCESS WITH CORRELATED RANDOM ...
Notations. In the whole paper, Ip is the identity matrix of order p, [v]i refers ... correlation between two consecutive values of the random coefficient. In a time.

NHLRumourReport.com Points Projections 2016-2017
83 Mike Hoffman. OTT. 80. 29. 27. 56. 40/70. 84 Vincent Trocheck. FLA. 79. 26. 30. 56. 30/65. 85 Derick Brassard. OTT. 78. 22. 34. 56. 35/65. 86 Paul Stastny. STL. 77. 19. 37. 56. 40/70. 87 Ondrej Palat. TB. 76. 21. 34. 55. 40/65 ..... 310 Alexandre

Machine Learning with OpenCV2 - bytefish.de
Feb 9, 2012 - 7.3 y = sin(10x) . ... support and OpenCV 2.3.1 now comes with a programming interface to C, C++, Python and Android. OpenCV is released ...

Projections Alberta 2015 final.pdf
Calgary-Hawkwood 32.2% 28.6% 5.9% 31.2% 2.1% 46% 15% 0% 38% 0%. Calgary-Hays 38.5% 28.6% 3.2% 29.8% 0.0% 86% 5% 0% 9% 0%. Calgary-Klein ...

Demographia World Urban Areas Population Projections
h for Jabotab a definitions ation project liwice-Tychy rojected popu e Individual U population gr ndividual Urb growth rate u ban Area Not used, due to below). Areas: .... Suzhou, JS. 3,605,000. 4,925,000. 5,425,000. 82. Mexico. Guadalajara. 4,210,00

orthographic projections pdf
Sign in. Loading… Whoops! There was a problem loading more pages. Whoops! There was a problem previewing this document. Retrying... Download. Connect ...

Neighborhood MinMax Projections
formance on several data sets demonstrates the ef- fectiveness of the proposed method. 1 Introduction. Linear dimensionality reduction is an important method ...