Astronomical Data Analysis Software and Systems XIX ASP Conference Series, Vol. XXX, 2009 Y. Mizumoto, K.-I. Morita, and M. Ohishi, eds.

P74

A Map/Reduce Parallelized Framework for Rapidly Classifying Astrophysical Transients Dan L. Starr1,2 , Joshua S. Bloom1 , John M. Brewer1,3 , N. R. Butler1 , C. Klein1 University of California, Berkeley, California 94720 USA Abstract. The Berkeley Transients Classification Pipeline (TCP) is a source identification, classification, and broadcast pipeline which federates data streams from multiple surveys. The TCP identifies variable science by making probabilistic statements about the scientific classification of newly discovered sources observed by the Palomar Transient Factory’s all sky survey. The primary purpose of PTF is to consistently map the available sky with the intent to discover a variety of galactic and extragalactic transient sources and events. The TCP identifies and alerts follow-up telescopes such as PAIRITEL (Bloom et al. 2005) and end users to these newly discovered transient sources. Here we discuss software used within the TCP to generate science classifiers when little or no data has been acquired by the survey of interest. This case proves more challenging than when generating classifiers for a well populated survey. We present some of the difficulties encountered and a parallelized Hadoop/MapReduce based technique we use to resolve them.

1.

TCP and the Palomar Transient Factory

Initially developed using the ∼750 million row SDSS-II stripe-82 dataset, the TCP has been subsequently interfaced with the PTF’s image subtraction pipeline (Rau et al. 2009). This subtraction pipeline, which is hosted at Lawrence Berkeley Natl. Lab., subtracts historical reference mosaics from recent images taken by the 7.8 degree FOV instrument on the Palomar 48” telescope. Soon after the telescope’s commissioning (spring of 2009), the subtraction pipeline came online and the TCP began ingesting the object data stream in real-time. 2.

Feature Extractors

The goal of the TCP is make probabilistic statements about transients making use of their light curves and where the event occurs on the sky (“context”). To make use of pre-existing machine learning frameworks we need to marshall the heterogeneous data into a common set of m-dimension real-number line “features”. Feature extractors are algorithms which summarize individual quanta 1

Astronomy Department, University of California, Berkeley, CA, USA

2

Las Cumbres Global Telescope Network (LCOGT), Santa Barbara, CA, USA

3

Department of Physics & Astronomy, San Francisco State University, San Francisco, CA, USA

1

2

Starr, Bloom, Brewer et al.

of information from light-curves and context (Starr et al. 2008). Some example feature extractors are: the location of the source in the Galactic plane, the primary period as well as harmonics of a periodic source1 , the statistical modes of the flux values of the source. The success of the TCP’s classifiers depends on how well its set of feature algorithms characterize sources generated from PTF data. If some features don’t apply to many PTF sources, or if they are ineffective at distinguishing between different types of science, then the resulting classifiers will be weak. 3.

Noisification

In order to classify sources from the beginning of a survey, the TCP requires classifiers which can be trained without an existing dataset. This differs from the more common case in machine learning where a classifier is trained using a subset of the data which will eventually be classified. With data from a completed or at least well sampled survey, one can easily derive supervised datasets by cross-correlating with sources found in other classified surveys. In the case of the TCP, to generate these science classifiers without any PTF data, we included a step which “noisifies” well-sampled, well-classified sources that are taken from literature and stored in our http://DotAstro.org light-curve warehouse (Brewer et al. 2008). The noisification code resamples both the time and magnitudes of a wellsampled source using precomputed cadences and models for observing depths, sky brightnesses, etc. Originally the noisification software referenced a list of observing cadences which were generated prior to the telescope’s first light. A couple months after commissioning, however, it became evident that several survey cadences were being used and our assumption of a single survey cadence resulted in poorly performing classifiers. We attempted to generate classifiers for different PTF cadences, but even these rarely matched a source’s multicadenced sampling. To better understand why these classifiers did not apply to different cadences, we have begun analyzing the effect which noisification has on the spread of feature values. Figure 1 and 2 show the spread of a feature’s values when well sampled DotAstro.org sources are noisified. Figure 1 shows the period invariant skew of flux values feature. In the primary frequency plots in Figure 2, there are fewer DotAstro.org sources for the RR Lyrae and Cepheid science classes, than as seen in the skew case. This is due to the failure of the feature extractors to find periods for many of these sources. Also notice that the primary frequency range for RR Lyrae is much smaller than the range expected for RR Lyrae in general. A classifier trained with only these noisified RR Lyrae sources would have too narrow a primary frequency constraint for it to be applicable to many expected sources. 1

One of the most important time-series characterizations is whether a source is periodic and if so, what is its primary period. After evaluating combinations of Lomb Scargle, Stetson string length and Dworetsky string length in combination with super-smoothing, REFERENCES we have found it difficult to reliably find periods for sources with less than 10–15 epochs and observed using the PTF survey cadences.

3

Berkeley Transient Classification Pipeline

103

W Ursae Majoris

102

Noisified DotAstro.org

101 100 10-1 10 10 10 10 10

2.0

1.5

1.0

0.5

2.0

1.5

1.0

0.5

2

0.0

0.5

1.0

1.5

2.0

0.0

0.5

1.0

1.5

2.0

0.5

1.0

1.5

2.0

Skew

RR Lyrae

1

0

-1

10 10 10 10 10

3

3

Cepheid Skew

2

1

0

-1

2.0

1.5

1.0

0.5

0.0 Skew

Figure 1. Histogram of skewness spread for several science classes before and after noisification. For skewness, noisification does not change the “observed” distribution significantly from the input (well-sampled) data. The number of sources is plotted as the logarithm.

We recently explored a more computationally expensive, but more ideal (in principle) noisification technique which matches the time sampling of a particular PTF source. The resulting noisified training-set generates a classifier customized to the cadence of a single PTF source. But, as noted, the periods of noisfied sources with less-than 10–15 epochs are difficult to determine robustly. Even with the difficulties of finding periods for sparsely sampled data, we have proceeded with parallelizing this technique of noisifying and generating a customized classifier for a specific PTF source. To make use of available Yahoo/Hadoop cluster resources, we decided to port our existing IPython parallelized noisification and classifier generating code to a Hadoop based architecture. 4.

Parallelization: IPython and Hadoop

The realtime pipeline and noisification software of the TCP were originally developed in an embarrassingly parallel way using IPython’s parallelization tools. This version spawns tasks across several 8 core machines via the ssh protocol. The IPython-parallel noisification software is also used on an 96 core beowulf cluster. To make use of CPU-time granted to our project on both Yahoo’s M45 Hadoop cluster and Amazon EC2 resources, we have wrapped components of the noisification and classification software with Hadoop compatible code. Our Weka classifiers (Witten et al. 1999) are Java based, allowing integration with Hadoop, while our noisification software is primarily written in Python and

4

Starr, Bloom, Brewer et al.

103 102 101 100 10-1 103 102 101 100 10-1 103 102 101 100 10-1

W Ursae Majoris

0

1

2

0

1

2

0

1

2

Noisified DotAstro.org

3

4

5

6

7

8

9

3

4

5

6

7

8

9

3

4

5

6

7

8

9

RRfrequency Lyrae (cycle/day) Lomb Scargle 1st

Lomb Scargle 1stCepheid frequency (cycle/day)

Lomb Scargle 1st frequency (cycle/day)

Figure 2. Primary frequency spread for several science classes before and after noisification. The number of sources is Log() plotted.

wrapped using the Hadoop Streaming package. We successfully reengineered the noisification and classification pipeline using the Cascading Hadoop package, which allows dividing the dataflow into modular map-reduce components. In addition, we plan to explore the Hadoop based Mahout machine learning project and incorporating Hadoop Hive’s SQL-like functionality for parallelized data exploration. Acknowledgments. We thank Las Cumbres Observatory Global Telescope Network (LCOGTN) for partial material support of D.L.S. and Bloom’s group. This work was partially supported by a Cyber-Enabled Discovery and Innovation (CDI) grant from the National Science Foundation (NSF; award #0941742): “CDI-Type II: Real-time Classification of Massive Time-series Data Streams.” References Bloom, J. S., Starr, D. L. 2005, in ASP Conf. Ser. 351, ADASS XV, ed. C. Gabriel, C. Arviset, D. Ponz, & E. Solano (San Francisco: ASP), [P.156] Brewer, J., Bloom, J. S., Kennedy, R., Starr, D. L. 2008, in ASP Conf. Ser. 394, ADASS XVII, ed. Robert W. Argyle, Peter S. Bunclark & James R. Lewis (San Francisco: ASP), [D.04] Rau, A., Kulkarni, S., Law, N. M., Bloom, J. S. 2009, PASP Starr, D. L., Bloom, J. S. 2008, in ASP Conf. Ser. 394, ADASS XVII, ed. Robert W. Argyle, Peter S. Bunclark & James R. Lewis (San Francisco: ASP), [E.45] Witten, I. H., Frank, E., Trigg, et al., 1999, The Waikato Environment for Knowledge Analysis

A Map/Reduce Parallelized Framework for Rapidly ...

Astronomical Data Analysis Software and Systems XIX. P74. ASP Conference Series, Vol. XXX, 2009. Y. Mizumoto, K.-I. Morita, and M. Ohishi, eds.

96KB Sizes 1 Downloads 207 Views

Recommend Documents

A Scalable MapReduce Framework for All-Pair ... - Research at Google
stage computes the similarity exactly for all candidate pairs. The V-SMART-Join ... 1. INTRODUCTION. The recent proliferation of social networks, mobile appli- ...... [12] eHarmony Dating Site. http://www.eharmony.com. [13] T. Elsayed, J. Lin, ...

Towards a General Framework for Secure MapReduce ...
on the public cloud without protection to prevent data leakages. Cryptographic techniques such as fully homo-. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that co

OPTIMAL RESOURCE PROVISIONING FOR RAPIDLY ...
OPTIMAL RESOURCE PROVISIONING FOR RAPIDL ... UALIZED CLOUD COMPUTING ENVIRONMENTS.pdf. OPTIMAL RESOURCE PROVISIONING FOR ...

Cloud MapReduce: a MapReduce Implementation on ...
a large-scale system design and implementation if we build on top of it. Unfortunately .... The theorem states that, of the three properties of shared-data systems ...

MapReduce/Bigtable for Distributed Optimization
our global model parameters, we use a distributed data-store known as Bigtable ... allows backup workers to duplicate effort without producing erroneous output.

Cloud MapReduce: a MapReduce Implementation on ...
The theorem states that, of the three properties of shared-data systems – data ...... then copies over the results to the hard disks on the destination node when ...

SDGToolkit: A Toolkit for Rapidly Prototyping Single ...
computer, a programmer has to do low-level device ... Even when this is done, the programmer ... SDGToolkit appears to the SDG application developer, and.

A Proposed Framework for Proposed Framework for ...
approach helps to predict QoS ranking of a set of cloud services. ...... Guarantee in Cloud Systems” International Journal of Grid and Distributed Computing Vol.3 ...

Incoop: MapReduce for Incremental Computations
Max Planck Institute for Software Systems (MPI-SWS) and ... The second approach would be to develop systems that ... plexity and development effort low. ...... Acar, G. E. Blelloch, and R. Harper. Adaptive functional programming. ACM Trans.

POSITION OVERVIEW A rapidly expanding ... -
Work with latest technologies: As we continue to lead the industry, we require expertise across a broad spectrum of technologies including short and long-range wireless communication, video surveillance, lighting and HVAC automation, web development,

SIGMETRICS Tutorial: MapReduce
Jun 19, 2009 - A programming model for large-scale distributed data ..... Could be hard to debug in .... Reading from local disk is much faster and cheaper.

Rapidly converging methods for the location of ...
from a sign problem in the case of fermionic or frustrated systems and does not reach the level of accuracy of the. DMRG. Very recently, there have been ...

Developing a Framework for Decomposing ...
Nov 2, 2012 - with higher prevalence and increases in medical care service prices being the key drivers of ... ket, which is an economically important segmento accounting for more enrollees than ..... that developed the grouper software.

A framework for consciousness
needed to express one aspect of one per- cept or another. .... to layer 1. Drawing from de Lima, A.D., Voigt, ... permission of Wiley-Liss, Inc., a subsidiary of.

A GENERAL FRAMEWORK FOR PRODUCT ...
procedure to obtain natural dualities for classes of algebras that fit into the general ...... So, a v-involution (where v P tt,f,iu) is an involutory operation on a trilattice that ...... G.E. Abstract and Concrete Categories: The Joy of Cats (onlin

Microbase2.0 - A Generic Framework for Computationally Intensive ...
Microbase2.0 - A Generic Framework for Computationally Intensive Bioinformatics Workflows in the Cloud.pdf. Microbase2.0 - A Generic Framework for ...

A framework for consciousness
single layer of 'neurons' could deliver the correct answer. For example, if a ..... Schacter, D.L. Priming and multiple memory systems: perceptual mechanisms of ...

A SCALING FRAMEWORK FOR NETWORK EFFECT PLATFORMS.pdf
Page 2 of 7. ABOUT THE AUTHOR. SANGEET PAUL CHOUDARY. is the founder of Platformation Labs and the best-selling author of the books Platform Scale and Platform Revolution. He has been ranked. as a leading global thinker for two consecutive years by T

Developing a Framework for Evaluating Organizational Information ...
Mar 6, 2007 - Purpose, Mechanism, and Domain of Information Security . ...... Further, they argue that the free market will not force products and ...... Page 100 ...

Unsupervised natural experience rapidly alters ...
Mar 10, 2010 - ... be found in the online. Updated information and services, ... 2 article(s) on the ISI Web of Science. cited by ... 2 articles hosted by HighWire Press; see: cited by ... of the cell lines were in good concordance with data found by

RoboCupRescue Rapidly Manufactured Robot CHALLENGE ... - Groups
LI. 31.25. 4. 0. 0 39.216. 4. 37.5. 3. 0. 0. 0. 0. 6.25. 0.2 6.6667. 0.2. 30. 0.6. 20. 0.4. 100. 4. 75. 3 345.88. ME. 23.438. 3 55.556 0.6667 27.451. 2.8. 35. 2.8 23.529.

Online Load Balancing for MapReduce with Skewed ...
strategy is a constrained version of online minimum makespan and, in the ... server clusters, offering a highly flexible, scalable, and fault tolerant solution for ...