Test Selection Safety Evaluation Framework Claire Leong

Confidential + Proprietary

Confidential + Proprietary

Goal: to make a generic framework for evaluating test scheduling algorithms at scale from the historical record.

Confidential + Proprietary

Project Overview ●

Implementation: 1.

Determine safety information for historical changelists

2.

Evaluate the safety of test selection algorithms

3.

Implement optimistic, pessimistic and random test selection algorithms

Confidential + Proprietary

Project Overview ●

Used over 2 datasets: Small Dataset

Large Dataset

2 days of CL data (6-8 Dec 2017)

1 month of CL data (October 2017)

11k changelists

900k changelists

1k total targets

4m total targets

430k times targets were affected

16b times targets were affected

Confidential + Proprietary

Determining safety ●

Safety = would skipping this test target miss a transition?



Transition = a change in target results, either from failing->passing or passing->failing

Confidential + Proprietary

Safe Targets skipping this target would not miss a transition Time Changelist

CL1

CL2

Target Result

P

P

Safety

-

Safe

Transition

-

P->P

* = affected Confidential + Proprietary

Safe Targets skipping this target would not miss a transition Time Changelist

CL1

CL2

Target Result

F

F

Safety

-

Safe

Transition

-

F->F

* = affected Confidential + Proprietary

Safe Targets skipping this target would not miss a transition Time Changelist

CL1

CL2

CL3

Target Result

P

*

P

Safety

-

Safe

Safe

Transition

-

P->P

P->P

* = affected Confidential + Proprietary

Safe Targets skipping this target would not miss a transition Time Changelist

CL1

CL2

CL3

Target Result

F

*

F

Safety

-

Safe

Safe

Transition

-

F->F

F->F

* = affected Confidential + Proprietary

Unsafe Targets skipping this target would definitely miss a transition Time Changelist

CL1

CL2

Target Result

P

F

Safety

-

Unsafe

Transition

-

P->F

* = affected Confidential + Proprietary

Unsafe Targets skipping this target would definitely miss a transition Time Changelist

CL1

CL2

Target Result

F

P

Safety

-

Unsafe

Transition

-

F->P

* = affected Confidential + Proprietary

Maybe Unsafe Targets skipping this target might miss a transition Time Changelist

CL1

CL2

CL3

Target Result

P

*

F

Safety

-

Maybe unsafe

Maybe unsafe

Transition

-

P->F

P->F

* = affected Confidential + Proprietary

Maybe Unsafe Targets skipping this target might miss a transition Time Changelist

CL1

CL2

CL3

Target Result

F

*

P

Safety

-

Maybe unsafe

Maybe unsafe

Transition

-

F->P

F->P

* = affected Confidential + Proprietary

Can we skip targets safely? ●

This information is used to determine whether skipping a target is safe



All non-definitive pass or fail results treated as affected



We calculate the safety of skipping tests at rates from 0-100% for an algorithm

Confidential + Proprietary

Input data ●

Input taken from Spanner backup’s Result and Affected tables



Used 3 methods to eliminate flakes from the data ○

Only take pass and fail results



Removing target results identified as flaky by Kellogs



Removing targets with over X transitions in the time period

Confidential + Proprietary

Removing high transition count targets

Remove targets with > 30 transitions (~3k targets) Confidential + Proprietary

Targets per CL Distribution

Stats: ● Median 38 tests! ● 90th percentile 2,604 ● 95th percentile 4,702 ● 99th percentile 55,730

Confidential + Proprietary

Implementation: Safety Data Builder This package creates safety data given the historical changelist data as input.

Confidential + Proprietary

Pipeline Read Target Results to PTable

Generate Input Data & Filter Flakes

Join Tables

Build Target Table Stage

Build Target Safeties Stage

Build Safety Records Stage

Read Affected Targets to PTable Write out Safety Records

Confidential + Proprietary

Pipeline Output: PTable>

Read Target Results to PTable

Generate Input Data & Filter Flakes

...

...

“//target_name”

<(CL10, PASS), (CL2, FAIL), (CL42, NONE)>

Join Tables

Build Target Table Stage

Build Target Safeties Stage

Build Safety Records Stage

Read Affected Targets to PTable Write out Safety Records

Confidential + Proprietary

Pipeline Output: PTable

Read Target Results to PTable

Generate Input Data & Filter Flakes

Join Tables

...

...

CL42

(“target_name”, SAFE, PP)

Build Target Table Stage

Build Target Safeties Stage

Build Safety Records Stage

Read Affected Targets to PTable Write out Safety Records

Confidential + Proprietary

Pipeline

Output PCollection ... CL42, safe_targets:<(“target_name”, SAFE, PP)>, unsafe_targets:<>, maybe_unsafe_targets:<>

Read Target Results to PTable

Generate Input Data & Filter Flakes

Join Tables

Build Target Table Stage

Build Target Safeties Stage

Build Safety Records Stage

Read Affected Targets to PTable Write out Safety Records

Confidential + Proprietary

Safety Data Results Small Data Set

Large Data Set

Total CLs

10170

891,621

CLs with only safe targets

96.4% (9801)

90.2% (804,160)

CLs with maybe unsafe

3.4% (346)

8.3% (73,897)

CLs with unsafe

0.2% (25)

1.5% (13,564)

Total target affecteds

428,938

15,931,019,923

Safe target affecteds

99.9% (428,547)

99.98% (15,927,853,638)

Maybe unsafe target affecteds

0.09% (365)

0.019% (3,054,667)

Unsafe target affecteds

0.01% (26)

0.0001% (111,618)

Confidential + Proprietary

Culprit finding works!

maybe unsafe

P

maybe unsafe

*

F

With culprit finding: safe

P

P

P

F

*

P

We don’t do fix finding

unsafe

P

F

safe

*

F

Confidential + Proprietary

Culprit finding works!

Confidential + Proprietary

Implementation: Algorithm Evaluator This package evaluates the safety of using an algorithm to select tests to skip for a changelist.

Confidential + Proprietary

Evaluator Implementation ●

For every changelist in the safety data, it will call an algorithm with skip rates from 0 to 100%



Using the targets returned by the algorithm, determines if that selection was safe or not ○

Safe = no unsafe or maybe unsafe tests were skipped



Maybe unsafe = maybe unsafe tests were skipped but no unsafe tests



Unsafe = unsafe tests were skipped

Confidential + Proprietary

Algorithms ●

Algorithms are implementations of the interface TestSelectionAlgorithm which contains the method ImmutableSet skipTargets(long cl, Iterable targets, int numToSkip)

Confidential + Proprietary

Algorithms - Random Changelist’s Affected Targets

1 Safe target Num to skip = 0 1 Maybe unsafe target

1 Unsafe target

Random Algorithm

Safety = safe

Confidential + Proprietary

Algorithms - Random Changelist’s Affected Targets

1 Safe target Num to skip = 1 1 Maybe unsafe target

Random Algorithm

1 Unsafe target

Safety = unsafe

Confidential + Proprietary

Algorithms - Random Changelist’s Affected Targets

1 Safe target Num to skip = 2 1 Maybe unsafe target

1 Unsafe target

Random Algorithm

Safety = maybe unsafe

Confidential + Proprietary

Algorithms - Random Changelist’s Affected Targets

1 Safe target Num to skip = 3 1 Maybe unsafe target

Random Algorithm

1 Unsafe target

Safety = unsafe

Confidential + Proprietary

Algorithms - Optimistic Changelist’s Affected Targets

1 Safe target Num to skip = 0 1 Maybe unsafe target

1 Unsafe target

Optimistic Algorithm

Safety = safe

Confidential + Proprietary

Algorithms - Optimistic Changelist’s Affected Targets

1 Safe target Num to skip = 1 1 Maybe unsafe target

1 Unsafe target

Optimistic Algorithm

Safety = safe

Confidential + Proprietary

Algorithms - Optimistic Changelist’s Affected Targets

1 Safe target Num to skip = 2 1 Maybe unsafe target

1 Unsafe target

Optimistic Algorithm

Safety = maybe unsafe

Confidential + Proprietary

Algorithms - Optimistic Changelist’s Affected Targets

1 Safe target Num to skip = 3 1 Maybe unsafe target

1 Unsafe target

Optimistic Algorithm

Safety = unsafe

Confidential + Proprietary

Algorithms - Pessimistic Changelist’s Affected Targets

1 Safe target Num to skip = 0 1 Maybe unsafe target

Pessimistic Algorithm

1 Unsafe target

Safety = safe

Confidential + Proprietary

Algorithms - Pessimistic Changelist’s Affected Targets

1 Safe target Num to skip = 1 1 Maybe unsafe target

1 Unsafe target

Pessimistic Algorithm

Safety = unsafe

Confidential + Proprietary

Algorithms - Pessimistic Changelist’s Affected Targets

1 Safe target Num to skip = 2 1 Maybe unsafe target

1 Unsafe target

Pessimistic Algorithm

Safety = unsafe

Confidential + Proprietary

Algorithms - Pessimistic Changelist’s Affected Targets

1 Safe target Num to skip = 3 1 Maybe unsafe target

1 Unsafe target

Pessimistic Algorithm

Safety = unsafe

Confidential + Proprietary

Pipeline performance ●

Safety data builder ran in 35 mins



Algorithm evaluator ○

Optimistic ran in 2h 40m



Pessimistic ran in 3h 5m



Random ran in 4h 40m

Confidential + Proprietary

Small dataset results

floor of changelists with only safe affected targets

Confidential + Proprietary

Small dataset results

ceiling of changelists with maybe unsafe but no unsafe affected targets

Confidential + Proprietary

Small dataset results ceiling of changelists with unsafe affected targets

Confidential + Proprietary

Large dataset results

floor of changelists with only safe affected targets

Confidential + Proprietary

Why is random a curve? ●

Previously we had predicted a straight line for random



Small data set has a straight line

Confidential + Proprietary

Probability Distribution

Confidential + Proprietary

Probability Distribution where N = 1000, n = 995

Confidential + Proprietary

Confidential + Proprietary

Large dataset results

ceiling of changelists with maybe unsafe but no unsafe affected targets

Confidential + Proprietary

Large dataset results

ceiling of changelists with unsafe affected targets

Confidential + Proprietary

Conclusions ●

The project was completed!



We now have an offline method to evaluate test scheduling algorithms and a baseline for future comparison

Confidential + Proprietary

Continuing the project ●



Better flake exclusion ○

Filter using ratio transitions:results



Find the point where Kellogs doesn’t identify the target as flaky

Rerunning Elbaum experiments ○

An algorithm which prioritizes targets based on the number of transitions in some previous window of time



Evaluating Efficacy machine learning model

Confidential + Proprietary

Questions?

Confidential + Proprietary

Creating safeties for all targets { for (result in sorted target results) { if (result = affected) { add result to pending results continue; } if (previous result = mark this result } else (if no pending mark this result } else { mark this result }

result) { and all pending results as safe results) { as unsafe and all pending results as maybe safe

previous result = result; } } Confidential + Proprietary

Evaluating algorithms for (changelist) { retrieve affected targets for changelist for (skip rate = 0..100%) { skipped targets = algorithm.skipTargets(affected targets, skip rate * num affected targets); if (skipped targets contain sunsafe targets) { mark this test selection as unsafe } else if (skipped targets contains maybe unsafe targets) { mark this test selection as maybe unsafe } else { mark this test selection as safe } } }

Confidential + Proprietary

safety_record.proto // Represents the safety information for all affected targets at a CL. message SafetyRecord { optional int64 changelist = 1; repeated TargetSafety safe_targets = 2; repeated TargetSafety unsafe_targets = 3; repeated TargetSafety maybe_unsafe_targets = 4; } // Represents the safety of skipping a test target. message TargetSafety { optional string target_name = 1; optional Safety safety = 2; optional Transition transition = 3; }

Confidential + Proprietary

safety_result.proto // Represents the safety information for using a test selection algorithm on a // CL's test targets with a given target skip rate. message SafetyResult { optional int64 changelist = 1; optional string algorithm_name = 2; optional int32 skip_rate = 3; optional tko.testselectionevaluation.TargetSafety.Safety safety = 4; repeated string unsafe_skipped = 5; repeated string maybe_unsafe_skipped = 6; optional int32 total_skipped = 7; }

Confidential + Proprietary

Optimistic Algorithm Implementation // Constructor takes a SafetyRecord as input private final SafetyRecord record; skipTargets(changelist, targetList, numToSkip) { remainingToSkip = numToSkip; skip targets from record.safe_targets with limit remainingToSkip remainingToSkip -= safe skipped targets skip targets from record.maybe_safe_targets with limit remainingToSkip remainingToSkip -= maybe unsafe skipped targets skip targets from record.unsafe_targets with limit remainingToSkip remainingToSkip -= unsafe skipped targets return skipped targets }

Confidential + Proprietary

Test Selection Safety Evaluation Framework - Research at Google

Confidential + Proprietary. Pipeline performance. ○ Safety data builder ran in 35 mins. ○ Algorithm evaluator. ○ Optimistic ran in 2h 40m. ○ Pessimistic ran in 3h 5m. ○ Random ran in 4h 40m ...

943KB Sizes 2 Downloads 323 Views

Recommend Documents

SELECTION AND COMBINATION OF ... - Research at Google
Columbia University, Computer Science Department, New York. † Google Inc., Languages Modeling Group, New York. ABSTRACT. While research has often ...

Detecting Argument Selection Defects - Research at Google
CCS Concepts: • Software and its engineering → Software defect analysis; .... In contrast, our analysis considers arbitrary identifier names and does not require ...

The Snap Framework - Research at Google
Here, we look at Snap, a Web-development framework ... ment — an app running in development mode ... This is a complete work- .... its evolutionary life cycle.

Learning to Rank with Selection Bias in ... - Research at Google
As a result, there has been a great deal of research on extracting reliable signals from click-through data [10]. Previous work .... Bayesian network click model [8], click chain model [17], ses- sion utility ..... bels available in our corpus such a

iVector-based Acoustic Data Selection - Research at Google
DataHound [2], a data collection application running on An- droid mobile ... unsupervised training techniques where the hypothesized tran- scripts are used as ...

Training Data Selection Based On Context ... - Research at Google
distribution of a target development set representing the application domain. To give a .... set consisting of about 25 hours of mobile queries, mostly a mix of.

Label Transition and Selection Pruning and ... - Research at Google
Mountain View, CA 94043. Email: [email protected]. Dmitriy Genzel ..... 360 two-dimensional DCT coefficients. The numbers of hidden nodes in the DNN ...

Dynamic Model Selection for Hierarchical Deep ... - Research at Google
Figure 2: An illustration of the equivalence between single layers ... assignments as Bernoulli random variables and draw a dif- ..... lowed by 50% Dropout.

Web-Scale Multi-Task Feature Selection for ... - Research at Google
hoo! Research. Permission to make digital or hard copies of all or part of this work for ... geting data set, we show the ability of our algorithm to beat baseline with both .... since disk I/O overhead becomes comparable to the time to compute the .

Online Selection of Diverse Results - Research at Google
large stream of data coming from a variety of sources. In this pa- per, we ... while simultaneously covering different regions of the world, pro- viding a good mix of ...

A Framework for Benchmarking Entity ... - Research at Google
Copyright is held by the International World Wide Web Conference. Committee (IW3C2). .... of B in such a way that, a solution SB for IB can be used to determine a ..... a dataset (call it D), we apply iteratively Sim over all doc- uments and then ...

An interactive tutorial framework for blind users ... - Research at Google
technology, and 2) frequent reliance on videos/images to identify parts of web ..... the HTML tutorial, a participant was provided with two windows, one pointing to.

Deep Shot: A Framework for Migrating Tasks ... - Research at Google
contact's information with a native Android application. We make ... needed to return a page [10]. ... mobile operating systems such as Apple iOS and Android.

Strato: A Retargetable Framework for Low-Level ... - Research at Google
optimizers that remove or hoist security checks; we call .... [2, 3]. One way to enforce CFI is to insert IDs ... CFI in a way similar to the original implementation [3],.

A Scalable MapReduce Framework for All-Pair ... - Research at Google
stage computes the similarity exactly for all candidate pairs. The V-SMART-Join ... 1. INTRODUCTION. The recent proliferation of social networks, mobile appli- ...... [12] eHarmony Dating Site. http://www.eharmony.com. [13] T. Elsayed, J. Lin, ...

User Experience Evaluation Methods in ... - Research at Google
Tampere University of Technology, Human-Centered Technology, Korkeakoulunkatu 6, ... evaluation methods that provide information on how users feel about ...

Evaluation Strategies for Top-k Queries over ... - Research at Google
their results at The 37th International Conference on Very Large Data Bases,. August 29th ... The first way is to evaluate row by row, i.e., to process one ..... that we call Memory-Resident WAND (mWAND). The main difference between mWAND ...

Self-evaluation in Advanced Power Searching ... - Research at Google
projects [10]. While progress is ... assessing the credibility of a website, one of the skills addressed during the ... Builder platform [4] (with modifications to add a challenge- .... student's submission (see Figure 4, which shows the top part of

Automata Evaluation and Text Search Protocols ... - Research at Google
Jun 3, 2010 - out in the ideal world; of course, in the ideal world the adversary can do almost ... †Dept. of Computer Science and Applied Mathematics, Weizmann Institute and IDC, Israel. ... Perhaps some trusted certification authorities might one

survey and evaluation of audio fingerprinting ... - Research at Google
should be short for mobile applications (e.g., 10 seconds for retrieval. ... age features in the computer vision community for content- .... on TV and from laptop speakers in noisy environments. In.

ReFr: An Open-Source Reranker Framework - Research at Google
a lattice or hypergraph or (b) simply use a strict reranking ap- proach applied to n-best ... tations for any developer converting from their own, proprietary format.

A Comparative Evaluation of Finger and Pen ... - Research at Google
May 5, 2012 - use of the pen with a view to user convenience and simplic- ity. Such a ...... 360-367. 16. Morris, M.R., Huang, A., Paepcke, A. and Winoqrad, ...