Practical Significance Testing for Experiments in ...

Viewer
Transcript

Practical Significance Testing for Experiments in Natural Language Processing Benjamin Roth CIS LMU M¨ unchen

August 12, 2016

Benjamin Roth (CIS LMU M¨ unchen)

Practical Significance Testing for Experiments in Natural Language August Processing 12, 2016

1 / 29

Comparing Different Systems

Assume... Data is correctly split into training, dev, test data. Systems A and B produce outputs oA and oB (on test data). Evaluation metric e gives e(oB ) > e(oA ) Can we be confident that system B is better than A? (i.e. will it give better results than A on new inputs?)

Benjamin Roth (CIS LMU M¨ unchen)

Practical Significance Testing for Experiments in Natural Language August Processing 12, 2016

4 / 29

Statistical Significance Tests Question: I I

Different results because new technique (system) actually better? Differences just due to chance?

We would like to get: P(observed differences due to chance | test set results) I I

If low, we can assume systems are actually different. If high, F F

differences just random or, data not sufficient to tell.

Proxy question: P(differences at least as big as observed | techniques equally good) I

p-value

“Is that difference significant?” ⇔ p-value ≤ α Typical values for significance level α: 0.05, 0.01, 0.001 . . .

Benjamin Roth (CIS LMU M¨ unchen)

Practical Significance Testing for Experiments in Natural Language August Processing 12, 2016

5 / 29

Statistical Significance Tests Framework Systems A, B Outputs oA , oB Evaluation metric e(oB ) > e(oA ) 1 Often, the test is “Are systems different?” (rather than “Is B better than A?”) 2 “Null hypothesis H ”: There is no difference between A and B. 0 3 “Test statistic”: Quantify differences between system outputs. t(o1 , o2 ) = |e(o1 ) − e(o2 )| 4

5

Find a distribution over t(X , Y ) assuming null hypothesis is true (“under the null hypothesis”) Check where our observed difference t(oA , oB ) lies in that distribution. I

I

Does is look standard? (Are more extreme values likely?) ⇒ Keep null hypothesis (no significant difference) Does is look extreme? (Are more extreme values unlikely?) ⇒ Reject null hypothesis (significant difference)

Benjamin Roth (CIS LMU M¨ unchen)

Practical Significance Testing for Experiments in Natural Language August Processing 12, 2016

6 / 29

Distribution over Differences in Outcomes under H0 Find a distribution over t(X , Y ) (differences between system outcomes) under the null hypothesis Check where our observed difference t(oA , oB ) lies in that distribution. How to find distribution under the null hypothesis? Examples: I

I

I

Paired t-test: Differences are distributed by a Gaussian with mean 0 (vs. mean 6= 0). Sign test: If there is a difference, then it is an improvement with probability only p = 50% (vs. p 6= 50%) Randomized test: Randomly recombine outputs of systems A and B, and get empirical distribution over t(X , Y )

Benjamin Roth (CIS LMU M¨ unchen)

Practical Significance Testing for Experiments in Natural Language August Processing 12, 2016

7 / 29

Rejecting the Null Hypothesis Once we have the distribution under H0 , check where the observed difference lies. Check how likely differences are that are “more extreme” than t. (Use area under probability density function). Reject H0 if more extreme differences are unlikely (⇔ observed difference is extreme). The significance level α defines the rejection region.

Benjamin Roth (CIS LMU M¨ unchen)

Practical Significance Testing for Experiments in Natural Language August Processing 12, 2016

8 / 29

Testing Different Metrics I

Accuracy I

I

Is the change in number of correct samples significant (unlikely due to chance)? Suggested test: sign test (exact binomial test)

Mean Average Precision (MAP) I

I I

Given a set of lists ranked by two systems, is the average ranking improvement significant? E.g. rank documents for a given set of queries. Suggested test: Paired t-test.

Benjamin Roth (CIS LMU M¨ unchen)

Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016

10 / 29

Testing Different Metrics II F-Score I I

I

Is increase in performance as measured by F-score significant? This one is tricky, as the overall score is non-linear in the single test cases. Suggested test: randomized tests

Testing assumptions about data I

I

I I

Does the data deviate from some a priori assumption about its distribution? E.g. do spam emails contain more named entities than non-spam emails? Suggested test: Chi-square test. Can be used for feature selection.

Benjamin Roth (CIS LMU M¨ unchen)

Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016

11 / 29

Testing of Accuracy 10 test cases (e.g. sentiment correctly predicted) Systems correct (1) or not correct (0) on test cases: System A System B

0 1

0 1

0 0

0 1

0 1

0 1

0 1

1 1

1 0

System A: Accuracy 20% System B: Accuracy 80% Sign test (=binomial test): I I

Trials (Number of cases where A and B differ): 8 Successes (Number of cases where B is better than A): 7

In python: >>> scipy.stats.binom_test(7,8) 0.0703125 ⇒ p-value is larger than α = 0.05, results are not significant.

Benjamin Roth (CIS LMU M¨ unchen)

Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016

13 / 29

Mean Average Precision Ranking measure. Commonly used in information retrieval evaluation. Documents are relevant (R) or not relevant (N) with respect to certain queries. Mean Average Precision

(0.63 + 0.25 + 0.59) / 3 = 0.59

query1 (1+0.4+0.5)/3=0.63

document)

Benjamin Roth (CIS LMU M¨ unchen)

document collection

1/1=1 R N N N 2/5=0.4 R 3/6=0.5 R N Precision N (at relevant

query2 0.25 N N N 1/4=0.25 R N N N N

query3 0.59 Average Precision 1 R (at specific N query) N N 0.4 R N N 0.38 R

Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016

15 / 29

Testing of Mean Average Precision 10 test cases (e.g. queries issued to search engine) Per-query system performances (average precision): System A System B

0.1 0.7

0.2 0.8

0.3 0.1

0.1 0.7

0.2 0.8

0.3 0.9

0.1 0.7

0.2 0.8

0.7 0.9

0.8 0.1

System A: MAP 30% System B: MAP 65% Paired t-test in python: >>> oA=[.1,.2,.3,.1,.2,.3,.1,.2,.7,.8] >>> oB=[.7,.8,.1,.7,.8,.9,.7,.8,.9,.1] >>> scipy.stats.ttest_rel(oA,oB) Ttest_relResult(statistic=array(-2.4313), pvalue=0.0378) ⇒ p-value is smaller than α = 0.05, results are significant!

Benjamin Roth (CIS LMU M¨ unchen)

Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016

16 / 29

Paired T-Test vs Sign Test Paired t-test: I I I

Compare scores for matched pairs. Assumption: system differences are approximately Gaussian. Null hypothesis: The mean of that Gaussian is zero.

Sign (binomial) test: I

I I

Test cases are viewed purely binary: did the new method improve/deteriorate on them? Null hypothesis: improvements are random with probability 50%. ... can be used ... F

F

in situations where paired t-test can be used. (But sign test is less sensitive) in many more situations, where paired t-test cannot be used.

For both tests, make sure to remove duplicate items in your test data.

Benjamin Roth (CIS LMU M¨ unchen)

Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016

17 / 29

F1-Score Balances Precision vs. Recall. Non-linear with respect to the performance on the single test cases.

TP TP + FP TP Rec = TP + FN 2 ∗ Pr ∗ Rec F1 = Pr + Rec Pr =

Benjamin Roth (CIS LMU M¨ unchen)

Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016

19 / 29

Testing of F-Score 10 test cases (e.g. predicted named entities) Per-query system predictions and gold standard: System A System B Gold

NE NE NE

NE NE NE

O NE O

O O O

O O O

NE NE NE

O NE NE

O NE NE

O NE O

O NE O

System A: Prec= 100%, Rec=60%, F1=75% System B: Prec= 62.5%, Rec=80%, F1=70.2% Option 1 (hack, not sensitive): Make n folds, then use paired t-test. Option 2 (better): Randomized testing [Yeh, 2000] I

I I I

Repeat R-times: randomly flip predictions from oA with oB to get new outputs oX and oY . Let r be the number that t(oX , oY ) ≥ t(oA , oB ) t: difference in F -score r +1 As R → ∞, R+1 approaches the p-value.

See also Sebastian Pados script www.nlpado.de/~sebastian/software/sigf.shtml Benjamin Roth (CIS LMU M¨ unchen)

Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016

20 / 29

Testing of Assumptions about Data

So far: Is system A is better than system B? How about some observations we make about the data? Are they systematic or random (i.e. can we make assumptions about unseen data)? I I

Formulate H0 in a way that the assumption doesn’t hold. If H0 can be rejected, you can make the assumption.

Benjamin Roth (CIS LMU M¨ unchen)

Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016

22 / 29

Testing of Assumptions about Data

Classic Example: Coin flips I

I I

We observe that in 50 coin flips, there are more Head outcomes than Tail (28 vs 22). Can we assume the coin is biased? Choose H0 “Coin is not biased.”

Chi-Square test: “Are observations as expected?” Testing for a biased coin in python: >>> scipy.stats.chisquare([28, 22], [25, 25]) Power_divergenceResult(statistic=0.72, pvalue=0.3961) ⇒ the coin seems fair.

Benjamin Roth (CIS LMU M¨ unchen)

Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016

23 / 29

Pitfalls of Significance Testing

Benjamin Roth (CIS LMU M¨ unchen)

Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016

25 / 29

Pitfalls of Significance Testing

Benjamin Roth (CIS LMU M¨ unchen)

Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016

26 / 29

Pitfalls of Significance Testing

Benjamin Roth (CIS LMU M¨ unchen)

Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016

27 / 29

Pitfalls of Significance Testing The more experiments you run, the more type I errors your test will make (i.e. suggest significance when there is none). Bonferroni correction: devide required p-value by number of experiments done. More important: Avoid selection bias, report honestly. I I I

Report all tested results, also those that don’t fit neatly into the story. Do not let the test decide whether to report result or not. When to test: at the end.

Take-home message: I I

You can always find significance: Don’t go on fishing expedictions. Significance is only a heuristic: a rough proxy to interestingness.

Benjamin Roth (CIS LMU M¨ unchen)

Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016

28 / 29

Questions?

Happy testing!

Benjamin Roth (CIS LMU M¨ unchen)

Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016

29 / 29

A null-model for significance testing of presence-only species ...

Experiments testing macroscopic quantum ...

Statistics for Online Experiments - Optimizely

What's the Significance? Statistical Significance and Expected Returns

Experiments in Graph-Based Semi-Supervised Learning Methods for ...