Practical Significance Testing for Experiments in Natural Language Processing Benjamin Roth CIS LMU M¨ unchen
August 12, 2016
Benjamin Roth (CIS LMU M¨ unchen)
Practical Significance Testing for Experiments in Natural Language August Processing 12, 2016
1 / 29
Comparing Different Systems
Assume... Data is correctly split into training, dev, test data. Systems A and B produce outputs oA and oB (on test data). Evaluation metric e gives e(oB ) > e(oA ) Can we be confident that system B is better than A? (i.e. will it give better results than A on new inputs?)
Benjamin Roth (CIS LMU M¨ unchen)
Practical Significance Testing for Experiments in Natural Language August Processing 12, 2016
4 / 29
Statistical Significance Tests Question: I I
Different results because new technique (system) actually better? Differences just due to chance?
We would like to get: P(observed differences due to chance | test set results) I I
If low, we can assume systems are actually different. If high, F F
differences just random or, data not sufficient to tell.
Proxy question: P(differences at least as big as observed | techniques equally good) I
p-value
“Is that difference significant?” ⇔ p-value ≤ α Typical values for significance level α: 0.05, 0.01, 0.001 . . .
Benjamin Roth (CIS LMU M¨ unchen)
Practical Significance Testing for Experiments in Natural Language August Processing 12, 2016
5 / 29
Statistical Significance Tests Framework Systems A, B Outputs oA , oB Evaluation metric e(oB ) > e(oA ) 1 Often, the test is “Are systems different?” (rather than “Is B better than A?”) 2 “Null hypothesis H ”: There is no difference between A and B. 0 3 “Test statistic”: Quantify differences between system outputs. t(o1 , o2 ) = |e(o1 ) − e(o2 )| 4
5
Find a distribution over t(X , Y ) assuming null hypothesis is true (“under the null hypothesis”) Check where our observed difference t(oA , oB ) lies in that distribution. I
I
Does is look standard? (Are more extreme values likely?) ⇒ Keep null hypothesis (no significant difference) Does is look extreme? (Are more extreme values unlikely?) ⇒ Reject null hypothesis (significant difference)
Benjamin Roth (CIS LMU M¨ unchen)
Practical Significance Testing for Experiments in Natural Language August Processing 12, 2016
6 / 29
Distribution over Differences in Outcomes under H0 Find a distribution over t(X , Y ) (differences between system outcomes) under the null hypothesis Check where our observed difference t(oA , oB ) lies in that distribution. How to find distribution under the null hypothesis? Examples: I
I
I
Paired t-test: Differences are distributed by a Gaussian with mean 0 (vs. mean 6= 0). Sign test: If there is a difference, then it is an improvement with probability only p = 50% (vs. p 6= 50%) Randomized test: Randomly recombine outputs of systems A and B, and get empirical distribution over t(X , Y )
Benjamin Roth (CIS LMU M¨ unchen)
Practical Significance Testing for Experiments in Natural Language August Processing 12, 2016
7 / 29
Rejecting the Null Hypothesis Once we have the distribution under H0 , check where the observed difference lies. Check how likely differences are that are “more extreme” than t. (Use area under probability density function). Reject H0 if more extreme differences are unlikely (⇔ observed difference is extreme). The significance level α defines the rejection region.
Benjamin Roth (CIS LMU M¨ unchen)
Practical Significance Testing for Experiments in Natural Language August Processing 12, 2016
8 / 29
Testing Different Metrics I
Accuracy I
I
Is the change in number of correct samples significant (unlikely due to chance)? Suggested test: sign test (exact binomial test)
Mean Average Precision (MAP) I
I I
Given a set of lists ranked by two systems, is the average ranking improvement significant? E.g. rank documents for a given set of queries. Suggested test: Paired t-test.
Benjamin Roth (CIS LMU M¨ unchen)
Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016
10 / 29
Testing Different Metrics II F-Score I I
I
Is increase in performance as measured by F-score significant? This one is tricky, as the overall score is non-linear in the single test cases. Suggested test: randomized tests
Testing assumptions about data I
I
I I
Does the data deviate from some a priori assumption about its distribution? E.g. do spam emails contain more named entities than non-spam emails? Suggested test: Chi-square test. Can be used for feature selection.
Benjamin Roth (CIS LMU M¨ unchen)
Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016
11 / 29
Testing of Accuracy 10 test cases (e.g. sentiment correctly predicted) Systems correct (1) or not correct (0) on test cases: System A System B
0 1
0 1
0 0
0 1
0 1
0 1
0 1
1 1
1 0
System A: Accuracy 20% System B: Accuracy 80% Sign test (=binomial test): I I
Trials (Number of cases where A and B differ): 8 Successes (Number of cases where B is better than A): 7
In python: >>> scipy.stats.binom_test(7,8) 0.0703125 ⇒ p-value is larger than α = 0.05, results are not significant.
Benjamin Roth (CIS LMU M¨ unchen)
Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016
13 / 29
Mean Average Precision Ranking measure. Commonly used in information retrieval evaluation. Documents are relevant (R) or not relevant (N) with respect to certain queries. Mean Average Precision
(0.63 + 0.25 + 0.59) / 3 = 0.59
query1 (1+0.4+0.5)/3=0.63
document)
Benjamin Roth (CIS LMU M¨ unchen)
document collection
1/1=1 R N N N 2/5=0.4 R 3/6=0.5 R N Precision N (at relevant
query2 0.25 N N N 1/4=0.25 R N N N N
query3 0.59 Average Precision 1 R (at specific N query) N N 0.4 R N N 0.38 R
Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016
15 / 29
Testing of Mean Average Precision 10 test cases (e.g. queries issued to search engine) Per-query system performances (average precision): System A System B
0.1 0.7
0.2 0.8
0.3 0.1
0.1 0.7
0.2 0.8
0.3 0.9
0.1 0.7
0.2 0.8
0.7 0.9
0.8 0.1
System A: MAP 30% System B: MAP 65% Paired t-test in python: >>> oA=[.1,.2,.3,.1,.2,.3,.1,.2,.7,.8] >>> oB=[.7,.8,.1,.7,.8,.9,.7,.8,.9,.1] >>> scipy.stats.ttest_rel(oA,oB) Ttest_relResult(statistic=array(-2.4313), pvalue=0.0378) ⇒ p-value is smaller than α = 0.05, results are significant!
Benjamin Roth (CIS LMU M¨ unchen)
Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016
16 / 29
Paired T-Test vs Sign Test Paired t-test: I I I
Compare scores for matched pairs. Assumption: system differences are approximately Gaussian. Null hypothesis: The mean of that Gaussian is zero.
Sign (binomial) test: I
I I
Test cases are viewed purely binary: did the new method improve/deteriorate on them? Null hypothesis: improvements are random with probability 50%. ... can be used ... F
F
in situations where paired t-test can be used. (But sign test is less sensitive) in many more situations, where paired t-test cannot be used.
For both tests, make sure to remove duplicate items in your test data.
Benjamin Roth (CIS LMU M¨ unchen)
Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016
17 / 29
F1-Score Balances Precision vs. Recall. Non-linear with respect to the performance on the single test cases.
TP TP + FP TP Rec = TP + FN 2 ∗ Pr ∗ Rec F1 = Pr + Rec Pr =
Benjamin Roth (CIS LMU M¨ unchen)
Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016
19 / 29
Testing of F-Score 10 test cases (e.g. predicted named entities) Per-query system predictions and gold standard: System A System B Gold
NE NE NE
NE NE NE
O NE O
O O O
O O O
NE NE NE
O NE NE
O NE NE
O NE O
O NE O
System A: Prec= 100%, Rec=60%, F1=75% System B: Prec= 62.5%, Rec=80%, F1=70.2% Option 1 (hack, not sensitive): Make n folds, then use paired t-test. Option 2 (better): Randomized testing [Yeh, 2000] I
I I I
Repeat R-times: randomly flip predictions from oA with oB to get new outputs oX and oY . Let r be the number that t(oX , oY ) ≥ t(oA , oB ) t: difference in F -score r +1 As R → ∞, R+1 approaches the p-value.
See also Sebastian Pados script www.nlpado.de/~sebastian/software/sigf.shtml Benjamin Roth (CIS LMU M¨ unchen)
Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016
20 / 29
Testing of Assumptions about Data
So far: Is system A is better than system B? How about some observations we make about the data? Are they systematic or random (i.e. can we make assumptions about unseen data)? I I
Formulate H0 in a way that the assumption doesn’t hold. If H0 can be rejected, you can make the assumption.
Benjamin Roth (CIS LMU M¨ unchen)
Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016
22 / 29
Testing of Assumptions about Data
Classic Example: Coin flips I
I I
We observe that in 50 coin flips, there are more Head outcomes than Tail (28 vs 22). Can we assume the coin is biased? Choose H0 “Coin is not biased.”
Chi-Square test: “Are observations as expected?” Testing for a biased coin in python: >>> scipy.stats.chisquare([28, 22], [25, 25]) Power_divergenceResult(statistic=0.72, pvalue=0.3961) ⇒ the coin seems fair.
Benjamin Roth (CIS LMU M¨ unchen)
Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016
23 / 29
Pitfalls of Significance Testing
Benjamin Roth (CIS LMU M¨ unchen)
Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016
25 / 29
Pitfalls of Significance Testing
Benjamin Roth (CIS LMU M¨ unchen)
Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016
26 / 29
Pitfalls of Significance Testing
Benjamin Roth (CIS LMU M¨ unchen)
Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016
27 / 29
Pitfalls of Significance Testing The more experiments you run, the more type I errors your test will make (i.e. suggest significance when there is none). Bonferroni correction: devide required p-value by number of experiments done. More important: Avoid selection bias, report honestly. I I I
Report all tested results, also those that don’t fit neatly into the story. Do not let the test decide whether to report result or not. When to test: at the end.
Take-home message: I I
You can always find significance: Don’t go on fishing expedictions. Significance is only a heuristic: a rough proxy to interestingness.
Benjamin Roth (CIS LMU M¨ unchen)
Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016
28 / 29
Questions?
Happy testing!
Benjamin Roth (CIS LMU M¨ unchen)
Practical Significance Testing for Experiments in Natural Language August 12, Processing 2016
29 / 29