Common Error Traps

1. Running chi-square on paired / matched binary data

Scenario: A student measures 40 patients' opinions (agree / disagree) before and after an intervention. They analyse by a 2×2 chi-square test of independence.

Why it looks right: The data are categorical, arranged in a 2×2 table, and chi-square is "the" test for 2×2 tables.

What's actually wrong: The "before" and "after" columns are the same subjects measured twice. The observations in the two cells of each row are not independent of one another — but chi-square of independence requires independent observations.

Correct approach: Use McNemar's test for 2×2 matched-pair binary data. McNemar tests whether the two marginal proportions differ, which is the right question for before/after on the same subjects.

McNemar (1947); Agresti (2018), Statistical Methods for the Social Sciences, Ch. 10.

2. Independent-samples t-test on paired data

Scenario: A before/after design with 20 subjects: 20 pre-treatment BP values and 20 post-treatment BP values. Student runs an independent (two-sample) t-test.

Why it looks right: There are two groups of 20 numbers each — that sure looks like two samples.

What's actually wrong: The two "samples" are not independent — each post value shares a subject with exactly one pre value. The paired structure contains information (within-subject correlation) that an independent-samples test ignores. Power is usually lost, and occasionally Type I error is inflated.

Correct approach: Use a paired t-test on the within-subject differences. In this app, open the t Calculator, switch to "Paired" mode, and paste the two columns.

Rosner (2015), Fundamentals of Biostatistics, Ch. 8.

3. Pooled-variance t-test when variances clearly differ

Scenario: Student compares treatment and control means; treatment SD is 4, control SD is 12. Runs a classic (pooled) two-sample t-test.

Why it looks right: "Student's t-test" is what the textbook calls the two-sample comparison.

What's actually wrong: The pooled variant assumes equal population variances. When variances differ substantially (and especially with unequal sample sizes) the test's Type I error rate can be far from the nominal α. This is the Behrens-Fisher problem.

Correct approach: Use Welch's t-test (the default in R's t.test). It uses the Welch-Satterthwaite df and is accurate under unequal variances. In this app, open t Calculator and choose "Two-sample (Welch's)" mode.

Welch (1947); Ruxton (2006), "The unequal variance t-test is an underused alternative to Student's t-test..." Behav. Ecol. 17:688-690.

4. Chi-square on a sparse 2×2 table

Scenario: Pilot trial with [[8, 2], [1, 5]]. Student reports the chi-square p-value of 0.013.

Why it looks right: The test ran, produced a number, and that number is below 0.05.

What's actually wrong: Minimum expected cell count is below 5 (it's 2.63 here). The chi-square approximation to the discrete hypergeometric distribution fails in this regime — uncorrected χ² gives p = .013, Fisher's exact gives p = .035, and Yates-corrected χ² gives p = .051. The three disagree across the 0.05 line.

Correct approach: For any 2×2 with an expected cell < 5 (Cochran, 1954), use Fisher's exact test. In this app, the Compare page auto-flags the Cochran violation and shows Fisher's alongside the other tests.

Cochran (1954), Biometrics 10:417-451; Campbell (2007), Statistics in Medicine.

5. Chi-square to compare group means

Scenario: Comparing average exam scores across three teaching methods. Student runs a chi-square test.

Why it looks right: Three groups, a comparison, a table — sounds like chi-square.

What's actually wrong: Chi-square is for categorical counts (how many subjects fall into each cell). Means are continuous measurements, not counts. Putting the means into a "table" and running chi-square doesn't make sense — the test has no way to know the values are on a continuous scale.

Correct approach: For 2 groups, use a t-test. For 3+ groups, use ANOVA (followed by post-hoc comparisons with a correction for multiple comparisons).

Moore, McCabe & Craig (2021), Introduction to the Practice of Statistics, Ch. 7 and 12.

6. Z-test with small n and unknown population SD

Scenario: Student has n = 20 observations and computes Z = (x̄ − μ₀) / (s/√n), reporting Z with a standard-normal p-value.

Why it looks right: The formula for Z and t look nearly identical; substituting s for σ feels harmless.

What's actually wrong: When s is a sample estimate (not the known population σ), the resulting statistic doesn't follow N(0,1) — it follows Student's t with df = n − 1. For small n, the t distribution has heavier tails, so a standard-normal p-value is too small. You're under-stating your uncertainty.

Correct approach: Use the t-test. For n ≥ ~30 the two distributions are close enough that the distinction rarely matters; for small n, use t.

Student [Gosset] (1908), Biometrika 6:1-25.

7. Running 20 tests, reporting only the significant one

Scenario: Researcher runs 20 outcome comparisons on the same dataset. One has p = .03. Manuscript reports that one, doesn't mention the other 19.

Why it looks right: p < .05; result is "significant"; why not publish?

What's actually wrong: Under a true null, the probability of at least one p < .05 in 20 tests is about 1 − 0.95²⁰ = 0.64. You are 64% likely to "find something" when there's nothing there. This is the multiple-comparisons problem; without correction, reported p-values are not what they claim to be.

Correct approach: Pre-register your primary outcome. For the remaining comparisons, apply Bonferroni (very conservative) or Benjamini-Hochberg (controls FDR). Always disclose how many comparisons were made.

Benjamini & Hochberg (1995), JRSS-B 57:289-300. Also the "garden of forking paths" — Gelman & Loken (2013).

8. Interpreting p = .03 as "97% probability the effect is real"

Scenario: Student finds p = .03, concludes "There's a 97% chance our hypothesis is correct."

Why it looks right: Small p, big confidence; feels like a natural reading.

What's actually wrong: The p-value is P(data this extreme | H₀ is true). It is NOT P(H₀ is true given the data). Those are different conditional probabilities (Bayes' theorem). A p-value of .03 does not tell you how likely H₀ is; it tells you how unusual the data would be if H₀ were true.

Correct approach: Phrase it as "Under H₀, we would see data this extreme about 3% of the time." To get a probability statement about H₀, you need Bayesian analysis.

Wasserstein & Lazar (2016), ASA Statement on p-values, The American Statistician 70:129-133.

9. Reporting statistical significance without an effect size

Scenario: RCT with n = 3,000 per arm. BP reduction of 0.6 mmHg is "highly significant" (p < .0001). Manuscript celebrates.

Why it looks right: p < .0001; large trial; published.

What's actually wrong: With a huge N, statistical significance is easy; practical significance is not. A 0.6 mmHg BP drop is clinically meaningless. "Significant" has a technical meaning here (statistical); readers often infer the everyday meaning (important), which is misleading.

Correct approach: Always report an effect size alongside the p-value: Cohen's d for means, Cramer's V or OR for 2×2. Discuss clinical importance, not just statistical significance.

Cohen (1988); Cumming (2014), "The New Statistics."

10. Picking one-tailed after seeing the data

Scenario: Two-tailed p = .07 — "not significant." Student notices the direction matches their hypothesis and switches to a one-tailed test, getting p = .035.

Why it looks right: One-tailed is valid when the hypothesis was directional.

What's actually wrong: Choosing the one-tailed direction AFTER seeing the data is a form of p-hacking. The actual probability of "seeing something this extreme in the direction of greatest observed effect" under H₀ is the two-tailed p-value. The declared one-tailed p is too small by a factor of 2.

Correct approach: Declare the test direction before looking at the data (pre-registration helps). One-tailed is appropriate only when you had a specific directional hypothesis a priori and are willing to treat effects in the opposite direction as "null."

Ruxton & Neuhäuser (2010), "When should we use one-tailed hypothesis testing?" Methods Ecol. Evol. 1:114-117.

11. Running t-test on strongly skewed data without checking

Scenario: n = 8 observations of medical costs (per-patient expenditure), deeply right-skewed. Student runs a one-sample t-test.

Why it looks right: Cost is a number; t-test is for numbers; n ≥ 2 technically suffices.

What's actually wrong: The t-test assumes the sampling distribution of the mean is approximately normal. For n = 8 and strong skew, the Central Limit Theorem is not rescuing you; the test's Type I error rate is likely off.

Correct approach: Check with the Assumption Coach first. If the verdict is yellow or red: consider log-transformation, non-parametric Wilcoxon signed-rank test, or bootstrap CI (Simulate page). For cost data specifically, log-transformation is often the first step.

Sawilowsky & Blair (1992), Psychol. Bull. 111:352-360.

12. Correlation implies causation

Scenario: Observational study finds r = .45 between coffee consumption and longevity. Headline: "Coffee makes you live longer."

Why it looks right: Statistical association is real (p is tiny). Cause seems to follow.

What's actually wrong: Correlation is consistent with: A causes B, B causes A, a third variable causes both (confounding), selection bias, or chance. Without randomisation or a strong quasi-experimental design, any of these alternatives may be at work. Heavy coffee drinkers may also be richer, better educated, non-smokers — any of which could drive the association.

Correct approach: Adjust for known confounders (if observational). For causal claims, aim for a randomised design, a natural experiment, or Mendelian randomisation. State clearly in the discussion that correlational data cannot establish causation.

Hill (1965), "The Environment and Disease: Association or Causation?" Proc. R. Soc. Med. 58:295-300.

13. "95% CI means 95% probability the parameter is inside"

Scenario: Student computes a 95% CI of [2.3, 5.1] for a mean difference and states "There is a 95% probability the true mean difference lies between 2.3 and 5.1."

Why it looks right: "95%" and "confidence" naturally read as "95% probability."

What's actually wrong: In frequentist statistics, the true parameter is a fixed (unknown) number. The CI is random — it would be different in a different sample. The "95%" refers to the procedure: under repeated sampling, about 95% of the CIs so constructed would contain the true parameter. For any specific computed interval, the true value is either inside or outside (probability is 0 or 1, not 0.95).

Correct approach: Phrase it as "We are 95% confident (in the sense of the procedure's long-run coverage) that the interval [2.3, 5.1] contains the true mean difference." Or, for a probability statement about the parameter, use Bayesian credible intervals (different philosophy).

Hoekstra, Morey, Rouder & Wagenmakers (2014), Psychon. Bull. Rev. 21:1157-1164.

14. "p > .05 means there's no effect"

Scenario: Pilot trial (n = 15 per arm) finds p = .12 for the treatment vs. control difference. Conclusion: "Treatment has no effect."

Why it looks right: p > .05; "not significant"; move on.

What's actually wrong: "Absence of evidence is not evidence of absence" (Altman & Bland, 1995). A non-significant result with a small sample is uninformative about whether the effect is zero or meaningful. Your 95% CI might span both "no effect" and "substantial effect."

Correct approach: Report the CI for the effect, not just the p-value. A CI of [−2, 10] for a mean difference tells you the data is consistent with everything from a small negative effect to a large positive one — a very different story from a CI of [−0.2, 0.3].

Altman & Bland (1995), BMJ 311:485.

15. Misinterpreting regression to the mean

Scenario: Highest-BP patients are given a new drug. At follow-up their BP has dropped. Conclusion: "The drug worked."

Why it looks right: BP went down after the drug, in patients selected for high BP. Clean pre-post comparison.

What's actually wrong: Patients selected for being extreme on a noisy measurement will, on repeat measurement, tend to be less extreme even with no intervention. This is regression to the mean, and it produces a pre-post drop whether or not the drug has any effect. Without a control group, you cannot separate RTM from the real effect.

Correct approach: Use a randomised control group. Compare the change in treated patients to the change in untreated controls (both selected with the same criterion). The difference-in-differences isolates the treatment effect.

Bland & Altman (1994), BMJ 308:1499. Galton (1886), "Regression toward mediocrity in hereditary stature."