Week 1: Comparing exam scores for two classes - Two-sample t test w/unequal variances

Goal: 

Compare the average score on midterm 1 for classes A and B. Are they significantly different? What is the effect size?

Parameters:

- One variable of interest (exam score)
- Two sample problem (class A and class B)
- Want to compare means

Assumptions:

- Two samples are independent

Checks:

- Underlying distributions are normal
Which test to use?
+ Anderson Darling Test (p=0.15 for A, p=0.91 for B) using Mathematica and R
+ Jarque Bera Test (p=0.39 for A, p=0.67 for B) using R
+ Shapiro Wilk Test (p=0.33 for A, p=0.61) using Mathematica
If we had a p<0.05, we'd want to reject the hypothesis that the distribution was normal. Our p values are out of this range. Are these p values reasonable for us to accept the hypothesis that the distributions are normal?

- Variances of 2 samples are significantly different
+R's variance comparison gave a ratio of the variances =1.5 and p=0.035
+Mathematica's variance comparison gave p=0.014
Both have p<0.05, so we can reject the hypothesis that the variances are equal.

Test:

-Two sample t test with unequal variances (see page 300 of the Rosner text)
+In R, t=41.47 which falls outside of the returned range 59 - 65 with p=10^-16
+In Mathematica,t=-6.27 with p=2.4x10^-9
So we can reject the hypothesis that the means do not differ significantly according to both administrations of this test. But why do they differ so much?

Effect size:

-Using R, Cohen's d test (with variance pooled) gives effect size of -0.86, and R helpfully tells us to consider this affect large!
We later found a flow chart (see Google Drive) that suggested we should've used Hedges' g test. Next time!

1 is Class A's midterm 1 scores
2 is Class B's midterm 1 scores


Class B midterm 1 scores

Class A midterm 1 scores












Comments

  1. If you don't mind, it will be useful for people if you can also put up the codes you used in the various programming language to show how you obtain the numbers above. This will make the data analysis reproducible.

    ReplyDelete
    Replies
    1. I did the Mathematica analysis, and the commands are very simple things like VarianceEquivalenceTest[list1,list2]. But I can share the workbook I was using if that would be helpful!

      Delete
  2. Here comes a long comment where I try to explain my understanding of the logic behind the tests...

    Tests for normality:

    According to Wikipedia, the Shapiro-Wilk test is has the best statistical power (i.e. likelihood that it correctly rejects the null hypothesis--in this case that the data is normally distributed. ) It is closely followed by the Anderson-Darling test.

    Also according to Wikipedia: "Some published works recommend the Jarque–Bera test, but the test has weakness. In particular, the test has low power for distributions with short tails, especially for bimodal distributions. Some authors have declined to include its results in their studies because of its poor overall performance."

    Luckily, Mathematica can do all of these tests pretty painlessly, so all that's left to do is understand what is best for our data.

    The Anderson-Darling test measures something like the sum of the square of the distance between the distribution in question and the a given distribution (e.g. normal), and uses a weighting function to weight the tails of the distribution (which I think makes it useful for data sets with small/short tails- they don't get neglected)

    This test spits out a statistic-A^2, which can be compared to critical values for a given distribution (like the normal distribution), which tells you whether or not the data fits said distribution. If A^2 exceeds some critical value that can be found in some tables, the hypothesis on normality is rejected. You can also get a p-value for this

    The Shapiro-Wilk test uses a different statistic (W), which I don't understand. It looks like a weighted sum of the values in your set, divided by the sum of the distances on values from the mean. The way values are weighted is complicated. The important takeaway, I think, is that:

    "The null-hypothesis of this test is that the population is normally distributed. Thus, if the p-value is less than the chosen alpha level, then the null hypothesis is rejected and there is evidence that the data tested are not from a normally distributed population; in other words, the data are not normal. On the contrary, if the p-value is greater than the chosen alpha level, then the null hypothesis that the data came from a normally distributed population cannot be rejected (e.g., for an alpha level of 0.05, a data set with a p-value of 0.02 rejects the null hypothesis that the data are from a normally distributed population). However, since the test is biased by sample size, the test may be statistically significant from a normal distribution in any large samples. Thus a Q–Q plot is required for verification in addition to the test." (Wikipedia)

    The Jarque-Bera test measures whether a data set has the skewness and kurtosis of a normal distribution. What are these?
    This test may incorrectly reject the null hypothesis for small sample sizes...

    ReplyDelete

Post a Comment

Popular Posts