(Week 6->7) Analysis of 1) Classes H&I Pre and 2) Class B Pre-to-Post
Introduction
In this investigation, we conducted two separate analyses.- In the first, we compared the pre-test responses to Question 1 for classes H and I using Fisher's exact test for a 2x2 contingency table. We found that the pre-test responses for Question 1 provided no evidence that classes H and I are different.
- In the second, we compared the pre-test responses to Question 1 to the post-test responses to Question 2 for class B using contingency table methods (the Chi-Square Test, an approximation to Fisher's exact test using Stirling's Approximation for n!, and consider the limiting factors in our dataset for using either Rank-Correlation methods, the Kappa statistic, or McNemar's test) and assess other measures available for comparing binary vectors. We find it hard to make any meaningful conclusions, but propose that there was a significant difference between the pre- and post-test responses for class B.
Classes H&I Pre-test Comparison
Because we have such small numbers, we are able to detail every possible arrangement of the students into the two classes given the class sizes (there are only 9 permutations, the details of which are included in the Appendix). The distribution of students we observe had a 33.55% likelihood of occurring and was the most likely permutation. Allowing slight deviations from the observed permutation (by swapping two students, with differing performances than each other, between the two classes) to create a set of three permutations centered on the observed one creates a bin with an 81.08% likelihood of occurrence.
Given the students' performance on this pre-test question, our analysis suggests that there is no evidence for claiming any difference between Class H and Class I.
Class B Pre-to-Post-Test Comparison
For this analysis, we were unable to find a useful statistical tool. The following is an overview of the things we tried and considered. Please treat it as a "narrative of things tried" as opposed to as a "how to manual."
The students in Class B were surveyed at the beginning of the class and at the end of the class. The students' responses to one question from each of these tests are analyzed in this comparison. To begin, we used the Rosner flowchart and determined we should use contingency tables.
|
Our initial decision tree through the Rosner flowchart that leads to the contingency table methods.
Note 1) Is this data actually ordinal??? We tentatively say it is not.
Note 2) Do we want association or reproducibility??? Arguments for both.
Note 3) Wait...how many samples do we have? And, how many variables???
|
We generated the following contingency table from the data.
We also generated the expected values for this contingency table.
Contingency Table Methods and Fisher's Exact Test
Alert! Beware the lack of information provided by the flowchart! In one branch, we see that some contingency table methods require sample independence. In this branch, we did not have to specify that. Our samples are dependent: we are analyzing pre-/post-test data. This dependence invalidates some of the basic contingency table methods and Fisher's Exact Test. Regardless, the following is the analysis we chugged away on throughout the beginning of the week.
We can perform a Chi-Square (and a Yates-Corrected Chi-Square) Test and compute a value of 7.16 (6.11). The 95% confidence limit for a binomial distribution with one degree of freedom is 3.84 [this can be generated in R by calling "qchisq(0.95, df=1)"]. So, with 95% confidence, we can say the observed data deviates from the expected values.
In an attempt to find meaning in the observed data's deviation from expectations, we generated the 35 permutations for this contingency table (please see the excel document on the Google Drive if you're curious) and used the Stirling Approximation for n! to compute the probability of each table using Fisher's exact test. The observed table had a 0.45% chance of occurring. Slightly deviating from the exact observed table to make a bin containing three tables centered on that observed would have a 1.84% chance of occurring.
However, Fisher's exact test is to be used in a scenario that maps onto the "drawing marbles out of a sack" analogy. Because we are using matched data, the test is invalidated. We jumped to the Fisher's exact test branch of the flowchart and backtracked to the "Are samples independent?" junction, which led to McNemar's Test.
In an attempt to find meaning in the observed data's deviation from expectations, we generated the 35 permutations for this contingency table (please see the excel document on the Google Drive if you're curious) and used the Stirling Approximation for n! to compute the probability of each table using Fisher's exact test. The observed table had a 0.45% chance of occurring. Slightly deviating from the exact observed table to make a bin containing three tables centered on that observed would have a 1.84% chance of occurring.
However, Fisher's exact test is to be used in a scenario that maps onto the "drawing marbles out of a sack" analogy. Because we are using matched data, the test is invalidated. We jumped to the Fisher's exact test branch of the flowchart and backtracked to the "Are samples independent?" junction, which led to McNemar's Test.
McNemar's Test -> Data Gathering Deficiencies
McNemar's Test requires data to be sorted as matched pairs. On the surface, our data appears to be organized as matched pairs: we have each student perform the pre-test and then perform the post-test. Each student is matched with their past self.
However, McNemar's Test is designed to compare the outcomes of two different treatments. In the example in the Rosner text, a matched pair consists of two individuals who are "identical" at the outset and then subjected to different treatments. These individuals' responses to these two treatments is then compared to assess the effectiveness of each treatment. The example in Rosner converts the contingency table as we are using it: Group vs. Outcome with 2N data points (each individual's response) into a condensed form: Outcome vs. Outcome with N data points (each matched pair's response).
In a PER sense, we would want to match sets of two students (maybe by pre-test evaluation) and then subject them to two different forms of education. The test, unfortunately, does not really map onto the each student's before/after.
However, McNemar's Test is designed to compare the outcomes of two different treatments. In the example in the Rosner text, a matched pair consists of two individuals who are "identical" at the outset and then subjected to different treatments. These individuals' responses to these two treatments is then compared to assess the effectiveness of each treatment. The example in Rosner converts the contingency table as we are using it: Group vs. Outcome with 2N data points (each individual's response) into a condensed form: Outcome vs. Outcome with N data points (each matched pair's response).
In a PER sense, we would want to match sets of two students (maybe by pre-test evaluation) and then subject them to two different forms of education. The test, unfortunately, does not really map onto the each student's before/after.
Reproducibility and the Kappa Statistic
We return to the flowchart in hopes of inspiration. Before, we had delved into the Fisher's exact test because it was what we had done for Classes H & I. This had led to a meandering detour up that flowchart path. However, our original flowchart path had some questionable decisions towards the end. Namely, do we want "Association" or "Reproducibility" for this analysis? At first, it seemed we desired to associate the "in-between" the pre-/post-tests stuff (i.e.: the teaching) to differences between the pre-/post-tests. After having thought long and hard about McNemar's test and the limitations of a pre-/post-test format, it seemed reasonable that we were essentially aiming to reproduce the original results of the pre-test (obviously, we would hope for students to do 'better' which means we have some mental hoops to jump through for our conclusions).
For reproducibility, we essentially only care about the frequency of the two cases where the response stayed the same (either pre=0 and post=0 or pre=1 and post=1). These correspond to "a" and "d" values from an arbitrary set of data:
The concordance rate for a 2x2 contingency table is:
The maximum concordance is 1 and the minimum is 0. The measure for reproducibility, the Kappa statistic, compares the concordance for the observed data against the concordance for the expected values and then normalizes that value:
Our data has a concordance rate of 40.83%. The expected concordance rate is 51.81%. The Kappa statistic is -0.23. Unfortunately, the textbook explicitly says "...negative values for Kappa usually have no biological significance." So, we have no great way of continuing forwards. Maybe, the best interpretation is that we should not have been looking for reproducibility! Alternatively, maybe this means the test was really not reproduced, which would be good - we don't want the post-test results to be the same as the pre-test results if we want to measure any changes to the students.
For reproducibility, we essentially only care about the frequency of the two cases where the response stayed the same (either pre=0 and post=0 or pre=1 and post=1). These correspond to "a" and "d" values from an arbitrary set of data:
The concordance rate for a 2x2 contingency table is:
Our data has a concordance rate of 40.83%. The expected concordance rate is 51.81%. The Kappa statistic is -0.23. Unfortunately, the textbook explicitly says "...negative values for Kappa usually have no biological significance." So, we have no great way of continuing forwards. Maybe, the best interpretation is that we should not have been looking for reproducibility! Alternatively, maybe this means the test was really not reproduced, which would be good - we don't want the post-test results to be the same as the pre-test results if we want to measure any changes to the students.
Rank-Correlation Methods
Returning once more to the flowchart, we can back-up a little bit further to where we decided that we were not dealing with ordinal data. We do not know what the question was about. It could easily have been a categorical question without a definitively correct response. However, we could also reason that the question was a physics question where a 1 indicated a correct response and a 0 indicated an incorrect response. This vaguely fits the definition of ordinal data. So what are the Rank-Correlation methods? And, do they apply?
Short answer: No, they do not apply.
Long answer: For rank-correlation methods, you simply order the data from largest to smallest across both variables. This works really well for the pre-/post-test scenario. In fact, the example is about scoring newborn health at 1 minute and then at 5 minutes: it is precisely a pre-/post-test framework. The problem, as was pointed out in our discussion last week, is that we need to have some fidelity to the responses. The example in Rosner uses a scale that goes from 0 to 10. That allows the data to have a reasonable spread, even with repeats. Binary data does not have enough resolution for rank-correlation methods to apply well. There are ways for trying to make it work, but it seems more useful to explore elsewhere beyond Rosner for ways to process our binary vectors.
Short answer: No, they do not apply.
Long answer: For rank-correlation methods, you simply order the data from largest to smallest across both variables. This works really well for the pre-/post-test scenario. In fact, the example is about scoring newborn health at 1 minute and then at 5 minutes: it is precisely a pre-/post-test framework. The problem, as was pointed out in our discussion last week, is that we need to have some fidelity to the responses. The example in Rosner uses a scale that goes from 0 to 10. That allows the data to have a reasonable spread, even with repeats. Binary data does not have enough resolution for rank-correlation methods to apply well. There are ways for trying to make it work, but it seems more useful to explore elsewhere beyond Rosner for ways to process our binary vectors.
Generally Processing Binary Vectors
There is an interesting paper that overviews and analyzes methods for evaluating binary vectors. The authors begin by defining the properties of binary vectors and then list 9 measures for computing "similarity" (and associated "dissimilarity") between two binary vectors. The following will be a brief synopsis of the introduction and the conclusion of the paper and what that means for our analysis of Class B.Firstly, a binary vector is just a list of 0's and 1's: Z = (0, 1, 0, 0, 0, 1, 0, ...). The authors wish to use a set of 4 numbers to compare two binary vectors: X and Y. These 4 numbers, are exactly the same numbers we have been using for our contingency tables:
Using this, we can easily compute the various "similarity" and "dissimilarity" measures for binary vectors. Note that all of the "dissimilarity" measures are normalized from 0 to 1. The authors recommend using the Rogers-Tanmoto, Correlation, and Sokal-Michener measures. The dissimilarity measure for our dataset is 0.74 for all three of these methods. The paper does not provide further tools for giving these "dissimilarity" measures statistical meaning, but 0.743 is on the dissimilar side of the [0,1] range...which would agree with the other results we have seen in the other methods.
Appendix
Expected Values
We can utilize the Totals (total correct/incorrect and the total enrolled in either class) to calculate expected values for this dataset. Calculating these expected values is simple and, given an arbitrary dataset:the expected values are calculated thusly:
Permutations for Distributing Students into Classes H & I
There are only 4 numbers that define a 2x2 contingency table. The first set of 4 numbers are derived from the raw data and used to populate our observed dataset
We begin to process these numbers into our larger context by calculating row and column totals and using those to, say, rewind the process to find expected values. We can push the envelope further by choosing a different set of 4 numbers to define the matrix.
Consider, for example, drawing X marbles from a sack containing N red and blue marbles, of which Y are red and assessing the probability of getting "a" red marbles.
Note 1) "a" and N are the same "a" and N as we had in the original "Arbitrary Dataset."
Note 2) X = a + c (given c from the original "Arbitrary Dataset").
Note 3) Y = a + b (given b from the original "Arbitrary Dataset").
|
Now, we have a different set of 4 numbers, which we can use to populate the rest of the grid.
Note 1) Given our definitions for X and Y, we can see that b = Y - a and c = X - a.
Note 2) d = N - X - Y + a is slightly more confounding algebra than the other calculations, but still works!
Note 3) None of the calculations we need will actually make use of the Class I and # Incorrect Totals, but those are still useful to know. Depending on the original distribution of numbers, it may be useful to swap the columns or the rows so that "a" is a variable that can be set to zero.
|
Our reframing will hopefully make it easier to see how we can systematically step through all of the possible 2x2 matrices that exist for our given X, Y, and N. To generate the various matrices, we start with a=0 and increment "a" by 1 until we zero one of the other cells.
Calculating the probability for each permutation is given as (using the original a, b, c, and d variables):
Because these probabilities use factorials, we need relatively small numbers to actually carry out this calculation. As the expected values for a table rise above 5, this process becomes intractable.
I'm trying to do some research to figure out whether McNemar's test is a good choice for analyzing Class B pre/post-test scores.
ReplyDelete-Can McNemar's test be used to compare some measurement before and after treatment, as opposed to comparing two different treatments on matched data sets?
Brief intro to McNemar's test: It can be used on 2x2 contigency tables, and the null hypothesis is that the probability of "positives" (or correct) is the same under treatment 1 and 2 (or, perhaps, before and after treatment?). This boils down to a null hypothesis that the probabilities of getting the off-diagonal numbers in your contingency table are the same. Then the alternative hypothesis is that they are not the same. In our case, the alternative hypothesis is that the probability of getting this number of students who when from incorrect to correct is not the same as the probability of getting this number of students who went from correct to incorrect.
On this question, VassarStats says that McNemar's test is appropriate for tests of individual subjects assessed with respect to two dichotomous variables.
"Suppose that 100 subjects are each assessed with respect to two dichotomous categorical variables, A and B. If the temporal sequence of the two measures is relevant, Variable A can be defined as the "before" measure and Variable B as the "after" measure."
My favorite source of information, Wikipedia, uses the example of testing whether a drug has an the effect on some disease. If we say that "drug"=instruction and " presence of disease"=wrong answer, that's basically what we are trying to do...so it seems legit.
For some PER examples:
This paper (https://journals.aps.org/prper/abstract/10.1103/PhysRevSTPER.4.010108) uses McNemar's test to compare the percentage of students that uses a certain kind of reasoning both before and after instruction to the persentege who use that kind of reasoning after instruction (only?). The authors mention that their statistics have to take into account the fact that the samples are not independent, so McNemar's is appropriate.
This article by Eric Kuo and Carl Weiman (https://journals.aps.org/prper/abstract/10.1103/PhysRevSTPER.11.020133) also uses McNemar's test to compare students' pre- and post-instruction performance.
A tangent question that I came across half an answer to: What is ordinal data?
According to Wikipedia, ordinal data is anything with a rank value, including "right/true" vs. "wrong/false" or healthy/sick. But McNemar's test is for nominal data...so healthy/sick seems to call for different statistics than presence/absence of disease.
Here's an article about the difference between chi-square and McNemar's tests: http://www.theanalysisfactor.com/difference-between-chi-square-test-and-mcnemar-test/
ReplyDelete