Last year, I attended a lecture by my former assessment and measurement professor, Dr. Van Haneghan, at the University of South Alabama. He addressed the paradox of using Cohen’s Kappa (k) for inter-rater reliability and acknowledged that it was identified in the literature two decades ago but has been mainly overlooked. The problem is that Cohen’s k skews data when there is a high agreement between raters or an imbalance in the margins of the data tables (Cicchetti & Feinstein, 1990; Gwet, 2008). This is contradictory to the statistical technique’s purpose, as researchers want to obtain an accurate degree of agreement. So if you’ve ever used Cohen’s Kappa for inter-rater reliability in your research studies, I recommend recalculating it with Gwet’s first-order agreement coefficient (AC1).
I decided to rerun the stats for my research study involving two raters analyzing the content of 23 online syllabi with the Online Community of Inquiry Syllabus Rubric for my presentation at AERA. AgreeStat was used to obtain Cohen’s k and Gwet’s AC1 to determine inter-rater reliability per category. Tables 1A-B show how the k statistic was affected by high agreement in the category of instructional design (ID) for cognitive presence (CP), while Gwet’s AC1 was not. Overall, Gwet’s AC1 values ranged from .102 to .675 (Mean SD = .135 ± .128). Interrater-reliability for scoring this category was good according to Altman’s (1991) benchmark, Gwet’s AC1 = .675, p < .001, and 95% CI [.04, .617].
Distribution of Scores by Rater and Category (Instructional Design for Cognitive Presence)
Inter-rater Coefficients and Associated Parameters for ID for CP
|Cohen’s Kappa||0.36406||0.172287||0.007 to 0.721||4.617E-02|
|Gwet’s AC1||0.67549||0.128882||0.408 to 0.943||2.944E-05|
|Scott’s Pi||0.33494||0.195401||-0.07 to 0.74||1.006E-01|
|Krippendorff’s Alpha||0.34940||0.195401||-0.056 to 0.755||8.754E-02|
|Brenann-Prediger||0.60870||0.140428||0.317 to 0.9||2.664E-04|
|Percent Agreement||0.73913||0.093618||0.545 to 0.933||7.344E-08|
Note. Unweighted Agreement Coefficients (Coeff.). Standard Error (StdErr) is the standard deviation. CI= confidence interval.
Gwet’s AgreeStat, Version 2015.6.1 (Advanced Analytics, Gaithersburg, MD, USA) currently costs $40. It’s fairly easy to use. See Kilem Gwet’s blog to learn more.
Altman, D. G. (1991). Practical statistics for medical research. London: Chapman and Hall.
Cicchetti, D.V., & Feinstein, A.R. (1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43(6), 551-558. doi:10.1016/0895-4356(90)90159-m
Gwet, K. (2008). Computing inter-rater reliability in the presence of high agreement. British Journal of Mathematical & Statistical Methodology, 61(1), 29-48. doi:10.1348/000711006×126600