Use Gwet’s AC1 instead of Cohen’s Kappa for Inter-rater Reliability

Last year, I attended a lecture by my former assessment and measurement professor, Dr. Van Haneghan, at the University of South Alabama. He addressed the paradox of using Cohen’s Kappa (k) for inter-rater reliability and acknowledged that it was identified in the literature two decades ago but has been mainly overlooked. The problem is that Cohen’sskews data when there is a high agreement between raters or an imbalance in the margins of the data tables (Cicchetti & Feinstein, 1990; Gwet, 2008).  This is contradictory to the statistical technique’s purpose, as researchers want to obtain an accurate degree of agreement. So if you’ve ever used Cohen’s Kappa for inter-rater reliability in your research studies, I recommend recalculating it with Gwet’s first-order agreement coefficient (AC1).

I decided to rerun the stats for my research study involving two raters analyzing the content of 23 online syllabi with the Online Community of Inquiry Syllabus Rubric for my presentation at AERA. AgreeStat was used to obtain Cohen’s k and Gwet’s AC1 to determine inter-rater reliability per category. Tables 1A-B show how the k statistic was affected by high agreement in the category of instructional design (ID) for cognitive presence (CP), while Gwet’s AC1 was not. Overall, Gwet’s AC1 values ranged from .102 to .675 (Mean SD = .135 ± .128). Interrater-reliability for scoring this category was good according to Altman’s (1991) benchmark, Gwet’s AC1 = .675, p < .001, and 95% CI [.04, .617].

Table 1A

Distribution of Scores by Rater and Category (Instructional Design for Cognitive Presence)

Rater1CP 3 4 5 Missing Total
3 0 0 0 0 0 [0%]
4 1 2 1 0 4 [17.4%]
5 4 0 15 0 19 [82.6%]
Missing 0 0 0 0 0 [0%]
Total 5 2 16 0 23 [100%]
[21.7%] [8.7%] [69.6%] [0%] [100%]

Table 1B


Inter-rater Coefficients and Associated Parameters for ID for CP

METHOD Coeff. StdErr 95% C.I. p-Value
Cohen’s Kappa 0.36406 0.172287 0.007 to 0.721 4.617E-02
Gwet’s AC1 0.67549 0.128882 0.408 to 0.943 2.944E-05
Scott’s Pi 0.33494 0.195401 -0.07 to 0.74 1.006E-01
Krippendorff’s Alpha 0.34940 0.195401 -0.056 to 0.755 8.754E-02
Brenann-Prediger 0.60870 0.140428 0.317 to 0.9 2.664E-04
Percent Agreement 0.73913 0.093618 0.545 to 0.933 7.344E-08

Note. Unweighted Agreement Coefficients (Coeff.). Standard Error (StdErr) is the standard deviation. CI= confidence interval.

Gwet’s AgreeStat, Version 2015.6.1 (Advanced Analytics, Gaithersburg, MD, USA) currently costs $40. It’s fairly easy to use. See Kilem Gwet’s blog to learn more.


Altman, D. G. (1991). Practical statistics for medical research. London: Chapman and Hall.

Cicchetti, D.V., & Feinstein, A.R. (1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43(6), 551-558. doi:10.1016/0895-4356(90)90159-m

Gwet, K. (2008). Computing inter-rater reliability in the presence of high agreement. British Journal of Mathematical & Statistical Methodology, 61(1), 29-48. doi:10.1348/000711006×126600

Author: teacherrogers

Content developer, instructional designer, trainer, and researcher

Thanks for visiting my blog! Please leave a message.

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s