About the Corpus

REALEC (Russian Error-Annotated Learner of English Corpus) is a corpus of essays written by university students with Russian as L1 in answer to examinations questions of two types. The corpus is the result of collaborative efforts undertaken since 2012 by professors and students of the School of Linguistics, Higher School of Economics, Moscow. The corpus has served as the basis for many Bachelor and Master theses.

The first version (essays from the years 2014-2019) contains 6,054 texts making up the total of 1,550,653 tokens. The essays in the first version were manually annotated for errors (47 types) and have POS tags assigned by Treetagger. The second version (with addition of the essays from the year 2020) has 18,710 texts with 4,833,749 tokens in them. The annotation in this folder of 2020 examination was produced by the neural network categorizing errrors into 6 types. The model for replacing 6 error types with the 47 types in REALEC taxonomy is in progress.

The corpus is in the open access and can be explored here.