Mikhail Kamrotov2; Ekaterina Talalakina1; Denis Stukal2; 1 Tampere University, Finland; 2 Personal capacity, Russian Federation
In this presentation, we address the topic of language for specific purposes (LSP) through the creation and validation of a technical, corpus-based word list in Russian. We introduce some methodological innovations in compiling a list of Russian vocabulary for students with a professional interest in economics. To create the Russian Economics Word List (REWL), we used the criteria of the frequency ratio, degree of dispersion, and minimum occurrence threshold in connection to the economics corpus, which we collected specifically for this study from academic and mass media texts of over 10 million tokens. We proposed a systematic approach to select optimal thresholds for the abovementioned ratios based on combinatorically symmetric cross-validation. The main advantage of our approach is that it is data-driven for the most part, with as little discretion used as possible. We chose the out-of-sample coverage as the target performance statistics for the word list and optimized it with respect to the selected measures. To check the list’s out-of-sample reliability, we estimated coverage by the REWL with a new corpus which was not part of the training data. The results match the coverage of the corpus used to derive the REWL. This supports our argument for the robustness of the proposed algorithm.
The methodology can potentially be used to develop corpus-based lists of specialized vocabulary in other languages.