Description of Kazakh Verb Dictionary
The document describes a service under development located onhttps://kazakhverb.khairulin.com/gc_landing_en.html.
Motivation
Easily available dictionary data will decrease barriers to build new services and applications for the Kazakh language.
Goal
The project goal is to create a dataset, covering all words of the Kazakh language with Russian and English translations. The words can have additional annotations, e.g. part of speech tagging. The dataset dataset should be publicly accessible, free to use, and have a mechanism for adding and correcting data.
Implementation
The Kazakh Verb Dictionary service allows users to register and participate in adding and correcting data.
Data model
Word
Each word in the service's database is annotated with its language and part of speech, and for Kazakh verbs, there may additionally be a mark indicating special conjugation. Authors are recorded for each word. Comments can be added to a word, for example, in the case of homonyms. Isolated words are not exported and are not visible to users, but they can be suggested when adding translations.
Translation
Translation is a connection between two words from different languages. It implies that the connection is bidirectional, i.e., the connection{kk:алма, en:apple}means that "алма" is translated as "apple," and "apple" is translated as "алма." The translator who added the translation is also recorded. A source can be specified for the translation. The translation is exported and shown to users as a pair of words that it connects.
User
An entity that stores data for user registration and login. The words and translations retain the reference to the user record as the author.
Review
To maintain the quality of the annotations, each added translation must undergo review, meaning verification by other users. A translation is considered verified if at least the specified number of other users have confirmed its correctness. At the moment, confirmations from at least 2 people are required. Additionally, a translation may be rejected. The process for handling rejected translations still needs to be worked out in more detail.
Review queue
The added translations go into a general review queue that is displayed on the service's website. Users can select translations and either approve or reject them. To limit the growth of the queue, once it reaches its maximum size, adding new translations is prohibited until the review queue decreases.
Gamification
Based on users' participation in the addition and review of translations, their contribution is calculated. The website features a leaderboard showing users with most contributions for all time and for a week.
Export
Approximately once a month, a snapshot of the database with verified translations is published in JSON Lines format. The export data license is CC-BY-4.0. When using the data, a reference to the Kazakh Verb Dictionary project is recommended but not required.
In the event of project closure, the final export will be published inthe project's repositoryon GitHub.
Funding
The author independently covers the infrastructure expenses of the project within the established monthly limit.