Benchmarking LLM Knowledge of Kazakh Vocabulary

For those who are learning the Kazakh language or developing tools for it, LLMs seem to be useful. However, which LLM should one choose? Benchmarks might help to decide.

Here we are presenting a small benchmark, created mostly to test the waters. Perhaps, it will evolve over time to cover newer models and use better methods.

Models

We have chosen three major providers: OpenAI, Anthropic and Google. Each of them has a variety of models. We picked two from each provider, a cheap versatile model and a more powerful versatile model.

From OpenAI, we have measured GPT-4o-mini and GPT-4o. There are also more recent models from o1 family, but they seem too specialized for our purpose.

From Anthropic, we have taken Claude 3 Haiku and Claude 3.5 Sonnet.

From Google, we have benchmarked Gemini 1.5 Flash and Gemini 1.5 Pro.

Method

Measurements were conducted on 2024.10.05.

There are many possible ways to measure knowledge of the Kazakh language. We approached it from a potential use case. What if we are building a Kazakh-Russian dictionary from scratch and want to use an LLM? Then we might be interested in the ratio of correct word translations produced by a model.

This benchmark is small. We have measured translations of only 25 random Kazakh words, 15 of which are verbs and 10 are nouns. It incurred expenses under $0.2. Therefore, going 10x from here should still be affordable.

However, we had to collect an extensive set of translations for each word, to allow more leeway for LLM. This part is quite time-consuming and limits the experiment scale.

A single prompt written in Russian was used for all models. The prompt language was chosen intentionally to be the same as the translation target language. The prompt is straightforward and was tweaked very little, if at all. A small note about Gemini 1.5 Flash: it really struggled to maintain the output format and lost score points because of it.

The prompt was executed with the 25 input words against the chosen 6 models, using a Python script. It helped that each of the providers ships a Python package for easier integration.

Scoring: for each word, we extracted LLM translations and sorted them into known and unknown, then averaged ratio of known / (known + unknown) over the 25 words.

Conclusion

The results are shown in the chart above.

There is a noticeable difference between the cheap and more expensive options from each provider. It makes sense to prefer the more expensive and powerful models over the cheapest ones.

Gemini performed significantly worse than its competitors. We can explain this in part by the fact that Gemini doesn't officially support the Kazakh language, and it is far from their focus. Other providers are less specific about the languages supported by their models.

To our surprise, GPT-4o can translate Kazakh words better than Claude 3.5 Sonnet. However, Sonnet still performs quite well.

Based on these benchmark results, we recommend using GPT-4o and Claude 3.5 Sonnet for tasks related to the Kazakh language.