Kazakhstan to launch its first large language model

On December 16th, Kazakhstan is set to take a huge technological step forward by launching its first large language model (LLM) called KazLLM. This launch coincides with the country’s 33rd independence anniversary.

Also read: Generative AI brings video game NPCs to life

The Institute of Smart Systems and Artificial Intelligence (ISSAI) announced this during their briefing at Nazarbayev University on July 18th. The project’s data collection started in March and the model is being trained using a cloud computing platform supplied with NVIDIA H100 nodes.

Students and experts join forces in AI development

Nazarbayev University students, Astana IT University students, Bolashak scholarship graduates and local participants are collaborating on the KazLLM project. The main aim of this initiative is to create KazLLM and create a workforce capable of producing intelligent AI tools and applications.

The technological gap with other countries is what ISSAI founder and head Professor Atakan Varol wants to bridge through this project. He said that after its completion, Kazakhstan would be only 18 months behind in terms of technology with leading nations. Integrating voice features is anticipated to shorten this span to 12 months while additional language vision model advancements may put Kazakhstan at the forefront of AI development.

Wikipedia articles, news outlets, government websites and open datasets like Common Crawl are some of the sources where data for the project is obtained. For over five years now, ISSAI has been creating various natural language processing datasets specifically designed for the Kazakh language. This extensive collection of datasets is very important as they help in training KazLLM effectively and accurately.

Kazakhstan hopes to tackle national and information security with AI innovation

The KazLLM project has national and information security implications. Kazakhstan is hoping to minimize its dependence on foreign tech that may result in data breaches and the presentation of distorted information by creating a locally made language model.

Deputy Director for External Relations and Lead Data Scientist, Madina Abdrakhmanova, highlighted the wide-ranging training corpus of the model. “It will consist of a minimum of 100 billion tokens in Kazakh, Russian, English and Turkish languages with each language being represented by 25 billion tokens,” the director said.

Currently, the project has more than 30 billion tokens including 26 billion tokens produced through Tilmash translator that converts English to Kazakh data. This translation capability ensures that coherent and accurate text can be generated in the Kazakh language by the model.

Also read: OpenAI goes lite, releases cheaper AI model called GPT-4o mini

ISSAI intends to create a user-friendly interface for KazLLM, like those of OpenAI models, to make it more accessible. Upon completion, it will be able to support model interaction, reinforcement learning from human feedback and tuning for different situations to maximize performance. KazLLM will be offered as a general subscription package and as an API for experienced users.