Linguistic diversity in the age of AI
Why is it difficult for AI language tools to learn Latvian? ChatGPT explains: ”There is not enough data and text in Latvian. The Latvian language has a complex grammar and language structure – nine conjugations, different genders, and figures – with multiple dialects and regional differences. It is one of the less common languages in the world making it not a priority for AI models.”
The vast majority of Large Language Models (LLM) such as ChatGPT are predominantly trained on English data sets, simply because this is the lingua franca of the Internet. Different major – mostly European – languages such as German, French, and Spanish are also well represented. Other languages, on the other hand, are not, leading to <a href=”https://www.economist.com/science-and-technology/2024/01/24/why-ai-needs-to-learn-new-languages” style=”text-decoration: underline !important;”>less representation of the worldviews and cultures that these languages carry with them.
Now, Tilde, a Latvian language technology company, has just won the European Commission's Large AI Grand Challenge, which will allow them to develop a foundational LLM for European languages. It will focus particularly on underrepresented European languages, especially Eastern European and Baltic languages. These languages are ”poorly covered in the current models”, the company stated following their win. They also claim that their model will improve AI applications for more than 155 million Europeans.
A founding principle of the EU is multilingualism – it's even written into the EU's Charter of Fundamental Rights. As LLMs are becoming more integrated into many people's everyday lives, researchers and AI companies are increasingly focusing on how to improve linguistic diversity, thereby ensuring digital language equality. As Pēteris Jurčenko, Chairman of the Latvian Open Technology Association, <a href=”https://www.lsm.lv/raksts/dzive--stils/tehnologijas-un-zinatne/20.02.2024-maksligais-intelekts-praktiski-ka-to-veiksmigak-lietot-latviski.a543080/” style=”text-decoration: underline !important;”>explains: ”The main idea why these tools need to be made Latvian is to keep the Latvian language alive. Not so that we can use artificial intelligence, but so that the Latvian language does not disappear.”