The competition to develop AI that is as multilingual as Europe.

      The European Union has 24 official languages, along with numerous unofficial ones spoken throughout the continent. Including European nations outside the union adds at least a dozen more languages to the total. When you take into account dialects, endangered languages, and those introduced by migrants to Europe, the count reaches into the hundreds.

      In the technology sector, there is a consensus that the US holds a dominant position, which extends to the languages used online. This dominance arises from American institutions, standards bodies, and companies that shaped the development of computers, operating systems, and software during their early phases. While this landscape is evolving, English continues to dominate, with approximately 50% of websites in English, even though it is the first language of only about 6% of the global population. Spanish, German, and Japanese follow, each representing around 5-6% of the web.

      As we explore the emerging realm of AI-driven applications and services, many rely on large language models (LLMs) fueled by data. Since much of the data for these models is controversially sourced from the web, LLMs primarily function in English. This issue poses a challenge as we navigate a transformative shift in technology driven by the swift advancement of AI tools.

      Europe already features several prominent AI companies and initiatives, such as Mistral and Hugging Face, with Google DeepMind being originally founded in Europe. The continent is also home to research projects focused on developing language models aimed at improving AI's understanding of lesser-spoken languages.

      This article examines these initiatives, questioning their efficacy and whether users primarily default to English versions of these tools. As Europe strives for independence in AI and machine learning, does it possess the necessary companies and expertise to achieve its objectives?

      Terminology and Technology Overview

      To better understand the ensuing discussion, you don’t need to be versed in the intricacies of model creation, training, or operation. However, grasping a few fundamental concepts related to models and their language support is beneficial.

      Unless documentation specifies that a model is multilingual or cross-lingual, requesting input or responses in unsupported languages may result in back-and-forth translation or replies in a language the model does understand. Both methods can yield unreliable and inconsistent outcomes, particularly for low-resource languages.

      High-resource languages, like English, have ample training data, while low-resource languages, such as Gaelic or Galician, generally have significantly less, leading to poorer performance.

      Another complex topic regarding models is “open.” While the term “open source” has a well-defined meaning in software, the precise definition of “open” in the context of models is still being debated. In summary, the label “open” applied to a model may not always signify the same things across different contexts.

      Two additional key terms include:

      - Training: This process teaches a model to make predictions or decisions based on input data.

      - Parameters: These are variables learned during training that dictate how a model correlates inputs to outputs, essentially shaping how it interprets and responds to queries. Generally, a higher number of parameters indicates a more complex model.

      With that brief introduction, how are European AI companies and projects enhancing these processes to bolster language support?

      Hugging Face

      Typically, when someone wishes to share code, they provide a link to their GitHub repository, whereas for sharing models, they often use Hugging Face. Founded in 2016 by French entrepreneurs in New York City, Hugging Face plays a crucial role in fostering communities and advocating for open models. In 2024, it launched an AI accelerator for European startups and partnered with Meta to develop translation tools based on Meta’s “No Language Left Behind” model. It is also pivotal in the development of the BLOOM model, a groundbreaking multilingual framework that established new benchmarks for international collaboration and openness.

      Hugging Face serves as a valuable resource for assessing language support within models. Currently, Hugging Face features 1,743,136 models and 298,927 datasets. A look at its leaderboard for monolingual models reveals the following rankings for those tagged as supporting European languages:

      Language

      Language Code

      Datasets

      Models

      English English

      en

      27,702

      205,459



      English

      eng

      1,370

      1,070

      French

      fra

      1,933

      850

      Spanish Español

      es

      1,745

      10,028

      German Deutsch

      de

      1,442

      9,714

      Although the tags generally reflect the languages supported, issues of duplication exist as the community can freely add values.

      English overwhelmingly dominates the models listed. The same trend is evident in the datasets on Hugging Face, which are primarily lacking in non-English data.

      What does this imply?

      Lucie-Aimée Kaffee, the EU Policy Lead at Hugging Face, clarified that the tags indicate whether a model has been trained to comprehend and process a specific language, or if a dataset contains materials in that language. She noted