In 2019, English is undoubtedly the main language used in AI. However, the application of artificial intelligence tends to take place in many scenarios and countries,  and across different languages. Creating and training algorithms with data in other languages, such as Spanish would open the door to a global market of 580 million Spanish speakers, for example. The French language would add around 350 million and Japanese 140 million. Spanish represents just over 27% of the world market for natural language processing technologies. French and Japanese language-based NLP technologies around 5% each. So, are Japanese, Spanish, French still foreign languages for artificial intelligence?

Siri, Cortana, Alexa and the Google assistant speak Spanish, French and Japanese, among other languages. But English is their mother tongue. “Machines have a hard time understanding the accents of different parts of Spain and the varieties of Spanish in America, while they work best in English because that’s the language of most scientific essays, research, and publications. It is the same case with French and regional accents from Canada and Africa. Japanese, although it is pretty homogeneous, lacks data.” says Mercedes Garcia, Chief Scientist at PangeaMT and an expert in artificial intelligence (AI) and adaptive language technologies.

In her opinion, answering questions that imply subjectivity and prior knowledge of the context is one of the main pitfalls of AI aiming to machine translate  and interpret human input. The challenge also exists when recognizing and imitating human voices. “An intelligent answer is not learnt with grammar classes – it also requires knowing what words and expressions are appropriate in certain contexts and registers,” she says.

But if algorithms are given a lot of information and casuistry with huge packs of questions that happen between humans and their likely answers, they will have information to at least reproduce similar situations, even if they are not capable of emotionally emulating a context. “AI quality improves as contextual information is supplemented with more training data, but for this we need an enormous amount of in-domain data, especially if there are different registers, dialects, linguistic varieties or professional jargon” notes Manuel Herranz, CEO and founder. “We have created adaptive systems which quickly learn to imitate a user’s style and preferences when translating, for example”

Manuel recalls that, second to English, AI’s main languages is Chinese, due to “its ability to penetrate user’s daily data through app usage, the government’s commitment to the development of this technology and the impact on millions of people. However, a lot of the ‘free data mining techniques’ that some American and Chinese companies use are simply illegal in the EU and Japan”.

Are French, Spanish and Japanese good languages for artificial intelligence?

But what about Spanish, the world’s second mother tongue by number of speakers? And French, spoken widely in the EU and in many countries in Africa and of course Canada? And Japanese, known for its innovations and love of robotics? “The datasets in those languages that can be used to train AI are still small when compared with English.” says Manuel Herranz.

Therefore, it is not strange that, according to expert figures, Spanish still represents around 27% of the world’s NLP (Natural Language Processing) market technologies, which according to the consulting firm Credence Research will grow at an annual rate close to 12% between 2018 and 2026, when it will reach $28,6Bn.

Manuel is convinced that the cross fertilization between language processing and the AI industry can become one of the “catalysts” of Europe’s and Japan’s competitiveness in the field of artificial intelligence, since companies from all sectors have a lot of legacy information in Spanish, French and Japanese with which they can train machines for specific solutions, from the Fintech sector to Medtech, insurance, legaltech, etc. “The challenge is to find, clean, refine and use the data correctly. Secondly, this data improves our own algorithms. Thirdly, we market it in a potential market with 580 million Spanish speakers, 300 million French speakers and 140 million Japanese speakers. Whatever we develop in Spanish is highly replicable in French and Japanese – we are very happy to have partnerships in place with Japanese companies and EU projects where French is prominent”.

Multilingual data gathering for machine learning

There are “great efforts” currently being made to highlight the importance of Spanish and language technologies in general in the future of AI. PangeaMT’s CEO mentions the Plan for the Promotion of Language Technologies, an initiative of the Secretariat of State for Digital Advancement and the EU’s new NTEU project, gathering 15 million quality sentences for machine learning into all the EU’s official languages but English to create neural machine translation engines for Public Administrations. “The Plan is one of the greatest efforts of Spain to connect the university world of research in language technologies with the corporate world, which is adopting language solutions at dizzying speed in internal and external processes to become more efficient”.

Personnel in charge of multilingual gathering efforts in the nteu project

Personnel in charge of multilingual gathering efforts in the nteu project: Laurent-Bie-Manuel-Herranz-Maite-Melero-Artūrs-Vasiļevskis-Mercedes-García-Sénead-O’Gorman, Oraianthi-Toumazatou, Valters Sics, Amando Estela, Ricardo Superbo

According to Manuel, all sectors of the economy can benefit from the implementation of language processing technologies, which build a new relationship scenario between companies, institutions and their communities and users, and public administrations and citizens in an increasingly multilingual world. In his opinion, the benefits of applying AI to language processing technologies in languages beyond English are already palpable in health, banking, automotive, insurance, education, tourism – providing millions of translated sentences and thus Big Data, processing voice data in milliseconds for legal enforcement, or aiding accessibility to technology for groups such as the disabled, the elderly and children.

“But we have to be prepared for what’s to come. In the coming years we will see how virtual assistants and personal assistants who use voice as an interface will modify the way we understand until today of building brands, of creating relationships in a conversational environment, of generating experiences and content or of selling and serving customers”, concludes Manuel.

Spanish figures
7.6% of the world’s population is Spanish-speaking: 580 million people.
Almost 483 million people have Spanish as their mother tongue. It is the official language of 21 countries.
Spanish is the second mother tongue in the world by number of speakers, after Mandarin Chinese, and the third language in a global count of speakers after English and Chinese.
In 2060, the United States will be the second Spanish-speaking country in the world, after Mexico. Nearly one in three Americans will be Hispanic.
About 22 million students study Spanish as a foreign language.
Source: Instituto Cervantes

French figures
3,8% of the world’s population is French-speaking: 280 million people.
Some 80 million people have French as their mother tongue, 280 million speak it daily. It is the official language of 29 countries.
French is the sixth mother tongue in the world by number of speakers.
In 2050, the number of native speakers using French daily will reach 650-700 million.
About 120 million students study French as a foreign language.
Source: Wikipedia, Babbel, Worldpopulationreview

Japanese figures
3,8% of the world’s population is French-speaking: 280 million people.
128 million people speak Japanese as their mother tongue, 280 million speak it daily. It is the national language of Japan and has official minority status in Palau (Angaur).
Japanese is the 13th most spoken language in the world.
About 120 million students study French as a foreign language.
Source: Wikipedia