cover image: Lost in Translation: Large Language Models in Non-English Content Analysis

20.500.12592/r41dg6

Lost in Translation: Large Language Models in Non-English Content Analysis

11 May 2023

Lost in Translation: Large Language Models in Non-English Content Analysis A report from Lost in Translation Large Language Models in Gabriel Nicholas Non-English Content Analysis Aliya BhatiaMay 2023 The Center for Democracy & Technology (CDT) is the leading nonpartisan, nonprofit organization fighting to advance civil rights and civil liberties in the digital age. [...] Large language models in general and multilingual language models in particular have the potential to create new economic opportunities and improve the web for all. [...] The abundance of English language data stems from its position as the official or de facto language of international business, politics, and media, itself a legacy of British colonialism and American neocolonialism and the subsequent erasure of regional and indigenous languages. [...] A language model is a mathematical function trained to solve a text prediction task like the following, “Given a sequence of words, predict what word will likely come next.” For example, a language model might be given the phrase “I was a bad student, I used to skip ____,” and generate as an output that there is a high percent chance the missing word is “class,” a low percent it is “rope,” and a n. [...] Background 17 As a result of these forces, English also dominates the field of natural language English 311 processing, and there is vastly more raw text data available in English than in any other German 27 language by orders of magnitude (Joshi et al., 2020). English has the most digitized Arabic 18 French 16 books and patents, the largest Wikipedia, and the biggest internet presence.

Authors

CDT Research

Pages
50
Published in
United States of America

Tables