Overall, NLLB-200 achieves an average XSTS score of 4.15 on out of English directions and 3.75 on into English directions. The overall score is not bad when comparing how a baseline Transformer does for translations into and out of English and some other language, but they actually see worse results on one pair, from English into Greek: "For low-resource languages, translations are usually of weaker quality, and so we focus far more on usable (meaning-preserving) translations, even if they are not fully fluent." "In short, XSTS is a human evaluation protocol that focuses on meaning preservation far more than fluency," they write. XSTS asks humans to rate translations on a scale of 1 to 5, with 1 being the worst, the two sentences have nothing to do with one another, and 5 being the best, they're pretty much saying the same thing according to a person. Using a protocol first suggested in 2012 by Eneko Agirre and colleagues called "Semantic Textual Similarity," the Meta team employ a variant called "XSTS," which they introduced in a separate paper in May. In addition to the automated scores, the authors had humans read translations and score them, and that's where some cracks appear. They make extensive comparisons between different versions of those scores. The results, as mentioned in the summary piece above, is an improvement of 44% in comparison to prior translation programs, as measured by common automated scores such as BLUE and chrF. Then, they test how the NLLB does on FLORES-200. (The authors even found a sweet spot for this approach: "Inserting MoE layers at an interval of every 4 Transformer blocks exhibits the best performance, in particular improving performance in very-low resource settings.")Īlong with the training set, the authors develop a new benchmark data set, FLORES-200, a high quality, many-to-many benchmark dataset that doubles the language coverage of a previous effort known as Flores- 101." The data set is "created with professional human translators who translate the FLORES source dataset into the target languages and a separate group of independent translation reviewers who perform quality assessments of the human translations and provide translation feedback to the translators." The NLLB-200 network, right, inserts "mixture of experts" elements in between the standard attention blocks of the Transformer model, left. The value of the MoE, they explain, is that they "unlock significant representational capacity while maintaining the same inference and training efficiencies in terms of FLOPs as compared to the core dense architecture." "Sparsely Gated Mixture of Experts (MoE) models are a type of conditional compute models that activate a subset of model parameters per input, as opposed to dense models that activate all model parameters per input," they explain. Basically, the experts can choose to turn off or on some of those 54-billion parameters when making predictions, so that the neural network can change its nature with each task. In between the individual layers of the network known as "attention heads," the authors interleave conditional execution branches known as a sparsely gated mixture of exports. They use a 54-billion parameter Transformer, which is not huge (some modes are approaching a trillion parameters), but they make a key modification. They start with the ubiquitous Transformer language model from Google that underlies most language translation today. Those training data sets are used to construct their neural net, NLLB-200. The authors found where they made their numeral net bigger, which should mean more powerful, they actually found diminishing returns when translating sentences from English to another language, and some negative effects when translating between non-English sentences. The lesson is that despite an ability to bring up average scores, the intricacies of creating translations that are meaningful, at least as far as a human views the translation, can not simply be automated. Also: Meta's latest AI model will make content available in hundreds of languagesĪ surprise buried in the report is that despite a measurable improvement across the board on a larger group of languages, as indicated by automatic scoring systems, when it comes to human evaluation on the quality of translations, the researchers' neural net, known affectionately as "No Language Left Behind Two Hundred," or NLLB-200, fails to show much improvement in a number of language cases, including not only low-resource languages such as Oromo but also languages with prevalent translation material such as Greek and Icelandic.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |