Machine Translation and Large Language Models: Shaping the Future of Enterprise Translation?

Machine translation – and the need to automate translations – has been around for quite a while [i]. After the Second World War, even the best experts believed machine translation was easy to achieve. All the computer needed was a dictionary and some rules on how the words would be put together in each language, and then you would have machine translation[ii]. This did not happen – and what happened over the next 60 years is a series of failed promises, at least in light of what has since transpired.

Neural Machine Translation

In 2015, Neural Machine Translation (NMT) was introduced[iii], and in the few years since, it has become widely adopted by most, if not all, machine translation providers. With good reason, even the early NMT systems produced entirely fluent and grammatically correct translations.

NMT didn’t happen all of a sudden. It came from arduous and decade-long processes which finally produced results. The mathematical construct called the neural network has been around since the 1940s[iv], proposed, among others, by Alan Turing. A neural network, however, needs a lot of computing power of which there was simply not enough until the early 2010s. Once the architects realized that neural networks could be run efficiently on graphical processors[v] and once those powerful graphical processors became available, there was finally a computing environment where neural networks could thrive.

NMT caused quite an uproar in the localization industry because it looked like it could take over much more human translation work than previous machine translation systems. Because the business world is always under time pressure, translation providers and buyers very quickly began to work in ways that used even more machine translation than before. In fact, machine translation post-editing seems to have become a profession in its own right[vi].

Large Language Models

NMT has not been the end of the road. In late 2022 and early 2023, the so-called large language models (LLMs)[vii] became available to the general public. A large language model is a neural network, similar to, but much more complex than an NMT engine.

What does this mean? NMT systems and large language models are called transformers because they transform an input (a prompt) into a response. For neural machine translation, the prompt is simply the source text, and the engine will do nothing but translate it into the single target language it knows.

A large language model takes this to the next level. It will read the prompt as a detailed question or as instructions on what to answer and how, and respond accordingly. For example, if you tell an LLM to “translate this”, it will translate whatever follows. If you tell it to “write a sonnet about”, it will give you poetry about a certain topic. Some models are multi-modal[viii] which means they don’t only work on text but also images or speech. For those, you can say “Draw me a lamb”, and they will indeed draw you a lamb.

LLMs were pioneered by a company called OpenAI, and they were very clever about it as they did not just publish the large language model—they built it into a service called ChatGPT.

Large language models have a complex relationship with translation because translation is not their main activity. They can do a lot of tasks such as rephrasing a sentence, editing or correcting translations, or changing the wording in a piece of text so that it means something else. They can be asked to do various things with translation at different points of the translation workflow—and all that needs to be changed is the prompt they receive. This sounds exciting, but because translation is not the main purpose, an NMT engine can sometimes produce better results than an LLM.

But how are NMT and LLMs so good? When it comes to natural language processing, the most striking difference between a statistical model and a neural network is how much they look ahead. In a statistical model, they look ahead 5-10 tokens at best (tokens, in this instance, typically stand for words or parts of words). In today’s neural models, the context window[ix] can be a few hundred tokens to 4,096 tokens (that’s what GPT-4 signifies). It means that a neural model (both NMT and LLM) looks at much more text at a time than anything statistical, and they can process meaningful units of text in an instance. This is probably the most important factor that makes them fluent.

But no matter how good the output, no matter how insightful a chat with ChatGPT looks, it is still a machine imitating a human process on the surface. It will not re-create human cognitive processes, no matter how immensely complex it is.

A neural network—an NMT engine or a large language model—is practically a black box[x], which means that its inner workings are very difficult, if not impossible, to control. When you converse with an LLM, and you are not happy with the output you receive, you might not be able to influence the output by changing your prompt. At the same time, LLMs (and all neural networks including NMT systems) are known to hallucinate [xi] and create answers that were not in the training data and are not necessarily true. Every neural network has a setting called the “temperature” which is a number between 0 and 1. The higher the temperature, the more likely the neural network will create new associations while working on the output. Hallucinations are a side effect of this process.

Because it is a black box and because it hallucinates, a large language model is unpredictable, and in the world of business, this is not good. The engineering term for this is “non-deterministic”, which describes a system that does not necessarily give the same answer to the same question all the time.

In addition, both neural machine translation and large language models need a lot of data and computing power. They require so much computing power that there is talk about the energy consumption of data centers running AI systems. An LLM needs so much data that developers could actually run out of it[xii]. Because of this, there is still the question of whether large language models, as they exist today, can be used at scale – which is very much what enterprises will want.

Language Coverage

Finally, let’s spend a word or two on language coverage[xiii]. People and companies want to use large language models for translation because they are so versatile and flexible. However, most of the large language models that are available today have not been optimized for translation. The amount of data in each language is very uneven. In most cases, English has the largest coverage, and there are other languages that can supply a lot of data, too. But there are others that don’t have that much, and when you hit a language that does not have a lot of data in your LLM, the translations will suffer. Don’t expect the same quality for English-French as for Hungarian-Mongolian, for example.

Translation Automation: What Does the Future Hold?

What does all this say about the future of enterprise translations? I think we can safely say that machine translation, either through NMT or through large language models, will be used more and more, and that the localization processes will increasingly be built around processes that start with machine translation. Large language models will help with the editing of translations, and will serve as aids in the interactive editing process, as well. This, however, will need to be supervised and controlled by human linguists who still need to learn how to translate and how to edit translations, no matter whether it’s human or machine translation.

“There will always be a portion of content that needs the human translator. The proportion of this content might be shrinking – but when automation is so successful, more content will be translated on the whole. I am aware of all the developments in localization technology—and I still insist that in the foreseeable future, translators will eventually get more work, not less,” says Balázs Kis, Co-CEO at memoQ.

[i] John Hutchins (March 2006). "The first public demonstration of machine translation: the Georgetown-IBM system, 7th January 1954" (PDF). Hutchins Web. S2CID 132677. Archived from the original (PDF) on 2007-10-21. See also: Georgetown Experiment (imtqy.com)

[ii] Haifeng Wang, Hua Wu, Zhongjun He, Liang Huang, Kenneth Ward Church (2022). Progress in Machine Translation, Engineering, Volume 18, 2022, Pages 143-153, ISSN 2095-8099, https://doi.org/10.1016/j.eng.2021.03.023. (https://www.sciencedirect.com/science/article/pii/S2095809921002745)

[iii] NMT was officially born with the publication of these two scientific papers:

Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd International Conference on Learning Representations; 2015 May 7–9; San Diego, USA; 2015.

Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems; 2014 Dec 8–13; Montreal, QC, Canada; 2014.

[iv] Turing, Alan (1948). Intelligent Machinery. In: The Essential Turing (Copeland, B. J., ed.), 2004,

Oxford University Press. See also: https://en.wikipedia.org/wiki/Artificial_neural_network, AMT-C-11 | The Turing Digital Archive (cam.ac.uk)

[v] Kyoung-Su Oh, Keechul Jung (2004). GPU implementation of neural networks. In: Pattern Recognition, Volume 37, Issue 6, 2004, Pages 1311-1314, ISSN 0031-3203, https://doi.org/10.1016/j.patcog.2004.01.013.

(https://www.sciencedirect.com/science/article/pii/S0031320304000524)

[vi] Sharon O'Brien, Laura Winther Balling, Michael Carl, Michel Simard and Lucia Specia (eds., 2014). Post-editing of Machine Translation: Processes and Applications. Cambridge Scholars Publishing. 978-1-4438-5476-4-sample.pdf (cambridgescholars.com)

[vii] Large language model - Wikipedia

[viii] Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., & Chen, E. (2023). A Survey on Multimodal Large Language Models. arXiv preprint arXiv:2306.13549. [2306.13549] A Survey on Multimodal Large Language Models (arxiv.org)

[ix] What is a context window? (techtarget.com)

[x] Why We Will Never Open Deep Learning’s Black Box | by Paul J. Blazek | Towards Data Science

[xi] Why do Large Language Models Hallucinate? | Medium

[xii] We could run out of data to train AI language programs | MIT Technology Review

[xiii] Lost in Translation: Large Language Models in Non-English Content Analysis - Center for Democracy and Technology (cdt.org)