Is Segmentation So Bad for Translation?

Segmentation in localization receives a lot of bad press nowadays. It is often listed as the number one barrier to good translation quality. To me, it seems that it is blamed more for quality issues in automated translation than the automation mechanism itself.

It is also the number one reason why some people think that translation management systems are outdated. These systems often present a segmented view of source and translated content, and they also use these segments when retrieving suggestions.

Segmentation is thought to be a barrier to efficient automation because the context window of large language models – and even neural machine translation engines – is much larger than the size of a segment. It is indeed true that these models produce more accurate and more pertinent translations when they are allowed to see a larger context.

The picture, as always, is more colorful than that. We need segmentation but we also need the context, and we also need to be able to look at the content in its entirety. It would be a grave mistake to remove segment-based mechanisms from localization.

Contrary to popular belief, a segmented document interface and segmented lookup do not lock you into a segments-only world. A segmented view is something that humans need to be able to focus. There is a reason why language has structure, especially in writing: words, sentences, paragraphs, sections, chapters, and the list goes on. There is a reason we sometimes write numbered or bulleted lists. There is a reason we draw up tables. These are all structures to help the human thought process.

There is a concept I like to call “minimum automation.” This is practically the principle of using the least sufficient resources to automate an activity. For example, if you want to reuse previous translations – to speed up the process and for better consistency – a database lookup will finish much quicker, use way less energy, and do a better (more predictable) job than a large language model. But it is also most efficient if it looks up a segment at a time: a meaningful unit that a human can also work with.

The human brain also wants to save energy. This is the reason we can focus, and that we will instinctively focus on the smallest meaningful unit that still allows us to do our work. We can also automatically extend the scope: if we need more context, we will look at more context: other segments, or maybe at a different view of the same content.

If we eliminate the access to segmented views and segmented lookups, we do one thing: we appease the AI model. Not only is this approach unethical – because it prioritizes the AI instead of the human user –, but it will also reduce the productivity of every localization automation process where human involvement is required.

What’s more, the AI model, trained on human language, will recreate or replay the structure because it will see the structure both in written and spoken content. The structure becomes part of the language that the model internalizes.

A translation management system does not have to be confined to the segmented views and segmented lookups. Tools might have poorly implemented features: for example, when content is sent to an AI model segment by segment. However, these are not hard-coded and are relatively easy to fix.

A well-built TMS will adapt to the need at hand and work with segmented views or segmented lookups when necessary. It will look at segmented context when that is best for the purpose. Finally, the TMS will work with unsegmented content when that is required. In other words, a TMS must be able to look at the same content through various levels of segmentation, or without it. So, the assumption that localization – or language operations – must somehow be “liberated” from the confines of segmentation is patently untrue.

But where does memoQ stand on all this? Well, memoQ was one of the first translation tools to

introduce a context-sensitive translation memory,
offer double context on TM and corpus matches (where double context means identifying the context in the text flow as well as an explicit ID that is there in tables and user interfaces),
implement a comprehensive corpus memory where entire monolingual or bilingual documents are preserved, and serve up translation memory-like matches,
construct versatile and flexible views where you can “glue” together or filter context, creating context that is not inherently there in the source content,
offer unsegmented preview next to the segmented content view.

These are all options that precede publicly available generative AI. These prove that a TMS does not need to be “forced” by new developments to allow the users and automated processes to be flexible about narrowing or widening the context as needed.

Balázs Kis

Co-Founder & Chief Evangelist at memoQ

Is Segmentation So Bad for Translation?

Balázs Kis

Get the Latest Blog Posts

Solutions

Products

Resources

Get in touch

Is Segmentation So Bad for Translation?

Balázs Kis

Get the Latest Blog Posts

Keep reading...

Solutions

Products

Resources

Get in touch