ASLIB London: Machine Translation Quality Assessment

The Association for Information Management: Translating and the Computer Conference – 17 & 18 November 2011

describe the image

An Empirical Model for MT Quality Assessment and Implications for Business Models

Sergio Pelino, Senior Program Manager, Localization Operations, Google Localization

(This does not reflect the opinions nor the research of the Google Translate product team – this is individual research conducted by Sergio Pelino)

The Alternative Machine Translation measurement model is based on Translation Memory benchmarking and “edit distance data”, attempting to reconcile the traditional Translation Memory business model with Machine Translation. The focus is on how to build the empirical model for quality measurement and the implications to a business model.

“Use of MT among translators grew from about 30% in 2010 to over 50% in 2011.” –


  1. Understand and measure impact of MT on translation effort
  2. Build a business model for MT post-editing

Standard TM model is based on the match rate of stored source TM against the new source text. In MT-TM word count model, the number of matches increases and therefore the cost decreases.

The MT measurement dilemma: what units of measurement exist for MT that could be practically used to impact and enterprise business model for leverage and pricing of translation tasks?

The empirical approach: use the same guiding principle used by TM systems – “edit distance” (i.e., amount of edits) = effort; but applied to translated suggestion and the final post-edited translation rather than being a comparison between source sentences.

Hypothesis: If we could measure that MT effort is equivalent to the TM effort for fuzzy matches, then we could apply the same business model.

  • Track the changes
  • Use “Edit Distance Data” as a proxy for the estimated effort

Is MT edit effort comparable to TM edit effort for suggestions (fuzzy matches) >75%? According to Google study (using Google Translate), the answer is yes.

Data and scope:

  • Timeline: December 2010 – current
  • 43 languages

Analysis: Is MT edit effort comparable to TM edit effort for fuzzy matches? Depends on the language.

Takeaway: It is possible to build an evaluation system that is fully engineered, with no human-biased evaluation