ASLIB London: How to Bring Innovation into the Corpora

The Association for Information Management: Translating and the Computer Conference – 17 & 18 November 2011


New-Old Resources – How to Bring Innovation into the Corpora?
István Lengyel is the COO of Kilgray Translation Technologies

Corpus Linguistics:

  • Traditionally, corpora are made up of texts
  • There are different corpus types for different goals:
    • Monolingual corpora
    • Parallel corpora
    • Comparable corpora
    • Kilgarriff and Grefenstette allow for the web as corpus (starting in 2003)
    • Concordancing, scholarly use

Translation Technology:

  • Your hard drive is a corpus
  • If you work in TEnTs, you use parallel corpora called bilingual files
  • Customers still don’t realize that translation memory is a technological implementation of corpus linguistics
  • Due to the primary file formats being monolingual, a lot of useful corpora is lost

Uses and Challenges:

  • The categorization and use of corpora helps in making right translation decisions
  • There are many types of corpus exploitation techniques:
    • Translation memories
    • Web-based and desktop search engines
    • Plagiarism detectors
    • Knowledge bases
    • Corpus linguistics requires different indexing

Translation Use:

  • SMT increases the importance of corpora
  • Corpus collection and maintenance becomes an issue
  • Translation memory maintenance has always been an issue
  • Alignment has always been an even bigger issue

Translation Memories:

  • Quick return on investment makes us forget how much they don’t leverage
  • Without TM sharing and maintenance, inconsistency is probable

The Cases for Corpora:

  • Target term collocation search (monolingual)
  • Leveraging earlier materials, being able to check context (bilingual, aligned)
  • Leveraging earlier materials without preparation (bilingual, unaligned)
  • Leveraging the same materials in the team of all translators, allowing for the improvement of these materials
  • Question: How do we establish team boundaries?

The Future is Corpus Linguistics (TAUS)

  • Corpus-based initiatives still far from mainstream translation
  • Corpus collection is a problem – what is the translation of what
  • We need collaborative learning alignment
  • We need to know what is good and what is not (cf. Google spell-check suggestions)
  • Authoring for translation / web design for translation