ASLIB London: How to Bring Innovation into the Corpora
The Association for Information Management: Translating and the Computer Conference – 17 & 18 November 2011
New-Old Resources – How to Bring Innovation into the Corpora?
István Lengyel is the COO of Kilgray Translation Technologies
Corpus Linguistics:
- Traditionally, corpora are made up of texts
- There are different corpus types for different goals:
- Monolingual corpora
- Parallel corpora
- Comparable corpora
- Kilgarriff and Grefenstette allow for the web as corpus (starting in 2003)
- Concordancing, scholarly use
Translation Technology:
- Your hard drive is a corpus
- If you work in TEnTs, you use parallel corpora called bilingual files
- Customers still don’t realize that translation memory is a technological implementation of corpus linguistics
- Due to the primary file formats being monolingual, a lot of useful corpora is lost
Uses and Challenges:
- The categorization and use of corpora helps in making right translation decisions
- There are many types of corpus exploitation techniques:
- Translation memories
- Web-based and desktop search engines
- Plagiarism detectors
- Knowledge bases
- Corpus linguistics requires different indexing
Translation Use:
- SMT increases the importance of corpora
- Corpus collection and maintenance becomes an issue
- Translation memory maintenance has always been an issue
- Alignment has always been an even bigger issue
Translation Memories:
- Quick return on investment makes us forget how much they don’t leverage
- Without TM sharing and maintenance, inconsistency is probable
The Cases for Corpora:
- Target term collocation search (monolingual)
- Leveraging earlier materials, being able to check context (bilingual, aligned)
- Leveraging earlier materials without preparation (bilingual, unaligned)
- Leveraging the same materials in the team of all translators, allowing for the improvement of these materials
- Question: How do we establish team boundaries?
The Future is Corpus Linguistics (TAUS)
- Corpus-based initiatives still far from mainstream translation
- Corpus collection is a problem – what is the translation of what
- We need collaborative learning alignment
- We need to know what is good and what is not (cf. Google spell-check suggestions)
- Authoring for translation / web design for translation