Parallel corpus alignment at the document, sentence and vocabulary levels

Research output: Contribution to journalReview articlepeer-review

1 Scopus citations

Abstract

This paper presents a language independent algorithm for the alignment of parallel corpora at the document, sentence and vocabulary levels using the to-be aligned corpus itself as the only source of information. The input is a set of documents written in two unknown languages A and B, where every document in language A has its corresponding translation into language B. The problem thus consists of: 1) dividing the set of documents in the two languages; 2) aligning at the document level to determine which document in language A is the original (or translation) of each document in language B; 3) aligning at the sentence level to determine which sentence in the original corresponds to each sentence in the translation and 4) aligning at the vocabulary level to determine which word in one language is equivalent to each word in the translation. The algorithm is iterative, using the resulting bilingual vocabulary to re-align the corpus. Evaluation figures in English, Spanish and French show competitive results at all levels of the alignment.

Original languageEnglish
Pages (from-to)129-136
Number of pages8
JournalProcesamiento de Lenguaje Natural
Volume47
StatePublished - 2011
Externally publishedYes

Keywords

  • Information extraction
  • Machine translation
  • Parallel corpus alignment

Fingerprint

Dive into the research topics of 'Parallel corpus alignment at the document, sentence and vocabulary levels'. Together they form a unique fingerprint.

Cite this