Abstract
This paper presents a language independent algorithm for the alignment of parallel corpora at the document, sentence and vocabulary levels using the to-be aligned corpus itself as the only source of information. The input is a set of documents written in two unknown languages A and B, where every document in language A has its corresponding translation into language B. The problem thus consists of: 1) dividing the set of documents in the two languages; 2) aligning at the document level to determine which document in language A is the original (or translation) of each document in language B; 3) aligning at the sentence level to determine which sentence in the original corresponds to each sentence in the translation and 4) aligning at the vocabulary level to determine which word in one language is equivalent to each word in the translation. The algorithm is iterative, using the resulting bilingual vocabulary to re-align the corpus. Evaluation figures in English, Spanish and French show competitive results at all levels of the alignment.
Original language | English |
---|---|
Pages (from-to) | 129-136 |
Number of pages | 8 |
Journal | Procesamiento de Lenguaje Natural |
Volume | 47 |
State | Published - 2011 |
Externally published | Yes |
Keywords
- Information extraction
- Machine translation
- Parallel corpus alignment