Parallel corpus alignment at the document, sentence and vocabulary levels

Producción científica: Contribución a una revistaArtículo de revisiónrevisión exhaustiva

2 Citas (Scopus)

Resumen

This paper presents a language independent algorithm for the alignment of parallel corpora at the document, sentence and vocabulary levels using the to-be aligned corpus itself as the only source of information. The input is a set of documents written in two unknown languages A and B, where every document in language A has its corresponding translation into language B. The problem thus consists of: 1) dividing the set of documents in the two languages; 2) aligning at the document level to determine which document in language A is the original (or translation) of each document in language B; 3) aligning at the sentence level to determine which sentence in the original corresponds to each sentence in the translation and 4) aligning at the vocabulary level to determine which word in one language is equivalent to each word in the translation. The algorithm is iterative, using the resulting bilingual vocabulary to re-align the corpus. Evaluation figures in English, Spanish and French show competitive results at all levels of the alignment.

Idioma originalInglés
Páginas (desde-hasta)129-136
Número de páginas8
PublicaciónProcesamiento de Lenguaje Natural
Volumen47
EstadoPublicada - 2011
Publicado de forma externa

Huella

Profundice en los temas de investigación de 'Parallel corpus alignment at the document, sentence and vocabulary levels'. En conjunto forman una huella única.

Citar esto