Automatic induction of a multilingual taxonomy of discourse markers

Producción científica: Contribución a una revistaArtículo de la conferenciarevisión exhaustiva

2 Citas (Scopus)


This paper describes a proposed method for the identification and classification of discourse markers (e.g., however, therefore, by the way) by applying statistical analysis to large parallel corpora. The objective is to build a lexical resource consisting of a multilingual taxonomy, so far in English, Spanish, German and French. A method is proposed that first separates discourse markers from the rest of the lexical units in the corpus using a measure of entropy, and then classifies them in groups by function using a clustering procedure especially designed for massive data processing. From that point onwards, the system is used to recursively identify and classify more units. Experimental evaluation shows that, in terms of precision, the automated method is able to perform as well as a team of human annotators (undergraduate students of linguistics), and it outperforms them in terms of recall.

Idioma originalInglés
Páginas (desde-hasta)440-454
Número de páginas15
PublicaciónProceedings of Electronic Lexicography in the 21st Century Conference
EstadoPublicada - 2021
Evento7th Biennial Conference on Electronic Lexicography, eLex 2021 - Virtual, Online
Duración: 5 jul. 20217 jul. 2021


Profundice en los temas de investigación de 'Automatic induction of a multilingual taxonomy of discourse markers'. En conjunto forman una huella única.

Citar esto