This paper describes a proposed method for the identification and classification of discourse markers (e.g., however, therefore, by the way) by applying statistical analysis to large parallel corpora. The objective is to build a lexical resource consisting of a multilingual taxonomy, so far in English, Spanish, German and French. A method is proposed that first separates discourse markers from the rest of the lexical units in the corpus using a measure of entropy, and then classifies them in groups by function using a clustering procedure especially designed for massive data processing. From that point onwards, the system is used to recursively identify and classify more units. Experimental evaluation shows that, in terms of precision, the automated method is able to perform as well as a team of human annotators (undergraduate students of linguistics), and it outperforms them in terms of recall.
|Número de páginas
|Proceedings of Electronic Lexicography in the 21st Century Conference
|Publicada - 2021
|7th Biennial Conference on Electronic Lexicography, eLex 2021 - Virtual, Online
Duración: 5 jul. 2021 → 7 jul. 2021