Automatic induction of a multilingual taxonomy of discourse markers

Research output: Contribution to journalConference articlepeer-review

Abstract

This paper describes a proposed method for the identification and classification of discourse markers (e.g., however, therefore, by the way) by applying statistical analysis to large parallel corpora. The objective is to build a lexical resource consisting of a multilingual taxonomy, so far in English, Spanish, German and French. A method is proposed that first separates discourse markers from the rest of the lexical units in the corpus using a measure of entropy, and then classifies them in groups by function using a clustering procedure especially designed for massive data processing. From that point onwards, the system is used to recursively identify and classify more units. Experimental evaluation shows that, in terms of precision, the automated method is able to perform as well as a team of human annotators (undergraduate students of linguistics), and it outperforms them in terms of recall.

Original languageEnglish
Pages (from-to)440-454
Number of pages15
JournalProceedings of Electronic Lexicography in the 21st Century Conference
Volume2021-July
StatePublished - 2021
Externally publishedYes
Event7th Biennial Conference on Electronic Lexicography, eLex 2021 - Virtual, Online
Duration: 5 Jul 20217 Jul 2021

Keywords

  • automatic creation of dictionary content
  • connectives
  • discourse markers
  • natural language processing
  • taxonomy induction

Fingerprint

Dive into the research topics of 'Automatic induction of a multilingual taxonomy of discourse markers'. Together they form a unique fingerprint.

Cite this