TOWARDS A MULTILINGUAL DICTIONARY OF DISCOURSE MARKERS Automatic extraction of units from parallel corpus

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

This paper presents a multilingual dictionary project of discourse markers. During its first stage, consisting of collecting the list of headwords, we used a parallel corpus to automatically extract units from texts written in Spanish, Catalan, English, French and German. We also applied a method to create a taxonomy structure for automatically organising the markers in clusters. As a result, we obtain an extensive, corpus-driven list of headwords. We present a prototype of the microstructure of the dictionary in the form of a standard XML database and describe the procedure to automatically fill in most of its fields (e. g., the type of DM, the equivalents in other languages, etc.), before human intervention.

Original languageEnglish
Title of host publicationProceedings of the 20th EURALEX International Congress, 2022
EditorsAnnette Klosa-Kückelhaus, Stefan Engelberg, Christine Möhrs, Petra Storjohann
PublisherEuropean Association for Lexicography
Pages262-272
Number of pages11
ISBN (Print)9783937241876
StatePublished - 2022
Externally publishedYes
Event20th EURALEX International Congress, 2022 - Mannheim, Germany
Duration: 12 Jul 202216 Jul 2022

Publication series

NameEURALEX Proceedings
ISSN (Electronic)2521-7100

Conference

Conference20th EURALEX International Congress, 2022
Country/TerritoryGermany
CityMannheim
Period12/07/2216/07/22

Keywords

  • Computational lexicography
  • corpus-driven lexicography
  • discourse markers
  • multilingual lexicography

Fingerprint

Dive into the research topics of 'TOWARDS A MULTILINGUAL DICTIONARY OF DISCOURSE MARKERS Automatic extraction of units from parallel corpus'. Together they form a unique fingerprint.

Cite this