Spell-checking in Spanish: The case of diacritic accents

Jordi Atserias, Maria Fuentes, Rogelio Nazar, IRENE RENAU ARAQUE

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

This article presents the problem of diacritic restoration (or diacritization) in the context of spell-checking, with the focus on an orthographically rich language such as Spanish. We argue that despite the large volume of work published on the topic of diacritization, currently available spell-checking tools have still not found a proper solution to the problem in those cases where both forms of a word are listed in the checker's dictionary. This is the case, for instance, when a word form exists with and without diacritics, such as continuo 'continuous' and continuo 'he/she/it continued', or when different diacritics make other word distinctions, as in continuo 'I continue'. We propose a very simple solution based on a word bigram model derived from correctly typed Spanish texts and evaluate the ability of this model to restore diacritics in artificial as well as real errors. The case of diacritics is only meant to be an example of the possible applications for this idea, yet we believe that the same method could be applied to other kinds of orthographic or even grammatical errors. Moreover, given that no explicit linguistic knowledge is required, the proposed model can be used with other languages provided that a large normative corpus is available.

Original languageEnglish
Title of host publicationProceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012
EditorsMehmet Ugur Dogan, Joseph Mariani, Asuncion Moreno, Sara Goggi, Khalid Choukri, Nicoletta Calzolari, Jan Odijk, Thierry Declerck, Bente Maegaard, Stelios Piperidis, Helene Mazo, Olivier Hamon
PublisherEuropean Language Resources Association (ELRA)
Pages737-742
Number of pages6
ISBN (Electronic)9782951740877
StatePublished - 1 Jan 2012
Event8th International Conference on Language Resources and Evaluation, LREC 2012 - Istanbul, Turkey
Duration: 21 May 201227 May 2012

Publication series

NameProceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012

Conference

Conference8th International Conference on Language Resources and Evaluation, LREC 2012
CountryTurkey
CityIstanbul
Period21/05/1227/05/12

Keywords

  • Computer-assisted writing in Spanish
  • Diacritic restoration
  • N-gram language models
  • Spell-checking

Fingerprint Dive into the research topics of 'Spell-checking in Spanish: The case of diacritic accents'. Together they form a unique fingerprint.

Cite this