Detecting Hate Speech in Cross-Lingual and Multi-lingual Settings Using Language Agnostic Representations

Sebastián E. Rodríguez, Héctor Allende-Cid, Héctor Allende

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

The automatic detection of hate speech is a blooming field in the natural language processing community. In recent years there have been efforts in detecting hate speech in multiple languages, using models trained on multiple languages at the same time. Furthermore, there is special interest in the capabilities of language agnostic features to represent text in hate speech detection. This is because models can be trained in multiple languages, and then the capabilities of the model and representation can be tested on a unseen language. In this work we focused on detecting hate speech in mono-lingual, multi-lingual and cross-lingual settings. For this we used a pre-trained language model called Language Agnostic BERT Sentence Embeddings (LabSE), both for feature extraction and as an end to end classification model. We tested different models such as Support Vector Machines and Tree-based models, and representations in particular bag of words, bag of characters, and sentence embeddings extracted from Multi-lingual BERT. The dataset used was the SemEval 2019 task 5 data set, which covers hate speech against immigrants and women in English and Spanish. The results show that the usage of LabSE as feature extraction improves the performance on both languages in a mono-lingual setting, and in a cross-lingual setting. Moreover, LabSE as an end to end classification model performs better than the reported by the authors of SemEval 2019 task 5 data set for the Spanish language.

Original languageEnglish
Title of host publicationProgress in Pattern Recognition, Image Analysis, Computer Vision, and Applications - 25th Iberoamerican Congress, CIARP 2021, Revised Selected Papers
EditorsJoão Manuel Tavares, João Paulo Papa, Manuel González Hidalgo
PublisherSpringer Science and Business Media Deutschland GmbH
Pages77-87
Number of pages11
ISBN (Print)9783030934194
DOIs
StatePublished - 2021
Event25th Iberoamerican Congress on Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, CIARP 2021 - Virtual, Online
Duration: 10 May 202113 May 2021

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12702 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference25th Iberoamerican Congress on Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, CIARP 2021
CityVirtual, Online
Period10/05/2113/05/21

Keywords

  • Hate speech detection
  • Multi-lingual language models
  • Natural Language Processing

Fingerprint

Dive into the research topics of 'Detecting Hate Speech in Cross-Lingual and Multi-lingual Settings Using Language Agnostic Representations'. Together they form a unique fingerprint.

Cite this