Automatic text classification of disciplinary texts

Research output: Chapter in Book/Report/Conference proceedingChapterpeer-review

Abstract

The aim of this research is to classify, using and comparing two automatic classification methods, the academic texts included in the PUCV-2006 Corpus of Spanish. The methods are based on shared lexical-semantic content words present in the corpus of academic texts. The classification methods compared in this study are Multinomial Naive Bayes and Support Vector Machine. Both enable the identification of a small group of shared words that help, according to statistical weights, to classify a new text into the four disciplinary areas involved in the corpora. The results allow us to establish that Support Vector Machine classifies academic texts efficiently. Using this method, we were able to automatically identify the disciplinary domain of an academic text - based on a reduced number of shared content lexemes - delivering high performance even in highly-refined disciplines such as Psychology and Social Work.

Original languageEnglish
Title of host publicationAcademic and Professional Discourse Genres in Spanish
EditorsGiovanni Parodi
PublisherJohn Benjamins Publishing Company
Pages121-141
Number of pages21
ISBN (Electronic)9789027288257
DOIs
StatePublished - 2010

Publication series

NameStudies in Corpus Linguistics
Volume40
ISSN (Print)1388-0373

Fingerprint Dive into the research topics of 'Automatic text classification of disciplinary texts'. Together they form a unique fingerprint.

Cite this