Automatic text classification of disciplinary texts

Resultado de la investigación: Capítulo del libro/informe/acta de congresoCapítulorevisión exhaustiva


The aim of this research is to classify, using and comparing two automatic classification methods, the academic texts included in the PUCV-2006 Corpus of Spanish. The methods are based on shared lexical-semantic content words present in the corpus of academic texts. The classification methods compared in this study are Multinomial Naive Bayes and Support Vector Machine. Both enable the identification of a small group of shared words that help, according to statistical weights, to classify a new text into the four disciplinary areas involved in the corpora. The results allow us to establish that Support Vector Machine classifies academic texts efficiently. Using this method, we were able to automatically identify the disciplinary domain of an academic text - based on a reduced number of shared content lexemes - delivering high performance even in highly-refined disciplines such as Psychology and Social Work.

Idioma originalInglés
Título de la publicación alojadaAcademic and Professional Discourse Genres in Spanish
EditoresGiovanni Parodi
EditorialJohn Benjamins Publishing Company
Número de páginas21
ISBN (versión digital)9789027288257
EstadoPublicada - 2010

Serie de la publicación

NombreStudies in Corpus Linguistics
ISSN (versión impresa)1388-0373


Profundice en los temas de investigación de 'Automatic text classification of disciplinary texts'. En conjunto forman una huella única.

Citar esto