Clasificación de textos académicos en función de su contenido léxico-semántico

Translated title of the contribution: Academic text classification based on lexical-semantic content

Research output: Contribution to journalArticlepeer-review

17 Scopus citations

Abstract

The aim of this research is to classify, using and comparing two automatic classification methods, the academic texts included in the PUCV-2006 Corpus belonging to the Fondecyt 1060440 research project. The methods are based on shared lexical-semantic content words present in a corpus of academic texts used in four professional carriers at the Pontificia Universidad Católica de Valparaíso, Chile. The research corpus, nowadays, is constituted by 652 texts with 96.288.874 words. For our purposes, we use a sample of 216 texts (30.886.081 words) divided, as following: 26 used in Construction Engineering, 31 used in Chemistry, 64 used Social Work, and 95 used in Psychology. The classification methods compared in this research are Multinomial Naïve Bayes and Support Vector Machine, both permits to identify a small group of shared words that permit, according statistical weights, to classify a new text into the four disciplinary areas. The results allows us to establish that Support Vector Machine classify in a efficient way academic texts, with high precision and recall values. With this method we are able to identify automatically the disciplinary domain, with a high percentage of accuracy (93,9%), of a new academic text in a query. We project to use this method as part of a more detailed multidimensional analysis of the PUCV-2006 Corpus.

Translated title of the contributionAcademic text classification based on lexical-semantic content
Original languageSpanish
Pages (from-to)239-271
Number of pages33
JournalRevista Signos
Volume40
Issue number63
StatePublished - 2007

Fingerprint

Dive into the research topics of 'Academic text classification based on lexical-semantic content'. Together they form a unique fingerprint.

Cite this