The aim of this research is to classify, using and comparing two automatic classification methods, the academic texts included in the PUCV-2006 Corpus belonging to the Fondecyt 1060440 research project. The methods are based on shared lexical-semantic content words present in a corpus of academic texts used in four professional carriers at the Pontificia Universidad Católica de Valparaíso, Chile. The research corpus, nowadays, is constituted by 652 texts with 96.288.874 words. For our purposes, we use a sample of 216 texts (30.886.081 words) divided, as following: 26 used in Construction Engineering, 31 used in Chemistry, 64 used Social Work, and 95 used in Psychology. The classification methods compared in this research are Multinomial Naïve Bayes and Support Vector Machine, both permits to identify a small group of shared words that permit, according statistical weights, to classify a new text into the four disciplinary areas. The results allows us to establish that Support Vector Machine classify in a efficient way academic texts, with high precision and recall values. With this method we are able to identify automatically the disciplinary domain, with a high percentage of accuracy (93,9%), of a new academic text in a query. We project to use this method as part of a more detailed multidimensional analysis of the PUCV-2006 Corpus.
|Translated title of the contribution||Academic text classification based on lexical-semantic content|
|Number of pages||33|
|State||Published - 1 Dec 2007|