TY - GEN
T1 - A taxonomy of Spanish nouns, a statistical algorithm to generate it and its implementation in open source code
AU - Nazar, Rogelio
AU - Renau, Irene
N1 - Funding Information:
This research is supported by two grants from the Chilean Government: Conicyt-Fondecyt 11140686, “Inducción automática de taxonomías de sustantivos generales y espe-cializados a partir de corpus textuales desde el enfoque de la lingüística cuantitativa” (lead researcher: Rogelio Nazar) and Conicyt-Fondecyt 11140704, “Detección automática del significado de los verbos del castellano por medio de patrones sintáctico-semánticos extraídos con estadística de corpus” (lead researcher: Irene Renau). We would like to thank the anonymous reviewers of the paper for their useful comments and the students who helped us with the evaluation.
PY - 2016
Y1 - 2016
N2 - In this paper we describe our work in progress in the automatic development of a taxonomy of Spanish nouns, we offer the Perl implementation we have so far, and we discuss the different problems that still need to be addressed. We designed a statistically-based taxonomy induction algorithm consisting of a combination of different strategies not involving explicit linguistic knowledge. Being all quantitative, the strategies we present are however of different nature. Some of them are based on the computation of distributional similarity coefficients which identify pairs of sibling words or co-hyponyms, while others are based on asymmetric co-occurrence and identify pairs of parent-child words or hypernym-hyponym relations. A decision making process is then applied to combine the results of the previous steps, and finally connect lexical units to a basic structure containing the most general categories of the language. We evaluate the quality of the taxonomy both manually and also using Spanish Wordnet as a gold-standard. We estimate an average of 89.07% precision and 25.49% recall considering only the results which the algorithm presents with high degree of certainty, or 77.86% precision and 33.72% recall considering all results.
AB - In this paper we describe our work in progress in the automatic development of a taxonomy of Spanish nouns, we offer the Perl implementation we have so far, and we discuss the different problems that still need to be addressed. We designed a statistically-based taxonomy induction algorithm consisting of a combination of different strategies not involving explicit linguistic knowledge. Being all quantitative, the strategies we present are however of different nature. Some of them are based on the computation of distributional similarity coefficients which identify pairs of sibling words or co-hyponyms, while others are based on asymmetric co-occurrence and identify pairs of parent-child words or hypernym-hyponym relations. A decision making process is then applied to combine the results of the previous steps, and finally connect lexical units to a basic structure containing the most general categories of the language. We evaluate the quality of the taxonomy both manually and also using Spanish Wordnet as a gold-standard. We estimate an average of 89.07% precision and 25.49% recall considering only the results which the algorithm presents with high degree of certainty, or 77.86% precision and 33.72% recall considering all results.
KW - Corpus statistics
KW - Distributional semantics
KW - Spanish
KW - Taxonomy induction
UR - http://www.scopus.com/inward/record.url?scp=85037176513&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85037176513
T3 - Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016
SP - 1485
EP - 1492
BT - Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016
A2 - Calzolari, Nicoletta
A2 - Choukri, Khalid
A2 - Mazo, Helene
A2 - Moreno, Asuncion
A2 - Declerck, Thierry
A2 - Goggi, Sara
A2 - Grobelnik, Marko
A2 - Odijk, Jan
A2 - Piperidis, Stelios
A2 - Maegaard, Bente
A2 - Mariani, Joseph
PB - European Language Resources Association (ELRA)
T2 - 10th International Conference on Language Resources and Evaluation, LREC 2016
Y2 - 23 May 2016 through 28 May 2016
ER -