TY - JOUR
T1 - Automatic taxonomy extraction for specialized domains using distributional semantics
AU - Nazar, Rogelio
AU - Vivaldi, Jorge
AU - Wanner, Leo
N1 - Funding Information:
This paper was possible thanks to funding granted to the first author by the project RICOTERM3 (Ministry of Science and Innovation, Spain. Ref: HUM2007-65966-C02–01/FILO; lead researcher: Dr. Mercè Lorente). We would like to thank the two anonymous reviewers for their detailed remarks and suggestions to improve the paper. Thanks are also extended to Aaron Feder for his proofreading.
PY - 2012
Y1 - 2012
N2 - This article explores a statistical, language-independent methodology for the construction of taxonomies of specialized domains from noisy corpora. In contrast to proposals that exploit linguistic information by searching for lexicosyntactic patterns that tend to express the hypernymy relation, our methodology relies entirely upon the distributional semantics of terms as captured by their lexical co-occurrence in large scale corpora. In a first stage, we analyze the syntagmatic relations of terms that serve as seeds of the taxonomy to be constructed and we obtain, thus, the first batch of hypernym candidate terms for our seed terms. In a second stage, we analyze the paradigmatic relations of the terms by inspecting which terms show a prominent frequency of co-occurrence with the terms that, as we found in the previous stage, are syntagmatically related to our seed terms - which allows us to refine the first batch of hypernym candidate terms and obtain new ones. In a third and final stage, we build a taxonomy from the obtained hypernym candidate lists, exploiting the asymmetric statistic association between terms that is characteristic of the hypernymy relation.
AB - This article explores a statistical, language-independent methodology for the construction of taxonomies of specialized domains from noisy corpora. In contrast to proposals that exploit linguistic information by searching for lexicosyntactic patterns that tend to express the hypernymy relation, our methodology relies entirely upon the distributional semantics of terms as captured by their lexical co-occurrence in large scale corpora. In a first stage, we analyze the syntagmatic relations of terms that serve as seeds of the taxonomy to be constructed and we obtain, thus, the first batch of hypernym candidate terms for our seed terms. In a second stage, we analyze the paradigmatic relations of the terms by inspecting which terms show a prominent frequency of co-occurrence with the terms that, as we found in the previous stage, are syntagmatically related to our seed terms - which allows us to refine the first batch of hypernym candidate terms and obtain new ones. In a third and final stage, we build a taxonomy from the obtained hypernym candidate lists, exploiting the asymmetric statistic association between terms that is characteristic of the hypernymy relation.
KW - Automatic taxonomy extraction for specialized domains using distributional semantics
KW - Distributional semantics
KW - Quantitative linguistics
KW - Taxonomy extraction
KW - Terminology extraction
UR - http://www.scopus.com/inward/record.url?scp=84866065858&partnerID=8YFLogxK
U2 - 10.1075/term.18.2.03naz
DO - 10.1075/term.18.2.03naz
M3 - Article
AN - SCOPUS:84866065858
SN - 0929-9971
VL - 18
SP - 188
EP - 225
JO - Terminology
JF - Terminology
IS - 2
ER -