TY - GEN
T1 - A suite to compile and analyze an LSP corpus
AU - Nazar, Rogelio
AU - Vivaldi, Jorge
AU - Cabré, Teresa
N1 - Funding Information:
The authors would like to thank the anonymous reviewers for their valuable comments. This paper was possible thanks to the ADQUA scholarship granted to the first author by the Government of Catalonia, Spain, according to the resolution UNI/772/ 2003.
PY - 2008
Y1 - 2008
N2 - This paper presents a series of tools for the extraction of specialized corpora from the web and its subsequent analysis mainly with statistical techniques. It is an integrated system of original as well as standard tools and has a modular conception that facilitates its re-integration on different systems. The first part of the paper describes the original techniques, which are devoted to the categorization of documents as relevant or irrelevant to the corpus under construction, considering relevant a specialized document of the selected technical domain. Evaluation figures are provided for the original part, but not for the second part involving the analysis of the corpus, which is composed of algorithms that are well known in the field of Natural Language Processing, such as Kwic search, measures of vocabulary richness, the sorting of n-grams by frequency of occurrence or by measures of statistical association, distribution or similarity.
AB - This paper presents a series of tools for the extraction of specialized corpora from the web and its subsequent analysis mainly with statistical techniques. It is an integrated system of original as well as standard tools and has a modular conception that facilitates its re-integration on different systems. The first part of the paper describes the original techniques, which are devoted to the categorization of documents as relevant or irrelevant to the corpus under construction, considering relevant a specialized document of the selected technical domain. Evaluation figures are provided for the original part, but not for the second part involving the analysis of the corpus, which is composed of algorithms that are well known in the field of Natural Language Processing, such as Kwic search, measures of vocabulary richness, the sorting of n-grams by frequency of occurrence or by measures of statistical association, distribution or similarity.
UR - http://www.scopus.com/inward/record.url?scp=84984688452&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:84984688452
T3 - Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008
SP - 1164
EP - 1169
BT - Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008
PB - European Language Resources Association (ELRA)
T2 - 6th International Conference on Language Resources and Evaluation, LREC 2008
Y2 - 28 May 2008 through 30 May 2008
ER -