A suite to compile and analyze an LSP corpus

Rogelio Nazar, Jorge Vivaldi, Teresa Cabré

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

This paper presents a series of tools for the extraction of specialized corpora from the web and its subsequent analysis mainly with statistical techniques. It is an integrated system of original as well as standard tools and has a modular conception that facilitates its re-integration on different systems. The first part of the paper describes the original techniques, which are devoted to the categorization of documents as relevant or irrelevant to the corpus under construction, considering relevant a specialized document of the selected technical domain. Evaluation figures are provided for the original part, but not for the second part involving the analysis of the corpus, which is composed of algorithms that are well known in the field of Natural Language Processing, such as Kwic search, measures of vocabulary richness, the sorting of n-grams by frequency of occurrence or by measures of statistical association, distribution or similarity.

Original languageEnglish
Title of host publicationProceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008
PublisherEuropean Language Resources Association (ELRA)
Pages1164-1169
Number of pages6
ISBN (Electronic)2951740840, 9782951740846
StatePublished - 2008
Externally publishedYes
Event6th International Conference on Language Resources and Evaluation, LREC 2008 - Marrakech, Morocco
Duration: 28 May 200830 May 2008

Publication series

NameProceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008

Conference

Conference6th International Conference on Language Resources and Evaluation, LREC 2008
Country/TerritoryMorocco
CityMarrakech
Period28/05/0830/05/08

Fingerprint

Dive into the research topics of 'A suite to compile and analyze an LSP corpus'. Together they form a unique fingerprint.

Cite this