Two-step flow in bilingual lexicon extraction from unrelated corpora

Rogelio Nazar, Leo Wanner, Jorge Vivaldi

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

This paper presents a language independent methodology for automatically extracting bilingual lexicon entries from the web without the need of resources like parallel or comparable corpora, POS tagging, nor an initial bilingual lexicon. It is suitable for specialized domains where bilingual lexicon entries are scarce. The input for the process is a corpus in the source language to use as example of real usage of the units we need to translate. It is a two-step flow process because first we extract single-word units from the source language and then the multi-word units where the initial single units are instantiated. For each of the multi-word units, we see if they appear in texts from the web in the target language. The unit of the target language that appears more frequently across the sets of multi-word units is usually the correct translation of the initial single-word source language entry.

Original languageEnglish
Title of host publicationProceedings of the 12th European Association for Machine Translation Conference, EAMT 2008
Pages140-149
Number of pages10
StatePublished - 2008
Externally publishedYes
Event12th Conference of the European Association for Machine Translation, EAMT 2008 - Hamburg, Germany
Duration: 22 Sep 200823 Sep 2008

Publication series

NameProceedings of the 12th European Association for Machine Translation Conference, EAMT 2008

Conference

Conference12th Conference of the European Association for Machine Translation, EAMT 2008
Country/TerritoryGermany
CityHamburg
Period22/09/0823/09/08

Keywords

  • Bilingual lexicon extraction
  • Corpus linguistics
  • Knowledge-poor methods
  • Machine translation
  • Specialized terminology
  • Statistical methods

Fingerprint

Dive into the research topics of 'Two-step flow in bilingual lexicon extraction from unrelated corpora'. Together they form a unique fingerprint.

Cite this