Brains, not brawn: The use of "smart" comparable corpora in bilingual terminology mining

Emmanuel Morin, Béatrice Daille, Koichi Takeuchi, Kyo Kageura

Research output: Contribution to journalArticle

9 Citations (Scopus)

Abstract

Current research in text mining favors the quantity of texts over their representativeness. But for bilingual terminologymining, and for many language pairs, large comparable corpora are not available. More importantly, as terms are defined vis-à-vis a specific domain with a restricted register, it is expected that the representativeness rather than the quantity of the corpus matters more in terminology mining. Our hypothesis, therefore, is that the representativeness of the corpus is more important than the quantity and ensures the quality of the acquired terminological resources. This article tests this hypothesis on a French-Japanese bilingual term extraction task. To demonstrate how important the type of discourse is as a characteristic of the comparable corpora, we used a state-of-the-art multilingual terminology mining chain composed of two extraction programs, one in each language, and an alignment program. We evaluated the candidate translations using a reference list, and found that taking discourse type into account resulted in candidate translations of a better quality even when the corpus size was reduced by half.

Original languageEnglish
Article number1
JournalACM Transactions on Speech and Language Processing
Volume7
Issue number1
DOIs
Publication statusPublished - Aug 2010

Fingerprint

Terminology
Mining
Brain
Text Mining
Term
Alignment
Corpus
Resources
Demonstrate
Language
Discourse

Keywords

  • Comparable corpora
  • Lexical alignment
  • Terminology mining

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computational Mathematics

Cite this

Brains, not brawn : The use of "smart" comparable corpora in bilingual terminology mining. / Morin, Emmanuel; Daille, Béatrice; Takeuchi, Koichi; Kageura, Kyo.

In: ACM Transactions on Speech and Language Processing, Vol. 7, No. 1, 1, 08.2010.

Research output: Contribution to journalArticle

@article{7d747924942946939ce0ca9613552340,
title = "Brains, not brawn: The use of {"}smart{"} comparable corpora in bilingual terminology mining",
abstract = "Current research in text mining favors the quantity of texts over their representativeness. But for bilingual terminologymining, and for many language pairs, large comparable corpora are not available. More importantly, as terms are defined vis-{\`a}-vis a specific domain with a restricted register, it is expected that the representativeness rather than the quantity of the corpus matters more in terminology mining. Our hypothesis, therefore, is that the representativeness of the corpus is more important than the quantity and ensures the quality of the acquired terminological resources. This article tests this hypothesis on a French-Japanese bilingual term extraction task. To demonstrate how important the type of discourse is as a characteristic of the comparable corpora, we used a state-of-the-art multilingual terminology mining chain composed of two extraction programs, one in each language, and an alignment program. We evaluated the candidate translations using a reference list, and found that taking discourse type into account resulted in candidate translations of a better quality even when the corpus size was reduced by half.",
keywords = "Comparable corpora, Lexical alignment, Terminology mining",
author = "Emmanuel Morin and B{\'e}atrice Daille and Koichi Takeuchi and Kyo Kageura",
year = "2010",
month = "8",
doi = "10.1145/1839478.1839479",
language = "English",
volume = "7",
journal = "ACM Transactions on Speech and Language Processing",
issn = "1550-4875",
publisher = "Association for Computing Machinery (ACM)",
number = "1",

}

TY - JOUR

T1 - Brains, not brawn

T2 - The use of "smart" comparable corpora in bilingual terminology mining

AU - Morin, Emmanuel

AU - Daille, Béatrice

AU - Takeuchi, Koichi

AU - Kageura, Kyo

PY - 2010/8

Y1 - 2010/8

N2 - Current research in text mining favors the quantity of texts over their representativeness. But for bilingual terminologymining, and for many language pairs, large comparable corpora are not available. More importantly, as terms are defined vis-à-vis a specific domain with a restricted register, it is expected that the representativeness rather than the quantity of the corpus matters more in terminology mining. Our hypothesis, therefore, is that the representativeness of the corpus is more important than the quantity and ensures the quality of the acquired terminological resources. This article tests this hypothesis on a French-Japanese bilingual term extraction task. To demonstrate how important the type of discourse is as a characteristic of the comparable corpora, we used a state-of-the-art multilingual terminology mining chain composed of two extraction programs, one in each language, and an alignment program. We evaluated the candidate translations using a reference list, and found that taking discourse type into account resulted in candidate translations of a better quality even when the corpus size was reduced by half.

AB - Current research in text mining favors the quantity of texts over their representativeness. But for bilingual terminologymining, and for many language pairs, large comparable corpora are not available. More importantly, as terms are defined vis-à-vis a specific domain with a restricted register, it is expected that the representativeness rather than the quantity of the corpus matters more in terminology mining. Our hypothesis, therefore, is that the representativeness of the corpus is more important than the quantity and ensures the quality of the acquired terminological resources. This article tests this hypothesis on a French-Japanese bilingual term extraction task. To demonstrate how important the type of discourse is as a characteristic of the comparable corpora, we used a state-of-the-art multilingual terminology mining chain composed of two extraction programs, one in each language, and an alignment program. We evaluated the candidate translations using a reference list, and found that taking discourse type into account resulted in candidate translations of a better quality even when the corpus size was reduced by half.

KW - Comparable corpora

KW - Lexical alignment

KW - Terminology mining

UR - http://www.scopus.com/inward/record.url?scp=77958030314&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77958030314&partnerID=8YFLogxK

U2 - 10.1145/1839478.1839479

DO - 10.1145/1839478.1839479

M3 - Article

AN - SCOPUS:77958030314

VL - 7

JO - ACM Transactions on Speech and Language Processing

JF - ACM Transactions on Speech and Language Processing

SN - 1550-4875

IS - 1

M1 - 1

ER -