Bio-medical entity extraction using support vector machines

Koichi Takeuchi, Nigel Collier

Research output: Contribution to journalArticle

44 Citations (Scopus)

Abstract

Objective: Support vector machines (SVMs) have achieved state-of-the-art performance in several classification tasks. In this article we apply them to the identification and semantic annotation of scientific and technical terminology in the domain of molecular biology. This illustrates the extensibility of the traditional named entity task to special domains with large-scale terminologies such as those in medicine and related disciplines. Methods and materials: The foundation for the model is a sample of text annotated by a domain expert according to an ontology of concepts, properties and relations. The model then learns to annotate unseen terms in new texts and contexts. The results can be used for a variety of intelligent language processing applications. We illustrate SVMs capabilities using a sample of 100 journal abstracts texts taken from the {human, blood cell, transcription factor} domain of MEDLINE. Results: Approximately 3400 terms are annotated and the model performs at about 74% F-score on cross-validation tests. A detailed analysis based on empirical evidence shows the contribution of various feature sets to performance. Conclusion: Our experiments indicate a relationship between feature window size and the amount of training data and that a combination of surface words, orthographic features and head noun features achieve the best performance among the feature sets tested.

Original languageEnglish
Pages (from-to)125-137
Number of pages13
JournalArtificial Intelligence in Medicine
Volume33
Issue number2
DOIs
Publication statusPublished - Feb 2005

Fingerprint

Terminology
Support vector machines
Semantics
MEDLINE
Molecular Biology
Blood Cells
Molecular biology
Transcription factors
Transcription Factors
Language
Head
Medicine
Ontology
Blood
Cells
Processing
Support Vector Machine
Experiments

Keywords

  • Machine learning
  • MEDLINE
  • Multi-classifier
  • Named entity
  • Natural language processing
  • Support vector machines
  • Text mining

ASJC Scopus subject areas

  • Artificial Intelligence
  • Medicine(all)

Cite this

Bio-medical entity extraction using support vector machines. / Takeuchi, Koichi; Collier, Nigel.

In: Artificial Intelligence in Medicine, Vol. 33, No. 2, 02.2005, p. 125-137.

Research output: Contribution to journalArticle

@article{d9081101ad844da690db63e970b50cab,
title = "Bio-medical entity extraction using support vector machines",
abstract = "Objective: Support vector machines (SVMs) have achieved state-of-the-art performance in several classification tasks. In this article we apply them to the identification and semantic annotation of scientific and technical terminology in the domain of molecular biology. This illustrates the extensibility of the traditional named entity task to special domains with large-scale terminologies such as those in medicine and related disciplines. Methods and materials: The foundation for the model is a sample of text annotated by a domain expert according to an ontology of concepts, properties and relations. The model then learns to annotate unseen terms in new texts and contexts. The results can be used for a variety of intelligent language processing applications. We illustrate SVMs capabilities using a sample of 100 journal abstracts texts taken from the {human, blood cell, transcription factor} domain of MEDLINE. Results: Approximately 3400 terms are annotated and the model performs at about 74{\%} F-score on cross-validation tests. A detailed analysis based on empirical evidence shows the contribution of various feature sets to performance. Conclusion: Our experiments indicate a relationship between feature window size and the amount of training data and that a combination of surface words, orthographic features and head noun features achieve the best performance among the feature sets tested.",
keywords = "Machine learning, MEDLINE, Multi-classifier, Named entity, Natural language processing, Support vector machines, Text mining",
author = "Koichi Takeuchi and Nigel Collier",
year = "2005",
month = "2",
doi = "10.1016/j.artmed.2004.07.019",
language = "English",
volume = "33",
pages = "125--137",
journal = "Artificial Intelligence in Medicine",
issn = "0933-3657",
publisher = "Elsevier",
number = "2",

}

TY - JOUR

T1 - Bio-medical entity extraction using support vector machines

AU - Takeuchi, Koichi

AU - Collier, Nigel

PY - 2005/2

Y1 - 2005/2

N2 - Objective: Support vector machines (SVMs) have achieved state-of-the-art performance in several classification tasks. In this article we apply them to the identification and semantic annotation of scientific and technical terminology in the domain of molecular biology. This illustrates the extensibility of the traditional named entity task to special domains with large-scale terminologies such as those in medicine and related disciplines. Methods and materials: The foundation for the model is a sample of text annotated by a domain expert according to an ontology of concepts, properties and relations. The model then learns to annotate unseen terms in new texts and contexts. The results can be used for a variety of intelligent language processing applications. We illustrate SVMs capabilities using a sample of 100 journal abstracts texts taken from the {human, blood cell, transcription factor} domain of MEDLINE. Results: Approximately 3400 terms are annotated and the model performs at about 74% F-score on cross-validation tests. A detailed analysis based on empirical evidence shows the contribution of various feature sets to performance. Conclusion: Our experiments indicate a relationship between feature window size and the amount of training data and that a combination of surface words, orthographic features and head noun features achieve the best performance among the feature sets tested.

AB - Objective: Support vector machines (SVMs) have achieved state-of-the-art performance in several classification tasks. In this article we apply them to the identification and semantic annotation of scientific and technical terminology in the domain of molecular biology. This illustrates the extensibility of the traditional named entity task to special domains with large-scale terminologies such as those in medicine and related disciplines. Methods and materials: The foundation for the model is a sample of text annotated by a domain expert according to an ontology of concepts, properties and relations. The model then learns to annotate unseen terms in new texts and contexts. The results can be used for a variety of intelligent language processing applications. We illustrate SVMs capabilities using a sample of 100 journal abstracts texts taken from the {human, blood cell, transcription factor} domain of MEDLINE. Results: Approximately 3400 terms are annotated and the model performs at about 74% F-score on cross-validation tests. A detailed analysis based on empirical evidence shows the contribution of various feature sets to performance. Conclusion: Our experiments indicate a relationship between feature window size and the amount of training data and that a combination of surface words, orthographic features and head noun features achieve the best performance among the feature sets tested.

KW - Machine learning

KW - MEDLINE

KW - Multi-classifier

KW - Named entity

KW - Natural language processing

KW - Support vector machines

KW - Text mining

UR - http://www.scopus.com/inward/record.url?scp=16244362685&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=16244362685&partnerID=8YFLogxK

U2 - 10.1016/j.artmed.2004.07.019

DO - 10.1016/j.artmed.2004.07.019

M3 - Article

VL - 33

SP - 125

EP - 137

JO - Artificial Intelligence in Medicine

JF - Artificial Intelligence in Medicine

SN - 0933-3657

IS - 2

ER -