Cost evaluation of CRF-based bibliography extraction from reference strings

Naomichi Kawakami, Manabu Ohta, Atsuhiro Takasu, Jun Adachi

Research output: Chapter in Book/Report/Conference proceedingChapter

Abstract

The effective use of digital libraries demands maintenance of bibliographic databases. Especially, the reference fields of academic papers are full of useful bibliographic information such as authors' names and paper titles. We, therefore, propose a method of automatically extracting bibliographic information from reference strings using a conditional random field (CRF). However, at least a few hundred reference strings are necessary for training the CRF to achieve high extraction accuracies. As described herein, we propose the use of active sampling and pseudo-training data to reduce the amount of training data. Then we evaluate the associated training costs by experimentation.

Original languageEnglish
Title of host publicationThe Emergence of Digital Libraries - Research and Practices - 16th International Conference on Asia-Pacific Digital Libraries, ICADL 2014, Proceedings
PublisherSpringer Verlag
Pages268-278
Number of pages11
Volume8839
ISBN (Print)9783319128221
Publication statusPublished - 2014
Event16th International Conference on Asia-Pacific Digital Libraries, ICADL 2014 - Chiang Mai, Thailand
Duration: Nov 5 2014Nov 7 2014

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8839
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other16th International Conference on Asia-Pacific Digital Libraries, ICADL 2014
CountryThailand
CityChiang Mai
Period11/5/1411/7/14

Fingerprint

Conditional Random Fields
Digital libraries
Bibliographies
Strings
Sampling
Evaluation
Costs
Digital Libraries
Experimentation
Maintenance
Necessary
Training
Bibliography
Evaluate

Keywords

  • Active sampling
  • CRF
  • Information extraction
  • Pseudo-training data
  • Reference string

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Kawakami, N., Ohta, M., Takasu, A., & Adachi, J. (2014). Cost evaluation of CRF-based bibliography extraction from reference strings. In The Emergence of Digital Libraries - Research and Practices - 16th International Conference on Asia-Pacific Digital Libraries, ICADL 2014, Proceedings (Vol. 8839, pp. 268-278). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 8839). Springer Verlag.

Cost evaluation of CRF-based bibliography extraction from reference strings. / Kawakami, Naomichi; Ohta, Manabu; Takasu, Atsuhiro; Adachi, Jun.

The Emergence of Digital Libraries - Research and Practices - 16th International Conference on Asia-Pacific Digital Libraries, ICADL 2014, Proceedings. Vol. 8839 Springer Verlag, 2014. p. 268-278 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 8839).

Research output: Chapter in Book/Report/Conference proceedingChapter

Kawakami, N, Ohta, M, Takasu, A & Adachi, J 2014, Cost evaluation of CRF-based bibliography extraction from reference strings. in The Emergence of Digital Libraries - Research and Practices - 16th International Conference on Asia-Pacific Digital Libraries, ICADL 2014, Proceedings. vol. 8839, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8839, Springer Verlag, pp. 268-278, 16th International Conference on Asia-Pacific Digital Libraries, ICADL 2014, Chiang Mai, Thailand, 11/5/14.
Kawakami N, Ohta M, Takasu A, Adachi J. Cost evaluation of CRF-based bibliography extraction from reference strings. In The Emergence of Digital Libraries - Research and Practices - 16th International Conference on Asia-Pacific Digital Libraries, ICADL 2014, Proceedings. Vol. 8839. Springer Verlag. 2014. p. 268-278. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
Kawakami, Naomichi ; Ohta, Manabu ; Takasu, Atsuhiro ; Adachi, Jun. / Cost evaluation of CRF-based bibliography extraction from reference strings. The Emergence of Digital Libraries - Research and Practices - 16th International Conference on Asia-Pacific Digital Libraries, ICADL 2014, Proceedings. Vol. 8839 Springer Verlag, 2014. pp. 268-278 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inbook{2d12edc77382466791b3617b8db164da,
title = "Cost evaluation of CRF-based bibliography extraction from reference strings",
abstract = "The effective use of digital libraries demands maintenance of bibliographic databases. Especially, the reference fields of academic papers are full of useful bibliographic information such as authors' names and paper titles. We, therefore, propose a method of automatically extracting bibliographic information from reference strings using a conditional random field (CRF). However, at least a few hundred reference strings are necessary for training the CRF to achieve high extraction accuracies. As described herein, we propose the use of active sampling and pseudo-training data to reduce the amount of training data. Then we evaluate the associated training costs by experimentation.",
keywords = "Active sampling, CRF, Information extraction, Pseudo-training data, Reference string",
author = "Naomichi Kawakami and Manabu Ohta and Atsuhiro Takasu and Jun Adachi",
year = "2014",
language = "English",
isbn = "9783319128221",
volume = "8839",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "268--278",
booktitle = "The Emergence of Digital Libraries - Research and Practices - 16th International Conference on Asia-Pacific Digital Libraries, ICADL 2014, Proceedings",

}

TY - CHAP

T1 - Cost evaluation of CRF-based bibliography extraction from reference strings

AU - Kawakami, Naomichi

AU - Ohta, Manabu

AU - Takasu, Atsuhiro

AU - Adachi, Jun

PY - 2014

Y1 - 2014

N2 - The effective use of digital libraries demands maintenance of bibliographic databases. Especially, the reference fields of academic papers are full of useful bibliographic information such as authors' names and paper titles. We, therefore, propose a method of automatically extracting bibliographic information from reference strings using a conditional random field (CRF). However, at least a few hundred reference strings are necessary for training the CRF to achieve high extraction accuracies. As described herein, we propose the use of active sampling and pseudo-training data to reduce the amount of training data. Then we evaluate the associated training costs by experimentation.

AB - The effective use of digital libraries demands maintenance of bibliographic databases. Especially, the reference fields of academic papers are full of useful bibliographic information such as authors' names and paper titles. We, therefore, propose a method of automatically extracting bibliographic information from reference strings using a conditional random field (CRF). However, at least a few hundred reference strings are necessary for training the CRF to achieve high extraction accuracies. As described herein, we propose the use of active sampling and pseudo-training data to reduce the amount of training data. Then we evaluate the associated training costs by experimentation.

KW - Active sampling

KW - CRF

KW - Information extraction

KW - Pseudo-training data

KW - Reference string

UR - http://www.scopus.com/inward/record.url?scp=84909643322&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84909643322&partnerID=8YFLogxK

M3 - Chapter

SN - 9783319128221

VL - 8839

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 268

EP - 278

BT - The Emergence of Digital Libraries - Research and Practices - 16th International Conference on Asia-Pacific Digital Libraries, ICADL 2014, Proceedings

PB - Springer Verlag

ER -