Error detection of CRF-based bibliography extraction from reference strings

Manabu Ohta, Daiki Arauchi, Atsuhiro Takasu, Jun Adachi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

We proposed a parsing method for reference strings usually listed at the end of research papers to extract important bibliographies such as a title from them. The method uses a conditional random field (CRF) to estimate the correct bibliographic label for each token in the token sequence generated from a reference string. Although we achieved reasonable parsing accuracies for a Japanese academic journal, errors are inevitable. Therefore, this paper proposes ways to increase confidence for CRF-based bibliography parsing to detect such parsing errors. This paper also reports an empirical evaluation of the proposed parsing on the basis not only of its accuracies but also of how easy it is to detect errors. The experiments showed that the proposed measures reasonably indicated parsing errors and could be used to improve the quality of extracted bibliographies at a moderate manual post-editing cost.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages229-238
Number of pages10
Volume7634 LNCS
DOIs
Publication statusPublished - 2012
Event14th International Conference on Asia-Pacific Digital Libraries, ICADL 2012 - Taipei, Taiwan, Province of China
Duration: Nov 12 2012Nov 15 2012

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume7634 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other14th International Conference on Asia-Pacific Digital Libraries, ICADL 2012
CountryTaiwan, Province of China
CityTaipei
Period11/12/1211/15/12

Fingerprint

Conditional Random Fields
Error Detection
Error detection
Bibliographies
Parsing
Strings
Labels
Confidence
Bibliography
Costs
Experiments
Evaluation
Estimate
Experiment

Keywords

  • bibliography extraction
  • conditional random field (CRF)
  • confidence measure
  • digital library
  • error detection
  • reference string

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Ohta, M., Arauchi, D., Takasu, A., & Adachi, J. (2012). Error detection of CRF-based bibliography extraction from reference strings. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7634 LNCS, pp. 229-238). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 7634 LNCS). https://doi.org/10.1007/978-3-642-34752-8_29

Error detection of CRF-based bibliography extraction from reference strings. / Ohta, Manabu; Arauchi, Daiki; Takasu, Atsuhiro; Adachi, Jun.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 7634 LNCS 2012. p. 229-238 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 7634 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ohta, M, Arauchi, D, Takasu, A & Adachi, J 2012, Error detection of CRF-based bibliography extraction from reference strings. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 7634 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 7634 LNCS, pp. 229-238, 14th International Conference on Asia-Pacific Digital Libraries, ICADL 2012, Taipei, Taiwan, Province of China, 11/12/12. https://doi.org/10.1007/978-3-642-34752-8_29
Ohta M, Arauchi D, Takasu A, Adachi J. Error detection of CRF-based bibliography extraction from reference strings. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 7634 LNCS. 2012. p. 229-238. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-642-34752-8_29
Ohta, Manabu ; Arauchi, Daiki ; Takasu, Atsuhiro ; Adachi, Jun. / Error detection of CRF-based bibliography extraction from reference strings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 7634 LNCS 2012. pp. 229-238 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{8017c67c5c41402b81336d65d1386b5b,
title = "Error detection of CRF-based bibliography extraction from reference strings",
abstract = "We proposed a parsing method for reference strings usually listed at the end of research papers to extract important bibliographies such as a title from them. The method uses a conditional random field (CRF) to estimate the correct bibliographic label for each token in the token sequence generated from a reference string. Although we achieved reasonable parsing accuracies for a Japanese academic journal, errors are inevitable. Therefore, this paper proposes ways to increase confidence for CRF-based bibliography parsing to detect such parsing errors. This paper also reports an empirical evaluation of the proposed parsing on the basis not only of its accuracies but also of how easy it is to detect errors. The experiments showed that the proposed measures reasonably indicated parsing errors and could be used to improve the quality of extracted bibliographies at a moderate manual post-editing cost.",
keywords = "bibliography extraction, conditional random field (CRF), confidence measure, digital library, error detection, reference string",
author = "Manabu Ohta and Daiki Arauchi and Atsuhiro Takasu and Jun Adachi",
year = "2012",
doi = "10.1007/978-3-642-34752-8_29",
language = "English",
isbn = "9783642347511",
volume = "7634 LNCS",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "229--238",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - Error detection of CRF-based bibliography extraction from reference strings

AU - Ohta, Manabu

AU - Arauchi, Daiki

AU - Takasu, Atsuhiro

AU - Adachi, Jun

PY - 2012

Y1 - 2012

N2 - We proposed a parsing method for reference strings usually listed at the end of research papers to extract important bibliographies such as a title from them. The method uses a conditional random field (CRF) to estimate the correct bibliographic label for each token in the token sequence generated from a reference string. Although we achieved reasonable parsing accuracies for a Japanese academic journal, errors are inevitable. Therefore, this paper proposes ways to increase confidence for CRF-based bibliography parsing to detect such parsing errors. This paper also reports an empirical evaluation of the proposed parsing on the basis not only of its accuracies but also of how easy it is to detect errors. The experiments showed that the proposed measures reasonably indicated parsing errors and could be used to improve the quality of extracted bibliographies at a moderate manual post-editing cost.

AB - We proposed a parsing method for reference strings usually listed at the end of research papers to extract important bibliographies such as a title from them. The method uses a conditional random field (CRF) to estimate the correct bibliographic label for each token in the token sequence generated from a reference string. Although we achieved reasonable parsing accuracies for a Japanese academic journal, errors are inevitable. Therefore, this paper proposes ways to increase confidence for CRF-based bibliography parsing to detect such parsing errors. This paper also reports an empirical evaluation of the proposed parsing on the basis not only of its accuracies but also of how easy it is to detect errors. The experiments showed that the proposed measures reasonably indicated parsing errors and could be used to improve the quality of extracted bibliographies at a moderate manual post-editing cost.

KW - bibliography extraction

KW - conditional random field (CRF)

KW - confidence measure

KW - digital library

KW - error detection

KW - reference string

UR - http://www.scopus.com/inward/record.url?scp=84869046469&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84869046469&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-34752-8_29

DO - 10.1007/978-3-642-34752-8_29

M3 - Conference contribution

SN - 9783642347511

VL - 7634 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 229

EP - 238

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -