Empirical evaluation of CRF F-based bibliography extraction from research papers

Manabu Ohta, Ryohei Inoue, Atsuhiro Takasu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

We proposed an automatic bibliography extraction method for research papers scanned with OCR markup. The method uses conditional random fields (CRF) to label serially OCRed text lines in the article title page as appropriate bibliographic element names. Although we achieved good extraction accuracies for some Japanese academic journals, extraction errors are inevitable. Therefore, this paper proposes three confidence measures for bibliography labeling to detect such extraction errors. This paper also reports an empirical evaluation of CRF-based page analysis for research papers on the basis not only of labeling accuracy but also of labeling error detection. We applied the three confidence measures to labeling three academic journals published in Japan. The experiments showed that the proposed confidence measures reasonably indicated the labeling accuracies and could be used for error detection.

Original languageEnglish
Title of host publicationProceedings of the IADIS International Conference Information Systems 2012, IS 2012
PublisherIADIS
Pages18-26
Number of pages9
ISBN (Print)9789728939687
Publication statusPublished - 2012
EventIADIS International Conference on Information Systems 2012, IS 2012 - Berlin, Germany
Duration: Mar 10 2012Mar 12 2012

Other

OtherIADIS International Conference on Information Systems 2012, IS 2012
CountryGermany
CityBerlin
Period3/10/123/12/12

Fingerprint

Bibliographies
Labeling
Error detection
Optical character recognition
Labels
Experiments

Keywords

  • Bibliography Extraction
  • Conditional Random Fields (CRF)
  • Digital Library
  • Error Detection
  • OCR

ASJC Scopus subject areas

  • Computer Science Applications
  • Hardware and Architecture
  • Information Systems
  • Software

Cite this

Ohta, M., Inoue, R., & Takasu, A. (2012). Empirical evaluation of CRF F-based bibliography extraction from research papers. In Proceedings of the IADIS International Conference Information Systems 2012, IS 2012 (pp. 18-26). IADIS.

Empirical evaluation of CRF F-based bibliography extraction from research papers. / Ohta, Manabu; Inoue, Ryohei; Takasu, Atsuhiro.

Proceedings of the IADIS International Conference Information Systems 2012, IS 2012. IADIS, 2012. p. 18-26.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ohta, M, Inoue, R & Takasu, A 2012, Empirical evaluation of CRF F-based bibliography extraction from research papers. in Proceedings of the IADIS International Conference Information Systems 2012, IS 2012. IADIS, pp. 18-26, IADIS International Conference on Information Systems 2012, IS 2012, Berlin, Germany, 3/10/12.
Ohta M, Inoue R, Takasu A. Empirical evaluation of CRF F-based bibliography extraction from research papers. In Proceedings of the IADIS International Conference Information Systems 2012, IS 2012. IADIS. 2012. p. 18-26
Ohta, Manabu ; Inoue, Ryohei ; Takasu, Atsuhiro. / Empirical evaluation of CRF F-based bibliography extraction from research papers. Proceedings of the IADIS International Conference Information Systems 2012, IS 2012. IADIS, 2012. pp. 18-26
@inproceedings{80dac91e41f64e70b1aca05cd00d1e45,
title = "Empirical evaluation of CRF F-based bibliography extraction from research papers",
abstract = "We proposed an automatic bibliography extraction method for research papers scanned with OCR markup. The method uses conditional random fields (CRF) to label serially OCRed text lines in the article title page as appropriate bibliographic element names. Although we achieved good extraction accuracies for some Japanese academic journals, extraction errors are inevitable. Therefore, this paper proposes three confidence measures for bibliography labeling to detect such extraction errors. This paper also reports an empirical evaluation of CRF-based page analysis for research papers on the basis not only of labeling accuracy but also of labeling error detection. We applied the three confidence measures to labeling three academic journals published in Japan. The experiments showed that the proposed confidence measures reasonably indicated the labeling accuracies and could be used for error detection.",
keywords = "Bibliography Extraction, Conditional Random Fields (CRF), Digital Library, Error Detection, OCR",
author = "Manabu Ohta and Ryohei Inoue and Atsuhiro Takasu",
year = "2012",
language = "English",
isbn = "9789728939687",
pages = "18--26",
booktitle = "Proceedings of the IADIS International Conference Information Systems 2012, IS 2012",
publisher = "IADIS",

}

TY - GEN

T1 - Empirical evaluation of CRF F-based bibliography extraction from research papers

AU - Ohta, Manabu

AU - Inoue, Ryohei

AU - Takasu, Atsuhiro

PY - 2012

Y1 - 2012

N2 - We proposed an automatic bibliography extraction method for research papers scanned with OCR markup. The method uses conditional random fields (CRF) to label serially OCRed text lines in the article title page as appropriate bibliographic element names. Although we achieved good extraction accuracies for some Japanese academic journals, extraction errors are inevitable. Therefore, this paper proposes three confidence measures for bibliography labeling to detect such extraction errors. This paper also reports an empirical evaluation of CRF-based page analysis for research papers on the basis not only of labeling accuracy but also of labeling error detection. We applied the three confidence measures to labeling three academic journals published in Japan. The experiments showed that the proposed confidence measures reasonably indicated the labeling accuracies and could be used for error detection.

AB - We proposed an automatic bibliography extraction method for research papers scanned with OCR markup. The method uses conditional random fields (CRF) to label serially OCRed text lines in the article title page as appropriate bibliographic element names. Although we achieved good extraction accuracies for some Japanese academic journals, extraction errors are inevitable. Therefore, this paper proposes three confidence measures for bibliography labeling to detect such extraction errors. This paper also reports an empirical evaluation of CRF-based page analysis for research papers on the basis not only of labeling accuracy but also of labeling error detection. We applied the three confidence measures to labeling three academic journals published in Japan. The experiments showed that the proposed confidence measures reasonably indicated the labeling accuracies and could be used for error detection.

KW - Bibliography Extraction

KW - Conditional Random Fields (CRF)

KW - Digital Library

KW - Error Detection

KW - OCR

UR - http://www.scopus.com/inward/record.url?scp=84869037046&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84869037046&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9789728939687

SP - 18

EP - 26

BT - Proceedings of the IADIS International Conference Information Systems 2012, IS 2012

PB - IADIS

ER -