Bibliographic element extraction from scanned documents using conditional random fields

Manabu Ohta, Takayuki Yakushi, Atsuhiro Takasu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Citations (Scopus)

Abstract

Bibliographic databases are indispensable to digital libraries for academic articles. However, extracting bibliographic elements from printed documents requires a lot of human intervention; it is not cost-effective, even when using various document image-processing techniques such as optical character recognition (OCR). In this paper, we propose an automatic bibliographic element extraction method for academic articles scanned with OCR markup. The proposed method first labels text blocks as predetermined bibliographic elements and then further labels the characters in each labeled text block if necessary. The second labeling enables us to extract each author's name from the authors' text block. The method uses conditional random fields (CRF) for labeling both text blocks and the characters in them. We applied the method to Japanese academic articles. The experiments showed that the proposed text block labeling correctly extracted all the predefined bibliographic elements from more than 97% of the articles; the proposed character labeling also correctly extracted all the author name strings from more than 99% of the authors' text blocks in Japanese.

Original languageEnglish
Title of host publication3rd International Conference on Digital Information Management, ICDIM 2008
Pages99-104
Number of pages6
DOIs
Publication statusPublished - Dec 1 2008
Event3rd International Conference on Digital Information Management, ICDIM 2008 - London, United Kingdom
Duration: Nov 13 2008Nov 16 2008

Publication series

Name3rd International Conference on Digital Information Management, ICDIM 2008

Other

Other3rd International Conference on Digital Information Management, ICDIM 2008
CountryUnited Kingdom
CityLondon
Period11/13/0811/16/08

ASJC Scopus subject areas

  • Information Systems
  • Information Systems and Management

Fingerprint Dive into the research topics of 'Bibliographic element extraction from scanned documents using conditional random fields'. Together they form a unique fingerprint.

Cite this