Bibliographic element extraction from scanned documents using conditional random fields

Manabu Ohta, Takayuki Yakushi, Atsuhiro Takasu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Citations (Scopus)

Abstract

Bibliographic databases are indispensable to digital libraries for academic articles. However, extracting bibliographic elements from printed documents requires a lot of human intervention; it is not cost-effective, even when using various document image-processing techniques such as optical character recognition (OCR). In this paper, we propose an automatic bibliographic element extraction method for academic articles scanned with OCR markup. The proposed method first labels text blocks as predetermined bibliographic elements and then further labels the characters in each labeled text block if necessary. The second labeling enables us to extract each author's name from the authors' text block. The method uses conditional random fields (CRF) for labeling both text blocks and the characters in them. We applied the method to Japanese academic articles. The experiments showed that the proposed text block labeling correctly extracted all the predefined bibliographic elements from more than 97% of the articles; the proposed character labeling also correctly extracted all the author name strings from more than 99% of the authors' text blocks in Japanese.

Original languageEnglish
Title of host publication3rd International Conference on Digital Information Management, ICDIM 2008
Pages99-104
Number of pages6
DOIs
Publication statusPublished - 2008
Event3rd International Conference on Digital Information Management, ICDIM 2008 - London, United Kingdom
Duration: Nov 13 2008Nov 16 2008

Other

Other3rd International Conference on Digital Information Management, ICDIM 2008
CountryUnited Kingdom
CityLondon
Period11/13/0811/16/08

Fingerprint

Labeling
Optical character recognition
Labels
Digital libraries
Image processing
Conditional random fields
Costs
Experiments

ASJC Scopus subject areas

  • Information Systems
  • Information Systems and Management

Cite this

Ohta, M., Yakushi, T., & Takasu, A. (2008). Bibliographic element extraction from scanned documents using conditional random fields. In 3rd International Conference on Digital Information Management, ICDIM 2008 (pp. 99-104). [4746745] https://doi.org/10.1109/ICDIM.2008.4746745

Bibliographic element extraction from scanned documents using conditional random fields. / Ohta, Manabu; Yakushi, Takayuki; Takasu, Atsuhiro.

3rd International Conference on Digital Information Management, ICDIM 2008. 2008. p. 99-104 4746745.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ohta, M, Yakushi, T & Takasu, A 2008, Bibliographic element extraction from scanned documents using conditional random fields. in 3rd International Conference on Digital Information Management, ICDIM 2008., 4746745, pp. 99-104, 3rd International Conference on Digital Information Management, ICDIM 2008, London, United Kingdom, 11/13/08. https://doi.org/10.1109/ICDIM.2008.4746745
Ohta M, Yakushi T, Takasu A. Bibliographic element extraction from scanned documents using conditional random fields. In 3rd International Conference on Digital Information Management, ICDIM 2008. 2008. p. 99-104. 4746745 https://doi.org/10.1109/ICDIM.2008.4746745
Ohta, Manabu ; Yakushi, Takayuki ; Takasu, Atsuhiro. / Bibliographic element extraction from scanned documents using conditional random fields. 3rd International Conference on Digital Information Management, ICDIM 2008. 2008. pp. 99-104
@inproceedings{fba51c71e38c407ebe1d3ee2c7a930ea,
title = "Bibliographic element extraction from scanned documents using conditional random fields",
abstract = "Bibliographic databases are indispensable to digital libraries for academic articles. However, extracting bibliographic elements from printed documents requires a lot of human intervention; it is not cost-effective, even when using various document image-processing techniques such as optical character recognition (OCR). In this paper, we propose an automatic bibliographic element extraction method for academic articles scanned with OCR markup. The proposed method first labels text blocks as predetermined bibliographic elements and then further labels the characters in each labeled text block if necessary. The second labeling enables us to extract each author's name from the authors' text block. The method uses conditional random fields (CRF) for labeling both text blocks and the characters in them. We applied the method to Japanese academic articles. The experiments showed that the proposed text block labeling correctly extracted all the predefined bibliographic elements from more than 97{\%} of the articles; the proposed character labeling also correctly extracted all the author name strings from more than 99{\%} of the authors' text blocks in Japanese.",
author = "Manabu Ohta and Takayuki Yakushi and Atsuhiro Takasu",
year = "2008",
doi = "10.1109/ICDIM.2008.4746745",
language = "English",
isbn = "9781424429172",
pages = "99--104",
booktitle = "3rd International Conference on Digital Information Management, ICDIM 2008",

}

TY - GEN

T1 - Bibliographic element extraction from scanned documents using conditional random fields

AU - Ohta, Manabu

AU - Yakushi, Takayuki

AU - Takasu, Atsuhiro

PY - 2008

Y1 - 2008

N2 - Bibliographic databases are indispensable to digital libraries for academic articles. However, extracting bibliographic elements from printed documents requires a lot of human intervention; it is not cost-effective, even when using various document image-processing techniques such as optical character recognition (OCR). In this paper, we propose an automatic bibliographic element extraction method for academic articles scanned with OCR markup. The proposed method first labels text blocks as predetermined bibliographic elements and then further labels the characters in each labeled text block if necessary. The second labeling enables us to extract each author's name from the authors' text block. The method uses conditional random fields (CRF) for labeling both text blocks and the characters in them. We applied the method to Japanese academic articles. The experiments showed that the proposed text block labeling correctly extracted all the predefined bibliographic elements from more than 97% of the articles; the proposed character labeling also correctly extracted all the author name strings from more than 99% of the authors' text blocks in Japanese.

AB - Bibliographic databases are indispensable to digital libraries for academic articles. However, extracting bibliographic elements from printed documents requires a lot of human intervention; it is not cost-effective, even when using various document image-processing techniques such as optical character recognition (OCR). In this paper, we propose an automatic bibliographic element extraction method for academic articles scanned with OCR markup. The proposed method first labels text blocks as predetermined bibliographic elements and then further labels the characters in each labeled text block if necessary. The second labeling enables us to extract each author's name from the authors' text block. The method uses conditional random fields (CRF) for labeling both text blocks and the characters in them. We applied the method to Japanese academic articles. The experiments showed that the proposed text block labeling correctly extracted all the predefined bibliographic elements from more than 97% of the articles; the proposed character labeling also correctly extracted all the author name strings from more than 99% of the authors' text blocks in Japanese.

UR - http://www.scopus.com/inward/record.url?scp=62949156808&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=62949156808&partnerID=8YFLogxK

U2 - 10.1109/ICDIM.2008.4746745

DO - 10.1109/ICDIM.2008.4746745

M3 - Conference contribution

SN - 9781424429172

SP - 99

EP - 104

BT - 3rd International Conference on Digital Information Management, ICDIM 2008

ER -