CRF-based authors' name tagging for scanned documents

Manabu Ohta, Atsuhiro Takasu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Citations (Scopus)

Abstract

Authors' names are a critical bibliographic element when searching or browsing academic articles stored in digital libraries. Therefore, those creating metadata for digital libraries would appreciate an automatic method to extract such bibliographic data from printed documents. In this paper, we describe an automatic author name tagger for academic articles scanned with optical character recognition (OCR) mark-up. The method uses conditional random fields (CRF) for labeling the unsegmented character strings in authors' blocks as those of either an author or a delimiter. We applied the tagger to Japanese academic articles. The results of the experiments showed that it correctly labeled more than 99% of the author name strings, which compares favorably with the under 96% correct rate of our previous tagger based on a hidden Markov model (HMM).

Original languageEnglish
Title of host publicationProceedings of the ACM International Conference on Digital Libraries
Pages272-275
Number of pages4
DOIs
Publication statusPublished - 2008
Event8th ACM/IEEE-CS Joint Conference on Digital Libraries 2008, JCDL'08 - Pittsburgh, PA, United States
Duration: Jun 16 2008Jun 20 2008

Other

Other8th ACM/IEEE-CS Joint Conference on Digital Libraries 2008, JCDL'08
CountryUnited States
CityPittsburgh, PA
Period6/16/086/20/08

Fingerprint

Digital libraries
Optical character recognition
Hidden Markov models
Metadata
Labeling
Experiments
experiment

Keywords

  • Conditional Random Fields (CRF)
  • Digital libraries
  • Information extraction

ASJC Scopus subject areas

  • Computer Science Applications
  • Software
  • Information Systems
  • Library and Information Sciences

Cite this

Ohta, M., & Takasu, A. (2008). CRF-based authors' name tagging for scanned documents. In Proceedings of the ACM International Conference on Digital Libraries (pp. 272-275) https://doi.org/10.1145/1378889.1378935

CRF-based authors' name tagging for scanned documents. / Ohta, Manabu; Takasu, Atsuhiro.

Proceedings of the ACM International Conference on Digital Libraries. 2008. p. 272-275.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ohta, M & Takasu, A 2008, CRF-based authors' name tagging for scanned documents. in Proceedings of the ACM International Conference on Digital Libraries. pp. 272-275, 8th ACM/IEEE-CS Joint Conference on Digital Libraries 2008, JCDL'08, Pittsburgh, PA, United States, 6/16/08. https://doi.org/10.1145/1378889.1378935
Ohta M, Takasu A. CRF-based authors' name tagging for scanned documents. In Proceedings of the ACM International Conference on Digital Libraries. 2008. p. 272-275 https://doi.org/10.1145/1378889.1378935
Ohta, Manabu ; Takasu, Atsuhiro. / CRF-based authors' name tagging for scanned documents. Proceedings of the ACM International Conference on Digital Libraries. 2008. pp. 272-275
@inproceedings{4f41370205894d64a324b5dea62931b5,
title = "CRF-based authors' name tagging for scanned documents",
abstract = "Authors' names are a critical bibliographic element when searching or browsing academic articles stored in digital libraries. Therefore, those creating metadata for digital libraries would appreciate an automatic method to extract such bibliographic data from printed documents. In this paper, we describe an automatic author name tagger for academic articles scanned with optical character recognition (OCR) mark-up. The method uses conditional random fields (CRF) for labeling the unsegmented character strings in authors' blocks as those of either an author or a delimiter. We applied the tagger to Japanese academic articles. The results of the experiments showed that it correctly labeled more than 99{\%} of the author name strings, which compares favorably with the under 96{\%} correct rate of our previous tagger based on a hidden Markov model (HMM).",
keywords = "Conditional Random Fields (CRF), Digital libraries, Information extraction",
author = "Manabu Ohta and Atsuhiro Takasu",
year = "2008",
doi = "10.1145/1378889.1378935",
language = "English",
isbn = "9781595939982",
pages = "272--275",
booktitle = "Proceedings of the ACM International Conference on Digital Libraries",

}

TY - GEN

T1 - CRF-based authors' name tagging for scanned documents

AU - Ohta, Manabu

AU - Takasu, Atsuhiro

PY - 2008

Y1 - 2008

N2 - Authors' names are a critical bibliographic element when searching or browsing academic articles stored in digital libraries. Therefore, those creating metadata for digital libraries would appreciate an automatic method to extract such bibliographic data from printed documents. In this paper, we describe an automatic author name tagger for academic articles scanned with optical character recognition (OCR) mark-up. The method uses conditional random fields (CRF) for labeling the unsegmented character strings in authors' blocks as those of either an author or a delimiter. We applied the tagger to Japanese academic articles. The results of the experiments showed that it correctly labeled more than 99% of the author name strings, which compares favorably with the under 96% correct rate of our previous tagger based on a hidden Markov model (HMM).

AB - Authors' names are a critical bibliographic element when searching or browsing academic articles stored in digital libraries. Therefore, those creating metadata for digital libraries would appreciate an automatic method to extract such bibliographic data from printed documents. In this paper, we describe an automatic author name tagger for academic articles scanned with optical character recognition (OCR) mark-up. The method uses conditional random fields (CRF) for labeling the unsegmented character strings in authors' blocks as those of either an author or a delimiter. We applied the tagger to Japanese academic articles. The results of the experiments showed that it correctly labeled more than 99% of the author name strings, which compares favorably with the under 96% correct rate of our previous tagger based on a hidden Markov model (HMM).

KW - Conditional Random Fields (CRF)

KW - Digital libraries

KW - Information extraction

UR - http://www.scopus.com/inward/record.url?scp=57649219461&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=57649219461&partnerID=8YFLogxK

U2 - 10.1145/1378889.1378935

DO - 10.1145/1378889.1378935

M3 - Conference contribution

AN - SCOPUS:57649219461

SN - 9781595939982

SP - 272

EP - 275

BT - Proceedings of the ACM International Conference on Digital Libraries

ER -