Empirical evaluation of active sampling for CRF-based analysis of pages

Manabu Ohta, Ryohei Inoue, Atsuhiro Takasu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Citations (Scopus)

Abstract

We propose an automatic method of extracting bibliographies for academic articles scanned with OCR markup. The method uses conditional random fields (CRF) for labeling serially OCR-ed text lines on an article's title page as appropriate names for bibliographic elements. Although we achieved excellent extraction accuracies for some Japanese academic journals, we needed a substantial amount of training data that had to be obtained through costly manual extraction of bibliographies from printed documents. Therefore, this paper reports an empirical evaluation of active sampling applied to the CRF-based extraction of bibliographies to reduce the amount of training data. We applied active sampling techniques to three academic journals published in Japan. The experiments revealed that the sampling strategy using the proposed criteria for selecting samples could reduce the amount of training data to less than half or even a third of those for two academic journals. This paper also reports the effect of pseudo-training data that were added to training.

Original languageEnglish
Title of host publication2010 IEEE International Conference on Information Reuse and Integration, IRI 2010
Pages13-18
Number of pages6
DOIs
Publication statusPublished - 2010
Event11th IEEE International Conference on Information Reuse and Integration, IRI 2010 - Las Vegas, NV, United States
Duration: Aug 4 2010Aug 6 2010

Other

Other11th IEEE International Conference on Information Reuse and Integration, IRI 2010
CountryUnited States
CityLas Vegas, NV
Period8/4/108/6/10

Fingerprint

Bibliographies
Optical character recognition
Sampling
Labeling
Conditional random fields
Empirical evaluation
Experiments
Academic journals

Keywords

  • Active sampling
  • Bibliography extraction
  • CRF
  • Digital library
  • OCR

ASJC Scopus subject areas

  • Information Systems
  • Information Systems and Management

Cite this

Ohta, M., Inoue, R., & Takasu, A. (2010). Empirical evaluation of active sampling for CRF-based analysis of pages. In 2010 IEEE International Conference on Information Reuse and Integration, IRI 2010 (pp. 13-18). [5558973] https://doi.org/10.1109/IRI.2010.5558973

Empirical evaluation of active sampling for CRF-based analysis of pages. / Ohta, Manabu; Inoue, Ryohei; Takasu, Atsuhiro.

2010 IEEE International Conference on Information Reuse and Integration, IRI 2010. 2010. p. 13-18 5558973.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ohta, M, Inoue, R & Takasu, A 2010, Empirical evaluation of active sampling for CRF-based analysis of pages. in 2010 IEEE International Conference on Information Reuse and Integration, IRI 2010., 5558973, pp. 13-18, 11th IEEE International Conference on Information Reuse and Integration, IRI 2010, Las Vegas, NV, United States, 8/4/10. https://doi.org/10.1109/IRI.2010.5558973
Ohta M, Inoue R, Takasu A. Empirical evaluation of active sampling for CRF-based analysis of pages. In 2010 IEEE International Conference on Information Reuse and Integration, IRI 2010. 2010. p. 13-18. 5558973 https://doi.org/10.1109/IRI.2010.5558973
Ohta, Manabu ; Inoue, Ryohei ; Takasu, Atsuhiro. / Empirical evaluation of active sampling for CRF-based analysis of pages. 2010 IEEE International Conference on Information Reuse and Integration, IRI 2010. 2010. pp. 13-18
@inproceedings{4010cb15085c4f79ade433d003feb978,
title = "Empirical evaluation of active sampling for CRF-based analysis of pages",
abstract = "We propose an automatic method of extracting bibliographies for academic articles scanned with OCR markup. The method uses conditional random fields (CRF) for labeling serially OCR-ed text lines on an article's title page as appropriate names for bibliographic elements. Although we achieved excellent extraction accuracies for some Japanese academic journals, we needed a substantial amount of training data that had to be obtained through costly manual extraction of bibliographies from printed documents. Therefore, this paper reports an empirical evaluation of active sampling applied to the CRF-based extraction of bibliographies to reduce the amount of training data. We applied active sampling techniques to three academic journals published in Japan. The experiments revealed that the sampling strategy using the proposed criteria for selecting samples could reduce the amount of training data to less than half or even a third of those for two academic journals. This paper also reports the effect of pseudo-training data that were added to training.",
keywords = "Active sampling, Bibliography extraction, CRF, Digital library, OCR",
author = "Manabu Ohta and Ryohei Inoue and Atsuhiro Takasu",
year = "2010",
doi = "10.1109/IRI.2010.5558973",
language = "English",
isbn = "9781424480975",
pages = "13--18",
booktitle = "2010 IEEE International Conference on Information Reuse and Integration, IRI 2010",

}

TY - GEN

T1 - Empirical evaluation of active sampling for CRF-based analysis of pages

AU - Ohta, Manabu

AU - Inoue, Ryohei

AU - Takasu, Atsuhiro

PY - 2010

Y1 - 2010

N2 - We propose an automatic method of extracting bibliographies for academic articles scanned with OCR markup. The method uses conditional random fields (CRF) for labeling serially OCR-ed text lines on an article's title page as appropriate names for bibliographic elements. Although we achieved excellent extraction accuracies for some Japanese academic journals, we needed a substantial amount of training data that had to be obtained through costly manual extraction of bibliographies from printed documents. Therefore, this paper reports an empirical evaluation of active sampling applied to the CRF-based extraction of bibliographies to reduce the amount of training data. We applied active sampling techniques to three academic journals published in Japan. The experiments revealed that the sampling strategy using the proposed criteria for selecting samples could reduce the amount of training data to less than half or even a third of those for two academic journals. This paper also reports the effect of pseudo-training data that were added to training.

AB - We propose an automatic method of extracting bibliographies for academic articles scanned with OCR markup. The method uses conditional random fields (CRF) for labeling serially OCR-ed text lines on an article's title page as appropriate names for bibliographic elements. Although we achieved excellent extraction accuracies for some Japanese academic journals, we needed a substantial amount of training data that had to be obtained through costly manual extraction of bibliographies from printed documents. Therefore, this paper reports an empirical evaluation of active sampling applied to the CRF-based extraction of bibliographies to reduce the amount of training data. We applied active sampling techniques to three academic journals published in Japan. The experiments revealed that the sampling strategy using the proposed criteria for selecting samples could reduce the amount of training data to less than half or even a third of those for two academic journals. This paper also reports the effect of pseudo-training data that were added to training.

KW - Active sampling

KW - Bibliography extraction

KW - CRF

KW - Digital library

KW - OCR

UR - http://www.scopus.com/inward/record.url?scp=77958016174&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77958016174&partnerID=8YFLogxK

U2 - 10.1109/IRI.2010.5558973

DO - 10.1109/IRI.2010.5558973

M3 - Conference contribution

AN - SCOPUS:77958016174

SN - 9781424480975

SP - 13

EP - 18

BT - 2010 IEEE International Conference on Information Reuse and Integration, IRI 2010

ER -