Utilization of multiple sequence analyzers for bibliographic information extraction

Atsuhiro Takasu, Manabu Ohta

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper discusses the problems of analyzing title page layouts and extracting bibliographic information from academic papers. Information extraction is an important function for digital libraries to offer, providing versatile and effective access paths to library content. Sequence analyzers, such as those based on a conditional random field, are often used to extract information from object pages. Recently, digital libraries have grown and can now handle a large number and wide variety of papers. Because of the variety of page layouts, it is necessary to prepare multiple analyzers, one for each type of layout, to achieve high extraction accuracy. This makes rule management important. For example, at what stage should we invest in a new analyzer, and how can we acquire it efficiently, when receiving papers with a new layout? This paper focuses on the detection of layout changes and how we learn to use a new sequence analyzer efficiently. We evaluate the confidence metrics for sequence analyzers to judge whether they would be suited to title page analysis by testing three academic journals. The results show that they are effective for measuring suitability. We also examine the sampling of training data when learning how to use a new analyzer.

Original languageEnglish
Title of host publicationPattern Recognition Applications and Methods - 3rs International Conference, ICPRAM 2014, Revised Selected Papers
EditorsMaria de Marsico, Ana Fred, Antoine Tabbone
PublisherSpringer Verlag
Pages222-236
Number of pages15
ISBN (Print)9783319255293
DOIs
Publication statusPublished - Jan 1 2015
Event3rd International Conference on Pattern Recognition Applications and Methods, ICPRAM 2014 - Angers, Loire Valley, France
Duration: Mar 6 2014Mar 8 2014

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9443
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other3rd International Conference on Pattern Recognition Applications and Methods, ICPRAM 2014
CountryFrance
CityAngers, Loire Valley
Period3/6/143/8/14

Keywords

  • Conditional random field
  • Digital libraries
  • Information extraction
  • Page layout analysis

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'Utilization of multiple sequence analyzers for bibliographic information extraction'. Together they form a unique fingerprint.

  • Cite this

    Takasu, A., & Ohta, M. (2015). Utilization of multiple sequence analyzers for bibliographic information extraction. In M. de Marsico, A. Fred, & A. Tabbone (Eds.), Pattern Recognition Applications and Methods - 3rs International Conference, ICPRAM 2014, Revised Selected Papers (pp. 222-236). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9443). Springer Verlag. https://doi.org/10.1007/978-3-319-25530-9_15