TY - GEN
T1 - Utilization of multiple sequence analyzers for bibliographic information extraction
AU - Takasu, Atsuhiro
AU - Ohta, Manabu
N1 - Publisher Copyright:
© Springer International Publishing Switzerland 2015.
PY - 2015
Y1 - 2015
N2 - This paper discusses the problems of analyzing title page layouts and extracting bibliographic information from academic papers. Information extraction is an important function for digital libraries to offer, providing versatile and effective access paths to library content. Sequence analyzers, such as those based on a conditional random field, are often used to extract information from object pages. Recently, digital libraries have grown and can now handle a large number and wide variety of papers. Because of the variety of page layouts, it is necessary to prepare multiple analyzers, one for each type of layout, to achieve high extraction accuracy. This makes rule management important. For example, at what stage should we invest in a new analyzer, and how can we acquire it efficiently, when receiving papers with a new layout? This paper focuses on the detection of layout changes and how we learn to use a new sequence analyzer efficiently. We evaluate the confidence metrics for sequence analyzers to judge whether they would be suited to title page analysis by testing three academic journals. The results show that they are effective for measuring suitability. We also examine the sampling of training data when learning how to use a new analyzer.
AB - This paper discusses the problems of analyzing title page layouts and extracting bibliographic information from academic papers. Information extraction is an important function for digital libraries to offer, providing versatile and effective access paths to library content. Sequence analyzers, such as those based on a conditional random field, are often used to extract information from object pages. Recently, digital libraries have grown and can now handle a large number and wide variety of papers. Because of the variety of page layouts, it is necessary to prepare multiple analyzers, one for each type of layout, to achieve high extraction accuracy. This makes rule management important. For example, at what stage should we invest in a new analyzer, and how can we acquire it efficiently, when receiving papers with a new layout? This paper focuses on the detection of layout changes and how we learn to use a new sequence analyzer efficiently. We evaluate the confidence metrics for sequence analyzers to judge whether they would be suited to title page analysis by testing three academic journals. The results show that they are effective for measuring suitability. We also examine the sampling of training data when learning how to use a new analyzer.
KW - Conditional random field
KW - Digital libraries
KW - Information extraction
KW - Page layout analysis
UR - http://www.scopus.com/inward/record.url?scp=84951807343&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84951807343&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-25530-9_15
DO - 10.1007/978-3-319-25530-9_15
M3 - Conference contribution
AN - SCOPUS:84951807343
SN - 9783319255293
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 222
EP - 236
BT - Pattern Recognition Applications and Methods - 3rs International Conference, ICPRAM 2014, Revised Selected Papers
A2 - de Marsico, Maria
A2 - Fred, Ana
A2 - Tabbone, Antoine
PB - Springer Verlag
T2 - 3rd International Conference on Pattern Recognition Applications and Methods, ICPRAM 2014
Y2 - 6 March 2014 through 8 March 2014
ER -