Reduction of expanded search terms for fuzzy English-text retrieval

Manabu Ohta, Atsuhiro Takasu, Jun Adachi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Optical character reader (OCR) misrecognition is a serious problem when OCR-recognized text is used for retrieval purposes in digital libraries. We have proposed fuzzy retrieval methods that, instead of correcting the errors manually, assume that errors remain in the recognized text. Costs are thereby reduced. The proposed methods generate multiple search terms for each input query term by referring to the con- fusion matrices, which store all characters likely to be misrecognized and the respective probability of each misrecognition. The proposed methods can improve recall rates without decreasing precision rates. However, in English fuzzy retrieval, occasionally a few million search terms are generated, which has an intolerable effect on retrieval speed. Therefore, this paper presents two heuristics to reduce the number of generated search terms by restricting the number of errors included in each expanded search term while maintaining retrieval effectiveness.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
PublisherSpringer Verlag
Pages619-633
Number of pages15
Volume1513
ISBN (Print)9783540651017
Publication statusPublished - 1998
Externally publishedYes
Event2nd European Conference on Digital Libraries, ECDL 1998 - Heraklion, Crete, Greece
Duration: Sep 21 1998Sep 23 1998

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume1513
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other2nd European Conference on Digital Libraries, ECDL 1998
CountryGreece
CityHeraklion, Crete
Period9/21/989/23/98

Fingerprint

Text Retrieval
Retrieval
Term
Digital libraries
Fusion reactions
Digital Libraries
Fusion
Likely
Costs
Heuristics
Query
Character

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Ohta, M., Takasu, A., & Adachi, J. (1998). Reduction of expanded search terms for fuzzy English-text retrieval. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 1513, pp. 619-633). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 1513). Springer Verlag.

Reduction of expanded search terms for fuzzy English-text retrieval. / Ohta, Manabu; Takasu, Atsuhiro; Adachi, Jun.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 1513 Springer Verlag, 1998. p. 619-633 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 1513).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ohta, M, Takasu, A & Adachi, J 1998, Reduction of expanded search terms for fuzzy English-text retrieval. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 1513, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 1513, Springer Verlag, pp. 619-633, 2nd European Conference on Digital Libraries, ECDL 1998, Heraklion, Crete, Greece, 9/21/98.
Ohta M, Takasu A, Adachi J. Reduction of expanded search terms for fuzzy English-text retrieval. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 1513. Springer Verlag. 1998. p. 619-633. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
Ohta, Manabu ; Takasu, Atsuhiro ; Adachi, Jun. / Reduction of expanded search terms for fuzzy English-text retrieval. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 1513 Springer Verlag, 1998. pp. 619-633 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{c534c412009d4ac18f8b129b2109e9c7,
title = "Reduction of expanded search terms for fuzzy English-text retrieval",
abstract = "Optical character reader (OCR) misrecognition is a serious problem when OCR-recognized text is used for retrieval purposes in digital libraries. We have proposed fuzzy retrieval methods that, instead of correcting the errors manually, assume that errors remain in the recognized text. Costs are thereby reduced. The proposed methods generate multiple search terms for each input query term by referring to the con- fusion matrices, which store all characters likely to be misrecognized and the respective probability of each misrecognition. The proposed methods can improve recall rates without decreasing precision rates. However, in English fuzzy retrieval, occasionally a few million search terms are generated, which has an intolerable effect on retrieval speed. Therefore, this paper presents two heuristics to reduce the number of generated search terms by restricting the number of errors included in each expanded search term while maintaining retrieval effectiveness.",
author = "Manabu Ohta and Atsuhiro Takasu and Jun Adachi",
year = "1998",
language = "English",
isbn = "9783540651017",
volume = "1513",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "619--633",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - Reduction of expanded search terms for fuzzy English-text retrieval

AU - Ohta, Manabu

AU - Takasu, Atsuhiro

AU - Adachi, Jun

PY - 1998

Y1 - 1998

N2 - Optical character reader (OCR) misrecognition is a serious problem when OCR-recognized text is used for retrieval purposes in digital libraries. We have proposed fuzzy retrieval methods that, instead of correcting the errors manually, assume that errors remain in the recognized text. Costs are thereby reduced. The proposed methods generate multiple search terms for each input query term by referring to the con- fusion matrices, which store all characters likely to be misrecognized and the respective probability of each misrecognition. The proposed methods can improve recall rates without decreasing precision rates. However, in English fuzzy retrieval, occasionally a few million search terms are generated, which has an intolerable effect on retrieval speed. Therefore, this paper presents two heuristics to reduce the number of generated search terms by restricting the number of errors included in each expanded search term while maintaining retrieval effectiveness.

AB - Optical character reader (OCR) misrecognition is a serious problem when OCR-recognized text is used for retrieval purposes in digital libraries. We have proposed fuzzy retrieval methods that, instead of correcting the errors manually, assume that errors remain in the recognized text. Costs are thereby reduced. The proposed methods generate multiple search terms for each input query term by referring to the con- fusion matrices, which store all characters likely to be misrecognized and the respective probability of each misrecognition. The proposed methods can improve recall rates without decreasing precision rates. However, in English fuzzy retrieval, occasionally a few million search terms are generated, which has an intolerable effect on retrieval speed. Therefore, this paper presents two heuristics to reduce the number of generated search terms by restricting the number of errors included in each expanded search term while maintaining retrieval effectiveness.

UR - http://www.scopus.com/inward/record.url?scp=84945237083&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84945237083&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84945237083

SN - 9783540651017

VL - 1513

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 619

EP - 633

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

PB - Springer Verlag

ER -