Reduction of expanded search terms for fuzzy English-text retrieval

Manabu Ohta, Atsuhiro Takasu, Jun Adachi

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Optical character reader (OCR) misrecognition is a serious problem when OCR-recognized text is used for retrieval purposes in digital libraries. We have proposed fuzzy retrieval methods that, instead of correcting the errors manually, assume that errors remain in the recognized text. Costs are thereby reduced. The proposed methods generate multiple search terms for each input query term by referring to confusion matrices, which store all characters likely to be misrecognized and the respective probability of each misrecognition. The proposed methods can improve recall rates without decreasing precision rates. However, a few million search terms are occasionally generated in English-text fuzzy retrieval, giving an intolerable effect on retrieval speed. Therefore, this paper presents two remedies to reduce the number of generated search terms while maintaining retrieval effectiveness. One remedy is to restrict the number of errors included in each expanded search term, while the other is to introduce another validity value different to our conventional one. Experimental results indicate that the former remedy reduced the number of terms to about 50 and the latter to not more than 20.

Original languageEnglish
Pages (from-to)140-151
Number of pages12
JournalInternational Journal on Digital Libraries
Volume3
Issue number2
DOIs
Publication statusPublished - 2000
Externally publishedYes

Fingerprint

remedies
costs
Values

Keywords

  • Confusion term
  • Fuzzy retrieval
  • OCR
  • Query term expansion
  • Retrieval speed

ASJC Scopus subject areas

  • Library and Information Sciences

Cite this

Reduction of expanded search terms for fuzzy English-text retrieval. / Ohta, Manabu; Takasu, Atsuhiro; Adachi, Jun.

In: International Journal on Digital Libraries, Vol. 3, No. 2, 2000, p. 140-151.

Research output: Contribution to journalArticle

@article{470ed2a8175b45878796317b08cc98ba,
title = "Reduction of expanded search terms for fuzzy English-text retrieval",
abstract = "Optical character reader (OCR) misrecognition is a serious problem when OCR-recognized text is used for retrieval purposes in digital libraries. We have proposed fuzzy retrieval methods that, instead of correcting the errors manually, assume that errors remain in the recognized text. Costs are thereby reduced. The proposed methods generate multiple search terms for each input query term by referring to confusion matrices, which store all characters likely to be misrecognized and the respective probability of each misrecognition. The proposed methods can improve recall rates without decreasing precision rates. However, a few million search terms are occasionally generated in English-text fuzzy retrieval, giving an intolerable effect on retrieval speed. Therefore, this paper presents two remedies to reduce the number of generated search terms while maintaining retrieval effectiveness. One remedy is to restrict the number of errors included in each expanded search term, while the other is to introduce another validity value different to our conventional one. Experimental results indicate that the former remedy reduced the number of terms to about 50 and the latter to not more than 20.",
keywords = "Confusion term, Fuzzy retrieval, OCR, Query term expansion, Retrieval speed",
author = "Manabu Ohta and Atsuhiro Takasu and Jun Adachi",
year = "2000",
doi = "10.1007/s007999900014",
language = "English",
volume = "3",
pages = "140--151",
journal = "International Journal on Digital Libraries",
issn = "1432-5012",
publisher = "Springer Verlag",
number = "2",

}

TY - JOUR

T1 - Reduction of expanded search terms for fuzzy English-text retrieval

AU - Ohta, Manabu

AU - Takasu, Atsuhiro

AU - Adachi, Jun

PY - 2000

Y1 - 2000

N2 - Optical character reader (OCR) misrecognition is a serious problem when OCR-recognized text is used for retrieval purposes in digital libraries. We have proposed fuzzy retrieval methods that, instead of correcting the errors manually, assume that errors remain in the recognized text. Costs are thereby reduced. The proposed methods generate multiple search terms for each input query term by referring to confusion matrices, which store all characters likely to be misrecognized and the respective probability of each misrecognition. The proposed methods can improve recall rates without decreasing precision rates. However, a few million search terms are occasionally generated in English-text fuzzy retrieval, giving an intolerable effect on retrieval speed. Therefore, this paper presents two remedies to reduce the number of generated search terms while maintaining retrieval effectiveness. One remedy is to restrict the number of errors included in each expanded search term, while the other is to introduce another validity value different to our conventional one. Experimental results indicate that the former remedy reduced the number of terms to about 50 and the latter to not more than 20.

AB - Optical character reader (OCR) misrecognition is a serious problem when OCR-recognized text is used for retrieval purposes in digital libraries. We have proposed fuzzy retrieval methods that, instead of correcting the errors manually, assume that errors remain in the recognized text. Costs are thereby reduced. The proposed methods generate multiple search terms for each input query term by referring to confusion matrices, which store all characters likely to be misrecognized and the respective probability of each misrecognition. The proposed methods can improve recall rates without decreasing precision rates. However, a few million search terms are occasionally generated in English-text fuzzy retrieval, giving an intolerable effect on retrieval speed. Therefore, this paper presents two remedies to reduce the number of generated search terms while maintaining retrieval effectiveness. One remedy is to restrict the number of errors included in each expanded search term, while the other is to introduce another validity value different to our conventional one. Experimental results indicate that the former remedy reduced the number of terms to about 50 and the latter to not more than 20.

KW - Confusion term

KW - Fuzzy retrieval

KW - OCR

KW - Query term expansion

KW - Retrieval speed

UR - http://www.scopus.com/inward/record.url?scp=1642411939&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=1642411939&partnerID=8YFLogxK

U2 - 10.1007/s007999900014

DO - 10.1007/s007999900014

M3 - Article

AN - SCOPUS:1642411939

VL - 3

SP - 140

EP - 151

JO - International Journal on Digital Libraries

JF - International Journal on Digital Libraries

SN - 1432-5012

IS - 2

ER -