Probabilistic Automaton-Based Fuzzy English-Text Retrieval

Manabu Ohta, Atsuhiro Takasu, Jun Adachi

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Optical Character Reader (OCR) incorrect recognition is a serious problem when searching for OCR-scanned documents in databases such as digital libraries. In order to reduce costs, this paper proposes fuzzy retrieval methods for English text containing errors in the recognized text without correcting the errors manually. The proposed methods generate multiple search terms for each input query term based on probabilistic automata which reflect both error-occurrence probabilities and character-connection probabilities. Experimental results of test-set retrieval indicate that one of the proposed methods improves the recall rate from 95.96% to 98.15% at the cost of a decrease in precision from 100.00% to 96.01% with 20 expanded search terms.

Original languageEnglish
Pages (from-to)1835-1844
Number of pages10
JournalIEICE Transactions on Information and Systems
VolumeE86-D
Issue number9
Publication statusPublished - Sep 2003
Externally publishedYes

Keywords

  • Bigram
  • Fuzzy retrieval
  • OCR
  • Probabilistic automaton
  • Query term expansion

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Vision and Pattern Recognition
  • Electrical and Electronic Engineering
  • Artificial Intelligence

Fingerprint Dive into the research topics of 'Probabilistic Automaton-Based Fuzzy English-Text Retrieval'. Together they form a unique fingerprint.

  • Cite this