Retrieval methods for English-text with misrecognized OCR characters

Manabu Ohta, A. Takasu, J. Adachi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

25 Citations (Scopus)

Abstract

This paper presents three probabilistic text retrieval methods designed to carry out a full-text search of English documents containing OCR errors. By searching for any query term on the premise that there are errors in the recognized text, the methods presented can tolerate such errors, and therefore costly manual post-editing is not required after OCR recognition. In the applied approach, confusion matrices are used to store characters which are likely to be interchanged when a particular character is missrecognized, and the respective probability of each occurrence. Moreover, a 2-gram matrix is used to store probabilities of character connection, i.e., which letter is likely to come after another. Multiple search terms are generated for an input query term by making reference to confusion matrices, after which a full-text search is run for each search term. The validity of retrieved terms is determined based on error-occurrence and character-connection probabilities. The performance of these methods is experimentally evaluated by determining retrieval effectiveness, i.e., by calculating recall and precision rates. Results indicate marked improvement in comparison with exact matching.

Original languageEnglish
Title of host publicationProceedings of the International Conference on Document Analysis and Recognition, ICDAR
Editors Anon
PublisherIEEE
Pages950-956
Number of pages7
Volume2
Publication statusPublished - 1997
Externally publishedYes
EventProceedings of the 1997 4th International Conference on Document Analysis and Recognition, ICDAR. Part 2 (of 2) - Ulm, Ger
Duration: Aug 18 1997Aug 20 1997

Other

OtherProceedings of the 1997 4th International Conference on Document Analysis and Recognition, ICDAR. Part 2 (of 2)
CityUlm, Ger
Period8/18/978/20/97

Fingerprint

Optical character recognition

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition

Cite this

Ohta, M., Takasu, A., & Adachi, J. (1997). Retrieval methods for English-text with misrecognized OCR characters. In Anon (Ed.), Proceedings of the International Conference on Document Analysis and Recognition, ICDAR (Vol. 2, pp. 950-956). IEEE.

Retrieval methods for English-text with misrecognized OCR characters. / Ohta, Manabu; Takasu, A.; Adachi, J.

Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. ed. / Anon. Vol. 2 IEEE, 1997. p. 950-956.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ohta, M, Takasu, A & Adachi, J 1997, Retrieval methods for English-text with misrecognized OCR characters. in Anon (ed.), Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. vol. 2, IEEE, pp. 950-956, Proceedings of the 1997 4th International Conference on Document Analysis and Recognition, ICDAR. Part 2 (of 2), Ulm, Ger, 8/18/97.
Ohta M, Takasu A, Adachi J. Retrieval methods for English-text with misrecognized OCR characters. In Anon, editor, Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. Vol. 2. IEEE. 1997. p. 950-956
Ohta, Manabu ; Takasu, A. ; Adachi, J. / Retrieval methods for English-text with misrecognized OCR characters. Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. editor / Anon. Vol. 2 IEEE, 1997. pp. 950-956
@inproceedings{a1232d91dba04092ae475f2a111da26f,
title = "Retrieval methods for English-text with misrecognized OCR characters",
abstract = "This paper presents three probabilistic text retrieval methods designed to carry out a full-text search of English documents containing OCR errors. By searching for any query term on the premise that there are errors in the recognized text, the methods presented can tolerate such errors, and therefore costly manual post-editing is not required after OCR recognition. In the applied approach, confusion matrices are used to store characters which are likely to be interchanged when a particular character is missrecognized, and the respective probability of each occurrence. Moreover, a 2-gram matrix is used to store probabilities of character connection, i.e., which letter is likely to come after another. Multiple search terms are generated for an input query term by making reference to confusion matrices, after which a full-text search is run for each search term. The validity of retrieved terms is determined based on error-occurrence and character-connection probabilities. The performance of these methods is experimentally evaluated by determining retrieval effectiveness, i.e., by calculating recall and precision rates. Results indicate marked improvement in comparison with exact matching.",
author = "Manabu Ohta and A. Takasu and J. Adachi",
year = "1997",
language = "English",
volume = "2",
pages = "950--956",
editor = "Anon",
booktitle = "Proceedings of the International Conference on Document Analysis and Recognition, ICDAR",
publisher = "IEEE",

}

TY - GEN

T1 - Retrieval methods for English-text with misrecognized OCR characters

AU - Ohta, Manabu

AU - Takasu, A.

AU - Adachi, J.

PY - 1997

Y1 - 1997

N2 - This paper presents three probabilistic text retrieval methods designed to carry out a full-text search of English documents containing OCR errors. By searching for any query term on the premise that there are errors in the recognized text, the methods presented can tolerate such errors, and therefore costly manual post-editing is not required after OCR recognition. In the applied approach, confusion matrices are used to store characters which are likely to be interchanged when a particular character is missrecognized, and the respective probability of each occurrence. Moreover, a 2-gram matrix is used to store probabilities of character connection, i.e., which letter is likely to come after another. Multiple search terms are generated for an input query term by making reference to confusion matrices, after which a full-text search is run for each search term. The validity of retrieved terms is determined based on error-occurrence and character-connection probabilities. The performance of these methods is experimentally evaluated by determining retrieval effectiveness, i.e., by calculating recall and precision rates. Results indicate marked improvement in comparison with exact matching.

AB - This paper presents three probabilistic text retrieval methods designed to carry out a full-text search of English documents containing OCR errors. By searching for any query term on the premise that there are errors in the recognized text, the methods presented can tolerate such errors, and therefore costly manual post-editing is not required after OCR recognition. In the applied approach, confusion matrices are used to store characters which are likely to be interchanged when a particular character is missrecognized, and the respective probability of each occurrence. Moreover, a 2-gram matrix is used to store probabilities of character connection, i.e., which letter is likely to come after another. Multiple search terms are generated for an input query term by making reference to confusion matrices, after which a full-text search is run for each search term. The validity of retrieved terms is determined based on error-occurrence and character-connection probabilities. The performance of these methods is experimentally evaluated by determining retrieval effectiveness, i.e., by calculating recall and precision rates. Results indicate marked improvement in comparison with exact matching.

UR - http://www.scopus.com/inward/record.url?scp=0030676856&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0030676856&partnerID=8YFLogxK

M3 - Conference contribution

VL - 2

SP - 950

EP - 956

BT - Proceedings of the International Conference on Document Analysis and Recognition, ICDAR

A2 - Anon, null

PB - IEEE

ER -