TY - JOUR
T1 - A Proposal of Printed Table Digitization Algorithm with Image Processing
AU - Shi, Chenrui
AU - Funabiki, Nobuo
AU - Huo, Yuanzhi
AU - Mentari, Mustika
AU - Suga, Kohei
AU - Toshida, Takashi
N1 - Publisher Copyright:
© 2022 by the authors.
PY - 2022/12
Y1 - 2022/12
N2 - Nowadays, digital transformation (DX) is the key concept to change and improve the operations in governments, companies, and schools. Therefore, any data should be digitized for processing by computers. Unfortunately, a lot of data and information are printed and handled on paper, although they may originally come from digital sources. Data on paper can be digitized using an optical character recognition (OCR) software. However, if the paper contains a table, it becomes difficult because of the separated characters by rows and columns there. It is necessary to solve the research question of “how to convert a printed table on paper into an Excel table while keeping the relationships between the cells?” In this paper, we propose a printed table digitization algorithm using image processing techniques and OCR software for it. First, the target paper is scanned into an image file. Second, each table is divided into a collection of cells where the topology information is obtained. Third, the characters in each cell are digitized by OCR software. Finally, the digitalized data are arranged in an Excel file using the topology information. We implement the algorithm on Python using OpenCV for the image processing library and Tesseract for the OCR software. For evaluations, we applied the proposal to 19 scanned and 17 screenshotted table images. The results show that for any image, the Excel file is generated with the correct structure, and some characters are misrecognized by OCR software. The improvement will be in future works.
AB - Nowadays, digital transformation (DX) is the key concept to change and improve the operations in governments, companies, and schools. Therefore, any data should be digitized for processing by computers. Unfortunately, a lot of data and information are printed and handled on paper, although they may originally come from digital sources. Data on paper can be digitized using an optical character recognition (OCR) software. However, if the paper contains a table, it becomes difficult because of the separated characters by rows and columns there. It is necessary to solve the research question of “how to convert a printed table on paper into an Excel table while keeping the relationships between the cells?” In this paper, we propose a printed table digitization algorithm using image processing techniques and OCR software for it. First, the target paper is scanned into an image file. Second, each table is divided into a collection of cells where the topology information is obtained. Third, the characters in each cell are digitized by OCR software. Finally, the digitalized data are arranged in an Excel file using the topology information. We implement the algorithm on Python using OpenCV for the image processing library and Tesseract for the OCR software. For evaluations, we applied the proposal to 19 scanned and 17 screenshotted table images. The results show that for any image, the Excel file is generated with the correct structure, and some characters are misrecognized by OCR software. The improvement will be in future works.
KW - digitization
KW - OCR
KW - OpenCV
KW - printed table
KW - Python
KW - Tesseract
UR - http://www.scopus.com/inward/record.url?scp=85144594706&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85144594706&partnerID=8YFLogxK
U2 - 10.3390/a15120471
DO - 10.3390/a15120471
M3 - Article
AN - SCOPUS:85144594706
SN - 1999-4893
VL - 15
JO - Algorithms
JF - Algorithms
IS - 12
M1 - 471
ER -