Javascript must be enabled to continue!

Algorithm for auto annotation of scanned documents based on subregion tiling and shallow networks

There are millions of scanned documents worldwide in around 4 thousand languages. Searching for information in a scanned document requires a text layer to be available and indexed. Preparation of a text layer requires recognition of character and sub-region patterns and associating with a human interpretation. Developing an optical character recognition (OCR) system for each and every language is a very difficult task if not impossible. There is a strong need for systems that add on top of the existing OCR technologies by learning from them and unifying disparate multitude of many a system. In this regard, we propose an algorithm that leverages the fact that we are dealing with scanned documents of handwritten text regions from across diverse domains and language settings. We observe that the text regions have consistent bounding box sizes and any large font or tiny font scenarios can be handled in preprocessing or postprocessing phases. The image subregions are smaller in size in scanned text documents compared to subregions formed by common objects in general purpose images. We propose and validate the hypothesis that a much simpler convolution neural network (CNN) having very few layers and less number of filters can be used for detecting individual subregion classes. For detection of several hundreds of classes, multiple such simpler models can be pooled to operate simultaneously on a document. The advantage of going by pools of subregion specific models is the ability to deal with incremental addition of hundreds of newer classes over time, without disturbing the previous models in the continual learning scenario. Such an approach has distinctive advantage over using a single monolithic model where subregions classes share and interfere via a bulky common neural network. We report here an efficient algorithm for building a subregion specific lightweight CNN models. The training data for the CNN proposed, requires engineering synthetic data points that consider both pattern of interest and non-patterns as well. We propose and validate the hypothesis that an image canvas in which optimal amount of pattern and non-pattern can be formulated using a means squared error loss function to influence filter for training from the data. The CNN hence trained has the capability to identify the character-object in presence of several other objects on a generalized test image of a scanned document. In this setting some of the key observations are in a CNN, learning a filter depends not only on the abundance of patterns of interest but also on the presence of a non-pattern context. Our experiments have led to some of the key observations - (i) a pattern cannot be over-expressed in isolation, (ii) a pattern cannot be under-xpressed as well, (iii) a non-pattern can be of salt and pepper type noise and finally (iv) it is sufficient to provide a non-pattern context to a modest representation of a pattern to result in strong individual sub-region class models. We have carried out studies and reported \textit{mean average precision} scores on various data sets including (1) MNIST digits(95.77), (2) E-MNIST capital alphabet(81.26), (3) EMNIST small alphabet(73.32) (4) Kannada digits(95.77), (5) Kannada letters(90.34), (6) Devanagari letters(100) (7) Telugu words(93.20) (8) Devanagari words(93.20) and also on medical prescriptions and observed high-performance metrics of mean average precision over 90%. The algorithm serves as a kernel in the automatic annotation of digital documents in diverse scenarios such as annotation of ancient manuscripts and hand-written health records.

Institute of Electrical and Electronics Engineers (IEEE)

Komuravelli Prashanth Kalidas Yeturu

2021

Title: Algorithm for auto annotation of scanned documents based on subregion tiling and shallow networks

Description:

There are millions of scanned documents worldwide in around 4 thousand languages.

Searching for information in a scanned document requires a text layer to be available and indexed.

Preparation of a text layer requires recognition of character and sub-region patterns and associating with a human interpretation.

Developing an optical character recognition (OCR) system for each and every language is a very difficult task if not impossible.

There is a strong need for systems that add on top of the existing OCR technologies by learning from them and unifying disparate multitude of many a system.

In this regard, we propose an algorithm that leverages the fact that we are dealing with scanned documents of handwritten text regions from across diverse domains and language settings.

We observe that the text regions have consistent bounding box sizes and any large font or tiny font scenarios can be handled in preprocessing or postprocessing phases.

The image subregions are smaller in size in scanned text documents compared to subregions formed by common objects in general purpose images.

We propose and validate the hypothesis that a much simpler convolution neural network (CNN) having very few layers and less number of filters can be used for detecting individual subregion classes.

For detection of several hundreds of classes, multiple such simpler models can be pooled to operate simultaneously on a document.

The advantage of going by pools of subregion specific models is the ability to deal with incremental addition of hundreds of newer classes over time, without disturbing the previous models in the continual learning scenario.

Such an approach has distinctive advantage over using a single monolithic model where subregions classes share and interfere via a bulky common neural network.

We report here an efficient algorithm for building a subregion specific lightweight CNN models.

The training data for the CNN proposed, requires engineering synthetic data points that consider both pattern of interest and non-patterns as well.

We propose and validate the hypothesis that an image canvas in which optimal amount of pattern and non-pattern can be formulated using a means squared error loss function to influence filter for training from the data.

The CNN hence trained has the capability to identify the character-object in presence of several other objects on a generalized test image of a scanned document.

In this setting some of the key observations are in a CNN, learning a filter depends not only on the abundance of patterns of interest but also on the presence of a non-pattern context.

Our experiments have led to some of the key observations - (i) a pattern cannot be over-expressed in isolation, (ii) a pattern cannot be under-xpressed as well, (iii) a non-pattern can be of salt and pepper type noise and finally (iv) it is sufficient to provide a non-pattern context to a modest representation of a pattern to result in strong individual sub-region class models.

We have carried out studies and reported \textit{mean average precision} scores on various data sets including (1) MNIST digits(95.

77), (2) E-MNIST capital alphabet(81.

26), (3) EMNIST small alphabet(73.

32) (4) Kannada digits(95.

77), (5) Kannada letters(90.

34), (6) Devanagari letters(100) (7) Telugu words(93.

20) (8) Devanagari words(93.

20) and also on medical prescriptions and observed high-performance metrics of mean average precision over 90%.

The algorithm serves as a kernel in the automatic annotation of digital documents in diverse scenarios such as annotation of ancient manuscripts and hand-written health records.

Back

<div>There are millions of scanned documents worldwide in around 4 thousand languages. Searching for information in a scanned document requires a text layer to be available a...

Algorithm for auto annotation of scanned documents based on subregion tiling and shallow networks

There are millions of scanned documents worldwide in around 4 thousand languages. Searching for information in a scanned document requires a text layer to be available and indexed....

The value of the malignant subregion-based texture analysis in predicting the Ki-67 status in breast cancer

ObjectiveTo evaluate the value of the malignant subregion-based texture analysis in predicting Ki-67 status in breast cancer.Materials and methodsThe dynamic contrast-enhanced magn...

Tiling Periodicity

We contribute to combinatorics and algorithmics of words by introducing new types of periodicities in words. A tiling period of a word w is partial word u such that w can be decomp...

To tile or not to tile?

Soils and landscapes vary within centimeters to decameters, which is not captured by state-of-the-art land-surface models that operate on kilometer scale. This leads to potential m...

Shallow Gas In The Oseberg, Brage And Troll Fields North Sea, 60°30' N

Abstract An integrated approach using geological, seismic, geotechnical and well log data have been used to investigate the presence of shallow gas in the Oseberg...

Practice of Ultra-Deepwater Shallow Well Construction in Nature Gas Hydrate and Shallow Gas Formation

Abstract Due to the large water depth and geological structure, a large amount of nature gas hydrate (NGH) and shallow gas are buried in the shallow layer in the dee...

Robot assisted tiling of glass mosaics with image processing

PurposeThis paper describes a robotic system developed for tiling mosaics based on image processing according to customer expectations.Design/methodology/approachMany varieties of ...

Email:
Password:

Email:

Algorithm for auto annotation of scanned documents based on subregion tiling and shallow networks

Related Results