The concept and types of OCR (optical character recognition)

Topic > The concept and types of OCR (optical character recognition)

Index1. Introduction1.1 Context1.2 Motivation2. Basic Study2.1 OCR2.2 Types of OCR3. Bangla OCR3.1 Existing Research3.2 Existing Projects3.3 Limitations4. Proposed methodology and implementation4.1 Deep CNN4.2 Why Deep CNN4.3 Experiment data4.4 Training and recognition1. Introduction1.1 BackgroundWith the advent of computers and Internet technology, the possibilities of collecting data and using it for various purposes have exploded. The possibilities are especially enticing when dealing with textual data. Converting the vast amount of data accumulated over the years of human history into digital format is vital for storage, data mining, sentiment analysis, etc., which will only add more to the progress of our society . The tool used for this purpose is called OCR. Say no to plagiarism. Get a tailor-made essay on "Why Violent Video Games Shouldn't Be Banned"? Get an original essay 1.2 MotivationLike many other languages, Bengali can also benefit from OCR technology, especially since it is the seventh most spoken language in the world and the speaking population is around 300 million. The Bengali speaking demographic is found primarily in Bangladesh, the Indian states of West Bengal, Assam, Tripura, Andaman and Nicobar Islands and also in the ever-growing diaspora in the United Kingdom (UK), United States (USA), Canada, Middle East -East, Australia, Malaysia etc. So the advancement in digital usage of Bengali language is something that encompasses the interest of many countries.2. Basic Study2.1 OCROCR is the short form of optical character recognition. It is a technology for converting images of printed/handwritten text into a machine-readable, i.e. digital, format. Although OCRs today are mostly focused on digitizing text, previous OCRs were analogous. The world's first OCR is believed to have been invented by American inventor Charles R. Carey who used an image transmission system using a mosaic of photocells. Later inventions focused on scanning documents to produce multiple copies or to convert them into telegraph code, and then the digital format gradually became more popular. In 1966, IBM's Rochester laboratory developed the IBM 1287, the first scanner capable of reading handwritten numbers. The first commercial OCR was introduced in 1977 by Caere Corporation. OCR began to be made available online as a service (WebOCR) in 2000 on a variety of platforms via cloud computing. 2.2 Types of OCR According to its method, OCR can be divided into two types: Inline OCR (not to be confused with "online" in Internet technology) involves the automatic conversion of text as it is written on a special digitizer or PDA, where a sensor detects the movements of the pen tip and the movement of the pen up/down. This type of data is known as digital ink and can be considered a digital representation of handwriting. The resulting signal is converted into letter codes that can be used within computers and word processing applications. Offline OCR scans an image as a whole and does not handle stroke orders. It is a kind of image processing as it tries to recognize character patterns in certain image files. Online OCR can only process real-time written text, while offline OCR can process images of both handwritten and printed and non-printed texts. no special device needed is available.3. Bangla OCR 3.1 Existing Search Most of the successful searches in Bangla OCR so far have been conducted for textprinted, although researchers are gradually focusing more on recognizing handwritten text. Sanchez and Pal * proposed a classical line-based approach for continuous Bengali script recognition based on hidden Markov models and n-gram models. They used both word-based LM (language model) and character-based LM for their experiment and found better results with word-based LM. Garain, Mioulet, Chaudhuri, Chatelain, and Paquet* developed a recurrent neural network model to recognize unconstrained Bangla writing at the character level. They used a BLSTM-CTC-based recognizer on a dataset consisting of 2338 unconstrained Bangla handwritten lines, which is approximately 21000 words in total. Instead of horizontal segmentation, they chose vertical segmentation by classifying words into “semi-ortho syllables.” Their experiment yielded an accuracy of 75.40% without any post-processing. Hasnat, Chowdhury and Khan * developed a Tesseract-based OCR for Bengali script which they used on printed documents. They achieved a maximum accuracy of 93% on clean printed documents and a minimum accuracy of 70% on screen-printed images. It is evident that this is very sensitive to variations in letter shapes and is not very favorable for use in Bengali script character recognition. Chowdhury and Rahman* proposed an optimal neural network setup for Bengali handwritten number recognition which consisted of two convolution layers with Tanh activation, a hidden layer with Tanh activation, and an output layer with softmax activation. To recognize the 9 Bengali numeric characters, they used a dataset of 70,000 samples with an error rate of 1.22% to 1.33%. Purkayastha, Datta and Islam * also used convolutional neural network for Bengali handwritten character recognition. They are the first to work on composite Bengali handwritten characters. Their recognition experiment also included numeric characters and alphabets. They achieved 98.66% accuracy on numbers and 89.93% accuracy on almost all Bengali characters (80 classes). 3.2 Existing projects Some projects have been developed for Bangla OCR, it is notable that none of them work on handwritten text. BanglaOCR * is an open source OCR developed by Hasnat, Chowdhury and Khan * that uses the Google Tesseract engine for character recognition and works on printed documents, as discussed in Section 3.1 Puthi OCR aka GIGA Text Reader is a cross-platform Bangla OCR application developed by Giga TECH. This application works on printed documents written in Bengali, English and Hindi. The Android app version is free to download, but the desktop version and enterprise version require payment. Chitrolekha * is another Bengali OCR that uses Google Tesseract and Open CV Image Library. The application is free and maybe it was available in Google Play Store in the past, but at the moment (as of 15.07.2018) it is no longer available.i2OCR* is a multilingual OCR that supports more than 60 languages including Bengali.3.3 LimitationsMany of the Existing Bangla OCRs have major limitations such as segmentation: Two types of segmentation are used to separate individual characters/shapes: horizontal and vertical. Handwritten recognition OCRs using horizontal segmentation do not perform very well in Bengali cursive texts. Cursive Forms: Many OCRs have been successful in recognizing individually written Bengali numbers or characters, but when handling texts with Bengali cursive forms, they do not produce favorable results. Variation in Forms: The Cursive MethodPeople writing characters varies widely from person to person, especially since Bangla has many forms due to kar and compound letters. No OCR has yet been developed that can recognize all these forms in writing.4. Proposed methodology and implementation 4.1 Deep CNN Deep CNN stands for Deep Convolutional Neural Network. First, let's try to understand what a convolution neural network (CNN) is. Neural networks are tools used in machine learning inspired by the architecture of the human brain. The most basic version of the artificial neuron is called a perceptron which makes a decision based on inputs and probabilities weighted against the threshold value. A neural network consists of interconnected perceptrons whose connection may differ depending on the various configurations. The simplest perceptron topology is the feed-forward network composed of three layers: input layer, hidden layer, and output layer. Deep neural networks have more than one hidden layer. So, a deep CNN is a convolutional neural network with more than one hidden layer. Now we come to the question of the convolutional neural network. While neural networks are inspired by the human brain, CNNs are another type of neural network that goes further by drawing some similarities to the visual cortex of animals as well*. Because CNNs are influenced by research on receptive field theory* and the neocognition model*, they are better suited to learning multilevel hierarchies of visual features from images than other computer vision techniques. CNNs have made significant achievements in artificial intelligence and computer vision in recent years. The main difference between convolutional neural network and other neural networks is that a neuron in the hidden layer is connected only to a subset of neurons (perceptrons) in the previous layer. As a result of this connectivity scarcity, CNNs are able to learn features implicitly, meaning they do not need predefined features during training. A CNN is made up of different layers such as the convolutional layer: This is the basic unit of a CNN where most of the computations take place. . A CNN consists of a series of convolutional and pooling (subsampling) layers optionally followed by fully connected layers. The input to a convolutional layer is the amxmxr image where m is the height and width of the image and r is the number of channels. The convolutional layer will have k filters (or kernels) of size nxnxq where n is less than the image size and q can be equal to the number of channels r or less and can vary for each kernel. The size of the filters gives rise to the locally connected structure, each convolved with the image to produce k feature maps of size m−n+1. Pooling layer: Each feature map is then typically subsampled with average or maximum pooling over contiguous pxp regions where p varies between 2 for small images (e.g. MNIST) and is usually no greater than 5 for larger inputs. Alternating convolutional layers and pooling layers to reduce the spatial dimension of activation maps leading to lower overall computational complexity. Some common pooling operations are maximum pooling, average pooling, stochastic pooling*, spectral pooling*, spatial pyramid pooling*, and multiscale orderless pooling*. Fully Connected Layer: In this layer, neurons are fully connected to all neurons in the previous layer as normal Neural Network. Some high-level reasoning is being done here. Since neurons are not one-dimensional, there cannot be another convolutional layer after this layer. In some architectures the fully connected layer is.