OCR in Indian languages


is the process of converting the image into text. OCR for English and other European languages has been able to achieve a high percentage of accuracy in conversion. But the OCR for Indian Languages were not able to achieve the kind of accuracy they achieved. This is mostly due to the complexity of Indian language, lack of standard representation, encoding, support of operating system and keyboard.
Centre for Development of Advanced Computing and Technology Development for Indian Languages, the premier R&D organisation of the Ministry of Electronics and Information Technology of India has done many projects for OCR. Their projects include OCR for Malayalam, Odia, Punjabi, Telugu and Devanagari script.

Properties of Indian Scripts

In India, there are 22 officially recognized languages. Among these Hindi, Bengali and Punjabi are most spoken languages in India and fourth, seventh and tenth most popular languages in the world. Two or more languages can be written with same script. For example, Devanagiri is used to write Hindi, Marathi, Rajasthani, Bhojpuri and many more. While Bengali Script is used to write Sanskrit, Manipuri etc.
Apart from basic characters as consonants and vowels, most Indian Languages combines 2 or more basic characters to form compound characters. The shape of compound character is more complex than the constituent basic characters. Some Indian languages has horizontal line over the characters. While some languages doesn't have these horizontal lines. These are some of the main challenges for creating a single OCR for all Indian languages.
The concept of upper-/lower-case character is absent in Indian Languages. Like English Languages, writing mode of languages is from left to right except Urdu.

Examples

  1. SanskritOCR - OCR software for Sanskrit, Hindi and other Languages of India based on Devanagari Writing system|script.
  2. E-aksharayan - Optical character recognition engine for Indian languages
  3. Chitrankan - It is developed by ISI, Kolkata and the technology is transferred to C-DAC. It processes printed Hindi text either directly from scanner or from an image.