FAQ12
What are some of the issues with Optical Character Recognition of Slavic and East European languages? (FAQ12)
One of the areas where digitization of Slavic texts may cause problems is in Optical Character Recognition (OCR). The second method for obtaining digital text files is by taking the digital page images and running them through OCR. Particularly when a large amount of text is involved, OCR can be the best option. There are several commercial OCR programs available which can recognize Slavic languages, both those written in modified Latin and those written in the Cyrillic alphabets.
The end result of OCR is a digital file of the vernacular language text. If the text is to be directly presented to the end user, a very high level of accuracy needs to be attained in the OCR process. Achieving this level of accuracy often increases the cost of this phase of the digitization project substantially. A less time consuming and sometimes more cost effective alternative is "dirty" OCR. In this method, the amount of post-OCR proofreading is minimized. The output text from the OCR program is searchable, but when the results of the search is presented, often the page image is displayed rather than the result set from the test file which masks errors in the OCR. However if an error in the OCR happens to occur in the search string this would obviously effect the accuracy of the search results.
In the case of all OCR (not just Slavic), there are several issues that effect overall accuracy of the resulting text file:
1. Original Typeface. Relative weight of horizontal and vertical lines in glyphs.
2. Line spacing, broken [hyphenated] words, line justification
3. Quality of original printing... smudgy, clean or dirty type, insufficient amount of ink used in the original printing process
4. Opacity of paper (bleed through from other side), wicking of ink into paper fibers
5. Contrast (whiteness of paper versus blackness of ink) which is tied to:
6. Issues resulting from ageing: yellowness of paper, ink fading.
7. Quality of page image: bound versus flat pages, paper versus microfilm source document, skew, bit-depth [also called color depth].