The tesseract package provides R bindings Tesseract: a powerful optical character recognition (OCR) engine that supports over 100 languages. The engine is highly configurable in order to tune the detection algorithms and obtain the best possible results.

Keep in mind that OCR (pattern recognition in general) is a very difficult problem for computers. Results will rarely be perfect and the accuracy rapidly decreases with the quality of the input image. But if you can get your input images to reasonable quality, Tesseract can often help to extract most of the text from the image.

Extract Text from Images

OCR is the process of finding and recognizing text inside images, for example from a screenshot, scanned paper. The image below has some example text:

test

library(tesseract)
eng <- tesseract("eng")
text <- tesseract::ocr("http://jeroen.github.io/images/testocr.png", engine = eng)
cat(text)
## This is a lot of 12 point text to test the
## ocr code and see if it works on all types
## of file format.
## 
## The quick brown dog jumped over the
## lazy fox. The quick brown dog jumped
## over the lazy fox. The quick brown dog
## jumped over the lazy fox. The quick
## brown dog jumped over the lazy fox.

Not bad! The ocr_data() function returns all words in the image along with a bounding box and confidence rate.

results <- tesseract::ocr_data("http://jeroen.github.io/images/testocr.png", engine = eng)
results
## # A tibble: 60 × 3
##    word  confidence bbox          
##    <chr>      <dbl> <chr>         
##  1 This        96.8 36,92,96,116  
##  2 is          96.9 109,92,129,116
##  3 a           95.7 141,98,156,116
##  4 lot         95.7 169,92,201,116
##  5 of          96.5 212,92,240,116
##  6 12          96.5 251,92,282,116
##  7 point       96.4 296,92,364,122
##  8 text        96.2 374,93,427,116
##  9 to          96.9 437,93,463,116
## 10 test        97.0 474,93,526,116
## # ℹ 50 more rows

Language Data

The tesseract OCR engine uses language-specific training data in the recognize words. The OCR algorithms bias towards words and sentences that frequently appear together in a given language, just like the human brain does. Therefore the most accurate results will be obtained when using training data in the correct language.

Use tesseract_info() to list the languages that you currently have installed.

tesseract_info()
## $datapath
## [1] "/Users/jeroen/Library/Application Support/tesseract5/tessdata/"
## 
## $available
## [1] "eng" "nld" "osd"
## 
## $version
## [1] "5.3.3"
## 
## $configs
##  [1] "alto"             "ambigs.train"     "api_config"       "bigram"          
##  [5] "box.train"        "box.train.stderr" "digits"           "get.images"      
##  [9] "hocr"             "inter"            "kannada"          "linebox"         
## [13] "logfile"          "lstm.train"       "lstmbox"          "lstmdebug"       
## [17] "makebox"          "pdf"              "quiet"            "rebox"           
## [21] "strokewidth"      "tsv"              "txt"              "unlv"            
## [25] "wordstrbox"

By default the R package only includes English training data. Windows and Mac users can install additional training data using tesseract_download(). Let’s OCR a screenshot from Wikipedia in Dutch (Nederlands)