The tesseract package provides R bindings Tesseract: a powerful optical character recognition (OCR) engine that supports over 100 languages. The engine is highly configurable in order to tune the detection algorithms and obtain the best possible results.
Keep in mind that OCR (pattern recognition in general) is a very difficult problem for computers. Results will rarely be perfect and the accuracy rapidly decreases with the quality of the input image. But if you can get your input images to reasonable quality, Tesseract can often help to extract most of the text from the image.
OCR is the process of finding and recognizing text inside images, for example from a screenshot, scanned paper. The image below has some example text:
library(tesseract)
eng <- tesseract("eng")
text <- tesseract::ocr("http://jeroen.github.io/images/testocr.png", engine = eng)
cat(text)
## This is a lot of 12 point text to test the
## ocr code and see if it works on all types
## of file format.
##
## The quick brown dog jumped over the
## lazy fox. The quick brown dog jumped
## over the lazy fox. The quick brown dog
## jumped over the lazy fox. The quick
## brown dog jumped over the lazy fox.
Not bad! The ocr_data()
function returns all words in
the image along with a bounding box and confidence rate.
results <- tesseract::ocr_data("http://jeroen.github.io/images/testocr.png", engine = eng)
results
## # A tibble: 60 × 3
## word confidence bbox
## <chr> <dbl> <chr>
## 1 This 96.8 36,92,96,116
## 2 is 96.9 109,92,129,116
## 3 a 95.7 141,98,156,116
## 4 lot 95.7 169,92,201,116
## 5 of 96.5 212,92,240,116
## 6 12 96.5 251,92,282,116
## 7 point 96.4 296,92,364,122
## 8 text 96.2 374,93,427,116
## 9 to 96.9 437,93,463,116
## 10 test 97.0 474,93,526,116
## # ℹ 50 more rows
The tesseract OCR engine uses language-specific training data in the recognize words. The OCR algorithms bias towards words and sentences that frequently appear together in a given language, just like the human brain does. Therefore the most accurate results will be obtained when using training data in the correct language.
Use tesseract_info()
to list the languages that you
currently have installed.
tesseract_info()
## $datapath
## [1] "/Users/jeroen/Library/Application Support/tesseract5/tessdata/"
##
## $available
## [1] "eng" "nld" "osd"
##
## $version
## [1] "5.3.3"
##
## $configs
## [1] "alto" "ambigs.train" "api_config" "bigram"
## [5] "box.train" "box.train.stderr" "digits" "get.images"
## [9] "hocr" "inter" "kannada" "linebox"
## [13] "logfile" "lstm.train" "lstmbox" "lstmdebug"
## [17] "makebox" "pdf" "quiet" "rebox"
## [21] "strokewidth" "tsv" "txt" "unlv"
## [25] "wordstrbox"
By default the R package only includes English training data. Windows
and Mac users can install additional training data using
tesseract_download()
. Let’s OCR a screenshot from Wikipedia
in Dutch (Nederlands)