Lynne Teaches Tech: How does OCRbot work?

after downloading the image, OCRbot uses tesseract-ocr to extract the text. this is a free (as in libre and gratis) program that uses neural networks to do the job. it’s been trained on a massive dataset of text and has “learned” what letters usually look like. it knows that a lowercase h is a line with a smaller line connected to it on the right, for example.

tesseract first tries to split the image into chunks of what it believes to be text. it then splits these chunks into words by looking for spaces, and then tries to identify the letters making up those words. if any of these steps fail, the entire process fails.

neural networks work in a similar way to how a brain does, but on a much smaller scale – computers really aren’t up to the task of simulating an entire brain right now. powerful computers are used to train the neural network for hours and hours, making it become more accurate at reading text. this has some drawbacks – if it was always trained on helvetica text, it’d only be able to read that, for example. thus it’s important to make sure you train it on lots of different fonts. tesseract provides these datasets for you, but you can train your own if you’d like.

finally, after extracting the text, OCRbot does some rudimentary fixes (like replacing | with I, as tesseract thinks | is a lot more frequent than it really is) and posts the reply.

view original post