Thursday, July 09, 2009

reCAPTCHA's Business Model

If you use the web, chances are you've been asked to use a "captcha." A captcha is a way of differentiating between humans and machines by asking users to transcribe garbled text that is unreadable to a computer. Whether it's preventing spam on blogs or verifying website sign-ups, captchas keep malicious programs from sending spam and consuming energy. Captchas have been around for almost a decade and are fairly commonplace, but an organization called reCAPTCHA is pushing the envelope in terms of how data is used and has some strong potential for being a lucrative company.

reCAPTCHA is a project from Carnegie Mellon that offers a standard captcha service for free to any web service. What is innovative about reCAPTCHA is that the service asks for two words to be transcribed before allowing users to proceed. The first word has a known value and is the test, while the second word is displayed so that reCAPTCHA can learn its meaning. If enough users agree on the meaning of the second word to a point of statistical significance, chances are that the meaning of the garbled word has been found.

Here are two examples from reCAPTCHA's website.

Original scanned image:

The computer's translation of the image into text, with unreadable parts highlighted:

reCAPTCHA is currently working with the Internet Archive and the New York Times in an effort to convert books and old papers to text so that they can be preserved, searched, and kept accessible for generations to come. In addition to the altruistic applications of reCAPTCHA's technology and data, it could have a very lucrative business model. There are several companies that are digitizing books, including Google and Amazon. reCAPTCHA could license it's technology to help these companies transcribe books more quickly and accurately. Another potential business could be to license their technology to law firms that have to sift through thousands of pages of written documents to gather evidence and build their case. Using this technology would save them time and reduce labor costs for these firms. reRAPTCHA is a great example of a free service that is generating huge amounts of data and using it in a valuable way.

Update 9/16/2009:
Google has acquired reCAPTCHA and will by applying it's technology to digitize more content. Google is essentially buying time so that they don't have to wait to build out their OCR library. From the Google Blog:
This technology powers large scale text scanning projects like Google Books and Google News Archive Search. Having the text version of documents is important because plain text can be searched, easily rendered on mobile devices and displayed to visually impaired users. So we'll be applying the technology within Google not only to increase fraud and spam protection for Google products but also to improve our books and newspaper scanning process.


david said...

NOVA just had a segment about the inventor:

Will Hambly said...

Thanks David. That was interesting. Good stuff.

amuthanjrv said...

This article is showing your knowledge in reCAPTCHA's business model.

David Sameth said...

Nice iformation, thanks for sharing 2captcha