Leave a comment

reCAPTCHA: Crowd Sourcing


CAPTCHA: Telling Humans and Computers Apart Automatically

A CAPTCHA is a program that protects websites against bots by generating and grading tests that humans can pass but current computer programs cannot. For example, humans can read distorted text as the one shown below, but current computer programs can’t:

CAPTCHA example

The term CAPTCHA (for Completely Automated Public Turing Test To Tell Computers and Humans Apart) was coined in 2000 by Luis von Ahn, Manuel Blum, Nicholas Hopper and John Langford of Carnegie Mellon University.


Louis von Ahn, associate professor of computer science at Carnegie Mellon University and original creator of the CAPTCHA challenge screen, had a brainstorm a couple of years back — why not harness all that time and energy people are putting into re-typing CAPTCHA codes, and put it to good use?

Now, it is — many CAPTCHA codes now presented to verify human end-users are actually words taken from classic print books, via optical character recognition, and farmed out for conversion to digital format.

As von Ahn put it at a recent TED presentation, there’s a lot of potential energy and brainpower than can be harnessed out there:

“It turns out that approximately 200 million CAPTCHAs are typed everyday by people around the world. When I first heard this, I was quite proud of myself. I thought, look at the impact that my research has had. But then I started feeling bad. See here’s the thing, each time you type a CAPTCHA, essentially you waste 10 seconds of your time. And if you multiply that by 200 million, you get that humanity as a whole is wasting about 500,000 hours every day typing these annoying CAPTCHAs. So then I started feeling bad.”

Von Ahn and his team launched the “reCAPTCHA” project, which engages libraries and publishers to deliver OCR images to Web security sites to essentially use the wisdom of the crowd to convert the words into text. While OCR technology automatically converts many words into digital text, about 30% of printed works more than 50 years old are unrecognizable to the system. “So the next time you type a CAPTCHA, these words that you’re typing are actually words that are coming from books that are being digitized that the computer could not recognize,” he says.

Currently, reCAPTCHA is helping to digitize 100 millions words a day, or the equivalent of about two and a half million books a year, Ahn says.

“Every time you buy tickets on Ticketmaster, you help to digitize a book. Facebook: Every time you add a friend or poke somebody, you help to digitize a book. Twitter and about 350,000 other sites are all using reCAPTCHA.”


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: