IntelliJ Java IDE
The moose likes General Computing and the fly likes How does google know when you've made a typo? (serious question) Big Moose Saloon
  Search | Java FAQ | Recent Topics
Register / Login
JavaRanch » Java Forums » Engineering » General Computing
Reply Bookmark "How does google know when you Watch "How does google know when you New topic
Author

How does google know when you've made a typo? (serious question)

James Hodgkiss
Ranch Hand

Joined: Jan 22, 2004
Posts: 394
If I search for VEGUN FOOD in google, it knows I meant VEGAN FOOD.

How does it know? What the process / technology called?

I need to research whatever it is, as we have a web service where a consumer types in a product name to get info on it, but they so often make a typo and it comes back with nothing found.

I have read a bit about fuzzy logic, but that doesn't appear to be the answer.

Any help or advice much appreciated.

Thanks,
James
Bear Bibeault
Author and opinionated walrus
Marshal

Joined: Jan 10, 2002
Posts: 50691

Well it's easy for them to tell when there's a typo via a simple spell check. As for how they try to figure out what you meant, who knows. They could just be using a Soundex, but I'd also expect that they're doing something a little more sophisticated.


[Smart Questions] [JSP FAQ] [Books by Bear] [Bear's FrontMan] [About Bear]
Pavani Vantimitta
Greenhorn

Joined: Mar 04, 2010
Posts: 1
Soundex and Spell Checker is right. But the spell checker works with the index that google creates with all its crawled web pages. Depending on the closest terms it finds to your query terms, and some heuristics it figures out what would be the correct spelling.

If you have noticed, sometimes they return with multiple spell checks for the same word too.
James Hodgkiss
Ranch Hand

Joined: Jan 22, 2004
Posts: 394
Hiya Bear,

Do you know of a decent spellcheck utility in Java?

The thing that got me with google today is that I did a search for "Caroline Garland" and it gave me search results for "Carolyn Garland" - which is actually what I needed. Presumably, it searches for variations of your search terms (using whatever means) and then recommends the one with the most hits.

If it is only a spellcheck they use to create the variations, but it must be a pretty advanced one, cos even when I type in HEZN it know I meant HEINZ. Maybe they've just got a really advanced spellcheck who's dictionary is updated when a user corrects a typo, which would be easy enough to detect... (How I'd love access to that dictionary though!)

Hmmm, anyway if you know of any decent Java ones that would be great to be going on with.

Thanks,
James
Chris Baron
Ranch Hand

Joined: Mar 21, 2003
Posts: 1028
A colleague told me that he build such a word similarity check with the "Levenshtein Algorithm".
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 32768
The "Garland" example would indeed be covered by Soundex, although that has some drawbacks that were corrected by Metaphone (and Double Metaphone, if you need to support non-English languages). The Apache Commons Codec library has implementations of all these algorithms. The general procedure would be to calculate Metaphone("Garland") beforehand, and then compare Metaphone(whatever_term_you're_searching_for) to it.

Another interesting technique is the "edit distance", in particular the Levenshtein and -even better- Damerau/Levenshtein algorithms. It handles the HEZN/HEINZ part of the problem.

If you're looking for a more complete package, check out the Apache Lucene library; it is *the* premier Java text indexing/searching library. If it seems like it could be useful, get the "Lucene in Action" book - it's indispensable for serious users.


Android appsImageJ pluginsJava web charts
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 13842

You might be surprised at the answer. I don't know what the answer is, but I am pretty sure it's more complex than the ideas posted so far. And I don't think that spelling correction is involved, because you actually want to find pages on which your search term was misspelled in a lot of cases.

Here's an example of what Google can do: A few months ago I was searching for something on a Czech website, and one of my search terms was the word "modry" (which means "blue). Not only did I get links to pages with that word on them, I also got links to pages containing the word "modrem" (which is one of the Czech grammatical forms of "modry").

So Google's search algorithm can do stemming in many languages. And Google has thousands of employees who are working to make their search algorithms more flexible. So I'd say it isn't something as simple as Soundex, or at least it isn't only that.
pawan chopra
Ranch Hand

Joined: Jan 23, 2008
Posts: 326

This is a good link google search


Pawan Chopra
SCJP - DuMmIeS mInD
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 32768
This is a good link google search

Not really. PageRank is a different aspect of how Google works, but this is about text processing, not web site relevance.
pawan chopra
Ranch Hand

Joined: Jan 23, 2008
Posts: 326

Ulf Dittmer wrote:
This is a good link google search

Not really. PageRank is a different aspect of how Google works, but this is about text processing, not web site relevance.


Yes, that is for page rank. I had gone through this Article
which explains about the searched text processing. I am not sure that even this time I am posting the right link or not.
Joachim Rohde
Ranch Hand

Joined: Nov 27, 2006
Posts: 419

Someone asked the same question here where also a video is linked in one of the answers. Haven't watched it yet, though.
Bert Bates
author
Sheriff

Joined: Oct 14, 2002
Posts: 8439
I've heard folks from Google describe at least *part* of what's going on, and it's cool...

Very roughly, Google doesn't use a spell checker, or have a dictionary. Instead, it basically 'votes' on what word you probably meant by doing searches on how frequently close spellings can be found on the web. In other words, if everyone on the web uses the word "orange", so "orange" gets a zillion hits, it assumes that when you typed "oragne" you meant "orange".

Further, I recently saw a talk in which a very crude version of this algorithm was written in about 20 lines of python code


Eliminate fossil fuel subsidies. (If you're not on the edge, you're taking up too much room.)
 
IntelliJ Java IDE
 
subject: How does google know when you've made a typo? (serious question)
 
Threads others viewed
Calling javascript function in js file from JSP
RequestDispatcher Doubt
UN World Food Day Message: more than 1 billion people are hungry - Why?
I just browsed JavaRanch from an iPhone...
JavaScript-Printing
IntelliJ Java IDE