How does google know when you've made a typo? (serious question)
James Hodgkiss
Ranch Hand
Joined: Jan 22, 2004
Posts: 394
posted
0
If I search for VEGUN FOOD in google, it knows I meant VEGAN FOOD.
How does it know? What the process / technology called?
I need to research whatever it is, as we have a web service where a consumer types in a product name to get info on it, but they so often make a typo and it comes back with nothing found.
I have read a bit about fuzzy logic, but that doesn't appear to be the answer.
Any help or advice much appreciated.
Thanks,
James
Bear Bibeault
Author and opinionated walrus
Marshal
Well it's easy for them to tell when there's a typo via a simple spell check. As for how they try to figure out what you meant, who knows. They could just be using a Soundex, but I'd also expect that they're doing something a little more sophisticated.
Soundex and Spell Checker is right. But the spell checker works with the index that google creates with all its crawled web pages. Depending on the closest terms it finds to your query terms, and some heuristics it figures out what would be the correct spelling.
If you have noticed, sometimes they return with multiple spell checks for the same word too.
James Hodgkiss
Ranch Hand
Joined: Jan 22, 2004
Posts: 394
posted
0
Hiya Bear,
Do you know of a decent spellcheck utility in Java?
The thing that got me with google today is that I did a search for "Caroline Garland" and it gave me search results for "Carolyn Garland" - which is actually what I needed. Presumably, it searches for variations of your search terms (using whatever means) and then recommends the one with the most hits.
If it is only a spellcheck they use to create the variations, but it must be a pretty advanced one, cos even when I type in HEZN it know I meant HEINZ. Maybe they've just got a really advanced spellcheck who's dictionary is updated when a user corrects a typo, which would be easy enough to detect... (How I'd love access to that dictionary though!)
Hmmm, anyway if you know of any decent Java ones that would be great to be going on with.
Thanks,
James
Chris Baron
Ranch Hand
Joined: Mar 21, 2003
Posts: 1028
posted
0
A colleague told me that he build such a word similarity check with the "Levenshtein Algorithm".
Ulf Dittmer
Marshal
Joined: Mar 22, 2005
Posts: 32768
posted
0
The "Garland" example would indeed be covered by Soundex, although that has some drawbacks that were corrected by Metaphone (and Double Metaphone, if you need to support non-English languages). The Apache Commons Codec library has implementations of all these algorithms. The general procedure would be to calculate Metaphone("Garland") beforehand, and then compare Metaphone(whatever_term_you're_searching_for) to it.
Another interesting technique is the "edit distance", in particular the Levenshtein and -even better- Damerau/Levenshtein algorithms. It handles the HEZN/HEINZ part of the problem.
If you're looking for a more complete package, check out the Apache Lucene library; it is *the* premier Java text indexing/searching library. If it seems like it could be useful, get the "Lucene in Action" book - it's indispensable for serious users.
You might be surprised at the answer. I don't know what the answer is, but I am pretty sure it's more complex than the ideas posted so far. And I don't think that spelling correction is involved, because you actually want to find pages on which your search term was misspelled in a lot of cases.
Here's an example of what Google can do: A few months ago I was searching for something on a Czech website, and one of my search terms was the word "modry" (which means "blue). Not only did I get links to pages with that word on them, I also got links to pages containing the word "modrem" (which is one of the Czech grammatical forms of "modry").
So Google's search algorithm can do stemming in many languages. And Google has thousands of employees who are working to make their search algorithms more flexible. So I'd say it isn't something as simple as Soundex, or at least it isn't only that.
Not really. PageRank is a different aspect of how Google works, but this is about text processing, not web site relevance.
Yes, that is for page rank. I had gone through this Article which explains about the searched text processing. I am not sure that even this time I am posting the right link or not.
Someone asked the same question here where also a video is linked in one of the answers. Haven't watched it yet, though.
Bert Bates
author
Sheriff
Joined: Oct 14, 2002
Posts: 8439
posted
0
I've heard folks from Google describe at least *part* of what's going on, and it's cool...
Very roughly, Google doesn't use a spell checker, or have a dictionary. Instead, it basically 'votes' on what word you probably meant by doing searches on how frequently close spellings can be found on the web. In other words, if everyone on the web uses the word "orange", so "orange" gets a zillion hits, it assumes that when you typed "oragne" you meant "orange".
Further, I recently saw a talk in which a very crude version of this algorithm was written in about 20 lines of python code
Eliminate fossil fuel subsidies. (If you're not on the edge, you're taking up too much room.)
subject: How does google know when you've made a typo? (serious question)