• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

How to write a spell checker?

 
A Bhattacharya
Ranch Hand
Posts: 125
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have got asked this question in 2 different interviews so far. I need the answer before going for the next interview. The only answer that comes to my mind is to compare the typed word against every word in the dictionary, to find the edit distance and suggest the words that have the least edit distance. Obviously this procedure is very inefficient and may take hours. I need a better answer.
 
Henry Wong
author
Marshal
Pie
Posts: 20881
75
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Question: if you are checking to see if a word is a real word, do you check every word in the dictionary to see if it matches? Or do you try to look up the word, alphabetically to see if the word exists?

Henry
 
A Bhattacharya
Ranch Hand
Posts: 125
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How to identify a real word depends on what kind of datastructure you use for the dictionary.
 
Henry Wong
author
Marshal
Pie
Posts: 20881
75
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by A Bhattacharya:
How to identify a real word depends on what kind of datastructure you use for the dictionary.


True, but remember that it is a job interview. If you are not willing to make assumptions about the datastructure -- like the dictionary is ordered and searchable, then what does that say about you?

Are you the type that really believe that it is okay to have a dictionary which is just a random ordering of words? Are you the type that won't speak up, if the dictionary is just a random ordering of words?

Henry
 
Devesh H Rao
Ranch Hand
Posts: 687
Hibernate jQuery Spring
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Henry Wong:


okay to have a dictionary which is just a random ordering of words?

Henry


How else do you suppose, the human mind works

But yes, I do agree with henry, a spell check(software) would be against a alphabetically sorted word list.
[ May 24, 2008: Message edited by: Devesh H Rao ]
 
Rambo Prasad
Ranch Hand
Posts: 628
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think you can build a binary tree...I think something like a Trie should help...

Then do a binary search....
 
Jeanne Boyarsky
author & internet detective
Marshal
Posts: 34071
331
Eclipse IDE Java VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by A Bhattacharya:
How to identify a real word depends on what kind of datastructure you use for the dictionary.

It does. And this is something you can ask the interviewer. When I give interviews, I leave out some information from my coding questions. Virtually all of the good candidates ask clarifying questions or ask whether they can assume something is a certain way. Sometimes they ask things I intentionally left out and sometimes they ask things I didn't even think about. And recently the twice I have forgotten to specify what version of Java we are using, the candidate asked. (since Java 1.4 and 5/6 are so different.)

In the real world, assignments don't always contain all the required information. It isn't a trick, it's just what happens in requirements documents. Being able to identify this information and ask for clarification is a valuable skill.

If someone asked me about a spell checker, I would state that I was assuming the data structure was laid out a certain way. (alphabetical makes sense, 26 alphabetical files - one for each letter - seems even better) Then I would go on to answer the actual question.
 
Ilja Preuss
author
Sheriff
Posts: 14112
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'd answer that I would spend half an hour on researching existing open source spell checkers. Failing that, I would google for algorithms for spell checkers. Failing that, I'd ask the folks at JavaRanch.
 
A Bhattacharya
Ranch Hand
Posts: 125
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Rambo, REmember we need to give spelling suggestions.
 
Jeanne Boyarsky
author & internet detective
Marshal
Posts: 34071
331
Eclipse IDE Java VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Ilja Preuss:
I'd answer that I would spend half an hour on researching existing open source spell checkers. Failing that, I would google for algorithms for spell checkers. Failing that, I'd ask the folks at JavaRanch.

I like that answer! Although I'd probably prompt the candidate to speculate about an algorithm anyway to see how they think.

Originally posted by A Bhattacharya:
REmember we need to give spelling suggestions.

Which is easier to do in a sorted list. I imagine it is faster to look for all the entries with one or two letters different/missing/added than look at every single word in the dictionary.
 
Deepak Bala
Bartender
Posts: 6663
5
Firefox Browser Linux MyEclipse IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'd answer that I would spend half an hour on researching existing open source spell checkers. Failing that, I would google for algorithms for spell checkers. Failing that, I'd ask the folks at JavaRanch.


I would have said that as well. I was researching a SWING spell checker a while back and it was open source. Cant remember the name

I would go for a sorted list and search through it. Pretty much the same algorithm that you suggested. If you can assign ranks to the possible completed word the list would be more and more meaningful as time goes by. Adaptive auto complete.
 
Srikanth Basa
Ranch Hand
Posts: 241
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hashing can be a part of one of the better solutions.
 
Ilja Preuss
author
Sheriff
Posts: 14112
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Srikanth Basavaraju:
Hashing can be a part of one of the better solutions.


I was thinking about something similar - assigning a single number to each word, so that the closer they are in spelling, the closer the numbers would be. Should be interesting.
 
Ulf Dittmer
Rancher
Pie
Posts: 42967
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Ilja Preuss:
I was thinking about something similar - assigning a single number to each word, so that the closer they are in spelling, the closer the numbers would be. Should be interesting.


It is a fascinating area. Two algorithms that might help are Double Metaphone and the Levenshtein/Damerau Edit Distance. The former creates a normalized presentation of a word, enabling you to compare two different words, while the latter calculates the "distance" between two words, i.e. how many edit operations they are apart. Both can be used to determine whether a particular word -that is known to be written correctly- might be a substitute for an unknown -possibly misspelt- word.
 
A Bhattacharya
Ranch Hand
Posts: 125
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Like I mentioned in my first post, I told this 'edit distance' thing in my answer. Problem is, you have to find the edit distance of the given word from EVERY word in the dictionary and choose the one at least distance. It will not work.
 
Ulf Dittmer
Rancher
Pie
Posts: 42967
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yeah, you're probably better off to narrow down the range of possible matches using Metaphone (or Double Metaphone for certain non-English languages). Those values can be precomputed. Once you have a small selection of possible matches you might usefully compute the edit distance for just those few words.
[ May 29, 2008: Message edited by: Ulf Dittmer ]
 
Ilja Preuss
author
Sheriff
Posts: 14112
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Ilja Preuss:

I was thinking about something similar - assigning a single number to each word, so that the closer they are in spelling, the closer the numbers would be. Should be interesting.


Thought a little bit more about it, and I guess that those numbers would need to be *really, really big* - probably *too* big to be a reasonable solution. It just would consume too much memory... (This is just my current gut feel, though.)
 
Pat Farrell
Rancher
Posts: 4678
7
Linux Mac OS X VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Gorin wrote SPELL in 1971.
ispell and aspell are open source grandchildren from it.

Finding words in the dictionary is trivial. Get dictionary words, populate HashMap.

Its finding and correcting misspellings that is challenging.
 
kaviyazhagan kajendiran
Greenhorn
Posts: 4
Eclipse IDE Java Windows XP
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
we are using morkov algorthm for text handinling

may be non linear data structure like List data structure

 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic