File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Jobs Discussion and the fly likes How to write a spell checker? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Careers » Jobs Discussion
Bookmark "How to write a spell checker?" Watch "How to write a spell checker?" New topic
Author

How to write a spell checker?

A Bhattacharya
Ranch Hand

Joined: Oct 22, 2007
Posts: 125
I have got asked this question in 2 different interviews so far. I need the answer before going for the next interview. The only answer that comes to my mind is to compare the typed word against every word in the dictionary, to find the edit distance and suggest the words that have the least edit distance. Obviously this procedure is very inefficient and may take hours. I need a better answer.
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18545
    
  40

Question: if you are checking to see if a word is a real word, do you check every word in the dictionary to see if it matches? Or do you try to look up the word, alphabetically to see if the word exists?

Henry


Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
A Bhattacharya
Ranch Hand

Joined: Oct 22, 2007
Posts: 125
How to identify a real word depends on what kind of datastructure you use for the dictionary.
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18545
    
  40

Originally posted by A Bhattacharya:
How to identify a real word depends on what kind of datastructure you use for the dictionary.


True, but remember that it is a job interview. If you are not willing to make assumptions about the datastructure -- like the dictionary is ordered and searchable, then what does that say about you?

Are you the type that really believe that it is okay to have a dictionary which is just a random ordering of words? Are you the type that won't speak up, if the dictionary is just a random ordering of words?

Henry
Devesh H Rao
Ranch Hand

Joined: Feb 09, 2002
Posts: 687

Originally posted by Henry Wong:


okay to have a dictionary which is just a random ordering of words?

Henry


How else do you suppose, the human mind works

But yes, I do agree with henry, a spell check(software) would be against a alphabetically sorted word list.
[ May 24, 2008: Message edited by: Devesh H Rao ]
Rambo Prasad
Ranch Hand

Joined: Feb 23, 2006
Posts: 628
I think you can build a binary tree...I think something like a Trie should help...

Then do a binary search....


Helping hands are much better than the praying lips
Jeanne Boyarsky
internet detective
Marshal

Joined: May 26, 2003
Posts: 30136
    
150

Originally posted by A Bhattacharya:
How to identify a real word depends on what kind of datastructure you use for the dictionary.

It does. And this is something you can ask the interviewer. When I give interviews, I leave out some information from my coding questions. Virtually all of the good candidates ask clarifying questions or ask whether they can assume something is a certain way. Sometimes they ask things I intentionally left out and sometimes they ask things I didn't even think about. And recently the twice I have forgotten to specify what version of Java we are using, the candidate asked. (since Java 1.4 and 5/6 are so different.)

In the real world, assignments don't always contain all the required information. It isn't a trick, it's just what happens in requirements documents. Being able to identify this information and ask for clarification is a valuable skill.

If someone asked me about a spell checker, I would state that I was assuming the data structure was laid out a certain way. (alphabetical makes sense, 26 alphabetical files - one for each letter - seems even better) Then I would go on to answer the actual question.


[Blog] [JavaRanch FAQ] [How To Ask Questions The Smart Way] [Book Promos]
Blogging on Certs: SCEA Part 1, Part 2 & 3, Core Spring 3, OCAJP, OCPJP beta, TOGAF part 1 and part 2
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
I'd answer that I would spend half an hour on researching existing open source spell checkers. Failing that, I would google for algorithms for spell checkers. Failing that, I'd ask the folks at JavaRanch.


The soul is dyed the color of its thoughts. Think only on those things that are in line with your principles and can bear the light of day. The content of your character is your choice. Day by day, what you do is who you become. Your integrity is your destiny - it is the light that guides your way. - Heraclitus
A Bhattacharya
Ranch Hand

Joined: Oct 22, 2007
Posts: 125
Rambo, REmember we need to give spelling suggestions.
Jeanne Boyarsky
internet detective
Marshal

Joined: May 26, 2003
Posts: 30136
    
150

Originally posted by Ilja Preuss:
I'd answer that I would spend half an hour on researching existing open source spell checkers. Failing that, I would google for algorithms for spell checkers. Failing that, I'd ask the folks at JavaRanch.

I like that answer! Although I'd probably prompt the candidate to speculate about an algorithm anyway to see how they think.

Originally posted by A Bhattacharya:
REmember we need to give spelling suggestions.

Which is easier to do in a sorted list. I imagine it is faster to look for all the entries with one or two letters different/missing/added than look at every single word in the dictionary.
Deepak Bala
Bartender

Joined: Feb 24, 2006
Posts: 6661
    
    5

I'd answer that I would spend half an hour on researching existing open source spell checkers. Failing that, I would google for algorithms for spell checkers. Failing that, I'd ask the folks at JavaRanch.


I would have said that as well. I was researching a SWING spell checker a while back and it was open source. Cant remember the name

I would go for a sorted list and search through it. Pretty much the same algorithm that you suggested. If you can assign ranks to the possible completed word the list would be more and more meaningful as time goes by. Adaptive auto complete.


SCJP 6 articles - SCJP 5/6 mock exams - More SCJP Mocks
Srikanth Basa
Ranch Hand

Joined: Jun 06, 2005
Posts: 241
Hashing can be a part of one of the better solutions.
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
Originally posted by Srikanth Basavaraju:
Hashing can be a part of one of the better solutions.


I was thinking about something similar - assigning a single number to each word, so that the closer they are in spelling, the closer the numbers would be. Should be interesting.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41151
    
  45
Originally posted by Ilja Preuss:
I was thinking about something similar - assigning a single number to each word, so that the closer they are in spelling, the closer the numbers would be. Should be interesting.


It is a fascinating area. Two algorithms that might help are Double Metaphone and the Levenshtein/Damerau Edit Distance. The former creates a normalized presentation of a word, enabling you to compare two different words, while the latter calculates the "distance" between two words, i.e. how many edit operations they are apart. Both can be used to determine whether a particular word -that is known to be written correctly- might be a substitute for an unknown -possibly misspelt- word.


Ping & DNS - my free Android networking tools app
A Bhattacharya
Ranch Hand

Joined: Oct 22, 2007
Posts: 125
Like I mentioned in my first post, I told this 'edit distance' thing in my answer. Problem is, you have to find the edit distance of the given word from EVERY word in the dictionary and choose the one at least distance. It will not work.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41151
    
  45
Yeah, you're probably better off to narrow down the range of possible matches using Metaphone (or Double Metaphone for certain non-English languages). Those values can be precomputed. Once you have a small selection of possible matches you might usefully compute the edit distance for just those few words.
[ May 29, 2008: Message edited by: Ulf Dittmer ]
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
Originally posted by Ilja Preuss:

I was thinking about something similar - assigning a single number to each word, so that the closer they are in spelling, the closer the numbers would be. Should be interesting.


Thought a little bit more about it, and I guess that those numbers would need to be *really, really big* - probably *too* big to be a reasonable solution. It just would consume too much memory... (This is just my current gut feel, though.)
Pat Farrell
Rancher

Joined: Aug 11, 2007
Posts: 4646
    
    5

Gorin wrote SPELL in 1971.
ispell and aspell are open source grandchildren from it.

Finding words in the dictionary is trivial. Get dictionary words, populate HashMap.

Its finding and correcting misspellings that is challenging.
kaviyazhagan kajendiran
Greenhorn

Joined: Apr 24, 2011
Posts: 4

we are using morkov algorthm for text handinling

may be non linear data structure like List data structure

 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: How to write a spell checker?
 
Similar Threads
how to use distanceBetween built in methog
arrays not equal
Algorithm to find Nodes having largest distance in Binary tree.
Code Help
any good algorithms for a spellchecker?