This week's book giveaway is in the Servlets forum.
We're giving away four copies of Murach's Java Servlets and JSP and have Joel Murach on-line!
See this thread for details.
The moose likes Programming Diversions and the fly likes coding simple bayes classifier in java . need help with the algo Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Murach's Java Servlets and JSP this week in the Servlets forum!
JavaRanch » Java Forums » Other » Programming Diversions
Bookmark "coding simple bayes classifier in java . need help with the algo" Watch "coding simple bayes classifier in java . need help with the algo" New topic
Author

coding simple bayes classifier in java . need help with the algo

karthik raghunathan
Greenhorn

Joined: Nov 12, 2011
Posts: 10
edit : deleted original post, removed links to github, copied code here instead. corrected spelling.

My naive bayes classifier started out as a spam filter and now has been recruited to classify whether a text is by Dickens or Twain.
First of all, would this be the right forum to ask this question ?

Second, it doesn't work very well. Can anyone help me correct the algo ? I sorta copied some of it from shiffman.net tutorial, which sorta uses the 'paulgraham : a plan for spam' approach.
ps : the code is not in OOP style, it's more or less procedural. Is this a problem ?

Dave Trower
Ranch Hand

Joined: Feb 12, 2003
Posts: 86
I read an article on how spam filters work but my experience is limited to that one article. You need to create a dictionary where for each word used in any book returns a probability of if the book is twain or Dickens. I think you do this by counting the frequency of each word.
Then when you are given a sample book, you look up the probabilities for each word from the dictionary and then apply the Bayesian algorithm.
Let me know if this helps.
karthik raghunathan
Greenhorn

Joined: Nov 12, 2011
Posts: 10
Looks like I read the same article too. Except I don't know what I am doing wrong in my code 'cos I don't know what output to expect
Dave Trower
Ranch Hand

Joined: Feb 12, 2003
Posts: 86
I would suggest you google the words "bayesian for spam". I do not have the original article but this is a good one:
webpage

Here is a quote from the article:
This word probability is calculated as follows: If the word “mortgage” occurs in 400 of 3,000 spam mails and
in 5 out of 300 legitimate emails, for example, then its spam probability would be 0.8889 (that is, [400/3000]
divided by [5/300 + 400/3000]).

So now in the dictionary, the word mortgage probability is 0.8889.

So if the word mortgage is used in an e-mail, there 88.89% chance the e-mail is spam. However, the bayesian filter looks at all words in an e-mail. So the total probability of an e-mail being spam would change based on the other words.

In your case, you build a dictionary based on how often a word appears in which of the two works.
I think the output of your program should be something like:
There is a 99.3% chance the book I just looked at is Twain.
karthik raghunathan
Greenhorn

Joined: Nov 12, 2011
Posts: 10
Okay , that is pretty much what I am doing ,except for the final part, where I combine probabilities for each of the words.
I am making two dictionaries - one for spam and one for ham, then I calculate spam probability by doing rSpam/ (rSpam + rHam)
So I _am_ on the right track. Let me see if I am doing something wrong in the combining of probabilities .....
I also return a default of 0.1 if the probability is 0. That might be a downer..
I'll update in a day
karthik raghunathan
Greenhorn

Joined: Nov 12, 2011
Posts: 10
I'm going to try with different data. Maybe twain and dickens aren't too different ...... marking this resolved.
Thanks for all the help. really appreciate it.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: coding simple bayes classifier in java . need help with the algo
 
Similar Threads
REST API Design Rulebook - design
How to convert HashMap to JavaBean
Android Development for client / server architecture
JSF complicated custom component
The Strange Loop