Win a copy of Mesos in Action this week in the Cloud/Virtualizaton forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

coding simple bayes classifier in java . need help with the algo

 
karthik raghunathan
Greenhorn
Posts: 13
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
edit : deleted original post, removed links to github, copied code here instead. corrected spelling.

My naive bayes classifier started out as a spam filter and now has been recruited to classify whether a text is by Dickens or Twain.
First of all, would this be the right forum to ask this question ?

Second, it doesn't work very well. Can anyone help me correct the algo ? I sorta copied some of it from shiffman.net tutorial, which sorta uses the 'paulgraham : a plan for spam' approach.
ps : the code is not in OOP style, it's more or less procedural. Is this a problem ?

 
Dave Trower
Ranch Hand
Posts: 86
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I read an article on how spam filters work but my experience is limited to that one article. You need to create a dictionary where for each word used in any book returns a probability of if the book is twain or Dickens. I think you do this by counting the frequency of each word.
Then when you are given a sample book, you look up the probabilities for each word from the dictionary and then apply the Bayesian algorithm.
Let me know if this helps.
 
karthik raghunathan
Greenhorn
Posts: 13
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Looks like I read the same article too. Except I don't know what I am doing wrong in my code 'cos I don't know what output to expect
 
Dave Trower
Ranch Hand
Posts: 86
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I would suggest you google the words "bayesian for spam". I do not have the original article but this is a good one:
webpage

Here is a quote from the article:
This word probability is calculated as follows: If the word “mortgage” occurs in 400 of 3,000 spam mails and
in 5 out of 300 legitimate emails, for example, then its spam probability would be 0.8889 (that is, [400/3000]
divided by [5/300 + 400/3000]).

So now in the dictionary, the word mortgage probability is 0.8889.

So if the word mortgage is used in an e-mail, there 88.89% chance the e-mail is spam. However, the bayesian filter looks at all words in an e-mail. So the total probability of an e-mail being spam would change based on the other words.

In your case, you build a dictionary based on how often a word appears in which of the two works.
I think the output of your program should be something like:
There is a 99.3% chance the book I just looked at is Twain.
 
karthik raghunathan
Greenhorn
Posts: 13
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Okay , that is pretty much what I am doing ,except for the final part, where I combine probabilities for each of the words.
I am making two dictionaries - one for spam and one for ham, then I calculate spam probability by doing rSpam/ (rSpam + rHam)
So I _am_ on the right track. Let me see if I am doing something wrong in the combining of probabilities .....
I also return a default of 0.1 if the probability is 0. That might be a downer..
I'll update in a day
 
karthik raghunathan
Greenhorn
Posts: 13
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm going to try with different data. Maybe twain and dickens aren't too different ...... marking this resolved.
Thanks for all the help. really appreciate it.
 
Consider Paul's rocket mass heater.
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic