JavaRanch » Java Forums »
Other »
Programming Diversions
| Author |
coding simple bayes classifier in java . need help with the algo
|
karthik raghunathan
Greenhorn
Joined: Nov 12, 2011
Posts: 7
|
|
edit : deleted original post, removed links to github, copied code here instead. corrected spelling.
My naive bayes classifier started out as a spam filter and now has been recruited to classify whether a text is by Dickens or Twain.
First of all, would this be the right forum to ask this question ?
Second, it doesn't work very well. Can anyone help me correct the algo ? I sorta copied some of it from shiffman.net tutorial, which sorta uses the 'paulgraham : a plan for spam' approach.
ps : the code is not in OOP style, it's more or less procedural. Is this a problem ?
|
 |
Dave Trower
Ranch Hand
Joined: Feb 12, 2003
Posts: 78
|
|
I read an article on how spam filters work but my experience is limited to that one article. You need to create a dictionary where for each word used in any book returns a probability of if the book is twain or Dickens. I think you do this by counting the frequency of each word.
Then when you are given a sample book, you look up the probabilities for each word from the dictionary and then apply the Bayesian algorithm.
Let me know if this helps.
|
 |
karthik raghunathan
Greenhorn
Joined: Nov 12, 2011
Posts: 7
|
|
|
Looks like I read the same article too. Except I don't know what I am doing wrong in my code 'cos I don't know what output to expect
|
 |
Dave Trower
Ranch Hand
Joined: Feb 12, 2003
Posts: 78
|
|
I would suggest you google the words "bayesian for spam". I do not have the original article but this is a good one:
webpage
Here is a quote from the article:
This word probability is calculated as follows: If the word “mortgage” occurs in 400 of 3,000 spam mails and
in 5 out of 300 legitimate emails, for example, then its spam probability would be 0.8889 (that is, [400/3000]
divided by [5/300 + 400/3000]).
So now in the dictionary, the word mortgage probability is 0.8889.
So if the word mortgage is used in an e-mail, there 88.89% chance the e-mail is spam. However, the bayesian filter looks at all words in an e-mail. So the total probability of an e-mail being spam would change based on the other words.
In your case, you build a dictionary based on how often a word appears in which of the two works.
I think the output of your program should be something like:
There is a 99.3% chance the book I just looked at is Twain.
|
 |
karthik raghunathan
Greenhorn
Joined: Nov 12, 2011
Posts: 7
|
|
Okay , that is pretty much what I am doing ,except for the final part, where I combine probabilities for each of the words.
I am making two dictionaries - one for spam and one for ham, then I calculate spam probability by doing rSpam/ (rSpam + rHam)
So I _am_ on the right track. Let me see if I am doing something wrong in the combining of probabilities .....
I also return a default of 0.1 if the probability is 0. That might be a downer..
I'll update in a day
|
 |
karthik raghunathan
Greenhorn
Joined: Nov 12, 2011
Posts: 7
|
|
I'm going to try with different data. Maybe twain and dickens aren't too different ...... marking this resolved.
Thanks for all the help. really appreciate it.
|
 |
 |
|
|
subject: coding simple bayes classifier in java . need help with the algo
|
|
|
|