This week's giveaway is in the Groovy forum. We're giving away four copies of Groovy Fundamentals video training course and have Ken Kousen on-line! See this thread for details.

edit : deleted original post, removed links to github, copied code here instead. corrected spelling.

My naive bayes classifier started out as a spam filter and now has been recruited to classify whether a text is by Dickens or Twain.
First of all, would this be the right forum to ask this question ?

Second, it doesn't work very well. Can anyone help me correct the algo ? I sorta copied some of it from shiffman.net tutorial, which sorta uses the 'paulgraham : a plan for spam' approach.
ps : the code is not in OOP style, it's more or less procedural. Is this a problem ?

I read an article on how spam filters work but my experience is limited to that one article. You need to create a dictionary where for each word used in any book returns a probability of if the book is twain or Dickens. I think you do this by counting the frequency of each word.
Then when you are given a sample book, you look up the probabilities for each word from the dictionary and then apply the Bayesian algorithm.
Let me know if this helps.

karthik raghunathan
Greenhorn

Joined: Nov 12, 2011
Posts: 10

posted

0

Looks like I read the same article too. Except I don't know what I am doing wrong in my code 'cos I don't know what output to expect

Dave Trower
Ranch Hand

Joined: Feb 12, 2003
Posts: 86

posted

0

I would suggest you google the words "bayesian for spam". I do not have the original article but this is a good one:
webpage

Here is a quote from the article:
This word probability is calculated as follows: If the word “mortgage” occurs in 400 of 3,000 spam mails and
in 5 out of 300 legitimate emails, for example, then its spam probability would be 0.8889 (that is, [400/3000]
divided by [5/300 + 400/3000]).

So now in the dictionary, the word mortgage probability is 0.8889.

So if the word mortgage is used in an e-mail, there 88.89% chance the e-mail is spam. However, the bayesian filter looks at all words in an e-mail. So the total probability of an e-mail being spam would change based on the other words.

In your case, you build a dictionary based on how often a word appears in which of the two works.
I think the output of your program should be something like:
There is a 99.3% chance the book I just looked at is Twain.

karthik raghunathan
Greenhorn

Joined: Nov 12, 2011
Posts: 10

posted

0

Okay , that is pretty much what I am doing ,except for the final part, where I combine probabilities for each of the words.
I am making two dictionaries - one for spam and one for ham, then I calculate spam probability by doing rSpam/ (rSpam + rHam)
So I _am_ on the right track. Let me see if I am doing something wrong in the combining of probabilities .....
I also return a default of 0.1 if the probability is 0. That might be a downer..
I'll update in a day

karthik raghunathan
Greenhorn

Joined: Nov 12, 2011
Posts: 10

posted

0

I'm going to try with different data. Maybe twain and dickens aren't too different ...... marking this resolved.
Thanks for all the help. really appreciate it.