GeeCON Prague 2014*
The moose likes Java in General and the fly likes doubt in automatic email classification using naive bayes algorithm Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


JavaRanch » Java Forums » Java » Java in General
Bookmark "doubt in automatic email classification using naive bayes algorithm" Watch "doubt in automatic email classification using naive bayes algorithm" New topic
Author

doubt in automatic email classification using naive bayes algorithm

gayathri murugesan
Ranch Hand

Joined: Dec 21, 2009
Posts: 32
i am planning to do automatic classification of emails as personal,business,etc..and store in appropriate folder using naive bayes algorithm. Here Features are the keywords in the document and classes are the folder . But i am stuck after that step.please help me on how to apply naive bayes algorithm to my automatic mail classification application.
Oleg Tikhonov
Ranch Hand

Joined: Aug 02, 2008
Posts: 55
Hi,
Here Features are the keywords in the document and classes are the folder . But i am stuck after that step.

Could you give a little bit more information about where are you stuck?

Oleg.

gayathri murugesan
Ranch Hand

Joined: Dec 21, 2009
Posts: 32
i am confused at how to apply this algorithm to our application of automatic classification of mails.can you please tell how to calculate the probability of a message belonging to a folder.
Oleg Tikhonov
Ranch Hand

Joined: Aug 02, 2008
Posts: 55
-----------------------------------------------------
| description
-----------------------------------------------------
A | is a mail belonging to folder F_1
-----------------------------------------------------
B | is a mail belonging to folder F_2
-----------------------------------------------------
C | has a mail been classified before
-----------------------------------------------------
P | will a mail be classified to F_2
-----------------------------------------------------

Let’s assume that a mail that belonging to folder F_1, is also belonging to folder F_2, and has
been classified before. We want to predict the probability that the mail will be classified to F_2:
Pr{P=T|A=T,B=T,C=T}=Pr{A=T,B=T,C=T|P=T}Pr{T}/Pr{A=T,B=T,C=T}
Pr{P=F|A=T,B=T,C=T}=Pr{A=T,B=T,C=T|P=F}Pr{F}/Pr {A=T,B=T,C=T}

One of the easiest ways to compute an event’s probability is to take its frequency count.
In our table for example, all A,B,C events happened 20 times, event A happened 5 times, event B - 12, event c - 3.
Pr{A}=5/20; Pr{B}=12/20; Pr{C}=3/20.

Pr{A or B } = Pr{A} + Pr{B} – Pr{A and B}
Pr{A and B} = Pr{A}Pr{B|A} = Pr{B} Pr{A|B} - Bayes' rule
output attribute could be either T - true or F -false.
Something like that.
David Newton
Author
Rancher

Joined: Sep 29, 2008
Posts: 12617

It's still not clear to me where you're stuck: implementing the algorithm itself? Determining how to use the results?
gayathri murugesan
Ranch Hand

Joined: Dec 21, 2009
Posts: 32
i must find the keyword in the incoming mail and determine which folder is suitable for the mail. i am stuck at applying the naive bayes algorithm to this problem.


for example :if i find the keywords in the mail as

microsoft offers windows

then

suppose there are two folders personal,technology

then how could i apply naive bayes algorithm to classify the mail with keywords "microsoft offers windows" in to the appropriate folder.

sorry for not explaining my problem in detail before.
David Newton
Author
Rancher

Joined: Sep 29, 2008
Posts: 12617

Have you already generated your score(s)? If you have, shouldn't it just be a matter of picking your cut-off point?
gayathri murugesan
Ranch Hand

Joined: Dec 21, 2009
Posts: 32
according to my example:

p(personal) * p(microsoft | personal) * p(offers | personal) * p(windows | personal)

p(technology) * p(microsoft | technology) * p(offers |technology) *p(windows |technology)

have i arrived at the correct step.

if so what would be the probability of the folders and the probability of the word given folder.
David Newton
Author
Rancher

Joined: Sep 29, 2008
Posts: 12617

So you've arrived at a spam probability, right? What's left to do?
gayathri murugesan
Ranch Hand

Joined: Dec 21, 2009
Posts: 32
i have a doubt over here..

for example

suppose already i have stored the mails with the keywords microsoft , windows , iphone , itunes , ipod into technology folder and mails with the keyword market,home,school,college in to the personal folder.

then

p(personal)=0.5

p(technology)=0.5

p(technology) * p(microsoft | technology) * p(offers |technology) *p(windows |technology)

o.5 * 1/5 * 0 =0

p(personal) * p(microsoft | personal) * p(offers | personal) * p(windows | personal)

0.5 * 0 =0

actually mail with the keyword "microsoft offers technolgy" should be classified in to technology folder but the probability turns out to be zero.
and so i dont know whether i am going in the right path.
David Newton
Author
Rancher

Joined: Sep 29, 2008
Posts: 12617

I'd say no, if it's not giving you the result you want--you might need to tweak your algorithm a bit.
gayathri murugesan
Ranch Hand

Joined: Dec 21, 2009
Posts: 32
thanks a lot for clearing my doubts.
 
GeeCON Prague 2014
 
subject: doubt in automatic email classification using naive bayes algorithm