aspose file tools*
The moose likes Java in General and the fly likes How to automatically classify data in a large database Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "How to automatically classify data in a large database" Watch "How to automatically classify data in a large database" New topic
Author

How to automatically classify data in a large database

Herve Chubaka Bagalwa
Greenhorn

Joined: May 25, 2011
Posts: 5
Hello,

I am supposed to take data from wikipeadia dump or freebase dump or dbpedia.
I am then supposed write code that gives as output what every datum in that database is. eg: name of a person or a bussines, address,... It does not matter in what language i write the code but, I’m only familiar with C, C++, Java and Python. Java is my preferred language.

Those databases have all types of data: title, person name, address, social security, phone...

I have three questions:

1) Since I have used machine learning a lot, I have decided to use a machine learning approach.
I have started looking into WEKA, a Java machine learning toolbox. It however has only a GPL license. Is there another tool box that i can use in commercial product.

2)The problem I am facing with a machine learning approach is that I don't know what features to use. All I can think of right now is: the length of the datum, the number of string characters it has, the number of integer character it has.
This is very little with all the type of data those databases have. Regular expression seems to not be a solution for this type of project.

2)Is there another approach I can use? I mean, is machine learning the only approach?

Thank you for your help.

Regards,

Herve
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12835
    
    5
If I understand your problem, you do not have to start from zero. The Moby Words project has sub lists of such things as:


I used these lists in a phonetic matching project originally created to help a legal service catch multiple spellings of the same name.

Automatic classification of text is a HUGE area right now so if you can master even a small part of the subject you will have a valuable skill.

Bill

Herve Chubaka Bagalwa
Greenhorn

Joined: May 25, 2011
Posts: 5
Thank you for your reply. I was also thinking of using the structure of the data. The data i'm will be using are extracted from XML pages, so the data is organized in a certain way.
How do you think i can take advantage of that?

Thanks you for your help,

Regards,

Herve
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12835
    
    5

I'm not sure what you mean by existing structure, but surely any regularity in the structure of the data could guide your classification process.

You might look at this open source project.

Bill
Herve Chubaka Bagalwa
Greenhorn

Joined: May 25, 2011
Posts: 5
Thanks for the link. I'll try and see if i can get something from it.

By the way, by structured data, i mean data that looks like the one below:

**************************************************************************
name id beer_style first_brewed alcohol_content original_gravity final_gravity ibu_scale country brewery_brand color_srm from_region containers
Bramling /m/0cttpqn 4.0 Buntingford Brewery
Dark Star Hophead Extra /m/0dl8hjd 5.8 Dark Star
Brewers Gold /m/0cttps0 Bitter 4.0 Crouch Vale Brewery
Wem Brewing Company Cascade Bitter /m/0dlfhn2 Wem Brewing Company
Friedrich Dull Krautheimer Urtyp Dunkel /m/04dqd7m 5.4 Friedrich Dull Germany /m/04dr2qr
Nethergate Umbel Ale Coriander Beer /m/04dqf7b 3.8 Nethergate brewery United Kingdom /m/04dr00w
Skinner's Cornish Gold /m/04dqmzy 5.1 Skinner's Brewery United Kingdom /m/04dr1kh
Brouwerij Martens Damburger Export /m/04dqrz0 5.1 Brouwerij Martens Belgium /m/04dr6cb
Concord Brewers Rapscallion Premier /m/04dqt4r 6.75 Concord Brewers United States of America /m/04dr2fm
Federation High Level Strong Brown Ale /m/04dqhp1 4.5 Federation United Kingdom /m/04dr2n_
Chiltern Brewery Glad Tidings Spiced Milk Stout /m/04dqv4g 4.6 Chiltern Brewery United Kingdom /m/04dr57q
Huisbrouwerij Klein Duimpje Hillegoms Tarwe Bier /m/04dqhd9 5.0 Huisbrouwerij Klein Duimpje Netherlands /m/04dqztr
Wickwar Infernal Brew /m/04dqfy6 4.8 Wickwar United Kingdom /m/04dq_9r
Schöfferhofer Hefeweizen /m/04dqtxc 5.0 Schöfferhofer Germany /m/04dr0zh
Woodforde's Nelson's Revenge /m/04dqqg3 4.5 Woodforde’s Brewery United Kingdom /m/04dr6c6
Ridgeway Santa's Butt Winter Porter /m/04dqlpv 6.0 Ridgeway United Kingdom /m/04dqzdh
De Proefbrouwerij Kapel van Viven blond /m/04dqjhh 6.8 De Proefbrouwerij Belgium /m/04dr28r
Ventnor Wight Spirit /m/04dqkb9 5.0 Ventnor United Kingdom /m/04dr6h2
Wye Valley Brewery O'er The Sticks /m/04dqkxg 4.5 Wye Valley Brewery United Kingdom /m/04dr6w2
Cannery Blackberry Porter /m/04dqbfw 8.0 Cannery Canada /m/04dr4s5
Maclay Thistle MacKinnon's Curse (Asda) /m/04dqfzr 4.1 Maclay Thistle United Kingdom /m/04dr58z
Alcazar (Sherwood Forest Brewery Co) Maiden's Magic /m/04dq9h7 5.0 Alcazar (Sherwood Forest Brewing Co) United Kingdom /m/04dq_hm
Molson Stock Ale /m/04dqqkm 5.0 Molson Canada
Hirter Privat Pils /m/04dql0d 5.2 Hirter Austria /m/04dr2tr
Lodzkie (subsidiary of Kaltenberg) Glob Premium /m/04dqd05 5.5 Lodzkie (subsidiary of Kaltenberg) Poland /m/04dqxcr
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18992
    
    8

Okay, so let's take that as an example. It contains the word "Molson". What is your classification program supposed to do with that word?
Herve Chubaka Bagalwa
Greenhorn

Joined: May 25, 2011
Posts: 5
This is only part of the project. I think i should explain the whole project.

We developed a system that crawls the internet for similar types of websites. Before crawling, the user must indicate what data he/she is interested in eg: name, phone number,... The user also give an example of each. ie: if the user wants addresses as output, he/she can give an example like: 33, abcd avenue, New York.
You can look at Needlebase We are building something very simiar.

Our systems take html pages, converts them to xml so that they can be queried. the output of our system are files that look like the one in my previous post. That is what i meant by structured data.

So what i have as output is the list of examples the user gave and the structured data got from html pages (that were changed to xml pages)

So if i have the file above, my system is supposed to know what the "user defined entities" are in the file. ie: the system has to know that "Molson" is the name of a beer. or that "33, abcd avenue, New York" is an address...

I was thinking of using machine learning approach (i'm opened to other suggestions). In this approach, i would like to use data from wikipedia dump or freebase dump as training and testing sets. After that i can use the learned algorithm in my system.
As i said before, this is all i can think of right now

The next step will be to put everything in database, So everything looks pretty to the user

Thanks for asking questions and being interested in my application.

Waiting for your reply,

Regards,

Herve
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12835
    
    5
Wow, what an ambitious project - of course this sort of "data mining" is a really hot topic now so lots of people are playing in the field. Cross reference with the "Semantic Web" and you can see the size of the effort.

There are many many approaches to "machine learning" and lots of java projects, personally I have always admired the "genetic algorithm" approach.

Seems to me a key question is "exactly what sort of user input are you expecting to work with? "

Seems to me that makes a big difference in your approach. Will the user select from categories (which might match one of your existing "trained" algorithms) or do you expect to start with a blank slate each time?

If trained or partially trained algorithms can be saved/restored/refined you get a continuously improving system.

Getting loads of references to open-soure projects such as MAHOUT is easy and you can get overwhelmed with options!

Bill

 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: How to automatically classify data in a large database