wood burning stoves
The moose likes Other Open Source Projects and the fly likes Mahout data access future Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Products » Other Open Source Projects
Bookmark "Mahout data access future" Watch "Mahout data access future" New topic

Mahout data access future

Robert-Zsolt Kabai

Joined: Aug 16, 2011
Posts: 3


I'm wandering of how the data access and interoperability will evolve in the near future. You as authors may have some information, a vision or an opinion about that.
While the current data access methods are fine, the only way for mahout algorithms to use data from other Hadoop projects that have some table storage implemented (Hive, Cassandra, HBase) is to do a series of data extractions and transformations that is quite painful as multiple HDFS writes are necessary for it. This is of course because I can't just tell Mahout to use one of these tables. First we extract data from Hive/Cassandra/HBase and write it to a csv file on the HDFS, then start converting that csv data to the vector type data that mahout algorithms can eat. This is of course a lot of I/O work and that's a lot of time and resources.

Do you see the possibility of these operations and dataflow between these tools evolving to be more effective? After all, we have some data storage tools and some data analytics tools(like Mahout) and the need for the data flow to be effective is obvious. I've seen a current incubator project started to try to somewhat standardize table data to help interoperability, named HCatalog. Do you think this may be the short/long term answer for the question?

Thank you for your answers.

Ted Dunning

Joined: Aug 16, 2011
Posts: 11
Remember that Mahout is an open source project and, as such, doesn't really have a roadmap. What does exist is a set of desires that contributors have. As the contributors feel a need for something, it happens.

This means that you guys can influence the future of Mahout quite heavily.

To your point, however, it is true that the clustering code is rather inflexible about input. So is the Naive Bayes classifier family. The recommend framework is much more flexible (Sean recently added a Cassandra interface with very little work, for instance). The SGD classifier family is all about in-memory API's which makes it pretty easy to interface with.

The primary limitation right now on how the clustering and Naive Bayes systems accept data is that there is very little consensus on how that should work. Your input would be very helpful here.

Try emailing dev@mahout.apache.org and start a discussion around what you need.
I agree. Here's the link: http://aspose.com/file-tools
subject: Mahout data access future
It's not a secret anymore!