Sean Owen

author
+ Follow
since Nov 08, 2004
Merit badge: grant badges
For More
Cows and Likes
Cows
Total received
In last 30 days
0
Forums and Threads

Recent posts by Sean Owen

Real-world use cases include anything you can imagine involving clustering, classification or collaborative filtering. You might cluster people in your customer database to discover demographics of users that act alike. You might use classification to detect spam. You might use collaborative filtering to recommend products to users.

A lot of the project provides Hadoop jobs. They are intended to be stand-alone processes in their own right. You can certainly integrate the Hadoop jobs into your system, and reuse any of the code too. In that sense it's somewhere between a product and a library.
I think you need a good working knowledge of Java to use Mahout effectively, partly because you will probably want or need to look at the source code frequently. I would not describe it as "for beginners", but at the same time I don't think you need particularly advanced knowledge of Java to use it.
I'm not aware of any implementations in Mahout based on libsvm directly, no.
Most of it is based on Hadoop / MapReduce, yes. Not all of it is though, in particular a lot of the recommender code, which also has a significant non-distributed presence.

I don't think you can run Hadoop on GAE? Or at least I have not heard that you can, nor tried. I have personally run it on EC2. The book has a few pages on running Hadoop jobs on EC2; it's generally quite straightforward if you understand what's going on when you run it locally.
It depends on what you are doing.

For smaller-scale problems, Ted rightly always recommends R for playing around.
Weka is a quite established and mature library for machine learning.

For collaborative filtering I know of Cofi, CoFE, Duine, Vogoo, and more as potentially useful libraries -- depends on what language, environment and use case you have.
Agree, the book is not a course on machine learning techniques. It explains some of the basics along the way, but is much more about a particular tool (Mahout) and about applying it, than the theory. In that sense these are complementary things.
(Those are fairly different types of algorithms.) I do not know that you'll see a lot more different types of algorithms implemented, soon, though that's bound to come over time. I sense that what you see now is what you'll see in a year, except that it will be more refined.

But that's somewhat different from the question of being ready to use. If you need algorithm X and it's not implemented, no, it's not ready for you to use. But if you need Y and you find Y in the project -- I would encourage you to try it. I think any of it is ready enough to try in production, and some bits have been quite battle-tested. It really depends on what you're using.
I really don't know, but I can take wild guesses.

Things are changing fast. Hadoop is an excellent tool for its purpose and is actually getting somewhat mature. It is not an ideal tool for machine learning algorithms. I would not be surprised if some of the other distributed computing frameworks that are emerging, which are designed for a bit more general purpose application, become more popular within a few years for stuff like this. But that's still a few years off at least.

And if that changes, I would not be surprised if Mahout (or another project) changes to reimplement on another framework.

For now I think Mahout has figured out its identity: clustering, classification, collaborative filtering on top of Hadoop. It implements a lot of stuff, and in my opinion has a fair bit of work to do to polish and document what's there. I do not anticipate big changes in what it does, but I do anticipate refinement.

There are no plans for a second edition of the book at this point, as it would be years away at least. The final version of the book is written for Mahout 0.5, which is recent as of a few months ago, and that should remain a useful guide for versions of Mahout for the next 1-2 years.
Gives better results than what? And "better" in the sense of faster, or "more accurate"?

The clustering algorithms in Mahout are fairly standard algorithms, not some special approach. So I think they perform as well as any other implementation of these standard algorithms in terms of quality.

In terms of performance -- they are implemented on Hadoop. This means it is much easier to scale up to very large data sets, but means you incur a lot of Hadoop overhead. For small data sets, you could probably find a faster implementation that is all on one machine, maybe something written in R. For very large data sets, where you can't apply non-distributed tools, I imagine it's about as good as anything else freely available out there. Honestly I'm not aware of another distributed clustering package to compare to.
These are both general terms, and I don't think they are exclusive. One man's data mining is another man's machine learning. I personally use the terms interchangeably, but I could be wrong.
I don't think you need a machine learning background to understand and use the project -- but you might need a guide, like the book! It is, at the moment, a collection of interesting code, and a workshop for some valuable ideas, but is not well documented or explained. The book tries to help bridge that gap. (And the book only assumes you have some Java experience and standard math background, not deep machine learning experience.)

Some machine learning experience would probably help understand the ideas quicker, but what you really need to use the code is a willingness to read the source and experiment!
Not at all, it's a good question. I personally have always been involved with this purely out of personal interest. I was an engineer at Google by trade, but did not work directly on machine learning. So for me it is more of a hobby because I find it interesting more than profession.
Hi Michael,

I think this summarizes all the known (and public) uses of Mahout: https://cwiki.apache.org/MAHOUT/powered-by-mahout.html

As you suspect, many that I know of are not public, and some are small experimental projects.

Sean
This is Sean Owen -- yes, beginning in version 1.3 of the filter, you can include or exclude certain content types from compression.

For example, to only compress HTML and XML, you can add configure the filter with this parameter in web.xml:

<init-param>
<param-name>includeContentTypes</param-name>
<param-value>text/html,text/xml</param-value>
</init-param>

Other content types will not be compressed. Alternatively, to compress everything except PDFs, you can use this parameter:

<init-param>
<param-name>excludeContentTypes</param-name>
<param-value>application/pdf</param-value>
</init-param>

You can't specify both at once though.

You can find javadocs here, and the latest version for download here. Please post messages there if you have any other questions or problems. Thanks!
19 years ago