I have previously worked developing data mining applications, and one of the biggest problems we ran into is the ability to extract trends and analyze information in real-time. The data we were working with was rapidly changing (every couple of hours), so caching the data for any longer than that was not really an option. With information available in real-time from everywhere using Google, people want quick results.
Does your book discuss techniques for real time analysis? Also, do you use existing data mining frameworks (I believe I saw a weka jar in the source code, but I'm not sure their package is free anymore as it is now part of the Pentaho project)? Another issue we've had is that some of these frameworks, such as GATE and weka, are very heavyweight, and they can involve a lot of memory overhead to use even a small subset of their features.
Great questions. Let me try and answer each one of them
Real-time analysis: One of the first things I do in the book -- Section 2.1 -- is to present the architecture for applying collective intelligence in real-world applications. The key to applying these techniques is to precompute as much as possible asynchronously, so that minimal computation is carried out while the user is waiting. It helps to also have an event-driven SOA architecture.
One of the case studies I cover (Section 12.4.2) is how these techniques are being applied by Google News for personalization. They have a similar problem of high item churn and a large number of users. To quote a section from the book
Google News is a good example of building a scalable recommendation system for large number of users (several million unique visitors in a month) and large number of items (several million new stories in a two month period) with constant item churn � this is different from Amazon where the rate of item churn is much smaller.
Typically, the book presents the concepts (showing how the math works) by taking a simple example and working through the math, then a version of the algorithm is implemented in Java, and then I show how to use open-source APIs like WEKA, Lucene, Nutch, and JDM to solve the same problem. If you follow the principle of precomputing the information asynchronously, you should be able to solve the problem of some of the APIs being very heavyweight.
Joined: Apr 07, 2007
Thanks for the reply. I'm looking forward to reading the book.
Pentaho is all open-source if I'm not mistaken. A lot of it was created by combining other open-source projects.
An IDE is no substitute for an Intelligent Developer.
Joined: Apr 07, 2007
You are correct, Pentaho projects are open source. The weka project is licensed under GPL, which makes it difficult to integrate into commercial applications (unless you want to release your source), but they do offer some commercial licensing (which I believe is rather expensive).