File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes Java in General and the fly likes Collective Intelligence - Real Time Analysis Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login

Win a copy of EJB 3 in Action this week in the EJB and other Java EE Technologies forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Collective Intelligence - Real Time Analysis" Watch "Collective Intelligence - Real Time Analysis" New topic

Collective Intelligence - Real Time Analysis

Jeff Storey
Ranch Hand

Joined: Apr 07, 2007
Posts: 230
Hi Satnam,

I have previously worked developing data mining applications, and one of the biggest problems we ran into is the ability to extract trends and analyze information in real-time. The data we were working with was rapidly changing (every couple of hours), so caching the data for any longer than that was not really an option. With information available in real-time from everywhere using Google, people want quick results.

Does your book discuss techniques for real time analysis? Also, do you use existing data mining frameworks (I believe I saw a weka jar in the source code, but I'm not sure their package is free anymore as it is now part of the Pentaho project)? Another issue we've had is that some of these frameworks, such as GATE and weka, are very heavyweight, and they can involve a lot of memory overhead to use even a small subset of their features.

Thanks, looking forward to hearing back.

Jeff Storey
Satnam Alag

Joined: May 07, 2008
Posts: 26

Great questions. Let me try and answer each one of them

Real-time analysis:
One of the first things I do in the book -- Section 2.1 -- is to present the architecture for applying collective intelligence in real-world applications. The key to applying these techniques is to precompute as much as possible asynchronously, so that minimal computation is carried out while the user is waiting. It helps to also have an event-driven SOA architecture.

One of the case studies I cover (Section 12.4.2) is how these techniques are being applied by Google News for personalization. They have a similar problem of high item churn and a large number of users. To quote a section from the book

Google News is a good example of building a scalable recommendation system for large number of users (several million unique visitors in a month) and large number of items (several million new stories in a two month period) with constant item churn � this is different from Amazon where the rate of item churn is much smaller.

Typically, the book presents the concepts (showing how the math works) by taking a simple example and working through the math, then a version of the algorithm is implemented in Java, and then I show how to use open-source APIs like WEKA, Lucene, Nutch, and JDM to solve the same problem. If you follow the principle of precomputing the information asynchronously, you should be able to solve the problem of some of the APIs being very heavyweight.

Jeff Storey
Ranch Hand

Joined: Apr 07, 2007
Posts: 230

Thanks for the reply. I'm looking forward to reading the book.

Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 15628

Pentaho is all open-source if I'm not mistaken. A lot of it was created by combining other open-source projects.

Customer surveys are for companies who didn't pay proper attention to begin with.
Jeff Storey
Ranch Hand

Joined: Apr 07, 2007
Posts: 230

You are correct, Pentaho projects are open source. The weka project is licensed under GPL, which makes it difficult to integrate into commercial applications (unless you want to release your source), but they do offer some commercial licensing (which I believe is rather expensive).

I agree. Here's the link:
subject: Collective Intelligence - Real Time Analysis
Similar Threads
* Welcome Satnam Alag
Java-based Collective Intelligence
Collective Intelligence in Action
Collective Intelligence in Action and Algorithms of the Intelligent Web
OOP considered harmful (was: C. Date on UML book)