Win a copy of Mesos in Action this week in the Cloud/Virtualizaton forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Collective Intelligence - Real Time Analysis

 
Jeff Storey
Ranch Hand
Posts: 230
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Satnam,

I have previously worked developing data mining applications, and one of the biggest problems we ran into is the ability to extract trends and analyze information in real-time. The data we were working with was rapidly changing (every couple of hours), so caching the data for any longer than that was not really an option. With information available in real-time from everywhere using Google, people want quick results.

Does your book discuss techniques for real time analysis? Also, do you use existing data mining frameworks (I believe I saw a weka jar in the source code, but I'm not sure their package is free anymore as it is now part of the Pentaho project)? Another issue we've had is that some of these frameworks, such as GATE and weka, are very heavyweight, and they can involve a lot of memory overhead to use even a small subset of their features.

Thanks, looking forward to hearing back.

Jeff Storey
 
Satnam Alag
Author
Greenhorn
Posts: 26
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jeff,

Great questions. Let me try and answer each one of them

Real-time analysis:
One of the first things I do in the book -- Section 2.1 -- is to present the architecture for applying collective intelligence in real-world applications. The key to applying these techniques is to precompute as much as possible asynchronously, so that minimal computation is carried out while the user is waiting. It helps to also have an event-driven SOA architecture.

One of the case studies I cover (Section 12.4.2) is how these techniques are being applied by Google News for personalization. They have a similar problem of high item churn and a large number of users. To quote a section from the book


Google News is a good example of building a scalable recommendation system for large number of users (several million unique visitors in a month) and large number of items (several million new stories in a two month period) with constant item churn � this is different from Amazon where the rate of item churn is much smaller.


Typically, the book presents the concepts (showing how the math works) by taking a simple example and working through the math, then a version of the algorithm is implemented in Java, and then I show how to use open-source APIs like WEKA, Lucene, Nutch, and JDM to solve the same problem. If you follow the principle of precomputing the information asynchronously, you should be able to solve the problem of some of the APIs being very heavyweight.

thanks
Satnam
 
Jeff Storey
Ranch Hand
Posts: 230
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Satnam,

Thanks for the reply. I'm looking forward to reading the book.

Jeff
 
Tim Holloway
Saloon Keeper
Pie
Posts: 18167
53
Android Eclipse IDE Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Pentaho is all open-source if I'm not mistaken. A lot of it was created by combining other open-source projects.
 
Jeff Storey
Ranch Hand
Posts: 230
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tim,

You are correct, Pentaho projects are open source. The weka project is licensed under GPL, which makes it difficult to integrate into commercial applications (unless you want to release your source), but they do offer some commercial licensing (which I believe is rather expensive).

Jeff
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic