permaculture playing cards*
The moose likes Java in General and the fly likes Java-based Collective Intelligence Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


JavaRanch » Java Forums » Java » Java in General
Bookmark "Java-based Collective Intelligence" Watch "Java-based Collective Intelligence" New topic
Author

Java-based Collective Intelligence

Paul Michael
Ranch Hand

Joined: Jul 02, 2001
Posts: 697
Collective Intelligence in Action is a hands-on guidebook for implementing collective intelligence concepts using Java. It is the first Java-based book to emphasize the underlying algorithms and technical implementation of vital data gathering and mining techniques like analyzing trends, discovering relationships, and making predictions. It provides a pragmatic approach to personalization by combining content-based analysis with collaborative approaches.


I've been thinking of buying Oreilly book on this topic but somehow I'm not very excited about it because the code samples were NOT written in Java.

To Satnam: What do you think are the advantages and disadvantages of using Java in this kind of field?

Thanks and looking forward to reading your book.


SCJP 1.2 (89%), SCWCD 1.3 (94%), IBM 486 (90%), SCJA Beta (96%), SCEA (91% / 77%), SCEA 5 P1 (77%), SCBCD 5 (85%)
Satnam Alag
Author
Greenhorn

Joined: May 07, 2008
Posts: 26
A lot of work has been done by the open-source community in Java in the areas of text processing and search (Lucene), data mining (WEKA), web crawling (Nutch), and data mining standards (JDM). This book leverages these frameworks; presents examples and develops code that you can directly use in your Java application.

I personally have been using Java for a long time and have been successful in building highly scalable applications that use these techniques.
Paul Michael
Ranch Hand

Joined: Jul 02, 2001
Posts: 697
Thanks Satnam!

I'm currently very interested in the "Intelligent search" of the book.

I noticed that not all of the accompanying sources have been grouped per chapter. Would you be able to point me to the location for the "Intelligent search" samples?

Thanks again.
Paul Michael
Ranch Hand

Joined: Jul 02, 2001
Posts: 697
Oh and before you go, aside from the samples being written in Java (which is a BIG plus for us), how would you compare your book to the other Collective Intelligence books out there?

Thanks again and hope you had a nice stay here at the Ranch.
Jason Carreira
Author
Greenhorn

Joined: Oct 25, 2005
Posts: 7
Satnam, In your sections on Nutch, do you tie into Nutch to do pre-calculations using the indexed values (for instance the term vector) as the pages are finished? I've been trying to go through their docs for any event callback points, but it's not very obvious.


Co-Author of <a href="http://www.amazon.com/exec/obidos/ASIN/1932394532/ref=ase_dolliedish/103-1355009-6089459" target="_blank" rel="nofollow">WebWork in Action</a> from Manning.
Satnam Alag
Author
Greenhorn

Joined: May 07, 2008
Posts: 26
Here is the chapter wise classes -- this is also going to be there in the final source code

Chapter 1

Chapter 2

Chapter 3
com.alag.ci.tagcloud.TagCloud
com.alag.ci.tagcloud.TagCloudElement
com.alag.ci.tagcloud.FontSizeComputationStrategy
com.alag.ci.tagcloud.impl.TagCloudImpl
com.alag.ci.tagcloud.impl.TagCloudElementImpl
com.alag.ci.tagcloud.impl.FontSizeComputationStrategyImpl
com.alag.ci.tagcloud.VisualizeTagCloudDecorator
com.alag.ci.tagcloud.impl.HTMLTagCloudDecorator
com.alag.ci.tagcloud.test.TagCloudTest


Chapter 4
com.alag.ci.MetaDataVector
com.alag.ci.textanalysis.MetaDataExtractor
com.alag.ci.textanalysis.impl.SimpleMetaDataExtractor
com.alag.ci.textanalysis.impl.SimpleStopWordMetaDataExtractor
com.alag.ci.textanalysis.impl.SimpleStopWordStemmerMetaDataExtractor
com.alag.ci.textanalysis.impl.SimpleBiTermStopWordStemmerMetaDataExtractor


Chapter 5
com.alag.ci.blog.search.Blogsearcher
com.alag.ci.blog.search.BlogQueryParameter
com.alag.ci.blog.search.BlogQueryResult
com.alag.ci.blog.search.BlogSearchResponseHandler
com.alag.ci.blog.search.BlogSearcherException
com.alag.ci.blog.search.impl.BlogQueryParameterImpl
com.alag.ci.blog.search.impl.BlogSearcherImpl
com.alag.ci.blog.search.impl.BlogSearchResponseHandlerImpl
com.alag.ci.blog.search.impl.technorati.TechnoratiSearchBlogQueryParameterImpl
com.alag.ci.blog.search.impl.technorati.TechnoratiBlogSearcherImpl
com.alag.ci.blog.search.impl.technorati.TechnoratiResponseHandler
com.alag.ci.blog.search.impl.rss.RSSFeedBlogQueryParameterImpl
com.alag.ci.blog.search.impl.rss.RSSFeedBlogSearcherImpl
com.alag.ci.blog.search.impl.rss.RSSFeedResponseHandler

Chapter 6
com.alag.ci.webcrawler.NaiveCrawler
com.alag.ci.webcrawler.CrawlerUrl

Chapter 7
com.alag.ci.weka.tutorial.WEKATutorial
com.alag.ci.jdm.connect.JDMConnectionExample

Chapter 8
com.alag.ci.textanalysis.lucene.impl.PorterStemStopWordAnalyzer
com.alag.ci.textanalysis.PhrasesCache
com.alag.ci.textanalysis.SynonymsCache
com.alag.ci.textanalysis.lucene.impl.SynonymPhraseStopWordFilter
com.alag.ci.textanalysis.lucene.impl.SynonymPhraseStopWordAnalyzer
com.alag.ci.textanalysis.lucene.impl.CacheImpl
com.alag.ci.textanalysis.lucene.iml.SynonymsCacheImpl
com.alag.ci.textanalysis.lucene.impl.PhrasesCacheImpl
com.alag.ci.textanalysis.Tag
com.alag.ci.textanalysis.lucene.impl.TagImpl
com.alag.ci.textanalysis.TagCache
com.alag.ci.textanalysis.lucene.impl.TagCacheImpl
com.alag.ci.textanalysis.TagMagnitude
com.alag.ci.textanalysis.termvector.impl.TagMagnitudeVectorImpl
com.alag.ci.textanalysis.InverseDocFreqEstimator
com.alag.ci.textanalysis.lucene.impl.EqualInverseDocFreqEstimator
com.alag.ci.textanalysis.TextAnalyzer
com.alag.ci.textanalysis.lucene.impl.LuceneTextAnalyzer


Chapter 9
com.alag.ci.cluster.Clusterer
com.alag.ci.cluster.TextCluster
com.alag.ci.cluster.TextDataItem
com.alag.ci.blog.cluster.impl.BlogAnalysisDataItem
com.alag.ci.blog.cluster.impl.BlogDataSetCreatorImpl
com.alag.ci.textanalysis.lucene.impl.InverseDocFreqEstimatorImpl
com.alag.ci.blog.cluster.impl.ClusterImpl
com.alag.ci.blog.cluster.impl.TextKMeansClustererImpl
com.alag.ci.cluster.hiercluster.HierCluster
com.alag.ci.blog.cluster.impl.HierClusterImpl
com.alag.ci.blog.cluster.impl.HierDistance
com.alag.ci.blog.cluster.impl.HierarchialClusteringImpl
com.alag.ci.blog.cluster.weka.impl.WEKABlogDataSetClusterer
com.alag.ci.jdm.clustering.JDMClusteringExample


Chapter 10
com.alag.ci.blog.dataset.impl.WEKAPredictiveBlogDataSetCreatorImpl
com.alag.ci.blog.classify.weka.impl.WEKABlogClassifier
com.alag.ci.blog.predict.weka.impl.WEKABlogPredictor
com.alag.ci.jdm.classification.JDMClassificationExample

Chapter 11
com.alag.ci.search.lucene.BlogSearchExample
com.alag.ci.search.lucene.RetrievedBlogHitCollector


Chapter 12
com.alag.ci.recoengine.RelevanceTextDataItem
com.alag.ci.recoengine.ContentBasedBlogRecoEngine
com.alag.ci.cf.KNNWEKAExample
com.alag.ci.cf.SVDExample
Paul Michael
Ranch Hand

Joined: Jul 02, 2001
Posts: 697
Originally posted by Satnam Alag:
Here is the chapter wise classes -- this is also going to be there in the final source code


Wow, thanks for the detailed listing.
Satnam Alag
Author
Greenhorn

Joined: May 07, 2008
Posts: 26
Regarding your question on how it compares with other books on collective intelligence, here is what I wrote on the Amazon page for the book


Difference from other books

The book is really meant for developers (basic level of Java understanding helps) who are looking to add intelligence to their applications, especially user-centric Web 2.0 applications. A lot of work has been done by the open-source community in Java in the areas of text processing and search (Lucene), data mining (WEKA), web crawling (Nutch), and data mining standards (JDM). This book leverages these frameworks; presents examples and develops code that you can directly use in your Java application.

This is a practical book and I present a holistic view on things required to apply these techniques in the real-world. Consequently, the book discusses the architectures for implementing intelligence � you will find lots of diagrams, especially UML diagrams, lots of screen shots from well-known sites, in addition to code listings, and even database schema designs.

There are a plethora of examples. Typically, concepts and the underlying math for algorithms is explained via examples with detailed step-by-step analysis. Accompanying the examples is Java code that demonstrates the concepts by implementing the concept and/or using open-source frameworks.

There are a number of exciting topics that you will find interesting and are typically not covered by other books: harvesting information from the blogosphere, analyzing content � especially user-generated content, intelligent web crawling, intelligent search, building recommendation systems. In the last chapter, I also cover three real-world examples of personalization by Amazon, Google News, and Netflix � the BellKor solution from the Netflix competition is also covered. At the end of this you should be familiar with text analysis using Lucene, web crawling using Nutch, building content-based and collaborative-based recommendation engines, and data mining using WEKA and JDM.
Satnam Alag
Author
Greenhorn

Joined: May 07, 2008
Posts: 26
Regarding your question on Nutch.

Chapter 6 deals with web crawling and covers Nutch, here are the details on that chapter


In this chapter, we will continue with our theme of gathering information from outside one�s application. You will be introduced to the field of intelligent web crawling to retrieve relevant information. Search engines crawl the web periodically to index available content on the internet. You may be interested in crawling the web to harvest information from external sites, which can then be used in your application. Search engines such as Google and Yahoo! constantly crawl the web to gather data for their search results.

This chapter is organized in three sections.
*First, we will look at the field of web crawling; how it can be used in your application; what is the crawling process; how the crawling process can be made intelligent; how to access pages that are not retrievable using the traditional method of following hyperlink found on a page; and the available public domain crawlers that you can use.
*Second, to understand the basics of intelligent (focused) crawling we will implement a simple web crawler that highlights the key concepts related to web crawling.
*Third, we will use Apache Nutch, an open-source Java based scalable crawler. We will also discuss the concepts used to make Nutch distributed and scalable using concepts known as Hadoop and Map Reduce.



This chapter also talks about focused crawling -- to make the crawler more intelligent in pursuit of relevant content.

thanks
Satnam
Paul Michael
Ranch Hand

Joined: Jul 02, 2001
Posts: 697
Sounds good! Sorry if I missed that one from the Amazon description, I should have been more focused on reading it instead of quickly skimming trough the description.

Sounds good overall. I hope to buy an e-book copy of your book within next week using the $12 off coupon I got from Manning's contests (in case I don't win this JavaRanch promo).

But I still have high hopes! Let's just wait and see.

Thanks for taking your time off answering our questions.
Jason Carreira
Author
Greenhorn

Joined: Oct 25, 2005
Posts: 7
After you crawl with Nutch, do you do anything with the created indexes? Do you look at the term vector from Lucene and build up correlations between documents?
Satnam Alag
Author
Greenhorn

Joined: May 07, 2008
Posts: 26
Yes, Chapter 12 covers getting related items based on the similarities of the term vector. This chapter also shows how to find similar items using collaborative techniques.

thanks
Satnam
 
GeeCON Prague 2014
 
subject: Java-based Collective Intelligence