This week's book giveaway is in the OCAJP 8 forum.
We're giving away four copies of OCA Java SE 8 Programmer I Study Guide and have Edward Finegan & Robert Liguori on-line!
See this thread for details.
The moose likes Hadoop and the fly likes Twitter Sentiment Analysis in Hadoop Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login

Win a copy of OCA Java SE 8 Programmer I Study Guide this week in the OCAJP 8 forum!
JavaRanch » Java Forums » Databases » Hadoop
Bookmark "Twitter Sentiment Analysis in Hadoop" Watch "Twitter Sentiment Analysis in Hadoop" New topic

Twitter Sentiment Analysis in Hadoop

Andrew Martin

Joined: Aug 04, 2013
Posts: 3
Hi all,

I'm doing a project where I download a mass amount of tweets using Twitter4J and Flume and pipe them into Hadoop. I've just completed this step and I have a folder structure like this in Hadoop: /user/flume/[month]/[day]/[hour]
In each hour there is maybe 50 - 75 files containing multiple tweets. I am assuming these files are JSON.

I've been looking up some guides online on how to perform sentiment analysis. Right now I'm focussing on a basic type, where I compare positive and negative words and assign each tweet an aggregated score. However, all the guides I've found assume that all the tweets are in one large text file, not hundreds of multiple JSON files.

Should I convert my JSONs into one single txt file (and if so, how?) or should I perform the analysis on them as it is (and if so, how!! - I know I have to write a mapper and reduce function, but how do I get it to detect the multiple JSONs?)

Thanks for any help that can be offered.
chris webster

Joined: Mar 01, 2009
Posts: 2239

Depending on the volume of data, I'd consider merging your data files into a single JSON file, as it's probably easier to code and test this step separately as part of your data preparation, and it saves you having to code your main analysis process to swap between multiple files. On the other hand, Hadoop is supposed to be able to deal with lots of files independently, which is what you already have, so maybe your current file structure is fine. You'll be merging the outputs from the reduce phase anyway, so maybe it doesn't matter so much how your inputs are organised.

If you're interested in an alternative way of holding your data prior to the Hadoop work, you could use MongoDB to hold your Tweets in a single collection, which allows you to explore them using MongoDB's query tools and extract (sub-sets of) them to individual files if you want.

Install the free MongoDB database on your machine. MongoDB uses a variety of JSON as its standard data format, which means you can load JSON files straight into the database then query them very easily. A MongoDB "collection" is analogous to a table in a regular database.

Use the "mongoimport" tool to load your JSON file(s) into a MongoDB database collection e.g. (all on one line):

Log into the MongoDB shell from your command-line using the "mongo" command.

Inside MongoDB, switch to your preferred database:

Set an alias for your data collection if you want e.g. I want to refer to the "mytweets" collection as "data":

Now you can run queries on your Tweet data using MongoDB's query syntax (based on JavaScript). For example, to find Tweets where the user is geo-enabled:

Display the first 5 records where the user is geo-enabled:

Display only the "user" information for the first of those records in a "prettier" format:

Of course, you don't need to do any of this MongoDB stuff at all, but if you're working with big lumps of JSON data, it can be a handy option to have in your toolkit.

No more Blub for me, thank you, Vicar.
Jayesh A Lalwani
Saloon Keeper

Joined: Jan 17, 2008
Posts: 2682

If all you want to do is process each JSON file for sentiment analysis, I wouldn't use a database. It just adds another point of failure. To make your application scalabale, you will need to stay away from any sort of centralized database. You will need to install the database on each node, and you will run into the headaches there.

Yes, if you sentiment analysis requires you to a lot of searching on your tweets, or if you want to run some sort of aggregation queries that you want to simply leverage from the database, then having a database on each node makes a lot of sense

The rule of the thumb with any distributed app is is keep your workers very light . The more stuff you put on the workers, the more stuff that will need to be downloaded and installed and configured, and the more stuff that can go wrong. KISS applies a thousandfold for distributed apps. Go for the simplest solution that fits your problem

The simplest solution will be use Jackson to convert JSON to Java objects, and then do your sentiment analysis on the Java object.
Andrew Martin

Joined: Aug 04, 2013
Posts: 3
Thanks for both of your responses.

I chose not to use the DB in the end, as after initial processing I have no need for the files. I have performed two types of analysis, in Java and Python. Before I did either, I converted and parsed the JSON and stored the tweets as plain text. I found this was the easiest way to work with them. Each txt file of JSON tweets was 64MB in size.
I agree. Here's the link:
subject: Twitter Sentiment Analysis in Hadoop
It's not a secret anymore!