This week's book giveaway is in the OO, Patterns, UML and Refactoring forum. We're giving away four copies of Refactoring for Software Design Smells: Managing Technical Debt and have Girish Suryanarayana, Ganesh Samarthyam & Tushar Sharma on-line! See this thread for details.
I'm doing a project where I download a mass amount of tweets using Twitter4J and Flume and pipe them into Hadoop. I've just completed this step and I have a folder structure like this in Hadoop: /user/flume/[month]/[day]/[hour]
In each hour there is maybe 50 - 75 files containing multiple tweets. I am assuming these files are JSON.
I've been looking up some guides online on how to perform sentiment analysis. Right now I'm focussing on a basic type, where I compare positive and negative words and assign each tweet an aggregated score. However, all the guides I've found assume that all the tweets are in one large text file, not hundreds of multiple JSON files.
Should I convert my JSONs into one single txt file (and if so, how?) or should I perform the analysis on them as it is (and if so, how!! - I know I have to write a mapper and reduce function, but how do I get it to detect the multiple JSONs?)
Depending on the volume of data, I'd consider merging your data files into a single JSON file, as it's probably easier to code and test this step separately as part of your data preparation, and it saves you having to code your main analysis process to swap between multiple files. On the other hand, Hadoop is supposed to be able to deal with lots of files independently, which is what you already have, so maybe your current file structure is fine. You'll be merging the outputs from the reduce phase anyway, so maybe it doesn't matter so much how your inputs are organised.
If you're interested in an alternative way of holding your data prior to the Hadoop work, you could use MongoDB to hold your Tweets in a single collection, which allows you to explore them using MongoDB's query tools and extract (sub-sets of) them to individual files if you want.
Install the free MongoDB database on your machine. MongoDB uses a variety of JSON as its standard data format, which means you can load JSON files straight into the database then query them very easily. A MongoDB "collection" is analogous to a table in a regular database.
Use the "mongoimport" tool to load your JSON file(s) into a MongoDB database collection e.g. (all on one line):
Log into the MongoDB shell from your command-line using the "mongo" command.
Inside MongoDB, switch to your preferred database:
Set an alias for your data collection if you want e.g. I want to refer to the "mytweets" collection as "data":
Display the first 5 records where the user is geo-enabled:
Display only the "user" information for the first of those records in a "prettier" format:
Of course, you don't need to do any of this MongoDB stuff at all, but if you're working with big lumps of JSON data, it can be a handy option to have in your toolkit.
If all you want to do is process each JSON file for sentiment analysis, I wouldn't use a database. It just adds another point of failure. To make your application scalabale, you will need to stay away from any sort of centralized database. You will need to install the database on each node, and you will run into the headaches there.
Yes, if you sentiment analysis requires you to a lot of searching on your tweets, or if you want to run some sort of aggregation queries that you want to simply leverage from the database, then having a database on each node makes a lot of sense
The rule of the thumb with any distributed app is is keep your workers very light . The more stuff you put on the workers, the more stuff that will need to be downloaded and installed and configured, and the more stuff that can go wrong. KISS applies a thousandfold for distributed apps. Go for the simplest solution that fits your problem
The simplest solution will be use Jackson to convert JSON to Java objects, and then do your sentiment analysis on the Java object.
Joined: Aug 04, 2013
Thanks for both of your responses.
I chose not to use the DB in the end, as after initial processing I have no need for the files. I have performed two types of analysis, in Java and Python. Before I did either, I converted and parsed the JSON and stored the tweets as plain text. I found this was the easiest way to work with them. Each txt file of JSON tweets was 64MB in size.
I’ve looked at a lot of different solutions, and in my humble opinion Aspose is the way to go. Here’s the link: http://aspose.com