Win a copy of Head First Android this week in the Android forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Tim Cooke
  • Campbell Ritchie
  • Paul Clapham
  • Ron McLeod
  • Liutauras Vilda
Sheriffs:
  • Jeanne Boyarsky
  • Rob Spoor
  • Bear Bibeault
Saloon Keepers:
  • Jesse Silverman
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
Bartenders:
  • Piet Souris
  • Al Hobbs
  • salvin francis

Collecting tweets, which better creating a Reader class or through the Map function?

 
Greenhorn
Posts: 12
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
So, I'm working on twitter sentiment analysis. I have created a Reader.java that read twitter, clean them and store them in HDFS. Then the Map function will take the input and do the sentiment work.
Currently, I'm collecting a small number of tweets, but I thought, if I collect a huge number wouldn't be best to put the Reader work in the Map function? for the sake of handling Big data?
I tried doing it but the Map function show me a problem, that there is a need for input.

I'm confused, can anyone explain which solution is the best? Keep Reader or add the code to Map and do some workout for the input file?

Thanks,
 
Bartender
Posts: 1210
25
Android Python PHP C++ Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Receiving tweets directly in mappers is not a good idea.
Such a design would look like this, where each mapper is a separate process usually running in a different machine:

Note: The 1% is because twitter streaming API doesn't really give you all tweets that match your criteria, just 1% of them.

Since every mapper is receiving the same set of tweets, mapper design becomes complicated:
1) Mappers have to prevent duplicate writes
If number of mappers is m, tweet 1 should be written only by mapper 1, tweet 2 only by mapper 2,...tweet m only by mapper m,
tweet {m+1} only by mapper 1, and so on. You'll have to write extra logic that uses the mapper or task ID to do this kind of coordinated prevention.

2) Duplication prevention also wastes processing power
Since every mapper is responsible to write just 1 tweet in every m tweets, it's wastefully receiving and processing other m-1 tweets just to discard them.

3) Risk of data loss
If a mapper goes down, the tweets that are its responsibility will never be written, because twitter streaming API is a realtime API.
Once you miss a tweet, that tweet will never be redelivered via streaming API. It's left to you to retrieve them by tweet IDs via the REST API.

4) Risk of twitter throttling or banning your API access
Since all mappers will likely use the same API token, possibly via the same gateway IP address, there's a risk that twitter sees it as exceeding
limits and throttles or bans your token or IP address. I'm not sure if twitter actually does this for streaming API, but they do it for the REST APIs.
The risk is always there.

5) Overall network load of your cluster is higher
Since all the mappers open TCP channels to twitter's API endpoint and receive tweets pushed over those channels,
overall network load of your cluster is unnecessarily higher than with a single or limited readers. And, since most of the tweets
are discarded by every mapper, most of this network activity is in fact a waste.

The typical design for twitter analysis is to have one or two (for redundancy) readers that receive tweets, put them in a durable message queue like Kafka or RabbitMQ, and one or more MQ consumers pop items off that queue and write them to HDFS using efficient binary splittable compressible formats like Avro or Parquet. The mappers then just read those files from HDFS.
 
Arwa Saad
Greenhorn
Posts: 12
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thank you so much Karthik Shiraly! Now it make sense !
I appreciate it
 
You showed up just in time for the waffles! And this tiny ad:
Building a Better World in your Backyard by Paul Wheaton and Shawn Klassen-Koop
https://coderanch.com/wiki/718759/books/Building-World-Backyard-Paul-Wheaton
reply
    Bookmark Topic Watch Topic
  • New Topic