Sean Allen

Author
+ Follow
since Apr 28, 2015
Merit badge: grant badges
For More
Cows and Likes
Cows
Total received
5
In last 30 days
0
Total given
0
Likes
Total received
1
Received in last 30 days
0
Total given
0
Given in last 30 days
0
Forums and Threads
Scavenger Hunt
expand Ranch Hand Scavenger Hunt
expand Greenhorn Scavenger Hunt

Recent posts by Sean Allen

Scala is my preferred language for doing serious work on JVM. That said, I'm not a giant fan of the language. It has an awful lot of warts and wtfs, but I find it to be the most productive JVM language for me.
We wrote the book to try to take people from having 0 knowledge about Storm to being able to run a production cluster. It won't tell you everything you need to know but we think it is an excellent guide to get you there.
There's an awful lot there that I wish someone had been around to teach me rather than having to painfully figure out myself. We even through in some distributed systems advice for good measure.

All the code in the book is in Java. We wanted the examples to reach the widest audience (we use Scala ourselves outside of the context of the book).

To get started, just start. Pick something that looks interesting, Storm, Hadoop, Spark, Samza and start playing around with it. Get comfortable then find someplace you can put it into practical use running in a production environment.
You can either allot more memory per worker or work on improving the throughput of your bolts.

Upgrade to the latest Storm version.

All the incubating version suffered from this problem.
We never ran 0.9.2-incubating in production as it didn't pass our tests. I don't recall what problems we hit. I do know we didn't end up using it.


1. Average time isn't useful in your scenario. Your average can be below your timeout but many tuples can still be taking longer than that timeout. You are dealing with latency to process and that is going to vary widely.
2. You need to work out your issues with #1 above. Then figure out how to make your system work with at-least-once processing. You can't have exactly once processing in a distributed system. You can get usually once but that is the best you can do. Chapter 4 discusses this towards the end. Ideally, you want a system where it doesnt matter for anything other wasted processing if you duplicate a message.
3. That isn't something I'm capable of helping you hammer out via this forum. I'd suggest creating the simplest topology possible and figure out what in your code/configuration is causing the issue.



Max spout pending is the maximum number of tuples that can be in your topology unacknowledged from that spout at a given time.

Think of it this way.

Max Spout Pending is the cap equal to calls to next tuple on your spout minus tuples fully acked

So if your max spout pending is 10. You can have next tuple call 10 times without acking any of those tuples.

Your spout could be holding more messages from your broker waiting for next tuple to be called.

What you pull over from your broker at any time is a separate performance concern than the number of tuples that are "live" in your topology at any given period of time.

Max Spout Pending is a (simple) means of providing backpressure.
A lot has changed with the introduction of Hadoop 2 and Yarn. See: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

Re storm and tuple/message tracking. Its true you can track all messages to assure they are processed that is however, an optional feature. You can choose to not use guaranteed message processing and in turn, get better throughput.
That would be implementation specific.

* You could output changes in trending topics like

"started trending" : "one direction"

"no longer trending" : "katy perry"

to a message queue and other parts of your system could act on that. (for example, put the output in a Kafka topic which could serve as the beginning of other streams).

* You could be updating an external database such as Redis or Riak that provide Set as a datatype.

* Storm provides a DRPC client that you can use to query the in memory state of a topology (this is covered in Chapter 9 in relation to Trident but you could if you really wanted to, use it without Trident).

Hadoop is oriented towards working with batches of data.

Spark is oriented towards working with either batches of data like Hadoop or towards "micro batching" which is basically smaller batches of data that starts to approximate what a streaming solution is like.

Storm is oriented towards working on a never ending stream of data where you are constantly calculating and there is no start or end. Whenever data arrives, it is processed. Storm via Trident can also do microbatching.

Think batch processing system when you are crunching a large amount of data and don't need an answer right now. For example, you can process your website's log files to look for trends every day and extract value from them, then a batch framework like Hadoop is perfect. However, if you are analyzing those logs in order to detect intrusion attempts against your system, then you want to know as soon as possible. For this, you would want a system like Storm where each event within your system is shipped as a stream to Storm as soon as it happens so you can analyze it immediately.
Hadoop is a batch oriented so discussing it in terms of messages isn't really applicable. Hadoop works on a pool of data a batch at a time.

Storm is a stream processing system that operates on a stream of data, one message at a time.

With Hadoop, you want be able to extract any answers from your data until the batch is processed. Imagine for example the case of twitter's trending topics:

With Hadoop, you could take all of the tweets from the last 30 minutes and process them looking for trending topics. That 30 minutes of tweets is your batch.

With Storm, you can act on those tweets as a stream of data and start immediately detecting trending topics as you are working with the stream in real time rather than
as a batch.

So Hadoop is not real time. It doesn't operate on messages so guaranteeing messages are processed doesn't apply.
We included reliably in the summary because, Hadoop v1 had numerous reliability issues. Hadoop v2 addresses most of those, however, at this time, most people
equate Hadoop reliability with v1.
We don't cover any specific service such as AWS etc, we do however go through how to deploy Storm.

Deploying Storm is relatively easy. We strongly advise that you use a tool like Puppet to allow for recreateable deploys.
From there, you can easily deploy new nodes to the Cluster and what not.

My #1 piece of advise if you are setting up a cluster is to pay close attention to Zookeeper. Its a vital part of a Storm cluster.
Make sure that you are running Zookeeper machines with fast disk as its IO intensive and a lack of stability with Zookeeper
will impact on your entire cluster.
Storm is a streaming data tool so any use case that can work on streams of data is a good use case.

In your matching case, if you are matching something like a customer on a website to products, then that can be turned into a streaming case as its a per data point case.
You can also use Storm for doing aggregation etc on data.

We use Storm to run a number of matching algorithms during our day jobs here at TheLadders.
As long as you can turn that analysis into a streaming data problem, then yes, it can be.