For an internal web analytics platform here the traffic is around 15 million hits per month. That only equals out to around 7 request per second, say 25 during peak times. We are curious though the best way to make a web analytics platform very fast and scalable.
So basically similar to google analytics, the platform has a snippet of JS, that then goes and fires and SQL query. Now the question is, should we update this query on the fly, or should we just do an insert and let another process, *process* the data and update it for the end user (so they can see up to date analytics).
Should a relational db be used for this insert? Or would something else be faster? Then parse that *log file* or whatnot into the DB? Maybe that would be quicker than hitting the database every request, and doing a batch import into the database every 30 seconds or every minute. This follows along the theory that opening a connection and doing 1k queries is faster than opening and doing 1 and closing etc etc for every request.
Maybe there is a completely different approach for this, that we are just not aware of. Any input would be great.
The first thing that comes to mind is to decouple gathering the stats and saving them. Push the incoming stats into some king of queue, and have a lower-priority job process that queue by saving it in whichever way you want to save it. That way high traffic, or a slowdown in the DB (or file system) doesn't affect the speed of stats gathering.
Joined: Aug 17, 2013
Any other thoughts on this? Ways to pull it off, so it could scale up?
The max we will do is probably around 30 million hits per month, but still. I would like to make it as good as I possibly could. Anyone with analytics experience would be very appreciated.
Joshua Silva wrote:I would like to make it as good as I possibly could.
Just an observation - that isn't really a very good spec. One can always improve things, if one is willing to spend more time/money/resources. The law of diminishing returns certainly applies here.
So, come up with a specific, quantifiable spec, with actual numbers and statistics, not vague 'make it better' rhetoric. That's the only way you'll know if you've hit your target. You can certainly go back and revise the specs if you need to, but you need to have an obtainable goal.
There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors