Can you tell me if this is an appropriate situation in which to apply Hadoop:
We are a collection of toll road operators. We each operate one or more toll roads, and we each have our own set of customers. Every customer has "arrangements to pay" information stored with the toll road operator that owns the customer account. But we want any toll road customer to be able to use any toll road in a seemless way, so that all charges, incurred on any toll road, end up on their home toll road operator account - a true interoperability scenario. To make this possible, every day, we currently exchange large flat files (several GBs) containing "arrangements to pay" data.
This "arrangements to pay" data is held within specific database tables within our own tolling systems. Some of these systems are custom built, some based on SAP, some on Oracle applications, some use SQL Server and some use Oracle database.
Is it practical to think that we could create a Hadoop cluster of this "arrangements to pay" data so that any toll road operator, at any time, could query the cluster and determine the status of a particular customer's "arrangements to pay" to find out whether, for example, their account is account is active and has a sufficient balance of funds?
I would be very interested to hear your views on how this might be possible.
That's an interesting problem. So if I understand you correctly the key gain here is to have a single source of the arrangements to pay data that can be queried by any of the providers and obviate the need for this data to be kept in multiple different RDBMS, along the way easing the data sharing problem by not requiring all files to be pushed to all partners?
If so then this would be a great fit for Hive. You could possibly take the existing data files, push them into Hive and then use a SQL-like syntax to run reports against the data.
I see two possible wrinkles that would need more detailed thought:
1. If the query load is lots of small queries (e.g. a query per customer) then Hive, having higher latency than a transactional RDBMS will give poor performance. But if the workload is more report-type queries like "select <payment records> from < table> where date = <date> and customer_id in <my customers>" then it would work well.
2. If you also wanted to hold the customer data and account info in Hadoop then that's more a Hbase type use case where ease of updates and low-latency query response times are more important. So you could potentially hold customer data in Hbase, payment data in Hive. You'd still have the benefit of a single shared system.
Or in other words my kneejerk response is to say it could be a good fit, certainly worth some exploration if you are looking to do some rationalization/ process streamlining.