I have a requirement where we have an ETL jobs that runs in the backend which will get the up to date data from various sources , here we have both structured and unstructured data and the data capacity is massive which is in petabytes. Based on the data that is fetched we need to come up with a consolidated data view for the data which we retrieved. My queries are:-
1) How to create a consolidate data view , in what form it will be in (XML, JSON, etc)? whether we need to come up with a unified data model as the data is both structured and unstructured format? If so how we can go about it?
2) Since the data is really massive, how we can construct the consolidate data view? whether we need to create it in small chunks or whether we need to store these data in a cache or DB from where we will retrieve it and form the consolidated data view?
3) To process these data whether we need to have any Hadoop clusters to process these data parallel? As we are talking about massive data in the form of structured and un structured format?
4) Whether there is a need for NoSQL database to support unstructured data? If we are supposed to store the data?
I have a requirement where we have an ETL jobs that runs in the backend which will get the up to date data from various sources , here we have both structured and unstructured data and the data capacity is massive which is in petabytes.
I have no idea of ETL and I am new to Hadoop. But on hearing 'petabytes' I think you sure are talking of Big Data. You might want to explore more and learn it first than to seek a direct solution.
You are asking for some very specific recommendations for a system we know next to nothing about. The answer to pretty much all of your questions is "it depends", where it depends on many many details of your system and requirements that you have not told us about, nor could you reasonably do so in a forum post.
Given the type of system you are working on, and the type of questions you are asking, it appears to me that you are way out of your depth with this one. When you mention 'we' I'm also assuming that none of your colleagues know what to do either.
I don't think a general programming forum, even one as awesome as CodeRanch, is going to get you the assistance you need. I recommend putting your hand in your pocket and getting an expert in to help you get up and running.