We have data of size 10 TB(terabytes), stored in multiple disks. Metadata (data describing data like filename, its location, author, description etc.) can go in GB(gigabyes) say 5GB. To develop a web based application, should metadata be stored in xml files or in a database like oracle, mysql etc.
Since data is going to increase in future, scalability is required. Which approach will give better performance?
You want to extract data randomly from that 5 GB of metadata? Then don't make it a single XML document. That's completely unscalable. If it's amenable to being made into SQL data then I would do that. If it's less structured than that, then I don't know.
Joined: Jan 27, 2008
Hi Paul, It will be like a user wants to find data matching a particular criteria e.g. all files generated between specified start date and end date, extracting required data and analysing it to give statistics, generate plot etc.
Will database approach will give good performance? As xml file will be larger, so can't use DOM, but Is using SAX parser scalable and gives good performance?
Probably performance is not the only criteria that should decide whether one should use database or not. There are whole lot of things database provides like transactions, management capablities, power of sql or similar query language, stored procedures and the list is long. If you like it, there are many open source xml databases available today. My personal opinion is having data stored in an xml and trying to read it yourself, for any big application will lead you into a point from where you will think "may be i should have used an xml database". By whatever you have described, it looks like you are into some sort of reporting software that will definetly have a lot of different type of queries into the data and hence it will be a huge effort to write code to read xml without using xml database. It just makes the problem hugely complex if the size of data is of Terabytes or even GBs. I would suggest(as Paul suggested) to go for database if possible otherwise atleast look at some good xml database to the least. Do not venture into giving a shot at handling the xmls yourself!
[ January 29, 2008: Message edited by: Nitesh Kant ] [ January 29, 2008: Message edited by: Nitesh Kant ]
Originally posted by ashish bhardwaj: It will be like a user wants to find data matching a particular criteria e.g. all files generated between specified start date and end date, extracting required data and analysing it to give statistics, generate plot etc.
Will database approach will give good performance?
If your data model can be described as a reasonably small collection of tables, then a database would be a good approach. Databases have indexes and views which mean you don't have to read the entire database to find one piece of information.
As xml file will be larger, so can't use DOM, but Is using SAX parser scalable and gives good performance?
The SAX method won't use up all your memory, but you still have to read the entire XML document even if you want to extract one data element. And if you have a query that can't easily be answered by a single sequential scan of the database, then you have some hard work to do (which almost certainly would have been one line of SQL).