This question is not related to java or jdbc in specific, but i thought there are some great IT professionals in this forum that can help me out.
I am currently working on a project for collection of weather data. The problem is with database design. I have a central server where weather data would be coming in from number of stations (clients) every hour. And i need to store this in my database. So I need a table for storing data values of sent by every client, everyday.. It goes something like this
-> the first column would of the table would be the name of parameter,
-> there would be 24 other columns for the values of these parameters at the end of each hour on that day.
This is just a table for data coming in from 1 station in a single day. I would have many clients that would feed data and the system would be running everyday. How can I handle such data ?? Any ideas ? When it comes to implementation the only option I have thought of is to create a new table through my server side application for every client as a day passes by, But that is just too inefficient i feel and I cant imagine how i would write queries for that programmatically .. I dont have any other idea .. The only thing that comes to my mind right now is a multi-dimensional database. I have studied how retrieval from multidimensional data works in theory, like the rollup, drill down, slicing etc . But i dont know what tools to use and how to implement .. I need something that is fast and efficient when retrieving the data and the data coming in would be huge ..
Please give me your suggestions on this, I would be very thankful .. I am a newbie in database design, so please be easy on me if I am being an amateur in the language I have used above
Sorry to seem unhelpful, Dhaval, but this is way too big a question to resolve through a few forum queries. It sounds like you need to find an experienced data architect to work with you on this project. I'm not sure if dimensional modelling might be overkill for this project - your local DB expert will have to advise you on that.
Here are a few thoughts, but you really should get a local DB expert to help you with this stuff.
Is the set of parameters (readings?) from each weather station reasonably fixed i.e. do you get the same 20 (?) parameters (with different values) each hour from each station?
If so, then you should probably store each set of readings from a single station as a single record in your table, with each parameter in a different column, because this will make it much easier to correlate them e.g. to see the temperature when it's cloudy at station #1. You can always add extra columns for new parameters later if necessary.
The other advantage here is that it's really easy to compare the same parameter across different stations/dates etc e.g. to do things like getting the average rainfall for a set of stations/dates, etc.
A dimensional model is very flexible, but has some costs and disadvantages. For example, if you had a dimensional model with a separate record for each Station/Date/Hour/Parameter, you might have to do a lot of fairly expensive "pivot" operations to put the logical record back together to do things like "find me the station readings where rainfall > 3cm and cloud cover <80%". These queries are much easier if you store each set of readings in a single record instead, as described above.
Also, it seems like your data is basically all "facts" and only a few dimensions i.e. most of the data is observed values (facts), with just the station/date/time as potential dimensions, and these "dimensions" are the primary key for a given set of observations anyway. So your main data table is already a "fact" table in any case.
So let's assume you keep the readings as single records for each station/date/time combination, with each parameter in a different column e.g. Rainfall, Sunshine Hours, Cloud Cover etc. The unique business key on this table would probably be something like: Station ID, Reading Date, Reading Hour (1 to 24, or 0 to 23?). You could consider having a surrogate key (numeric ID) as the primary key instead, but to be honest I'm not sure there would be any advantage in this case, as you'd still need a unique index on the business key anyway.
Station ID is a foreign key pointing to the parent Weather Station (your "client" entity?) record in a separate table. The primary key for the Weather Station should be a numeric ID (generated from a sequence or similar) i.e. a surrogate key, as this will make it easier to manage any changes to your weather station details without impacting the large volume of data in your main table of readings. You'd probably want to index the Station ID in your main data table to make it easier to find all the readings for a given station.
You might have other tables e.g. for the organisation(s) who manage individual stations, staff contact details, or whatever else you need, but these should be easy to design.
The volume of data and the kind of operations you want to do on the readings are also significant for your design. How much data will you be storing?
If you have 24 readings per day * 365 days per year then that's only 8760 readings per year from each weather station, or less than 90,000 per station over 10 years. If you have 10,000 stations, that's still less than a billion records for all your weather station readings over 10 years. This is a significant but not unmanageable amount of data for an enterprise RDBMS. Also, because the "fact" records are basically all numeric values, individual records do not take up much storage space either.
But if your team decides the volume of data is too much for a single table, then you would have to think about e.g. putting the data for each weather station in a separate table (but this makes it a little bit harder to compare/aggregate data across different stations), or data for each year in a separate table (harder to compare data from 31 December 2010 to 1 January 2011). Your approach to this question would depend on your requirements, but my instinct would still be to keep the readings together i.e. any given set of readings for a given station/date/time should still be held as a single record, whichever table it ends up in.
Either way, you would certainly need to work with your DBAs to ensure the database is optimised for your needs e.g. use of indexes, partitioning etc to improve performance and maintenance.
Table partitions are like physical "sub-tables" based on a particular key e.g. date: each set of data for a particular value of the partition key is held in a separate physical section of the table. So when you query the table for a given date, for example, the query engine knows it only has to read the relevant partition instead of searching the whole index/table. It is also possible to back up/restore partitions individually so your DBA can back up last week's data easily, without having to do anything with last year's data. On the other hand, if you do a lot of queries that hit most of your partitions, then partitioning can sometimes slow down your query performance. You'd need to talk to your data architect/DBA about this.
Meanwhile, back to summary data. If the weather station readings are fixed i.e. the readings for a particular station/time are never updated, it means that you should be able to extract and store summary (roll-up) data regularly, without having to worry about updates to the original source values. Even if the source values are occasionally updated (e.g. to correct errors in the readings), you can still design processes to manage these changes e.g. by refreshing related summary data etc.
So you might have separate tables containing e.g. average or total readings per station per month, which would make it easier for people to run queries for this information without having to trawl through all the source records every time. You would need to speak to your users to find out what queries they would want to run regularly, and design suitable tables and processes to generate and store this data regularly.
If you're using Oracle, you can use materialised views, which are "snapshots" i.e. they consist of a SQL SELECT query to define the data you want to see, and a physical store for that data, which you can refresh regularly to bring in the latest data from your source tables. If you use MVs, it's a good idea to coordinate them with your table partitioning strategy, because Oracle can do clever stuff here e.g. it can tell if it only has to refresh a single partition of data in the MV, rather than the entire data set.
You might want to think about archiving strategies as well. How long do you want to keep the readings online, or how often do you want to archive them off to a separate store (if at all)? If you archive data, how do you provide access to it if necessary? Would you bother archiving your summary data, or would you keep it available throughout the life of the system? If you've archived the original data, how do you manage things like summary tables e.g. if you need to re-build them for some reason?
Anyway, I think that's enough from me. Others may have different advice to give you, but I hope this will at least give you some idea of the issues you need to talk about with your local DB expert.
Thank you soo much Chris, Thanks for the awesome reply
I guess I would stick with RDBMS rather than introducing the complexity of multidimensional database ... Never thought of archving the data before .. I don't need the data older than a month, so archiving the data would be a great Idea to make the data less bulky .. Thanks for the great suggestion .. Even storing the summary data would be awesome, because it might be accessed at a later stage .. And I can also think about other options you have mentioned .. But for now RDBMS and archiving should do it
You were spot on in your suggestions sir, Thank you very much