I have a system where certain clients actions are being logged to a central database for statistical reporting. The problem is a small number of users can generate tens of thousands of small records a day. Within a month we're into a million records added to a single table.
My question is, what's the best design pattern to handle this kind of usage? There's a few solutions I can think of, but some guidance for those that have experienced this kind of pattern would be nice:
Leave it as it is. Eventually backing-up/restoring the database becomes a time consuming operation even though the disc space of the database isn't growing as fast. As mentioned, the system definitely slows down as the table grows
Wipe the table to disk daily. Good solution although then statistical reporting becomes a process of merging text files. The advantage is the table stays small in the long run
Normalization/Schema changes: I've considered concatenating database records via normalization although this will great greater database connection and the number of records would still be the same
Disconnect the statistic database from the operations database. Good solution for clarity of data, but if the two databases are on the same machine and the operations are synchronous, it doesn't really solve the problem
[ April 16, 2008: Message edited by: Scott Selikoff ]
what types of things are you auditing? queries by user? logins? things other than jdbc? Is this for performance or auditing reasons? Some databases have auditing capabilities.
In the past I have aggregated older statistics and so was able to delete the details of all but the latest details. If you don't want to aggregate you could simply remove any data older than a month or archive it.
Yeah, the more I think about it, performing daily/weekly aggregate data builds then wiping the table is probably the best approach. The problem right now is we're working with the requirement "record everything, we'll sort it out later" mentality which needs to be fleshed out better since recording everything is too expensive.
Let's embrace that one person can generate 100,000 events in one day.
Let's embrace that we need to preserve the data for at least three months (and then it will be archived, or whatever).
First you have a row that talks about this person on this day and there is an ID for that row. Then, in another table, you have two columns: one column for the matching ID and one column for an integer that represents the time of day and type of event. There are 86400 seconds in a day. This leaves room for hundreds of types of events in a four byte int.
If you really need to put the squeeze on, you could log this to a file. The file name could represent the unique ID and then there could be a series of 4 byte ints. One day for one user doing 100,000 things would be 400k. Maybe 40k compressed.