I am working on a application where I have to get some information which is global to the organization.
To get this global data there are some services designed (called as integration layer services) which my application need to call.
Problem is - This data is being widely used in my application particularly in my core module and this is the module which end users might be accessing all the time. - This data is huge that is it may be some thousands of records and this data is not as clean as we need. So, my application has to process the data that was returned by the integration layer service. - In some cases to process this data, for each record to get extra information we have to make a call to the integration layer, which will be a huge overhead and can be a performance bottleneck.
Solution: - Rehosting the data: Which is not at all preferred because of data integrity. No one wants to manage the same data in two places. - Caching: I am not quitely sure how to do this. Cache in a Flat file: IO read/write expensive operations. Database: same as rehosting In memory in application scope: I think system can go slow. With all the above approaches, Synchronization and data integrity would be the major problem.
In such a scenario, how should I have to design my application.
Well, I don't know what would be the best solution. Actually I'd think that you will only know wether a solution will work appropriately after you tried it...
So the only real advice I know to give is to isolate your decision from the rest of the code. Put your caching/distribution/whatever-strategy into an isolated layer and don't expose the implementation details to the rest of the code, so that you can switch to a different strategy later on.
In fact I'd probably start with a "naive" implementation: do the simplest thing that could possibly work (direct access of the information you need, when you need it, but always through the layer), without thinking much about how it could be optimized. You might be surprised how well it works. And if you don't, at least you have something working to measure and profile to get real data to base your decisions on, and something you can inject different strategies into to experiment with.
Does that help?
The soul is dyed the color of its thoughts. Think only on those things that are in line with your principles and can bear the light of day. The content of your character is your choice. Day by day, what you do is who you become. Your integrity is your destiny - it is the light that guides your way. - Heraclitus
Your suggestion gave me confidence My current design is just straight forward. get the information process it and give it to client. As you have suggested I will have to implement it and test it to see how best or worst it can go..
I was wondering how caching works and how it takes care of synchronizing data with original data.
Ilja's advice is good advice. The important part is to "abstract out" the code that actually gathers the data. From your application, make calls to some "Data Abstraction Layer" (DAO) that will handle the acquisition of your data. To start with, the DAO will probably just make direct calls and might very well be a bottleneck. If it is, you can address that without having to change any of your code.
As far as caching, I think how it's implemented is really up to you. You could periodically pull down large amounts of data and store it in a local database. I know you mentioned that you didn't want to maintain the data in two places but, as long as this local cache is considered "temporary" storage, it requires very little maintenance.
Perhaps you could store the data in memory by making an object that reflects the structure of the data you're pulling down. When someone requests data, you can check to see if you already have it. If not, you can go get it. If you do, simply return it. Depending upon the amount of data you're pulling down, this could turn out to be a real memory hog. Is your application going on a dedicated machine? Will it be running in the background with a lot of other apps?
Like Ilja said, you might not know which solution is best (I'm sure there are plenty that I didn't even touch on) until you actually try it.
Going along with what Ilja said, besides isolating the data access from your code, you might want to add another layer that allows you to change between data access implementations easily (perhaps via an external file). That way, if you build the implementation one way and it doesn't work well, you could build a new implementation (without changing the original one) and easily swap between the two to do performance testing.
I had one more thought this morning while I was driving to work. If you'd like to get data into a local cache but you're concerned about hogging too much processing time, you might want to think about using a Daemon thread to process data acquistion in the background.
With a little judicious use of the processor, you could potentially create a thread that is constantly acquiring and updating data without causing significant lag in the user interface of your application (or any other applications on the machine).
I am not quitely sure how threads work in clustered environmen. May be I will be have to do some research on this. And the s/w could be potentially implemented using EJB and I guess it is not recommended to manage threads using EJB, I am not sure how to go about it.
I guess that your client application that wants to use data that needs to be threaded to monitor changes so if that is a web application you can do it. If it is session beans (I would seriously doubt if it is entity beans) that needs this cache data things and needs to fork threads then your worry is right.
Also, asynch messaging could be one way to have cache notification to the client app so the CacheManager can update the cache to keep it in synch with the global data. Of course we would end up having "monitoring thread" on the global data side that "publishes" the changes periodically to "the client channel"......