Profiling our app has revealed that a significant chunk of time is spent in database calls (all of them are basic select SQLs). We do have caching enabled for smaller tables, but we don't have caching for very large tables (such as the ones containing 1,000,000+ records). At this point, I am looking at some caching mechanism that would allow to replace the cached items, so that the maximum cache size remains under control. Can someone point to some code that implements some of these mechanisms (LRU, LFU, FIFO, etc)?
Originally posted by Jeanne Boyarsky: Not really about caching, but do you have indexes on those large tables?
And even if you have, are you sure they get used? We recently discovered that for some of our SQL statements, Oracle was using the "wrong" index. Giving it a hint to use a different one increased performance by huge amounts!
The soul is dyed the color of its thoughts. Think only on those things that are in line with your principles and can bear the light of day. The content of your character is your choice. Day by day, what you do is who you become. Your integrity is your destiny - it is the light that guides your way. - Heraclitus
Joined: Oct 08, 2001
[b]Not really about caching, but do you have indexes on those large tables?[b] Yes, all tables are properly indexed and up to date. But since the tables are large, each query takes from 100ms to 300ms, and with the thousands of queries that we perform to do the calculations, the cost of database I/O is prohibitive.
Is there any pattern to the use of these large tables that can give a hint as to the type of caching that would be fruitful. Can you record the actual query statements from a typical run for later analysis to look for patterns? Bill
I think before you dive into implementing a cache, you might want to read up on various caching strategies. For example, LRU (throwing out the least recently used item when the cache gets full) is a good strategy for certain types of accesses, and is so bad that it actually makes performance worse for other types of accesses. In particular, it tends to be bad for data that is accessed sequentially: for example, if you are searching a database table linearly through the rows, an LRU cache smaller than the entire table guarantees that the first row of the table is thrown out by the time you finish your search, so you never get any benefit from the cache, and it still costs overhead. On the other hand, LRU can be better for cases where there's no natural ordering of the data. When evaluating which caching strategies are right for your situation, you'll need to know more about how how your data is accessed. Indexed access to data follows different patterns than access to unindexed data, for example, so you might want to know which are taking up the bulk of the time in the queries - note that a small amount of unindexed searching can still take up much more time than a large amount of indexed searching. You might also find out that something else might do you more good than caching database data - for example, caching query results, giving the database hints about the query plans, adding yet another index, or even switching to a different database that is more appropriate to your needs. To start with, you might tell us what database you are using, and what kind of queries are taking so much time.
Joined: Jul 11, 2001
Originally posted by Eugene Kononov: [b]Not really about caching, but do you have indexes on those large tables?[b] Yes, all tables are properly indexed and up to date. But since the tables are large, each query takes from 100ms to 300ms, and with the thousands of queries that we perform to do the calculations, the cost of database I/O is prohibitive.
What kind of calculations are those? Could you move some to the DB server (for example by using Java Stored Procedures)?
Joined: Oct 08, 2001
William: Is there any pattern to the use of these large tables that can give a hint as to the type of caching that would be fruitful. Can you record the actual query statements from a typical run for later analysis to look for patterns? Yes, we did perform some analysis. Here is some background. The large tables store the rates that are used to caclulate the premiums for life/disability insurance policy. The rate is determined by the prospect gender, occupation type, age, state, and type of policy. All the possible permutations of these factors yield millions of different database records that contain the rate. We do have some types of policy, occupation types, and the states that are referenced more frequently than the others, but that is not invariant. Based on that, I thought that LFU cache eviction policy may be a good fit. Is there some code samples for that out there? Warren: I think before you dive into implementing a cache, you might want to read up on various caching strategies. For example, LRU (throwing out the least recently used item when the cache gets full) is a good strategy for certain types of accesses, and is so bad that it actually makes performance worse for other types of accesses. In particular, it tends to be bad for data that is accessed sequentially: for example, if you are searching a database table linearly through the rows, an LRU cache smaller than the entire table guarantees that the first row of the table is thrown out by the time you finish your search, so you never get any benefit from the cache, and it still costs overhead. On the other hand, LRU can be better for cases where there's no natural ordering of the data. I've got various books on Java Performance, but they all mentioned caching strategies in passing. Got any useful references/URLs? Ilja: What kind of calculations are those? Could you move some to the DB server (for example by using Java Stored Procedures)? The calculations are insurance related, highly complex and distributed, about 1 million lines of code in various components. It would be very impractical to embed them into stored procedures.
Joined: Mar 04, 2004
Okay, that helps. I'm guessing that access to table records is mostly random, so you won't have the problem with scanning type accesses that I mentioned in my previous post. Also, it seems like any nonuniformity of access might be relatively small, so you might not see a huge benefit from caching - but even a small benefit might be worthwhile given the size of your application. It's not clear from what you say whether there is likely to be much temporal clustering - that is, if a record is accessed, whether that means it's likely to be accessed again fairly soon; in your case, temporal clustering might result from the various modules independently accessing the same record for a calculation. I've found some references on line to some, but not all, of the common caching policies. I'll summarize them and give the references where available. FIFO (first in first out): I haven't found any documentation, but I believe this just adds items to the cache as they are accessed, putting them in a queue or buffer and not changing their location in the buffer; when the cache is full, items are ejected in the order they were added. Cache access overhead is constant time regardless of the size of the cache. The advantage of this algorithm is that it's simple and fast; it can be implemented using a simple array and an index. The disadvantage is that it's not very smart; it doesn't make any effort to keep more commonly used items in cache. Summary for FIFO: fast, not adaptive, not scan resistant LRU - (least recently used): Items are added to the cache as they are accessed; when the cache is full, the least recently used item is ejected. This type of cache is typically implemented as a linked list, so that an item in cache, when it is accessed again, can be moved back up to the head of the queue; items are ejected from the tail of the queue. Cache access overhead is again constant time. This algorithm is simple and fast, and it has a significant advantage over FIFO in being able to adapt somewhat to the data access pattern; frequently used items are less likely to be ejected from the cache. The main disadvantage is that it can still get filled up with items that are unlikely to be reaccessed soon; in particular, it can become useless in the face of scanning type accesses. Nonetheless, this is by far the most frequently used caching algorithm. Summary for LRU: fast, adaptive, not scan resistant LRU2 - (least recently used twice): Items are added to the main cache the second time they are accessed; when the cache is full, the item whose second most recent access is ejected. Cache access overhead increases logarithmically with cache size, which can be a disadvantage. In addition, accesses have to be tracked for some items not yet in the cache. There may also be a second, smaller, time limited cache to capture temporally clustered accesses, but the optimal size of this cache relative to the main cache depends strongly on the data access pattern, so there's some tuning effort involved. The advantage is that it's adaptive, like LRU, and in addition won't fill up from scanning accesses, since items aren't retained in the main cache unless they've been accessed more than once. Summary for LRU2: not especially fast, adaptive, scan resistant 2Q - (two queues): Items are added to an LRU cache as they are accessed. If accessed again, they are moved to a second, larger, LRU cache. Items are typically ejected so as to keep the first cache at about 1/3 the size of the second. This algorithm attempts to provide the advantages of LRU2 while keeping cache access overhead constant, rather than having it increase with cache size. Published data seems to indicate that it succeeds. Google has a cached HTML version of a paper on it at http://126.96.36.199/search?q=cache:5Lr6l4z-PnkJ:www.cs.nyu.edu/cs/faculty/shasha/papers/vldb94_3.ps - which includes a link to the postscript version of the paper. Summary for 2Q: fairly fast, adaptive, scan resistant LFU - (least frequently used): Frequency of use data is kept on all items. The most frequently used items are kept in the cache. Because of the bookkeeping requirements, cache access overhead increases logarithmically with cache size. The advantage is that long term usage patterns are captured well, incidentally making the algorithm scan resistant; the disadvantage, besides the larger access overhead, is that the algorithm doesn't adapt quickly to changing usage patterns, and in particular doesn't help with temporally clustered accesses. Summary for LFU: not fast, captures frequency of use, scan resistant ARC - (adaptive replacement cache): I haven't read up on this one carefully, but it's billed by some IBM researchers as having the advantages of LFU while keeping cache access overhead constant and being more adaptive. There's a web page at http://www.almaden.ibm.com/StorageSystems/autonomic_storage/ARC/index.shtml - with links to a couple of papers in pdf format. The first paper ("ARC: A Self-Tuning, Low Overhead Replacement Cache") also includes a more detailed discussion of some of the other caching strategies than I've provided, and some useful tables giving overhead times and hit ratios as functions of cache size and some other parameters. Summary for ARC: fairly fast, adaptive?, scan resistant As you mentioned, LFU would seem at first glance to be appropriate for your application. The potential issues are that it might not adapt well to variations in patterns of use, it doesn't handle temporal clustering well if that is an issue, and it's slower than some of the other algorithms. You might want to consider LRU and 2Q as possible alternatives. In terms of available source code, here's a web page providing an open source Java caching library implementing FIFO, LRU, and LFU and available under the LGPL (Gnu library license): http://jocache.sourceforge.net/lesser.html Finally, it looks to me like that big table may perhaps be read a lot more than it's written to? Traditional databases like Oracle have transaction features that are optimized for comparable numbers of reads and writes. If that table is read only, you might want to consider a database like MySQL that sacrifices some of the built in transaction capabilities in favor of much faster reads. Hope that all helps, Warren [ March 10, 2004: Message edited by: Warren Dew ]