What are tools / techniques Mondrian is using to scale up like caches , etc ?
And where does it fit into the Big Data platform especially with respect to Hadoop, Hive ,etc ?
How easy it is to plug Mondrian with other Big data tools ?
I'll answer in two parts. First, the scaling. Mondrian has two general approaches to scaling (chapter 7). The first is using aggregate tables. These are tables that pre-aggregate the data. For example, suppose you are storing facts about sales at the hourly level, but you usually just do analysis at the daily or weekly level. You can create an aggregate table that is used at those levels. This reduces the data being returned.
The second technique is caching. Mondrian caches schema, members, and segments (the things that make up an aggregate). This means that once the data has been queried it is stored in memory. Additionally, Mondrian support external caches, such as Infinispan, that allow very large amounts of data to be stored in memory with persistence and failover.
I'll tack on the response to Hadoop/Hive. We cover how Mondrian fits in with Big Data systems in Chapter 11. In that chapter we note that Mondrian has experimental Hive support. However, given the latency of the most basic Hive queries (for generating the list of values for the "year" column) the overall performance will always be lackluster for direct access with a engine like Mondrian. The work of Impala, Drill, etc will improve this (making simple queries fast, and longer queries longer) over time.
Joined: Apr 13, 2009
Thanks Bill . But I am now interested to know more about how level based , on-demand structure works . I ask this because I have faced situations in BI reporting where this was the structure that was required but was not there.
And Nicholas thanks for touching the latency issue. I am not aware of Impala but am eager to see how Mondrian plugs in with Drill.