• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Concurrency issue for dimensional tables

 
Rancher
Posts: 2759
32
Eclipse IDE Spring Tomcat Server
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Not sure, if this is the correct place for this question

We have an OLAP database with a typical star shaped schema. It has several dimension tables and some fact tables. Dimension tables have values for the dimensions, and fact tables have FKs to the dimension tables. The values in the dimension tables have to be unique, otherwise the reports look messy

Now, we have this grid distributed application that does some massive computation and puts data into the OLAP tables. The input data gets split into batches, and the batches are sent to workers on the grid. The workers basically do some processing and then insert data into the fact tables. One of the challenges is the Dimension tables aren't prepopulated. So, if a worker sees an output that requires a new record in the dimension table, it has to insert into the dimension table itself.. AND the records in the dimension table should be unique.

Each worker basically does this (assuming for simplicity's sake that there is only one column in the fact table.. and only one dimension




When we insert into dimension table we do this query


When we insert into fact table, we do this



Each batch is executed in it's own transaction.

Again, this is a simple example. Actually, there are several dimension tables, and we insert into each dimension table if required, and do one insert into the fact table that joins all the dimension tables

Ok, so the problem is that when 2 differrent threads are concurrently executing batches that produce Output with the same value. Both threads insert a record into the DImension table. The changes are isolated because we haven't committed the transaction yet. Then both threads insert a record into fact table.. first record in fact table has the FK to first record in dimension table.. second record in fact table has a FK to the second record. Then both commit, which results in duplicate records in the dimension table. Now, a third batch starts up. This third batch doesn't insert any records in dimension table,. but inserts 2 records into fact table! Actually there are other constraints on the fact table that prevent us from have dup records, so third batch fails


So, my main problem is how do I do this concurrently without adding (too many) bottlenecks

 
reply
    Bookmark Topic Watch Topic
  • New Topic