Not sure, if this is the correct place for this question
We have an OLAP database with a typical star shaped schema. It has several dimension tables and some fact tables. Dimension tables have values for the dimensions, and fact tables have FKs to the dimension tables. The values in the dimension tables have to be unique, otherwise the reports look messy
Now, we have this grid distributed application that does some massive computation and puts data into the OLAP tables. The input data gets split into batches, and the batches are sent to workers on the grid. The workers basically do some processing and then insert data into the fact tables. One of the challenges is the Dimension tables aren't prepopulated. So, if a worker sees an output that requires a new record in the dimension table, it has to insert into the dimension table itself.. AND the records in the dimension table should be unique.
Each worker basically does this (assuming for simplicity's sake that there is only one column in the fact table.. and only one dimension
When we insert into dimension table we do this query
When we insert into fact table, we do this
Each batch is executed in it's own transaction.
Again, this is a simple example. Actually, there are several dimension tables, and we insert into each dimension table if required, and do one insert into the fact table that joins all the dimension tables
Ok, so the problem is that when 2 differrent threads are concurrently executing batches that produce Output with the same value. Both threads insert a record into the DImension table. The changes are isolated because we haven't committed the transaction yet. Then both threads insert a record into fact table.. first record in fact table has the FK to first record in dimension table.. second record in fact table has a FK to the second record. Then both commit, which results in duplicate records in the dimension table. Now, a third batch starts up. This third batch doesn't insert any records in dimension table,. but inserts 2 records into fact table! Actually there are other constraints on the fact table that prevent us from have dup records, so third batch fails
So, my main problem is how do I do this concurrently without adding (too many) bottlenecks