Thursday, February 25, 2010

Oracle MERGE, type 2 dimension

I decided to look into this topic while learning SQL Server Integration Services and debating with my caffeine soaked brain if the slowly changing dimension task container would be more efficient than a T-SQL MERGE statement. I began wondering why the hell we were planning on shipping about 40 gigabytes per day out of Oracle into tables in SQL Server that are truncated (emptied) and rebuilt then query to determine what 2 kilobytes worth of changes took place that day. Why on earth would we want to stress out that multi-core, refrigerator sized server when an underpowered server and network and storage can do all the work?

Fact tables and dimensions are part of data warehousing. A slowly changing dimension is a dimension that slowly changes over time (duh). There are three types of slowly changing dimensions:

  • Type 1: Update the old values with the new values (do not give a crap about history)
  • Type 2: Create a new row in the dimension with a new primary (surrogate) key, and optionally update existing rows
  • Type 3: Update the old values with the new values, and add additional data to the table (date of the change, keep one level of history)

You would figure type 3 would come before type 2 but who cares..? This dimension uses a generated surrogate key for uniqueness (a sequence) and dates to keep track of when the row was effective with a NULL end date indicating this row is the current version of the foreign primary key value and all others contain historical data and attributes.

The same logic used to maintain slowly changing dimensions can also be applied to tracking historical changes to tables.

The code that follows assumes only additions and changes are taking place; after all nobody ever deletes data do they? The dimension table is simple, no clustering, no partitioning, bit-map indexes, or compression, and is purely for example:

CREATE TABLE crap_dimension
( dim_pk_surrogate NUMBER
, dim_version_beg_date DATE
, dim_version_end_date DATE
, dim_pk_foreign VARCHAR2(10)
, dim_description VARCHAR2(60)
, dim_bulb VARCHAR2(05)
, dim_sum VARCHAR2(08)
)

The MERGE SQL statement is best used for inserts and updates. Here is the single statement used for the entire dimension update:

MERGE /*+ APPEND */
INTO crap_dimension m
USING (
--- Step 1 BEG :: Select new and changed data for the USING section of the MERGE
SELECT
DECODE(v2.dim_upsert_level,1,v2.dim_pk_surrogate,0) AS dim_pk_surrogate
, v1.dim_pk_foreign
, v1.dim_description
, v1.dim_bulb
, v1.dim_sum
FROM (
--- Step 1a BEG :: Collect data from your OLTP or ODS or POS
SELECT
crap_id AS dim_pk_foreign
, crap_desc AS dim_description
, crap_poop AS dim_bulb
, crap_turd AS dim_sum
FROM crap_oltp_or_ods_source
--- Step 1a END
--- Step 1b BEG :: Remove identical data attributes already in the dimension
MINUS
SELECT
, dim_pk_foreign
, dim_description
, dim_bulb
, dim_sum
FROM crap_dimension
WHERE dim_version_end_date IS NULL
--- Step 1b END
) v1
--- Step 1c BEG :: two rows are needed for updates, update existing and insert new data, LEFT JOIN cartesian product of
LEFT JOIN (
SELECT
dim_pk_surrogate
, dim_pk_foreign
, dim_upsert_level
FROM crap_dimension
, (SELECT LEVEL AS dim_upsert_level FROM DUAL CONNECT BY LEVEL BETWEEN 1 AND 2)
WHERE dim_version_end_date IS NULL
) v2 ON v2.dim_pk_foreign = v1.dim_pk_foreign
--- Step 1c END
--- Step 1 END
) u
ON (m.dim_pk_surrogate = u.dim_pk_surrogate)
--- Step 2 BEG :: If update, mark the end date
WHEN MATCHED THEN UPDATE
SET m.dim_version_end_date = SYSDATE
--- Step 2 END
--- Step 3 BEG :: Insert new data and the new versions of changed data
WHEN NOT MATCHED THEN INSERT
VALUES (
dim_pk_surrogate_seq.NEXTVAL
, SYSDATE
, NULL
, u.dim_pk_foreign
, u.dim_description
, u.dim_bulb
, u.dim_sum
)
--- Step 3 END
;

I tried to break down the statement into steps.

Step 1a collects data from your ODS or OLTP and can be a simple query or a gigantic, complicated parallel query with a hundred joins and aggregate steps. The important thing to remember is we only want to do this expensive step once. When 1a completes we will have a fairly large result set.

Step 1b subtracts all identical data attribute rows that already exist in the dimension from the previous result set. This should significantly reduce the working set of rows at the expense of a full table scan through our dimension. When 1b completes the working set will contain new data and changed data from the source tables.

Step 1c creates data for updates. If data in the source has changed then two rows will be needed for the dimension; one row to update the existing data making the row historical and one row to insert for the new data making it the current row. The surrogate and foreign primary keys for all current rows are Cartesian joined to values 1 and 2 (one full table scan then hash join instead of two full table scans) and then left joined to the previous result set. New data from the source will not match data and will remain as one row with a NULL surrogate primary key. Changes from the source will match and join to the two rows in the query creating the two rows needed to do the update and insert. The surrogate primary key will be 0 for inserts (assuming the sequence started at 1 or higher) and valued for updates.

Step 1 finishes with rows containing changes to be applied to the dimension. Step 2 updates historical rows simply setting the end date to the current date. Step 3 inserts new data from the source and the new current version of the row for changed data.

More virus crap

It had been a while since I went on a tour through the blogosphere using the next blog link at the top of the page. After stumbling through a lot of foreign crap in languages I did not understand I come to a blog that starts the hard drive going nuts. Oh F.

When the hard drive goes nuts you usually have a virus or something nasty trying to do something to your computer. AVG caught and probably stopped and cleaned the malicious crap. I was going to post some Oracle MERGE code but now I have to wait until a full system scan completes and bitch about things from my netbook.

So here I am watching Olympic curling (Canada versus Switzerland, ladies) and typing this on my netbook with cats jockeying for space on my lap and on the couch around me. I keep getting the same thought in my head: why the hell am I still blogging? Seriously. Why do I share code and my life with the world? Most of my friends have moved on to do their belly-aching in status updates and micro-blogging. Any therapeutic value from blogging can easily be replaced by a good beer or six (Sam Adams Winter Ale, I have a stockpile of it). 

For the untrusted Internet I need to look into programs like Sandboxie or running web browsers inside virtual machines. When I purchase a new computer I will likely be running Linux which still has equal levels of threat, but it seems like fewer viruses are written to exploit Linux and it seems safer.

As I watch the women's Olympic ice hockey gold medal game I wonder if this will be my last blogger rant. Maybe I will start using Google Buzz and join others with one-line status updates like "my computer might be frakked" or "I really hate Microsoft again".