I decided to look into this topic while learning SQL Server Integration Services and debating with my caffeine soaked brain if the slowly changing dimension task container would be more efficient than a T-SQL MERGE statement. I began wondering why the hell we were planning on shipping about 40 gigabytes per day out of Oracle into tables in SQL Server that are truncated (emptied) and rebuilt then query to determine what 2 kilobytes worth of changes took place that day. Why on earth would we want to stress out that multi-core, refrigerator sized server when an underpowered server and network and storage can do all the work?
Fact tables and dimensions are part of data warehousing. A slowly changing dimension is a dimension that slowly changes over time (duh). There are three types of slowly changing dimensions:
- Type 1: Update the old values with the new values (do not give a crap about history)
- Type 2: Create a new row in the dimension with a new primary (surrogate) key, and optionally update existing rows
- Type 3: Update the old values with the new values, and add additional data to the table (date of the change, keep one level of history)
You would figure type 3 would come before type 2 but who cares..? This dimension uses a generated surrogate key for uniqueness (a sequence) and dates to keep track of when the row was effective with a NULL end date indicating this row is the current version of the foreign primary key value and all others contain historical data and attributes.
The same logic used to maintain slowly changing dimensions can also be applied to tracking historical changes to tables.
The code that follows assumes only additions and changes are taking place; after all nobody ever deletes data do they? The dimension table is simple, no clustering, no partitioning, bit-map indexes, or compression, and is purely for example:
CREATE TABLE crap_dimension
( dim_pk_surrogate NUMBER
, dim_version_beg_date DATE
, dim_version_end_date DATE
, dim_pk_foreign VARCHAR2(10)
, dim_description VARCHAR2(60)
, dim_bulb VARCHAR2(05)
, dim_sum VARCHAR2(08)
The MERGE SQL statement is best used for inserts and updates. Here is the single statement used for the entire dimension update:
MERGE /*+ APPEND */
INTO crap_dimension m
--- Step 1 BEG :: Select new and changed data for the USING section of the MERGE
DECODE(v2.dim_upsert_level,1,v2.dim_pk_surrogate,0) AS dim_pk_surrogate
--- Step 1a BEG :: Collect data from your OLTP or ODS or POS
crap_id AS dim_pk_foreign
, crap_desc AS dim_description
, crap_poop AS dim_bulb
, crap_turd AS dim_sum
--- Step 1a END
--- Step 1b BEG :: Remove identical data attributes already in the dimension
WHERE dim_version_end_date IS NULL
--- Step 1b END
--- Step 1c BEG :: two rows are needed for updates, update existing and insert new data, LEFT JOIN cartesian product of
LEFT JOIN (
, (SELECT LEVEL AS dim_upsert_level FROM DUAL CONNECT BY LEVEL BETWEEN 1 AND 2)
WHERE dim_version_end_date IS NULL
) v2 ON v2.dim_pk_foreign = v1.dim_pk_foreign
--- Step 1c END
--- Step 1 END
ON (m.dim_pk_surrogate = u.dim_pk_surrogate)
--- Step 2 BEG :: If update, mark the end date
WHEN MATCHED THEN UPDATE
SET m.dim_version_end_date = SYSDATE
--- Step 2 END
--- Step 3 BEG :: Insert new data and the new versions of changed data
WHEN NOT MATCHED THEN INSERT
--- Step 3 END
I tried to break down the statement into steps.
Step 1a collects data from your ODS or OLTP and can be a simple query or a gigantic, complicated parallel query with a hundred joins and aggregate steps. The important thing to remember is we only want to do this expensive step once. When 1a completes we will have a fairly large result set.
Step 1b subtracts all identical data attribute rows that already exist in the dimension from the previous result set. This should significantly reduce the working set of rows at the expense of a full table scan through our dimension. When 1b completes the working set will contain new data and changed data from the source tables.
Step 1c creates data for updates. If data in the source has changed then two rows will be needed for the dimension; one row to update the existing data making the row historical and one row to insert for the new data making it the current row. The surrogate and foreign primary keys for all current rows are Cartesian joined to values 1 and 2 (one full table scan then hash join instead of two full table scans) and then left joined to the previous result set. New data from the source will not match data and will remain as one row with a NULL surrogate primary key. Changes from the source will match and join to the two rows in the query creating the two rows needed to do the update and insert. The surrogate primary key will be 0 for inserts (assuming the sequence started at 1 or higher) and valued for updates.
Step 1 finishes with rows containing changes to be applied to the dimension. Step 2 updates historical rows simply setting the end date to the current date. Step 3 inserts new data from the source and the new current version of the row for changed data.