Wednesday, May 27, 2009

Building a time dimension with SQL

The time dimension can be found in the majority of star-schema based data warehouses. There are literally dozens of ways to generate the values in the table at the lowest level of the dimension.

The easiest way is to simply plug the contents of the table into a spreadsheet and then import the spreadsheet into a table. Seriously... Why write all the complicated code when you can manually put it all in, show it to users, let them bitch about it, make changes, fix their frak-ups, and then cram the data into a table. The table is not built on an hourly/daily/monthly basis like other dimensions. However, if you wish to use SQL (Oracle) then maybe I can help.

If at any point during this conversation you the reader notice room for improvement, please comment.

The Table

My personal preferences for time dimensions:

  1. define columns only for the basic values and values that are too expensive to calculate within a view, do not define columns that can be easily calculated with date math or TO_CHAR functions within a view
  2. do not bother with partitioning or logging
  3. use table compression

Most time dimensions I see include columns for day of month, end of week, day in year, etc. My reasoning behind storing only the base values and complex values and then doing columns like day in year in a view is two-fold. First is I/O reduction. I would rather do the blocked read of two or three blocks and some CPU rather than do five to ten blocked reads. Second is flexibility. If someone is crawling up your ass for another group of columns so their horribly inefficient DSS actually looks feasible the nerd simply adds the column to the view.  Regardless of what you store in it, the time dimension is small compared to other dimensions and the fact table so why bother with partitioning overhead. The table can be recreated so use NOLOGGING. Compression will reduce the size by about thirty percent leading to fewer I/O requests at the expense of CPU. I always want CPUs working.

Creating Values

MERGE /*+ parallel(time_dimension,4) append */
INTO time_dimension m
USING ( -- Step 7: use the record set as part of the merge
WITH period_sf AS ( --- Step 1 - factored subquery for the various periods from operational databases
SELECT
'Craptastic' AS period_id
, TRUNC(SYSDATE,'YEAR') AS period_beg_date
, TRUNC(ADD_MONTHS(SYSDATE,12),'YEAR')-1 AS period_end_date
FROM dual
WHERE 1 = 1
)
, distinct_years_sf AS ( --- Step 2: factored subquery for all distinct years
SELECT DISTINCT TRUNC(period_beg_date,'YEAR') AS year_beg_date
FROM period_sf
UNION
SELECT DISTINCT TRUNC(period_end_date,'YEAR') AS year_beg_date
FROM period_sf
)
SELECT -- Step 6: select base values for the date, values that will enter the time dimension
TO_CHAR((v1.year_beg_date + v2.day_num),'YYYYMMDD')||'.01' AS time_id
, v1.year_beg_date + v2.day_num AS time_date
, v3.period_id AS time_period_id
, v3.period_beg_date AS time_period_beg_date
, v3.period_end_date AS time_period_end_date
, v1.year_beg_date + v2.day_num - v3.period_beg_date + 1 AS time_period_day
, CASE (v1.year_beg_date + v2.day_num)
WHEN v1.year_beg_date THEN 'New Years Day'
WHEN v1.year_presidents_date THEN 'Presidents Day'
WHEN v1.year_easter_date THEN 'Easter'
WHEN v1.year_memorial_date THEN 'Memorial Day'
WHEN v1.year_independance_date THEN 'Independance Day'
WHEN v1.year_labor_date THEN 'Labor Day'
WHEN v1.year_veterans_date THEN 'Veterans Day'
WHEN v1.year_thanksgiving_date THEN 'Thanksgiving'
WHEN v1.year_christmas_date THEN 'Christmas'
ELSE NULL
END AS time_holiday_text
FROM ( --- Step 4: generate values for each
SELECT
year_id
, year_beg_date
, year_end_date
, CASE
WHEN year_id BETWEEN 1885 AND 1970 THEN year_beg_date + 52
WHEN year_id BETWEEN 1971 AND 9999 THEN NEXT_DAY(year_beg_date + 30,'MON') + 14
ELSE NULL
END AS year_presidents_date
, CASE
WHEN year_id BETWEEN 1900 AND 2199 THEN NEXT_DAY(year_end_date - 244 + easter_value + (CASE WHEN easter_value < 5 THEN 17 ELSE -12 END),'SAT') - 34 + (CASE year_id WHEN 2079 THEN 7 ELSE 0 END)
ELSE NULL
END AS year_easter_date
, CASE
WHEN year_id BETWEEN 1882 AND 1970 THEN year_end_date - 215
WHEN year_id BETWEEN 1971 AND 9999 THEN NEXT_DAY(year_end_date - 214,'MON') - 7
ELSE NULL
END AS year_memorial_date
, CASE
WHEN year_id BETWEEN 1882 AND 9999 THEN year_end_date - 180
ELSE NULL
END AS year_independance_date
, CASE
WHEN year_id BETWEEN 1894 AND 9999 THEN NEXT_DAY(year_end_date - 122,'MON')
ELSE NULL
END AS year_labor_date
, CASE
WHEN year_id BETWEEN 1954 AND 1970 THEN year_end_date - 50
WHEN year_id BETWEEN 1971 AND 1977 THEN NEXT_DAY(year_end_date - 92,'MON') + 21
WHEN year_id BETWEEN 1978 AND 9999 THEN year_end_date - 50
ELSE NULL
END AS year_veterans_date
, CASE
WHEN year_id BETWEEN 1863 AND 1938 THEN NEXT_DAY(year_end_date - 31,'THU') - 7
WHEN year_id = 1939 THEN NEXT_DAY(year_end_date - 61,'THU') + 21
WHEN year_id = 1940 THEN NEXT_DAY(year_end_date - 61,'THU') + 14
WHEN year_id BETWEEN 1941 AND 9999 THEN NEXT_DAY(year_end_date - 61,'THU') + 21
ELSE NULL
END AS year_thanksgiving_date
, CASE
WHEN year_id BETWEEN 1870 AND 9999 THEN year_end_date - 6
ELSE NULL
END AS year_christmas_date
FROM ( --- Step 3: calculate a few values for the years in the second step
SELECT
EXTRACT(YEAR FROM year_beg_date) AS year_id
, year_beg_date
, ADD_MONTHS(year_beg_date,12) - 1 AS year_end_date
, TRUNC(MOD(EXTRACT(YEAR FROM year_beg_date) / 38 * 1440,60) / 2) AS easter_value
FROM distinct_years_sf
)
) v1
JOIN ( --- Step 5: generate numbers 0 through 366 and join to the years
SELECT (LEVEL - 1) AS day_num FROM DUAL CONNECT BY ROWNUM BETWEEN 1 and 367
) v2 ON v2.day_num BETWEEN 0 AND (v1.year_end_date - v1.year_beg_date)
LEFT JOIN period_sf v3 ON (v1.year_beg_date + v2.day_num) BETWEEN period_beg_date AND period_end_date
ORDER BY time_id
) u
ON (m.time_id = u.time_id) --- Step 8: update or insert (i.e. MERGE)
WHEN MATCHED THEN UPDATE
SET
m.time_date = u.time_date
, m.time_period_id = u.time_period_id
, m.time_period_beg_date = u.time_period_beg_date
, m.time_period_end_date = u.time_period_end_date
, m.time_period_day = u.time_period_day
, m.time_holiday_text = u.time_holiday_text
WHEN NOT MATCHED THEN INSERT
VALUES
( u.time_id
, u.time_date
, u.time_period_id
, u.time_period_beg_date
, u.time_period_end_date
, u.time_period_day
, u.time_holiday_text
)

Most time dimensions will have primary units within periods or intervals of time. Sales might have sales periods. Colleges might have semesters. Television might have quarters and sweeps weeks within them. The source for these periods could be one or more tables in an OLTP database or ODS or perhaps already in the data warehouse if using a star-snowflake hybrid schema. Each source should be a factored sub-query containing the period primary key and the dates the period starts and ends. Oracle will create a temporary table for each sub-query. For this example, I chose one period based upon the current year (table dual is available on every Oracle database).

The second step is to generate a record set from all period sub-queries with distinct years (the first date of each year). The sub-query is also factored because it could be used more than once to generate record sets based on individual years.

The next few steps create data specific for each year in the above sub-query. I do this to avoid doing repetitive calculations for all 365 or so days in each year (do you really want the server calculating when Easter occurs for each day or just do it once for each year; yeah I thought so).

Next, each year is joined to a day number, 0 through the end of the year.

Each factored period sub-query that will have values in the time dimension is then LEFT JOINed to the years plus days result set. The left join is required because there might not be overlaps in the various periods. For example: a set of sales periods existed before, during, and then after a corporate acquisition.

The columns from the result set are fed into the USING part of the MERGE statement. The results are either inserted into or used to update the time dimension. The /*+ APPEND */ hint is used to trigger a direct-path insert for table compression and the /*+ PARALLEL */ hint is optional but could speed up the process depending upon your server.

The above example is very, very simplistic. There is an entire data warehousing language and concepts documentation library that I have yet to digest. I am also certain there are better ways of doing the same thing I have done in the SQL above. I could have also used PL/SQL and loop constructs to build the dimension table. To be honest, there are more important things in life.

Feel free to comment on my SQL brain fart. Suggest improvements. Steal the code and call it your own, basking in the glory.

No comments: