Tuesday, April 20, 2010

The Oracle CBO statistical time bomb

Ever have a query or process that runs fine for years and years and years but then one day just slows to a crawl, going from seconds to hours in execution time? If so, you have run into what I call the Oracle CBO statistical time bomb.

Oracle 10g/11g has a daily job that queries for "stale" and missing optimizer statistics on Oracle table/index objects. Those optimizer statistics are used by the optimizer when choosing the best access plan, join types, index and table scans to use for a SQL query. The default for "stale" is if 10 percent of the data have changed (static in 10g, can be changed in 11g).

Let's assume we have this really, really big invoice table that grows at a constant rate over time. This table has an index on the financial quarter. Each quarter roughly 25,000 new rows are inserted, and then updated 3-5 times each (as status changes, etc. whatever), and some idiot needs to clean up a couple hundred bogus rows each month. Here is the table for 2010:

Financial    Total  Insert  Update  Delete  Changed  Percent
Quarter Rows Rows Rows Rows Rows Changed
2010 Q1 1,000,000 25,000 80,000 169 105,169 %10.51 - daily job gathers statistics
2010 Q2 1,024,836 25,000 80,000 169 105,169 %10.26 - daily job gathers statistics
2010 Q3 1,049,672 25,000 80,000 169 105,169 %10.01 - daily job gathers statistics
2010 Q4 1,074,508 25,000 80,000 169 105,169 %09.78 - not stale yet, still have 2010 Q3 statistics

Without current statistics, the Oracle cost-based optimizer thinks one index and access method would be more efficient than a different index and access method (when it really, really, really is not).

I'll need an Oracle Guru to correct me but here is what happens in 2010 Q4:

Your super-optimized, meticulously coded query, queries the invoices by quarter and by another column existing in a massive subquery.

For 2010 Q1,Q2,Q3 the CBO says there are 25,000 rows reading the quarter index and likely less rows using an index on another column (customer number perhaps) if it does the massive subquery first. The CBO got it right and the massive subquery executes first then reads the table via index on the other column.

For 2010 Q4 the CBO says there are no/few rows for Q4 (when there are really 25,000) so read the invoices using the quarter index and then do a nested loop each time on the massive subquery because the cost of the read by quarter index times the cost of the massive subquery is less than the plan it took above (when it really is not). Instead of doing that nested loop on a couple rows (yawn) it does that nested loop massive subquery 25,000 times. Yikes.

What to do? Plan ahead.

Go to Tahiti and research DBMS_STATS in the Oracle Database PL/SQL Packages and Types Reference. Know your tables and diffuse the time bomb before it goes off. Let the daily job handle the stale statistics. Tables that contain an index on a period of time (quarter, year, etc.) or code based on a period of time (i.e. the Spring 2010 line of Mountain Dew with Ginseng and Xanax) should be researched as candidates for statistics gathering outside of Oracle's daily job. Gather table statistics before the big processing periods.

References

No comments: