Many years ago when I was first learning Oracle (in the deep end) I came by the problem of data densification. I can't remember how I solved the problem back then, but I've encountered it a few times since. When writing my presentation for the model clause, I explored the difference between two options. Recently whilst investigating syntax definitions for the ANSI standard outer joins, I encountered another solution called a partitioned outer join - so I thought I'd compare all three for performance and readability.
The scenario is thus - I have an amount in a table for various dates throughout the year, for a particular person.
NAME AMT DT
------ ---------- -----------
Scott 117 11-jan-2009
Scott 250 17-feb-2009
Scott 300 14-apr-2009
Scott 50 06-jun-2009
Wade 1231 17-mar-2009
Wade 2321 22-apr-2009
Wade 3122 30-sep-2009
Wade 59 31-oct-2009
However in my report I wish all months of the year displayed for the person, regardless of whether amounts exist. Let's add a cumulative figure in there for good measure. Included below this is the SQL for my test case.
NAME MTH AMT CUM_AMT
------ --------- ---------- ----------
Scott January 117 117
Scott February 250 367
Scott March 0 367
Scott April 300 667
Scott May 0 667
Scott June 50 717
Scott July 0 717
Scott August 0 717
Scott September 0 717
Scott October 0 717
Scott November 0 717
Scott December 0 717
Wade January 0 0
Wade February 0 0
Wade March 1231 1231
Wade April 2321 3552
Wade May 0 3552
Wade June 0 3552
Wade July 0 3552
Wade August 0 3552
Wade September 3122 6674
Wade October 59 6733
Wade November 0 6733
Wade December 0 6733
DROP TABLE customer;
CREATE TABLE customer (
name VARCHAR2(10)
,amt NUMBER
,dt DATE
);
insert into customer values('Scott',117, to_date('11-01-2009','DD-MM-YYYY'));
insert into customer values('Scott',250, to_date('17-02-2009','DD-MM-YYYY'));
insert into customer values('Scott',300, to_date('14-04-2009','DD-MM-YYYY'));
insert into customer values('Scott',50, to_date('06-06-2009','DD-MM-YYYY'));
insert into customer values('Wade',1231, to_date('17-03-2009','DD-MM-YYYY'));
insert into customer values('Wade',2321, to_date('22-04-2009','DD-MM-YYYY'));
insert into customer values('Wade',3122, to_date('30-09-2009','DD-MM-YYYY'));
insert into customer values('Wade',59, to_date('31-10-2009','DD-MM-YYYY'));
This is a pre-10g solution, using an efficient method on the DUAL table to conjure records. I also use analytics to determine the cumulative amount. There is however a cartesion join and two full table scans on the customer table.
SELECT date_fill.name
,TO_CHAR(real_dt,'Month') mth
,NVL(amt,0) amt
,NVL(SUM(amt) OVER (PARTITION BY date_fill.name
ORDER BY real_dt ),0) cum_amt
FROM
(SELECT name, TRUNC(dt,'mm') dt, SUM(amt) amt
FROM customer
GROUP BY name, TRUNC(dt,'mm')
) actual_data -- Actual data
,(SELECT name, real_dt
FROM (SELECT DISTINCT name
FROM customer)
,(WITH mths AS (SELECT TRUNC(SYSDATE,'YYYY') real_dt
FROM DUAL CONNECT BY LEVEL <= 12)
SELECT ADD_MONTHS(real_dt,ROWNUM-1) real_dt FROM mths)
) date_fill -- Distinct list with conjured dates
-- Outer join actual data with full date list
WHERE date_fill.real_dt = actual_data.dt(+)
AND date_fill.name = actual_data.name(+)
ORDER BY date_fill.name, date_fill.real_dt;
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 74 | 11 (28)|
| 1 | WINDOW SORT | | 2 | 74 | 11 (28)|
|* 2 | HASH JOIN OUTER | | 2 | 74 | 11 (28)|
| 3 | VIEW | | 2 | 24 | 6 (17)|
| 4 | MERGE JOIN CARTESIAN | | 2 | 24 | 6 (17)|
| 5 | VIEW | | 1 | 6 | 2 (0)|
| 6 | COUNT | | | | |
| 7 | VIEW | | 1 | 6 | 2 (0)|
| 8 | CONNECT BY WITHOUT FILTERING| | | | |
| 9 | FAST DUAL | | 1 | | 2 (0)|
| 10 | BUFFER SORT | | 2 | 12 | 6 (17)|
| 11 | VIEW | | 2 | 12 | 4 (25)|
| 12 | HASH UNIQUE | | 2 | 12 | 4 (25)|
| 13 | TABLE ACCESS FULL | CUSTOMER | 8 | 48 | 3 (0)|
| 14 | VIEW | | 8 | 200 | 4 (25)|
| 15 | HASH GROUP BY | | 8 | 136 | 4 (25)|
| 16 | TABLE ACCESS FULL | CUSTOMER | 8 | 136 | 3 (0)|
-------------------------------------------------------------------------------------
Here is what some would consider a more elegant solution using the model clause. The explain plan is far neater, and the performance was also enhanced - and there was no joins & half as many logical reads. The cumulative column was calculated using a nifty rule computation.
SELECT name, TO_CHAR(dt,'DD-MM-YYYY') dt, amt, cum_amt -- Model results
FROM (
SELECT name, TRUNC(dt, 'MM') dt, SUM(amt) amt
FROM customer
GROUP BY name, TRUNC(dt, 'MM')
)
MODEL
PARTITION BY (name)
DIMENSION BY (dt)
MEASURES (amt, cast(NULL AS NUMBER) cum_amt) -- Define calculated col
IGNORE NAV
RULES SEQUENTIAL ORDER(
-- Conjure dates
amt[FOR dt FROM TO_DATE('01-01-2009', 'DD-MM-YYYY')
TO TO_DATE('01-12-2009', 'DD-MM-YYYY')
INCREMENT NUMTOYMINTERVAL(1, 'MONTH')
] = amt[CV(dt)] -- Apply amt for given date, if found
,cum_amt[ANY] = SUM(amt)[dt <= CV(dt)] -- Calculate cumulative
)
ORDER BY name, dt;
------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 8 | 136 | 5 (40)|
| 1 | SORT ORDER BY | | 8 | 136 | 5 (40)|
| 2 | SQL MODEL ORDERED | | 8 | 136 | 5 (40)|
| 3 | HASH GROUP BY | | 8 | 136 | 5 (40)|
| 4 | TABLE ACCESS FULL | CUSTOMER | 8 | 136 | 3 (0)|
| 5 | WINDOW (IN SQL MODEL) SORT| | | | |
------------------------------------------------------------------------------
The following solution is the one I stumbled upon while playing with ANSI joins. I wanted to see how it compared with the others. We still have a join, but that's not necessarily an issue. It's certainly tidier looking than the first solution, and we don't need to
fry our brain learn how to use model. I have heard some resistance to ANSI syntax however - not for the sake of going against the grain, but apparently there have been some security issues. I'm not qualified to comment on this, let's just compare the performance in this particular test case.
SELECT c.name, TO_CHAR(real_dt,'Month') mth, NVL(amt,0) amt
,NVL(SUM(amt) OVER (PARTITION BY c.name
ORDER BY real_dt ),0) cum_amt
FROM customer c
PARTITION BY (name)
RIGHT OUTER JOIN
(WITH mths AS (SELECT TRUNC(SYSDATE,'YYYY') real_dt
FROM DUAL CONNECT BY LEVEL <= 12)
SELECT ADD_MONTHS(real_dt,ROWNUM-1) real_dt FROM mths) mths
ON (real_dt = TRUNC(c.dt,'mm'))
ORDER BY c.name, real_dt;
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 130 | 8 (13)|
| 1 | WINDOW SORT | | 2 | 130 | 8 (13)|
| 2 | VIEW | | 2 | 130 | 8 (13)|
| 3 | NESTED LOOPS PARTITION OUTER | | 2 | 46 | 8 (13)|
| 4 | BUFFER SORT | | | | |
| 5 | VIEW | | 1 | 6 | 2 (0)|
| 6 | COUNT | | | | |
| 7 | VIEW | | 1 | 6 | 2 (0)|
| 8 | CONNECT BY WITHOUT FILTERING| | | | |
| 9 | FAST DUAL | | 1 | | 2 (0)|
|* 10 | FILTER | | | | |
| 11 | SORT PARTITION JOIN | | 1 | 17 | 4 (25)|
| 12 | TABLE ACCESS FULL | CUSTOMER | 1 | 17 | 3 (0)|
-------------------------------------------------------------------------------------
I executed these 3 statements 1000 times on my laptop on 10gR2 (similar results in 11gR1) to see how throughput compares. I consistently found the following performance:
6 secs -- Original
.006 secs per iteration
4.6 secs -- Model clause
.0046 secs per iteration
.35 secs -- Partitioned outer join
.00035 secs per iteration
Personally I was a little surprised at the dramatic timing difference with the partitioned outer join over the original. I thought perhaps that would have been more comparable to the model clause solution. Something to bear in mind in future development.
If I needed to write this for a future project I'd probably take a closer look at the relevant data set using Tom Kyte's runstats package. This used to be located here - htpp://asktom.oracle.com/tkyte/runstats.html, but the link is currently broken. Possibly related to his recent
move. However here is one of many of his forum entries addressing
runstats. This package is better than just measuring throughput, it will let you know how your application may scale by paying attention to
latches. A DBA's perspective can be found
here.