Monday, 21 September 2009

Data Densification - filling dates example

Many years ago when I was first learning Oracle (in the deep end) I came by the problem of data densification. I can't remember how I solved the problem back then, but I've encountered it a few times since. When writing my presentation for the model clause, I explored the difference between two options. Recently whilst investigating syntax definitions for the ANSI standard outer joins, I encountered another solution called a partitioned outer join - so I thought I'd compare all three for performance and readability.

The scenario is thus - I have an amount in a table for various dates throughout the year, for a particular person.
NAME          AMT DT
------ ---------- -----------
Scott         117 11-jan-2009
Scott         250 17-feb-2009
Scott         300 14-apr-2009
Scott          50 06-jun-2009
Wade         1231 17-mar-2009
Wade         2321 22-apr-2009
Wade         3122 30-sep-2009
Wade           59 31-oct-2009
However in my report I wish all months of the year displayed for the person, regardless of whether amounts exist. Let's add a cumulative figure in there for good measure. Included below this is the SQL for my test case.
NAME   MTH              AMT    CUM_AMT
------ --------- ---------- ----------
Scott  January          117        117
Scott  February         250        367
Scott  March              0        367
Scott  April            300        667
Scott  May                0        667
Scott  June              50        717
Scott  July               0        717
Scott  August             0        717
Scott  September          0        717
Scott  October            0        717
Scott  November           0        717
Scott  December           0        717
Wade   January            0          0
Wade   February           0          0
Wade   March           1231       1231
Wade   April           2321       3552
Wade   May                0       3552
Wade   June               0       3552
Wade   July               0       3552
Wade   August             0       3552
Wade   September       3122       6674
Wade   October           59       6733
Wade   November           0       6733
Wade   December           0       6733

DROP TABLE customer;
CREATE TABLE customer (
   name  VARCHAR2(10)
  ,amt   NUMBER
  ,dt    DATE
);

insert into customer values('Scott',117,  to_date('11-01-2009','DD-MM-YYYY'));
insert into customer values('Scott',250,  to_date('17-02-2009','DD-MM-YYYY'));
insert into customer values('Scott',300,  to_date('14-04-2009','DD-MM-YYYY'));
insert into customer values('Scott',50,   to_date('06-06-2009','DD-MM-YYYY'));
insert into customer values('Wade',1231,  to_date('17-03-2009','DD-MM-YYYY'));
insert into customer values('Wade',2321,  to_date('22-04-2009','DD-MM-YYYY'));
insert into customer values('Wade',3122,  to_date('30-09-2009','DD-MM-YYYY'));
insert into customer values('Wade',59,    to_date('31-10-2009','DD-MM-YYYY'));
This is a pre-10g solution, using an efficient method on the DUAL table to conjure records. I also use analytics to determine the cumulative amount. There is however a cartesion join and two full table scans on the customer table.
SELECT date_fill.name
      ,TO_CHAR(real_dt,'Month') mth
      ,NVL(amt,0) amt
      ,NVL(SUM(amt) OVER (PARTITION BY date_fill.name
                          ORDER BY real_dt ),0) cum_amt
FROM
  (SELECT name, TRUNC(dt,'mm') dt, SUM(amt) amt
   FROM   customer
   GROUP BY name, TRUNC(dt,'mm')
   ) actual_data -- Actual data
  ,(SELECT name, real_dt
    FROM  (SELECT DISTINCT name
           FROM   customer)
         ,(WITH mths AS (SELECT TRUNC(SYSDATE,'YYYY') real_dt
                         FROM DUAL CONNECT BY LEVEL <= 12)
           SELECT ADD_MONTHS(real_dt,ROWNUM-1) real_dt FROM mths)
   ) date_fill -- Distinct list with conjured dates
-- Outer join actual data with full date list
WHERE date_fill.real_dt = actual_data.dt(+)
AND   date_fill.name    = actual_data.name(+)
ORDER BY date_fill.name, date_fill.real_dt;

-------------------------------------------------------------------------------------
| Id  | Operation                           | Name     | Rows  | Bytes | Cost (%CPU)|
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |          |     2 |    74 |    11  (28)|
|   1 |  WINDOW SORT                        |          |     2 |    74 |    11  (28)|
|*  2 |   HASH JOIN OUTER                   |          |     2 |    74 |    11  (28)|
|   3 |    VIEW                             |          |     2 |    24 |     6  (17)|
|   4 |     MERGE JOIN CARTESIAN            |          |     2 |    24 |     6  (17)|
|   5 |      VIEW                           |          |     1 |     6 |     2   (0)|
|   6 |       COUNT                         |          |       |       |            |
|   7 |        VIEW                         |          |     1 |     6 |     2   (0)|
|   8 |         CONNECT BY WITHOUT FILTERING|          |       |       |            |
|   9 |          FAST DUAL                  |          |     1 |       |     2   (0)|
|  10 |      BUFFER SORT                    |          |     2 |    12 |     6  (17)|
|  11 |       VIEW                          |          |     2 |    12 |     4  (25)|
|  12 |        HASH UNIQUE                  |          |     2 |    12 |     4  (25)|
|  13 |         TABLE ACCESS FULL           | CUSTOMER |     8 |    48 |     3   (0)|
|  14 |    VIEW                             |          |     8 |   200 |     4  (25)|
|  15 |     HASH GROUP BY                   |          |     8 |   136 |     4  (25)|
|  16 |      TABLE ACCESS FULL              | CUSTOMER |     8 |   136 |     3   (0)|
-------------------------------------------------------------------------------------
Here is what some would consider a more elegant solution using the model clause. The explain plan is far neater, and the performance was also enhanced - and there was no joins & half as many logical reads. The cumulative column was calculated using a nifty rule computation.
SELECT name, TO_CHAR(dt,'DD-MM-YYYY') dt, amt, cum_amt -- Model results
FROM (
   SELECT name, TRUNC(dt, 'MM') dt, SUM(amt) amt
   FROM   customer
   GROUP BY name, TRUNC(dt, 'MM')
)
MODEL
PARTITION BY (name)
DIMENSION BY (dt)
MEASURES (amt, cast(NULL AS NUMBER) cum_amt) -- Define calculated col
IGNORE NAV
RULES SEQUENTIAL ORDER(
   -- Conjure dates
   amt[FOR dt FROM TO_DATE('01-01-2009', 'DD-MM-YYYY')
              TO   TO_DATE('01-12-2009', 'DD-MM-YYYY')
              INCREMENT NUMTOYMINTERVAL(1, 'MONTH')
       ] = amt[CV(dt)] -- Apply amt for given date, if found
  ,cum_amt[ANY] = SUM(amt)[dt <= CV(dt)] -- Calculate cumulative
)
ORDER BY name, dt;
------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Rows  | Bytes | Cost (%CPU)|
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |     8 |   136 |     5  (40)|
|   1 |  SORT ORDER BY               |          |     8 |   136 |     5  (40)|
|   2 |   SQL MODEL ORDERED          |          |     8 |   136 |     5  (40)|
|   3 |    HASH GROUP BY             |          |     8 |   136 |     5  (40)|
|   4 |     TABLE ACCESS FULL        | CUSTOMER |     8 |   136 |     3   (0)|
|   5 |    WINDOW (IN SQL MODEL) SORT|          |       |       |            |
------------------------------------------------------------------------------
The following solution is the one I stumbled upon while playing with ANSI joins. I wanted to see how it compared with the others. We still have a join, but that's not necessarily an issue. It's certainly tidier looking than the first solution, and we don't need to fry our brain learn how to use model. I have heard some resistance to ANSI syntax however - not for the sake of going against the grain, but apparently there have been some security issues. I'm not qualified to comment on this, let's just compare the performance in this particular test case.
SELECT c.name, TO_CHAR(real_dt,'Month') mth, NVL(amt,0) amt
      ,NVL(SUM(amt) OVER (PARTITION BY c.name
                          ORDER BY real_dt ),0) cum_amt
FROM customer c
PARTITION BY (name)
RIGHT OUTER JOIN
  (WITH mths AS (SELECT TRUNC(SYSDATE,'YYYY') real_dt
                 FROM DUAL CONNECT BY LEVEL <= 12)
   SELECT ADD_MONTHS(real_dt,ROWNUM-1) real_dt FROM mths) mths
ON (real_dt = TRUNC(c.dt,'mm'))
ORDER BY c.name, real_dt;
-------------------------------------------------------------------------------------
| Id  | Operation                           | Name     | Rows  | Bytes | Cost (%CPU)|
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |          |     2 |   130 |     8  (13)|
|   1 |  WINDOW SORT                        |          |     2 |   130 |     8  (13)|
|   2 |   VIEW                              |          |     2 |   130 |     8  (13)|
|   3 |    NESTED LOOPS PARTITION OUTER     |          |     2 |    46 |     8  (13)|
|   4 |     BUFFER SORT                     |          |       |       |            |
|   5 |      VIEW                           |          |     1 |     6 |     2   (0)|
|   6 |       COUNT                         |          |       |       |            |
|   7 |        VIEW                         |          |     1 |     6 |     2   (0)|
|   8 |         CONNECT BY WITHOUT FILTERING|          |       |       |            |
|   9 |          FAST DUAL                  |          |     1 |       |     2   (0)|
|* 10 |     FILTER                          |          |       |       |            |
|  11 |      SORT PARTITION JOIN            |          |     1 |    17 |     4  (25)|
|  12 |       TABLE ACCESS FULL             | CUSTOMER |     1 |    17 |     3   (0)|
-------------------------------------------------------------------------------------
I executed these 3 statements 1000 times on my laptop on 10gR2 (similar results in 11gR1) to see how throughput compares. I consistently found the following performance:
6 secs -- Original
.006 secs per iteration
4.6 secs -- Model clause
.0046 secs per iteration
.35 secs -- Partitioned outer join
.00035 secs per iteration
Personally I was a little surprised at the dramatic timing difference with the partitioned outer join over the original. I thought perhaps that would have been more comparable to the model clause solution. Something to bear in mind in future development.

If I needed to write this for a future project I'd probably take a closer look at the relevant data set using Tom Kyte's runstats package. This used to be located here - htpp://asktom.oracle.com/tkyte/runstats.html, but the link is currently broken. Possibly related to his recent move. However here is one of many of his forum entries addressing runstats. This package is better than just measuring throughput, it will let you know how your application may scale by paying attention to latches. A DBA's perspective can be found here.

1 comment:

Fahd Mirza said...

Learned something from this post. Thanks for sharing it.

regards