Wednesday, 30 March 2016

Analytics in APEX Charts - Moving Average

Consider a chart with a trend that might be quite jagged across data points (blue line).

What if you would like a smoother version of that line - a moving average, if you will (red line). This stabilises the results, like looking at climate vs weather.
Oracle APEX Line Chart - 2 series
It's fairly easy from a SQL point of view - in this case it's another column in the original chart query
SELECT null lnk
  ,diff   label
  ,count(*) AS qty
  ,round(avg(count(*)) over (order by diff rows 
                             between 3 preceding and 3 following)
        ,2) AS moving_avg
...

The third column uses an analytical function to calculate the average of the 6 surrounding counts at any given point on the x-axis. The "rows between" syntax is often left to the default, which is everything up to the current value (see post comments), based on the order provided. In this case it explicitly specifies the 'window' of rows to average to be only nearby data points.

Note this report looks a lot cleaner with display setting 'Marker' using 'None', as opposed to 'Circle'.

Nice, simple way of providing the user information that can be easier to read & interpret.

2 comments:

Kim Berg Hansen said...

Thanks for promoting analytic functions, Scott, always great to see :-)

I have to add a small nitpick, though. You state:

The "rows between" syntax is often left to the default, which is everything up to the current row, based on the order provided.

That is not quite correct - it would be more correct to say:

The "rows between" syntax is often left to the default, which is everything up to the current VALUE, based on the order provided.

Because the default, when you have an ORDER BY clause and omit the windowing clause, is: RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

And in my opinion that is the "wrong" default, because 90% of the time what you really want is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

The difference is what happens when there are rows with the same value (ties) in the ORDER BY column. The RANGE default includes in the window all rows with the same value or less, which might include one or more following rows in case of ties.

If you leave it at default RANGE BETWEEN ..., you risk not only results that may not be what you want (in case of ties), you also risk superfluous sorting operations. I've blogged about that at http://www.kibeha.dk/2013/02/rows-versus-default-range-in-analytic.html.

Sorry about if it sounds like a rant - it is just a pet peeve with me and I grab every opportunity to try and tell developers of analytic functions, that my rule of thumb for analytics with ORDER BY clause is never to rely on default, but explicitly write ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (typically >90% of the time) or RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (typically <10% of the time.)

That way you have also self-documented your code. If you leave it to default, 3 months after writing the code you'll be in doubt whether this was one of the few cases where you actually meant RANGE or if it was a mistake and you actually meant ROWS.

Thanks for listening :-)

/Kim

Scott Wesley said...

That's great feedback, Kim. I think Connor's picked me up on these semantics before, and it reminds me of the UNION ALL 'default', just a little ... deeper. Great post.