Someone made a comment on Stackoverflow that I'm painting the
Recursive CTE as evil and should always be avoided on my post
Running Total. Here and Now
Well, like Stackoverflow, I like facts, not opinions, give me numbers any time of day, numbers don't lie. Numbers could lie if you are a boxer though. How? By cherry-picking opponents. Ok, I digress ;-)
First, I would never declare a too encompassing statement that this and that thing is evil. What I actually conveyed on that previous post is, using recursive CTE for running total is slow. How could I even make a fast hierarchical report in this 21st century if I don't use recursive CTE then?
Anyway, I don't have too many blog post declaring things as evil, in fact since I do blogging, I have only
one blog post that mention the word evil, two if you counted
Dracula as evil,
three after I deliberately title this post with headline-grabbing four-letter word
evil. With this in mind, I would give my blog a favor and give it a traffic-grabbing title,
XXX is EVIL. I will then title this article as
Recursive CTE is Evil, unmindful of the fact that what I really conveyed is
Running Total with Recursive CTE is Evil :-)
This blog post has twofold purposes. First, I find using *
recursive* CTE for running total is really dog-slow^H^H^H^H^H^H^H^H evil; second, I'm ashamed that quirky update approach for making running total semi-works only, and yeah it's really quirky. I put a guard statement on the UPDATE statement, and that guard statement gives a warning(divide-by-zero) if UPDATE update rows unpredictably(i.e. it run out-of-order). And unpredictable it is, sometimes the query work, sometimes it don't(divide-by-zero is invoked), I have yet to explore the suggestions by many to make a clustered index on temporary table to make the update be predictable.
And before I forgot, a third purpose of this post, I would like to re-introduce to you our old friend cursor; he's nice, he's dependable, he's our most unsung hero. I would hazard a guess that for every smooth-running system, there's a cursor-using report that is lurking around that is not given much recognition or accolade, because you know,
cursors are evil. While our new-found friend(CTE) is being given all the mad props, even with the mere fact that it is just doing a glorified
row_numbering chore, yes folks, SQL Server 2008 don't have decent windowing capability, that a running total blog post such as this exists because of that limitation. This blog post will be moot when most of us will be using the latest Sql Server 2012 by then, but that's neither here nor there, even we are already at the end of the second quarter of year 2012, most of us are still using Sql Server 2008.
Ok, so let's start with some data to verify our query correctness:
create table Test(
OrderID int primary key,
CustomerCode varchar(50),
Qty int not null
);
declare @i int = 1;
while @i <= 20 begin
insert into Test(OrderID, CustomerCode, Qty) values (
@i * 2
,case @i % 4
when 0 then 'JOHN'
when 1 then 'PAUL'
when 2 then 'GEORGE'
when 3 then 'RINGO'
end
,rand() * 10);
set @i = @i + 1;
end;
Then make a Recursive CTE approach:
with T AS
(
select ROW_NUMBER() over(partition by CustomerCode order by OrderID) as rn, * from test
)
,R(CustomerCode, Rn, OrderId, Qty, RunningTotal) as
(
select CustomerCode, Rn, OrderID, Qty, Qty
from t
where rn = 1
union all
select t.CustomerCode, t.Rn, t.OrderId, t.Qty, p.RunningTotal + t.Qty
from t t
join r p on p.CustomerCode = t.CustomerCode and t.rn = p.rn + 1
)
select R.CustomerCode, R.OrderId, R.Qty, R.RunningTotal from r
order by R.CustomerCode, R.OrderId
option(maxrecursion 0);
Output:
CustomerCode OrderId Qty RunningTotal
-------------------------------------------------- ----------- ----------- ------------
GEORGE 4 7 7
GEORGE 12 8 15
GEORGE 20 4 19
GEORGE 28 4 23
GEORGE 36 9 32
JOHN 8 2 2
JOHN 16 9 11
JOHN 24 0 11
JOHN 32 3 14
JOHN 40 3 17
PAUL 2 5 5
PAUL 10 8 13
PAUL 18 3 16
PAUL 26 7 23
PAUL 34 0 23
RINGO 6 7 7
RINGO 14 9 16
RINGO 22 2 18
RINGO 30 3 21
RINGO 38 3 24
(20 row(s) affected)
Then make a Quirky Update approach:
create function TestRunningTotalGuarded()
returns @ReturnTable table(
CustomerCode varchar(50), OrderId int, Qty int, RunningTotal int not null, RN int identity(1,1) not null
)
as begin
insert into @ReturnTable(CustomerCode, OrderID, Qty, RunningTotal)
select CustomerCode, OrderID, Qty, 0 from Test
order by CustomerCode, OrderID;
declare @RunningTotal int;
declare @RN_check INT = 0;
declare @PrevCustomerCode varchar(50) = NULL;
update @ReturnTable set
@RN_check = @RN_check + 1,
@RunningTotal =
(case when RN = @RN_check then
case when @PrevCustomerCode = CustomerCode then
@RunningTotal + Qty
else
Qty
end
else
1/0
end),
@PrevCustomerCode = CustomerCode,
RunningTotal = @RunningTotal;
return;
end;
Output:
CustomerCode OrderId Qty RunningTotal RN
-------------------------------------------------- ----------- ----------- ------------ -----------
GEORGE 4 7 7 1
GEORGE 12 8 15 2
GEORGE 20 4 19 3
GEORGE 28 4 23 4
GEORGE 36 9 32 5
JOHN 8 2 2 6
JOHN 16 9 11 7
JOHN 24 0 11 8
JOHN 32 3 14 9
JOHN 40 3 17 10
PAUL 2 5 5 11
PAUL 10 8 13 12
PAUL 18 3 16 13
PAUL 26 7 23 14
PAUL 34 0 23 15
RINGO 6 7 7 16
RINGO 14 9 16 17
RINGO 22 2 18 18
RINGO 30 3 21 19
RINGO 38 3 24 20
(20 row(s) affected)
Then make a Running Total via Cursor approach:
create function TestRunningTotalCursor()
returns @ReturnTable table(
CustomerCode varchar(50), OrderId int, Qty int, RunningTotal int not null
) as
begin
declare @c_CustomerCode varchar(50);
declare @c_OrderID int;
declare @c_qty int;
declare @PrevCustomerCode varchar(50) = null;
declare @RunningTotal int = 0;
declare o_cur cursor for
select CustomerCode, OrderID, Qty from Test
order by CustomerCode, OrderID;
open o_cur;
fetch next from o_cur
into @c_CustomerCode, @c_OrderID, @c_Qty;
while @@FETCH_STATUS = 0 begin
if @c_CustomerCode = @PrevCustomerCode begin
set @RunningTotal = @RunningTotal + @c_qty;
end
else begin
set @RunningTotal = @c_Qty;
end
set @PrevCustomerCode = @c_CustomerCode;
insert into @ReturnTable(CustomerCode, OrderId, Qty, RunningTotal)
values(@c_CustomerCode, @c_OrderID, @c_Qty, @RunningTotal);
fetch next from o_cur
into @c_CustomerCode, @c_OrderID, @c_Qty;
end
close o_cur;
deallocate o_cur;
return;
end;
Output:
CustomerCode OrderId Qty RunningTotal
-------------------------------------------------- ----------- ----------- ------------
GEORGE 4 7 7
GEORGE 12 8 15
GEORGE 20 4 19
GEORGE 28 4 23
GEORGE 36 9 32
JOHN 8 2 2
JOHN 16 9 11
JOHN 24 0 11
JOHN 32 3 14
JOHN 40 3 17
PAUL 2 5 5
PAUL 10 8 13
PAUL 18 3 16
PAUL 26 7 23
PAUL 34 0 23
RINGO 6 7 7
RINGO 14 9 16
RINGO 22 2 18
RINGO 30 3 21
RINGO 38 3 24
(20 row(s) affected)
After verifying all those queries are correct, now let's bump the number of rows to 5,000. Here are the metrics:
* Recursive CTE : 49 seconds
* Quirky Update : 0 second
* Cursor : 0 second
Those 0 seconds are not meaningful, let's bump the rows to 50,000 to see a meaningful metrics:
* Quirky Update : 1 second
* Cursor : 3 seconds
* Recursive CTE : An hour
Caveat, quirky update look like a speed demon, but there's some problem with it, with Sql Server(and other RDBMS might be too. Sybase was the first RDBMS that uses the quirky update approach for running total and encourages it, so YMMV if you are using Sybase.
Microsoft SQL Server was a
Sybase before), the order of updating rows is unpredictable when it is given large number of rows. I now see a divide-by-error intermittently on 50,000 rows for about one time in every five run.
CustomerCode OrderId Qty RunningTotal RN
-------------------------------------------------- ----------- ----------- ------------ -----------
Msg 8134, Level 16, State 1, Line 1
Divide by zero error encountered.
The statement has been terminated.
From the metrics and correctness, we should be now comfortable using cursor for running total. No one could paint FUD on this baby.
If someone could enlighten me if making quirky update not quirky is really a lost cause, I would stop here and would not find a way to make the quirky update reliable. Until further notice, I'm scouring the web if there's a way to make the quirky update reliable.
UPDATE
The
pure CTE approach is the
root of all evil :-) When a row-numbering query(that is being joined to a recursive CTE afterwards) is materialized to an actual table first before being joined to a recursive CTE, the speed goes up
select ROW_NUMBER() over(partition by CustomerCode order by OrderID) as rn, * into #xxx
from test;
with T AS
(
select * from #xxx
)
,R(CustomerCode, Rn, OrderId, Qty, RunningTotal) as
(
select CustomerCode, Rn, OrderID, Qty, Qty
from t
where rn = 1
union all
select t.CustomerCode, t.Rn, t.OrderId, t.Qty, p.RunningTotal + t.Qty
from t t
join r p on p.CustomerCode = t.CustomerCode and t.rn = p.rn + 1
)
select R.CustomerCode, R.OrderId, R.Qty, R.RunningTotal from r
order by R.CustomerCode, R.OrderId
option(maxrecursion 0);
drop table #xxx;
Using view is also as slow as the pure CTE, also an hour:
create view xxx
as
select ROW_NUMBER() over(partition by CustomerCode order by OrderID) as rn, *
from test;
go
with T AS
(
select * from xxx
)
,R(CustomerCode, Rn, OrderId, Qty, RunningTotal) as
(
select CustomerCode, Rn, OrderID, Qty, Qty
from t
where rn = 1
union all
select t.CustomerCode, t.Rn, t.OrderId, t.Qty, p.RunningTotal + t.Qty
from t t
join r p on p.CustomerCode = t.CustomerCode and t.rn = p.rn + 1
)
select R.CustomerCode, R.OrderId, R.Qty, R.RunningTotal from r
order by R.CustomerCode, R.OrderId
option(maxrecursion 0);
To speed up that view, materialize the view to an actual table:
create view xxx
as
select ROW_NUMBER() over(partition by CustomerCode order by OrderID) as rn, *
from test;
go
select * into #xxx from xxx;
with T AS
(
select * from #xxx
)
,R(CustomerCode, Rn, OrderId, Qty, RunningTotal) as
(
select CustomerCode, Rn, OrderID, Qty, Qty
from t
where rn = 1
union all
select t.CustomerCode, t.Rn, t.OrderId, t.Qty, p.RunningTotal + t.Qty
from t t
join r p on p.CustomerCode = t.CustomerCode and t.rn = p.rn + 1
)
select R.CustomerCode, R.OrderId, R.Qty, R.RunningTotal from r
order by R.CustomerCode, R.OrderId
option(maxrecursion 0);
CTE is not slow, not understanding how queries work is what makes your query slow.
To recap, prior to
materializing the row numbering to an actual table(i.e. temporary table), here are the metrics on 50,000 rows:
* Quirky Update : 1 second
* Cursor : 3 seconds
* Recursive CTE(Pure) : An hour
Here are the new metrics after row numbering table materialization (50,000 rows):
* Quirky Update : 1 second
* Cursor : 3 seconds
* Recursive CTE(Hybrid) : 2 seconds (inclusive of row numbering table materialization)
UPDATE May 29, 2012 8:00 PM
The quirky update is not so quirky after all. Just by putting a clustered primary key on sequential row number, SQL Server UPDATE is able to do an orderly update of rows.
The only change needed be done is to put a
not null primary key clustered
alter function TestRunningTotalGuarded()
returns @ReturnTable table(
CustomerCode varchar(50), OrderId int, Qty int, RunningTotal int not null, RN int identity(1,1) not null primary key clustered
)
I tried looping it 100 times, didn't encounter any divide-by-zero error
declare @i int = 0;
while @i < 100 begin
select * from dbo.TestRunningTotalGuarded() order by CustomerCode, OrderId
set @i = @i + 1;
end
And it's still the same 1 second query, and it's faster than Hybrid Recursive CTE.
Here's the metric for 100,000 rows:
Quirky Update : 3 seconds
Hybrid Recursive CTE : 5 seconds
Cursor : 6 seconds
Quirky update is not so quirky after all.
UPDATE July 27, 2013 10:47 AM
Table variable is faster by a few milliseconds...
declare @t table
(
RN INT,
OrderID int,
CustomerCode varchar(50),
Qty int,
primary key(CustomerCode, RN)
);
insert into @t(RN, OrderID, CustomerCode, Qty)
select ROW_NUMBER() over(partition by CustomerCode order by OrderID) as rn, * from test;
with T AS
(
select * from @t
)
,R(CustomerCode, Rn, OrderId, Qty, RunningTotal) as
(
select CustomerCode, Rn, OrderID, Qty, Qty
from t
where rn = 1
union all
select t.CustomerCode, t.Rn, t.OrderId, t.Qty, p.RunningTotal + t.Qty
from t t
join r p on p.CustomerCode = t.CustomerCode and t.rn = p.rn + 1
)
select R.CustomerCode, R.OrderId, R.Qty, R.RunningTotal from r
order by R.CustomerCode, R.OrderId
option(maxrecursion 0);
...than temporary table:
create table #t
(
RN INT,
OrderID int,
CustomerCode varchar(50),
Qty int,
primary key(CustomerCode, RN)
);
insert into #t(RN, OrderID, CustomerCode, Qty)
select ROW_NUMBER() over(partition by CustomerCode order by OrderID) as rn, * from test;
with T AS
(
select * from #t
)
,R(CustomerCode, Rn, OrderId, Qty, RunningTotal) as
(
select CustomerCode, Rn, OrderID, Qty, Qty
from t
where rn = 1
union all
select t.CustomerCode, t.Rn, t.OrderId, t.Qty, p.RunningTotal + t.Qty
from t t
join r p on p.CustomerCode = t.CustomerCode and t.rn = p.rn + 1
)
select R.CustomerCode, R.OrderId, R.Qty, R.RunningTotal from r
order by R.CustomerCode, R.OrderId
option(maxrecursion 0);
drop table #t
It's not possible to create a temporary table inside a function, only a table variable is possible:
create function TestRunningTotalMaterialized()
returns @ReturnTable table(
CustomerCode varchar(50), OrderId int, Qty int, RunningTotal int not null, RN int identity(1,1) not null
)
as
begin
declare @t table
(
RN INT,
OrderID int,
CustomerCode varchar(50),
Qty int,
primary key(CustomerCode, RN)
);
insert into @t(RN, OrderID, CustomerCode, Qty)
select ROW_NUMBER() over(partition by CustomerCode order by OrderID) as rn, * from test;
with T AS
(
select * from @t
)
,R(CustomerCode, Rn, OrderId, Qty, RunningTotal) as
(
select CustomerCode, Rn, OrderID, Qty, Qty
from t
where rn = 1
union all
select t.CustomerCode, t.Rn, t.OrderId, t.Qty, p.RunningTotal + t.Qty
from t t
join r p on p.CustomerCode = t.CustomerCode and t.rn = p.rn + 1
)
insert into @ReturnTable(CustomerCode, OrderId, Qty, RunningTotal)
select R.CustomerCode, R.OrderId, R.Qty, R.RunningTotal from r
order by R.CustomerCode, R.OrderId
option(maxrecursion 0);
return;
end;
In terms of performance, even we use table variable on Hybrid Recursive CTE, things stay the same: Quirky Update > Hybrid Recursive CTE > Cursor