Saturday, December 8, 2012

Query SARGability

A colleague asked how to select from two tables dynamically, something along this line(question was not exactly phrase as the following):


1
2
DECLARE @tableToSelect CHAR(1) = 'y';
select * from @tableToSelect where a < 100;  


If that is even possible, it would be a problem if the similar columns are differently named on the two tables, say PersonID, MencschID (German for PersonID). This can not select all the persons from the American tables:
1
select * from @tableToSelect where MenschID < 100;




I suggested him this:


1
2
3
4
5
6
7
8
9
WITH x AS
(
    SELECT a,b FROM dbo.t
    WHERE @tableToSelect = 't'
    UNION ALL   
    SELECT c,d FROM dbo.y   
    WHERE @tableToSelect = 'y'
)
SELECT * FROM x WHERE a < 100;


A discerning programmer that he is, he quizzed me if that would result to inefficient query. I like that he posed that question and wanted his code to be performant.


If he had asked me that question when I was just a starting developer, I would concur with him. I would naturally assume that it's not efficient to union the tables and filtering it further. I would assume that it would be better if the rows are filtered right from the root source, as illustrated below, and thus enabling both queries of the UNIONed queries to use the available index:


1
2
3
4
5
6
7
8
9
SELECT a,b
FROM t
WHERE @tableToSelect = 't' and a < 100
  
UNION all
  
SELECT c,d
FROM y
WHERE @tableToSelect = 'y' AND c < 100;



Fortunately for us, internet is a wonderful place; an eon ago, I stumble upon an article that says RDBMS doesn't short-circuit conditions: http://weblogs.sqlteam.com/mladenp/archive/2008/02/25/How-SQL-Server-short-circuits-WHERE-condition-evaluation.aspx


So this would result to an error (Conversion failed when converting date and/or time from character string):

1
2
3
4
SELECT   *
FROM     t1
WHERE    ISDATE(val) = 1
         AND CONVERT(DATETIME, val) < GETDATE();



If we try to fool the RDBMS by eagerly filtering the valid dates first, this would still not work, this will still result to conversion error. A proof that WITH or any form of re-arranging the query doesn't eagerly execute the query.


1
2
3
4
5
6
7
8
9
WITH x AS
(
      SELECT   *
      FROM     t1
      WHERE    ISDATE(val) = 1
)
SELECT *
FROM x   
WHERE CONVERT(DATETIME, val) < GETDATE();


That means, the conditions' execution order is not executed based on how you arrange your query. The outer query's conditions could be mixed with the inner query's condition by your RDBMS. Armed with that knowledge in mind, the following two queries are executed similarly:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
SELECT *
FROM t
WHERE @tableToSelect = 't' and a < 100
  
UNION all
  
SELECT *
FROM y
WHERE @tableToSelect = 'y' AND c < 100;
  
 
WITH x AS
(
    SELECT a,b FROM dbo.t
    WHERE @tableToSelect = 't'
    UNION ALL   
    SELECT c,d FROM dbo.y   
    WHERE @tableToSelect = 'y'
)
SELECT * FROM x WHERE a < 100;




That is, the second query(CTE) is expanded to the same query as the first query; thus, the second query's a < 100 expression utilizes all the available indexes on both tables, making the query performant. Given that they are semantically the same, it's better to use the second query as it's easier to maintain, the query's condition is just placed on one location only, there's no code duplication involved. Some shun the table-deriving or CTE approach, as they think enclosing the query on CTE or table-deriving it would deprive the expression a < 100 the SARGability it needed; this is not true, CTE-using queries can still use the available indexes on both tables.



Both queries have exactly the same identical query execution plan:







Here's the supporting DDL:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
SET NOCOUNT ON;
 
CREATE TABLE t(a INT, b INT);
CREATE TABLE y(c INT, d INT);
 
DECLARE @i INT;
 
SET @i = 0;
WHILE @i < 100000 BEGIN
    INSERT INTO t(a,b) VALUES(@i,0);   
SET @i = @i + 1;
END;
 
INSERT INTO y(c,d) SELECT a,b FROM t;
CREATE INDEX ux_t ON t(a);
CREATE INDEX ux_y ON y(c);
 
SET NOCOUNT OFF;



If you are a keen observer, you'll ask, what will happen to the query with the date detection in it? How to prevent the query from executing the conversion if the field is not a date? We can really do short-circuit in SQL Server explicitly, but that would entails using CASE WHEN, and that would make your query not being SARGable. Here's the short-circuited query:


1
2
3
SELECT *
FROM t1
WHERE CASE WHEN ISDATE(val) = 1 AND CONVERT(DATETIME, val) < GETDATE() THEN 1 END = 1



Another approach, is to use an optimization fence on ISDATE. This way, SQL Server don't need to merge the ISDATE condition with CONVERT, avoiding the conversion failed error.

1
2
3
4
5
6
7
8
9
with x as
(
      SELECT   TOP 4000000000 *
      FROM     t1
      WHERE    ISDATE(val) = 1
)
SELECT *
FROM x   
WHERE CONVERT(DATETIME, val) < GETDATE();
So here's the performance of a non-SARGable query:


As we can see the non-SARGable query is slow compared to the SARGable one. non-SARGable query has a cost of 81% against 19% of SARGable one



That's it, we should strive to make our query use our tables' indexes



Happy Coding! ツ

No comments:

Post a Comment