Saturday, July 27, 2013

Different methods to execute large query in small batches

Today I’ll show you some different methods to run a DML operation like UPDATE in small batches for a large table.  Though I’m going to show only UPDATE statements, methods are applicable to DELETE and INSERTS as well.   When you want  to run an UPDATE  query(or other DML operations) over a very huge table having  over a billion rows or even  100s of Million it is recommended that you should not update them at one go.  DML queries on a very large number of rows can cause performance issues, transaction log might get full, affects concurrent users, eats up lot of server resources, runs for hours and many other.   Good practice is that you should break the number of records in small batches of few thousand and update them.  This way you will use minimum server resources and most importantly you will prevent the transaction log file from getting full.

First lets create one table and insert some sample data.

--------------------------------------------------------------------------------------------------------

-- Create the table
CREATE TABLE #A
(
ID INT,
CREATED_DATE  DATE,
METHOD VARCHAR(10)
)

-- Insert 10000 rows in the table
;WITH MANGAL AS
(
SELECT 1 AS ID, CONVERT(DATE,'20000101') AS CREATED_DATE
UNION ALL
SELECT ID+1, DATEADD(DD, 1, CREATED_DATE)
FROM MANGAL
WHERE ID <= 9999

)
INSERT INTO #A(ID, CREATED_DATE)
SELECT ID, CREATED_DATE
FROM MANGAL

OPTION (MAXRECURSION 10000);

-- Check sample data
SELECT *
FROM #A

--------------------------------------------------------------------------------------------------------

This is not the most efficient way to generate some random data using a recursive CTE but I like it so I’m using it.

I. Using a TOP and GO:

This is the simplest method to run a query in a small batches.  I frequently use this method in development when I want to update some records quickly without thinking much.  Note the GO 10 in the query.  An integer after GO will execute the preceding query specified number of times.  In this case update statement will execute 10 times(I know there are 10000 rows in the table and I’m using TOP 1000, simple math).

------------------------------------------------------------------------------------------------------
UPDATE TOP (1000) #A
SET METHOD = 'I'
WHERE METHOD IS NULL
GO 10

-- Verify result
SELECT *
FROM #A

------------------------------------------------------------------------------------------------------

All the queries are quite simple  and they are self explanatory so I’m not going to explain them in detail.

II. Using ROWCOUNT and GO:

Same query but without the TOP operator.  Here I’m using a ROWCOUNT function to limit the number of rows in the batch.  ROWCOUNT causes SQL Server to stop processing the query after the specified number of rows are returned.

------------------------------------------------------------------------------------------------------
SET ROWCOUNT 1000
UPDATE #A
SET METHOD = 'II'
WHERE METHOD <> 'II'
GO 10

SET ROWCOUNT 0

-- Verify result
SELECT *
FROM #A

------------------------------------------------------------------------------------------------------ IMPORTANT:  Using SET ROWCOUNT will not affect DELETE, INSERT, and UPDATE statements in a future release of SQL Server. Avoid using SET ROWCOUNT with DELETE, INSERT, and UPDATE statements in new development work, and plan to modify applications that currently use it. For a similar behavior, use the TOP syntax.

III. Using TOP and GOTO:

Now instead of GO I’ll use GOTO to run the batch multiple times.  The GOTO statement causes the execution of the T-SQL batch to stop processing the following commands to GOTO and processing continues from the label where GOTO points.  Here I’ll use GOTO to keep processing the particular label until it’s @@ROWCOUNT becomes zero.

------------------------------------------------------------------------------------------------------ UPDATE_BATCH:

UPDATE TOP (3000) #A
SET METHOD = 'III'
WHERE METHOD <> 'III'
IF @@ROWCOUNT > 0  GOTO UPDATE_BATCH

-- Verify result
SELECT *
FROM #A

------------------------------------------------------------------------------------------------------

IV. Using ROWCOUNT and GOTO:

Same GOTO concept with ROWCOUNT.

------------------------------------------------------------------------------------------------------

SET ROWCOUNT 3500
UPDATE_BATCH:

UPDATE  #A
SET METHOD = 'IV'
WHERE METHOD <> 'IV'
IF @@ROWCOUNT > 0  GOTO UPDATE_BATCH

SET ROWCOUNT 0
-- verify result
SELECT *
FROM #A

------------------------------------------------------------------------------------------------------

V. Using WHILE and BREAK:

Here I’m taking the help of an infinite WHILE loop ( WHILE 1=1 ) to run the batch multiple times and using the BREAK to exit the loop when @@ROWCOUNT becomes zero.

------------------------------------------------------------------------------------------------------

WHILE (1=1)
BEGIN

    BEGIN TRANSACTION
       
        UPDATE TOP(3333) #A
        SET METHOD = 'V'
        WHERE METHOD <> 'V'
   
        IF @@ROWCOUNT = 0
            BEGIN
            COMMIT TRANSACTION
            BREAK
        END
    COMMIT TRANSACTION
END

-- Verify result
SELECT *
FROM #A

------------------------------------------------------------------------------------------------------

Same can be implemented using ROWCOUNT as I already shown.

VI. Using a Sequence column and WHILE:

Now an entirely different method.  Not using GO, GOTO or BREAK.  This method is more systematic where you have more control on how query is going to execute.  Note that when you use TOP or ROWCOUNT you actually have no control on which rows are going to get updated.  Sorting is completely dependent on the query plan created by the query engine.

------------------------------------------------------------------------------------------------------

DECLARE @C INT
DECLARE @I INT
DECLARE @L INT
DECLARE @MAX INT

SET @C = 1
SET @I = 2200
SET @L = @C + @I
SET @MAX = (SELECT MAX(ID) FROM #A)

SELECT @C AS C, @I AS I, @L AS L, @MAX AS MAX

WHILE @C < @MAX
BEGIN 

    UPDATE #A
    SET METHOD = 'VI'
    WHERE ID >= @C
    AND ID < @L

SET @C = @L -- OR SET @C = @C + @I
SET @L = @L + @I

END

-- Verify result
SELECT *
FROM #A

------------------------------------------------------------------------------------------------------

VII. Using a date range and WHILE:

On similar lines if we don’t have a reliable sequence column or we want to use a date column to update the records by month or year.

------------------------------------------------------------------------------------------------------

DECLARE @C DATE
DECLARE @I DATE
DECLARE @MAX DATE

SELECT    @C = MIN(CREATED_DATE),
        @MAX = MAX(CREATED_DATE)        
FROM #A

SET @I =  DATEADD(YEAR, DATEDIFF(YEAR, 0, @C)+1, 0)

--SELECT @C AS C, @I AS I , @MAX AS MAX

WHILE @C <= @MAX

BEGIN
    UPDATE #A
    SET METHOD = 'VII'
    WHERE CREATED_DATE >= @C
    AND CREATED_DATE < @I

SET @C = @I
SET @I = DATEADD(YEAR, 1, @I)

--SELECT @C, @I
END

-- Verify result
SELECT *
FROM #A

------------------------------------------------------------------------------------------------------

VIII.  Using WHILE without GO, GOTO or BREAK:

One last method from me without using GO, GOTO and BREAK but relatively simpler compare to last 2 methods.  Because different people will advise you not to use GO, GOTO and BREAK in SQL.

------------------------------------------------------------------------------------------------------

DECLARE @ROWCOUNT INT = 1
DECLARE @C INT = 2800

WHILE (@ROWCOUNT > 0)
BEGIN
    UPDATE TOP (@C) #A
    SET METHOD = 'VIII'
    WHERE METHOD <> 'VIII'

SET @ROWCOUNT = @@ROWCOUNT
END

-- Verify result
SELECT *
FROM #A

------------------------------------------------------------------------------------------------------

Same can be done using ROWCOUNT.

IX. Final

Well this is the query I found some 4-5 years back when I was searching for same thing on Google.  I don’t remember where I saw it or who wrote it.

------------------------------------------------------------------------------------------------------
DECLARE @C INT

SET ROWCOUNT 1000
WHILE (1=1)

BEGIN
   BEGIN TRANSACTION
  
   UPDATE #A
   SET METHOD = 'X'
   WHERE METHOD <> 'X'

   SET @C = @@ROWCOUNT

   IF @@Error <> 0
    BEGIN
    Print 'Problem Updating the records'
    ROLLBACK TRANSACTION
    BREAK
    END

   IF @C = 0
   BEGIN
      COMMIT TRANSACTION
      BREAK
   END
   COMMIT TRANSACTION
END
SET ROWCOUNT 1000
-- Verify result
SELECT *
FROM #A

------------------------------------------------------------------------------------------------------

Some notes:  Both TOP and ROWCOUNT can be parameterized.  We can pass a variable instead of direct number.  But of course since ROWCOUNT is soon going to be deprecated from future SQL Server versions one should avoid using it.  I was being lazy not to use transactions (commits and rollback ) wherever possible in above examples but one should use them to add more control, clarity to code and also for better understanding. 

I have not really tested that which query is better, I leave that up to you.  Idea was to share different methods to execute a query in small batches.  But I would go for method 6 or 7 on production environment.  Do let me know your comments, suggestions and what do think of all these methods.

Namaste
Mangal

Sunday, November 14, 2010

ORDER BY Enhancements in SQL Server

Today I am going to talk about one new feature that is been introduced in the SQL Server Denali.   And that is:  one interesting enhancement in ORDER BY clause.  Now with ORDER BY clause you can also specify the OFFSET and FETCH options. 

From Books Online of Denali:
OFFSET: Specifies the number of rows to skip before it starts to return rows from the query expression. The value can be an integer constant or expression that is greater than or equal to zero.

FETCH: Specifies the number of rows to return after the OFFSET clause has been processed. The value can be an integer constant or expression that is greater than or equal to one.

This is how the new syntax of ORDER BY clause looks like in SQL Server Denali:
------------------------------------------------------------------------------------------------------
ORDER BY order_by_expression
    [ COLLATE collation_name ]
    [ ASC | DESC ]
    [ ,...n ]
[ <offset_fetch> ]
<offset_fetch> ::=
{
    OFFSET { integer_constant | offset_row_count_expression } { ROW | ROWS }
    [
      FETCH { FIRST | NEXT } {integer_constant | fetch_row_count_expression } { ROW | ROWS } ONLY
    ]
}

------------------------------------------------------------------------------------------------------

Take a look at following example:

------------------------------------------------------------------------------------------------------
USE AdventureWorks
GO
SELECT
VendorID
    ,Name
FROM Purchasing.Vendor
ORDER BY VendorID OFFSET 5 ROWS
GO

------------------------------------------------------------------------------------------------------

Output:

Offset

In the example I wrote 5 as OFFSET.  You can see in the output 1st five rows are skipped and we got VendorIds starting from 6.  Here SQL Server 1st orders the data on the Column specified in ORDER BY clause(i.e. VendorID).  The next query uses the clause OFFSET 5 ROWS to skip the first 5 rows and return all remaining rows.

Now lets see the example of FETCH:

------------------------------------------------------------------------------------------------------
USE AdventureWorks
GO
SELECT
VendorID
    ,Name
FROM Purchasing.Vendor
ORDER BY VendorID OFFSET 5 ROWS
    FETCH NEXT
3 ROWS ONLY
GO

------------------------------------------------------------------------------------------------------

Output:

FETCH

As you can SQL Server has fetched only 3 rows and that also after skipping 1st five rows.  That is because I specified 5 as OFFSET and 3 as FETCH NEXT.  Here FETCH NEXT 3 ROWS ONLY to limit the rows returned to 3 rows from the sorted result set.

Also:
1. offset_row_count_expression can be a variable, parameter, or constant scalar subquery. When a subquery is used, it cannot reference any columns defined in the outer query scope. That is, it cannot be correlated with the outer query.

2. ROW and ROWS are synonyms and are provided for ANSI compatibility.

3. In query execution plans, the offset row count value is displayed in the Offset attribute of the TOP query operator.

You can find all this information and more on: ORDER BY Clause (Transact-SQL)

Thanks
Mangal

Wednesday, September 1, 2010

Some Things About VIEWS

Everybody has different views about VIEWS, that’s what make them interesting topic to discuss.  Other thing is if you make some assumptions about VIEWS, they can lead you to problems.  As my target audience is SQL beginners, today I’ll talk about few things about VIEWS so some of the obvious mistakes can be avoided.   Actually there are so many things we can talk about VIEWS, but I leave them for future posts.

Today I’ll show what happens when you create a view using “SELECT * FROM TableName” and then ALTER the underlying table used in the VIEW.   The normal assumption most of the SQL beginners make is:  If you create a VIEW using “SELECT * FROM TableName”, all the changes made in the underlying table will automatically reflect in the VIEW as well.

Case 1: You add a column to the table. 
Now here is a question for you:  If you have a table named Employees and a VIEW created  top of it with simple query “SELECT * FROM Employees”.  Now if I add one column to Employees table, will that column appear in result if I execute query “SELECT * FROM View”?

I had ask this question many times in interviews; 80% of time I heard a thumping “Yes”.  And most of the candidate were having experience of well over 3 years.  Answer to above question is BIG NO.  Let me explain this with actual example, 1st lets create some sample data.

------------------------------------------------------------------------------------------------------

-- Create the table with 4 columns
CREATE TABLE
Employees
(
EmpID INT,
FirstName VARCHAR(6),
LastName VARCHAR(20),
ManagerID INT

)
GO

-- Populate the table with sample data
INSERT INTO Employees SELECT
1, 'Mangal', 'Pardeshi', 0 UNION ALL SELECT
2, 'Rahul', 'Sharma', 1 UNION ALL SELECT
3, 'Ajay', 'Varma',2
GO

-- Create a View with wild card *
CREATE VIEW Emp AS
SELECT *
FROM Employees
GO

-- Verify the data in the table and view
SELECT
*
FROM Employees
GO
SELECT *
FROM Emp
GO

------------------------------------------------------------------------------------------------------

Data in the table and view:
test data

As you can see the data  in the Table and View is identical, so are the number of columns.  Now lets add one column HireDate in the table and see what happens.

------------------------------------------------------------------------------------------------------
-- Add HireDate Column in Employees Table
ALTER TABLE Employees ADD HireDate DATETIME
GO

-- Update the HireDate column with some date
UPDATE Employees
SET HireDate = '20100827'
GO

-- Again Verify the data in table and View
SELECT *
FROM Employees
GO
SELECT *
FROM Emp
GO
------------------------------------------------------------------------------------------------------

add column

As you see HireDate column didn’t appear in VIEW.  So question arises: Why a VIEW doesn't get refresh when I add a column to the table? 
Short answer is when you create the VIEW, the column information/definition of VIEW(metadata of VIEW) gets stored in system tables at the time of creation of VIEW.  And that metadata doesn’t get refresh when you alter the underlying table.  You have to explicitly refresh the metadata of VIEW. 
So next question is: How to refresh the VIEW once you modified the underlying table?
There are 2 ways to refresh the VIEW:
1.  Using the system stored procedure sp_refreshview
From Books Online - sp_refreshview: Updates the metadata for the specified non-schema-bound view. Persistent metadata for a view can become outdated because of changes to the underlying objects upon which the view depends.
Syntax: EXECUTE sp_refreshview ‘viewname’

2. Or by executing ALTER VIEW statement.  When you ALTER the VIEW, SQL Server will pick the latest column definition from underlying table and will update the VIEW metadata.

So in our case we can refresh the VIEW Emp by:

------------------------------------------------------------------------------------------------------
-- 1. using system stored procedure sp_refreshview
EXECUTE sp_refreshview 'Emp'
GO
-- 2. Or by executing the ALTER statement
ALTER VIEW Emp
AS
SELECT
*
FROM Employees
GO

-- Verify the data in View
SELECT *
FROM Emp
GO
------------------------------------------------------------------------------------------------------

Note you don’t need to execute both the queries, either of the above can do the trick for you.  Now you can see the HireDate column in Emp view as well:
view

Case 2: You drop a column from the table.
Similarly when you drop a column from the table, the VIEW definition doesn’t get updated even though you have used wild card “*” in VIEW definition.  Now lets drop the column HireDate from the table and see what happens:
Note: In previous step I refreshed the View after adding the HireDate, so now HireDate is part of Emp view as well.

------------------------------------------------------------------------------------------------------
-- First Verify the data before dropping column
SELECT *
FROM Employees
GO
SELECT *
FROM Emp

-- Drop HireDate column from Table
ALTER TABLE Employees DROP COLUMN Hiredate
GO

-- Verify the data after dropping the column.
SELECT *
FROM Employees
GO
SELECT *
FROM Emp
GO

------------------------------------------------------------------------------------------------------

Now in this case HireDate column will get removed from table but the metadata of the VIEW still have its information stored so you will get the following error on selecting data from VIEW.
Msg 4502, Level 16, State 1, Line 1
View or function 'Emp' has more column names specified than columns defined.

The solution is, again refresh the VIEW definition by executing:

------------------------------------------------------------------------------------------------------
EXECUTE sp_refreshview 'Emp'
GO

------------------------------------------------------------------------------------------------------

Now you won’t get the same error.

Case 3: You drop one or more columns and add equal or more number columns to the table.
This case is even more dangerous as User selecting data from the VIEW can get wrong data under wrong columns and can create confusion.  I won’t explain this in detail, just execute following queries and you will realize what I’m saying.

------------------------------------------------------------------------------------------------------
-- Drop column LastName from table
ALTER TABLE Employees DROP COLUMN LastName
GO
-- Add Column DeptName with default value 'IT'
ALTER TABLE Employees ADD DeptName VARCHAR(10) NOT NULL DEFAULT 'IT'
GO

-- Verify the data
SELECT *
FROM Employees
GO
SELECT *
FROM Emp
GO
------------------------------------------------------------------------------------------------------

Final

Now if you see, the LastName column is not present in the table and you can also see the DeptName column in the table.  And interesting observation with VIEW is, though the data is exactly matching with the table but columns names are not correct.  We again need to refresh the VIEW to correct it.

------------------------------------------------------------------------------------------------------
EXECUTE sp_refreshview 'Emp'
GO

-- Verify the data
SELECT *
FROM Emp
GO
------------------------------------------------------------------------------------------------------

Back to normal now:
Last

What is the solution? 
Obvious prevention is don’t use wild card “*” while creating VIEWS.  But even listing out columns is just a prevention or I’d say it is a good practice.   Because even after listing out the columns, if you drop a column from the table that has been used in any VIEW you will still face problems.

The solution is creating the view using “WITH SCHEMABINDING” option.
From Books Online:  Binds the view to the schema of the underlying table or tables. When SCHEMABINDING is specified, the base table or tables cannot be modified in a way that would affect the view definition. The view definition itself must first be modified or dropped to remove dependencies on the table that is to be modified. When you use SCHEMABINDING, the select_statement must include the two-part names (schema.object) of tables, views, or user-defined functions that are referenced. All referenced objects must be in the same database.

Views or tables that participate in a view created with the SCHEMABINDING clause cannot be dropped unless that view is dropped or changed so that it no longer has schema binding. Otherwise, the Database Engine raises an error. Also, executing ALTER TABLE statements on tables that participate in views that have schema binding fail when these statements affect the view definition.

Let me show you one example with SCHEMABINDING:

------------------------------------------------------------------------------------------------------
-- Create View with SCHEMABINDING
CREATE VIEW NewEmp
WITH SCHEMABINDING
AS
SELECT
EmpID,
    FirstName
FROM dbo.Employees
GO

-- Verify the data
SELECT *
FROM NewEmp
GO

-- Now try to alter the table
ALTER TABLE Employees DROP COLUMN EmpID
GO
------------------------------------------------------------------------------------------------------

Now you will get the error:
Msg 5074, Level 16, State 1, Line 2
The object 'NewEmp' is dependent on column 'EmpID'.
Msg 4922, Level 16, State 9, Line 2
ALTER TABLE DROP COLUMN EmpID failed because one or more objects access this column.

Basically  WITH SCHEMABINDING has prevented the change that would affect the view definition.

Additional Information: 
A] If you want see the all the dependant objects on particular table you can use following script:

------------------------------------------------------------------------------------------------------
SELECT *

FROM sys.objects AS so
INNER JOIN sys.sql_expression_dependencies AS sed
    ON so.object_id = sed.referencing_id
WHERE so.type = 'V' AND sed.referenced_id = OBJECT_ID('dbo.employees')

GO
------------------------------------------------------------------------------------------------------

B] The following example creates a script that refreshes the metadata for all views that have a dependency on table dbo.employees:

------------------------------------------------------------------------------------------------------
SELECT DISTINCT 'EXEC sp_refreshview ''' + name + ''''
FROM sys.objects AS so
INNER JOIN sys.sql_expression_dependencies AS sed
    ON so.object_id = sed.referencing_id
WHERE so.type = 'V' AND sed.referenced_id = OBJECT_ID('dbo.employees')
GO

------------------------------------------------------------------------------------------------------

Source:
CREATE VIEW (Transact-SQL)
sp_refreshview (Transact-SQL)

Namaste!
- Mangal

Tuesday, August 24, 2010

Convert Seconds to HH:MM:SS

Today 2 of my colleagues from reporting team had this requirement. They had one table where Seconds were stored as INT and in report they wanted to convert the seconds to HH:MM:SS format.  They already had 1 solution ready with them.   It was something like:

---------------------------------------------------------------------------------------------------------------

DECLARE @Seconds INT
SET @Seconds = 3800

SELECT CONVERT(VARCHAR(10),@Seconds/3600) 
    +':'
    + RIGHT('00'+CONVERT(VARCHAR(2),(@Seconds%3600)/60),2)
    +':'
    + RIGHT('00'+CONVERT(VARCHAR(2),@Seconds%60),2) AS [HH:MM:SS]
  GO

----------------------------------------------------------------------------------------------------------------- The output--
01:03:20

It was working fine, but my colleagues were looking for something different, something elegant.  And I jumped on to help them.  Now a days I hardly get any chance to write SQL, so i don’t let such opportunities go.  I had one solution in mind using CONVERT and Style 108:

--------------------------------------------------------------------------------------------------------------- DECLARE @Seconds INT
SET @Seconds = 3800
SELECT CONVERT(VARCHAR(8), DATEADD(SS,@Seconds,0),108) AS [HH:MM:SS]
GO

---------------------------------------------------------------------------------------------------------------

But problem with the above query is, it fails when number of seconds are more than 86399(there are total 86400 seconds in a day).  So if number of seconds are 86400; above query will show 00 hours instead of showing 24 hours.  See following example:

--------------------------------------------------------------------------------------------------------------- DECLARE @Seconds INT
SET @Seconds = 86500
SELECT CONVERT(VARCHAR(8), DATEADD(SS,@Seconds,0),108)  AS [HH:MM:SS]
GO

---------------------------------------------------------------------------------------------------------------

-- The output--
00:01:40

Here the expected output was 24:01:40.  So I modified the above query a little and I came up with this:

---------------------------------------------------------------------------------------------------------------DECLARE @Seconds INT,@Hour VARCHAR(10)
SET @Seconds = 86500
SET @Hour = DATEDIFF(HH, 0, DATEADD(SS,@Seconds,0))
SELECT STUFF(CONVERT(VARCHAR(8), DATEADD(SS,@Seconds,0),108),1,2,@Hour)  AS [HH:MM:SS]
GO

---------------------------------------------------------------------------------------------------------------

-- The output--
24:01:40

Just in case you want to run above query on table:

--------------------------------------------------------------------------------------------------------------- SELECT STUFF(CONVERT(VARCHAR(8), DATEADD(SS, Seconds,0),108),1,2, DATEDIFF(HH, 0, DATEADD(SS, Seconds,0)))  AS [HH:MM:SS]
FROM TableName
GO

---------------------------------------------------------------------------------------------------------------

I don’t know which approach is better, rather you tell me which one you liked.  Looking at them I think both the queries will give almost identical performance, just that 1st query looks like a Mathematical solution while my approach looks like a SQL solution.  If you know any other approach please feel free to share.

Namaste!
- Mangal

Friday, August 20, 2010

SQL Myth: Primary Key and Clustered Index

After remaining quiet for almost a year I’m back with what I enjoy the most, talking about SQL and sharing whatever little knowledge I have.   Many SQL Developers have this misconception:  “Primary key => Clustered Index: Only a Clustered Index can exist on a Primary key column”.     On numerous occasions I had tough times explaining that this is not the case every time, you can create a Non-Clustered Index on a primary key column.  But if this hot discussion is going on across a coffee table and I’m away from Computer I get helpless.  So finally I decided to write about this.

You can create a Non-Clustered Index on primary key column.   Or if I try to put this in Myth Buster words “A primary key column can exist/survive without a Clustered Index”  Yes it is a fact that PRIMARY KEY constraints default to CLUSTERED Index.   But it doesn’t mean that you CAN’T create a non-clustered index on a Primary Key Column.  And also you can create a Clustered Index on a non-primary key column.  Let me show you this with some simple examples. 

Case 1:  1st lets see what happens when you specify only PRIMARY KEY and nothing else.  In this case YES, by default Clustered Index will be created on Primary Key Column.

USE tempdb
GO

CREATE TABLE MyTable1
(
Id INT PRIMARY KEY,
Dates DATETIME
)
GO

You can see as expected a clustered Index got created on ID column.1

Case 2:  But by just adding a NONCLUSTERED word in front of primary key you can tell SQL Server to create a Non Clustered Index instead of a default Clustered one.

Here you go:

USE tempdb
GO

CREATE TABLE MyTable2
(
Id INT PRIMARY KEY NONCLUSTERED,
Dates DATETIME
)
GO

As you can clearly see in the image a non-clustered index got created on ID column which is also a primary key.

2

Case 3:  Now here is the small trick, you can force SQL Server to create a non-clustered index on a primary key column even without writing NONCLUSTERED in front of it.  Yes, there is exception to the rule   “PRIMARY KEY constraints default to CLUSTERED Index” even if you don’t specify NONCLUSTERED.   Question is, How?  Well by simply creating a Clustered Index on another column while creating the table.

USE tempdb
GO

CREATE TABLE MyTable3
(
Id INT PRIMARY KEY,
Dates DATETIME UNIQUE CLUSTERED
)
GO

See the Image, a Non Clustered got created on ID column and a Clustered on Dates column. 3

The obvious question will be, why SQL Server didn’t create the Clustered on Id column this time?  Answer is very simple, if you know the basic rule “You can have only one Clustered Index on a table”.  And since in the CREATE statement you forced SQL Server to create a clustered index on dates columns SQL Server had no choice but to create a non clustered on Id column.

What we learned today?
Honestly speaking, I didn’t tell anything new.  Experts/people with good knowledge about SQL Server already knew this.  But interesting thing we can learn here – yes there are some DEFAULTs set by SQL Server, but that doesn’t stop you from telling SQL Server “Boss enough of your DEFAULTs, now let me take the control”.  Actually in early days of learning SQL we all get into  this habit of relying on DEFAULTs set by SQL Server.  And we get so used to them that we start considering them as RULES that can’t be broken.

Actually I think there is no harm in taking little bit extra effort and writing some extra keywords and telling SQL Server this is what I want or this is what is expected.

You also learned how to create a Clustered Index on a column of your choice.  This can be very useful when you don’t want a Clustered Index on a primary key especially in cases like where you are using GUID as a primary key(I hate them) and you want Clustered Index to be created on some other column.

Namaste!
- Mangal.

Wednesday, August 12, 2009

Microsoft SQL Server 2008 R2 CTP

The first community technology preview (CTP) of Microsoft SQL Server 2008 R2 available for download for MSDN and TechNet subscribers.

For more details about SQL Server 2008 R2 and for other related links See - http://www.microsoft.com/sqlserver/2008/en/us/R2.aspx

If you have any questions regarding SQL Server 2008 R2 you can visit the forums - http://social.msdn.microsoft.com/Forums/en-US/category/sqlserverprerelease

Tuesday, August 11, 2009

ROLLUP and ORDER BY

In one of my previous posts, I discussed on how useful GROUPING function is while writing a ROLLUP/CUBE query.  One more way GROUPING function can help you is – in ordering the results returned by ROLLUP/CUBE queries.

The reason I’m writing this post is, sometime back I seen somebody writing a weird ORDER BY statement to get the desired ordering after writing a query using a ROLLUP operator.  First he didn’t use the GROUPING function in SELECT statement and 2nd his ORDER BY was something like :

ORDER BY CASE ColumnName
            WHEN 'Total Of ColumnName'
            THEN 'Zzzzzzzz'
            ELSE ColumnName
         END

Just to get sub total returned by ROLLUP at the bottom of result set.  Well of course if he had the knowledge about GROUPING then he wouldn’t have written such CASE statement in ORDER BY.

As you may know that GROUPING function returns 1 when the row is added by either the CUBE or ROLLUP operator, and 0 when the row is not the result of CUBE or ROLLUP.  So you can easily use this property of the GROUPING for ordering the result set.

Lets have a look at the following example, 1st create some sample data.  

-- create sample table Sales

CREATE TABLE Sales
(
ID INT,
FName VARCHAR(30),
Zone VARCHAR(30),
Sale INT
)
GO

-- Load sample data

INSERT INTO Sales SELECT
1, 'Mangal', 'East', 20 UNION ALL SELECT
2, 'Mangal', 'East', 150 UNION ALL SELECT
3, 'Mangal', 'West', 50 UNION ALL SELECT
4, 'Ram', 'East', 45 UNION ALL SELECT
5, 'Ram', NULL, 80 UNION ALL SELECT
6, 'Ram', NULL, 40 UNION ALL SELECT
7, 'Sachin', 'West', 50 UNION ALL SELECT
8, 'Sachin', 'West', 40
GO

-- Test sample data

SELECT Id, FName, Zone, Sale
FROM Sales
GO

The sample data :

ID FName Zone Sale
1 Mangal East 20
2 Mangal East 150
3 Mangal West 50
4 Ram East 45
5 Ram NULL 80
6 Ram NULL 40
7 Sachin West 50
8 Sachin West 40

And here is the expected output :

FName Zone Total
Mangal East 170
Mangal West 50
Mangal All Zone 220
Ram East 45
Ram Unknown 120
Ram All Zone 165
Sachin West 90
Sachin All Zone 90
All Names All Zone 475

As you can see in the expected output, all the FNames are ordered in ascending order and their total SUM is at the bottom, same for the Zone column.  For ordering the result in that way just use the GROUPING(column_name) in ORDER BY just before the column_name.  See the following query, esp the ORDER BY clause:

SELECT CASE GROUPING(fname)
        WHEN 1 THEN 'All Names'
        ELSE ISNULL(Fname, 'Unknown')
        END AS FName,
        CASE GROUPING(Zone)
        WHEN 1 THEN 'All Zone'
        ELSE ISNULL(Zone, 'Unknown') END as Zone,
        SUM(Sale) AS Total
FROM Sales
GROUP BY Fname, Zone WITH ROLLUP
ORDER BY  GROUPING(fname),FName,GROUPING(Zone),Zone

Simple, isn’t it?  Now you don’t need to write a CASE statement in ORDER BY, just use the GROUPING function.  If you will be doing the ORDERING in application layer, then you will need to get in the GROUPING(fname) and GROUPING(zone) column in the SELECT list as well.

Mangal

Friday, August 7, 2009

TSQL Challenges

I just like to thank Jacob Sebastian who is a fellow SQL Server MVP and founder of the www.tsqlchallenges.com for offering me an opportunity to be part of the TSQL Challenges team. I’m really happy to be a part of a team consists of people like Jacob, Alejandro Messa, Peter Larsson (all 3 are SQL Server MVPs), Adam Haines (a moderator of MSDN SQL Server forums), Rui Carvalho and many other talented people.

Here is a brief description about TSQL Challenges site: TSQL Challenges constantly aim at helping people to enhance their SET based query writing skills. With TSQL Challenges, sometimes you learn stuff that you don’t know, sometimes you will see better ways of doing stuff that you already know and sometimes you will be able to use your expertise to help others to learn TSQL querying skills. Even SQL Server experts love TSQL Challenges because every challenge inspires them to come up with new better ways of solving the given problem.

The Mission: The entire “TSQL Challenge” team will focus on fulfilling our mission; “helping people to enhance their SET based query writing skills”. We will come up with more and more interesting TSQL Challenges that encourages you to look for alternate logics and inspires you to think outside the regular thought process.

I would like to invite my readers to participate in a TSQL Challenge - www.tsqlchallenges.com

Also like to thank Jacob again for a warm welcome and kind word he has put in introduction post - Introducing new “TSQL Challenge” Team Members

So I hope I will come with up some interesting SQL puzzles that will challenge your SQL skills and also you will a fun solving them.

Mangal

Tuesday, August 4, 2009

SQL Server 2008 Service Pack (SP) 1 on Microsoft Update as a Required Automatic Update

SQL Server 2008 Service Pack 1 will soon be available through Automatic Update starting from September. 

For the latest information you can read it from SQL Server Setup blog - SQL Server 2008 Service Pack (SP) 1 on Microsoft Update as a Required Automatic Update

For better understanding of Automatic Update see - Update Your PC Automatically

Wednesday, July 22, 2009

UNION Vs UNION ALL

Many times you may have heard this “Use UNION ALL over UNION whenever possible.”  The question arises - why?  To answer this question in one statement  - UNION ALL performs faster compare to UNION. 
Then again question arises - Why UNION ALL performs faster?  Also - Why whenever possible, why not always?

Let me answer the 2nd question 1st – Though both UNION and UNION ALL combines the results of two or more queries into a single result set, there is fundamental difference between these two.  UNION returns only DISTINCT result set, while UNION ALL returns basically all rows available, this includes duplicates.

Lets see the following example:

-- create 2 tables A and B.
CREATE TABLE A
(
ID INT,
Names VARCHAR(10)
)
GO
CREATE TABLE B
(
ID INT,
Names VARCHAR(10)
)
GO
-- insert data into table A
INSERT INTO A VALUES(1,'Mangal');
INSERT INTO A VALUES(5,'Sham');
INSERT INTO A VALUES(2,'Ram');

-- insert data into table B
INSERT INTO B VALUES(2,'Ram');
INSERT INTO B VALUES(3,'Shiv');
INSERT INTO B VALUES(4,'John');

-- test sample data
SELECT id, Names
FROM A
GO
SELECT id, Names
FROM B
GO

Here is how the data of the table A and B looks like :

samle

Note that id=2 and names=Ram is there in both the tables.  That will help us in understanding the difference between UNION and UNION ALL.  Now lets execute the following 2 queries, 1st is with UNION and 2nd is with UNION ALL.

-- with UNION
SELECT id, Names
FROM A
UNION
SELECT id, Names
FROM B
GO

-- with UNION ALL
SELECT id, Names
FROM A
UNION ALL
SELECT
id, Names
FROM B
GO

The result: 

Result

Observations :
1. 1st query with UNION returns 5 rows, and UNION ALL query returns 6 rows.
2.  Row for ID=2(for Ram) appears twice in UNION ALL result set.
3. Result set for UNION is sorted on ID column.  For UNION ALL all the rows of table A appeared 1st followed by rows of table B(no sort).

As you can see, UNION eliminates any duplicate rows from final result set while UNION ALL returns basically all rows available including duplicates.  That is the cause of UNION being slow.  For each row UNION operator checks whether the entire row exists in previous rows or not.  And for making this validation UNION by default 1st sort the result set on the 1st available column of the result set.  In our example UNION has sorted the result set on ID column even though I haven’t specified any ORDER BY clause.  If you see Name “Sham” (which is in table A) appeared last in the UNION result because it has the highest id 5 while it appeared on 2nd row of UNION ALL result.  A look at the query execution plan can help you visualizing it better :

plan 

As you can see cost of the UNION query is 73% compare to 27% for UNION ALL.  And measure reason being the “Distinct Sort” that UNION operator performs to sort and to eliminate the duplicate rows.  While UNION ALL doesn’t really bother about sort and duplicates.  And that is why UNION is slow compare to UNION ALL. 

So again going back to question – why not use UNION ALL always?  And one more question to be added - when to use which one?

- You should  use UNION when you don’t want the duplicates in your final result set, and you are not sure (or may be you are sure) that duplicate records exists in the different queries involved in the UNION.

- You should be using UNION ALL when :
1.  You are not bothered about the duplicate rows in the result.
2.  You are sure there are no duplicates in different queries involved in UNION.  e.g. if you are combining results from 2 or more different years(sales orders) with each query reruns result for individual year with some unique id for each row.  Or combining result for 2 or more different departments.

All this long I’m talking about UNION and UNION ALL as if they are 2 different things all together.  Are they?  Not exactly.  Reason I’m saying this because, when one of my friend asked me about UNION ALL and I advised him to look into the Books online, and he came back complaining me that “books online doesn’t say anything about UNION ALL”.  Reason – he was thinking that Books online must be having some separate section dedicated to UNION ALL, as if it is different from UNION.

Actually, the ALL is just an optional argument in the UNION syntax.  For more on UNION you can refer the books online - http://msdn.microsoft.com/en-us/library/ms180026.aspx.  

Mangal

Friday, July 3, 2009

DELETE Vs TRUNCATE

I know this has been done many times, but still here is something from my side. What is the difference between DELETE and TRUNCATE?
Well one reason I wanted to write this post was, on so many of the blogs, forum threads I keep seeing statements like “TRUNCATE cannot be rolled back”. This is also one of the most frequently asked questions in Interviews.  And since my target audience is people who are just started learning SQL (and NOT SQL Experts), I thought I should write this one.
Instead of just focusing on ROLLBCAK I will try to cover all the differences between DELETE and TRUNCATE.
  • Remove Data  : First thing first, both can remove the data from a table. 
    But a DELETE can be used, to remove the rows not only from a Table but also from a VIEW or the result of an OPENROWSET or OPENQUERY subject to provider capabilities.

  • FROM Clause : With DELETE you can also delete rows from one table/view/rowset_function_limited based on rows from another table by using another FROM clause.  In that FROM clause you can also write normal JOIN conditions.  Actually you can create a DELETE statement from a SELECT statement that doesn’t contain any aggregate functions by replacing SELECT with DELETE and removing column names. 
    With TRUNCATE you can’t do that.


  • WHERE : A TRUNCATE cannot have WHERE Conditions, but a DELETE can.  That means with TRUNCATE you can’t delete a specific row or specific group of rows.
    TRUNCATE TABLE is similar to the DELETE statement with no WHERE clause.

  • Performance : TRUNCATE TABLE is faster and uses fewer system and transaction log resources.
    And one of the reason is locks used by either statements. The DELETE statement is executed using a row lock, each row in the table is locked for deletion. TRUNCATE TABLE always locks the table and page but not each row.

  • Transaction log : DELETE statement removes rows one at a time and makes individual entries in the transaction log for each row. 
    TRUNCATE TABLE removes the data by deallocating the data pages used to store the table data and records only the page deallocations in the transaction log.

  • Pages : After a DELETE statement is executed, the table can still contain empty pages.
    TRUNCATE removes the data by deallocating the data pages used to store the table data.

  • Trigger : TRUNCATE does not activate the delete triggers on the table.  So you must be very careful while using TRUNCATE.  One should never use a TRUNCATE if delete Trigger is defined on the table to do some automatic cleanup or logging action when rows are deleted.
  • Identity Column : With TRUNCATE if the table contains an identity column, the counter for that column is reset to the seed value defined for the column.  If no seed was defined, the default value 1 is used.
    DELETE doesn’t reset the identity counter.  So if you want to retain the identity counter, use DELETE instead.

  • Replication : DELETE can be used against table used in transactional replication or merge replication. 
    While TRUNCATE cannot be used against the tables involved in transactional replication or merge replication.

  • Rollback : DELETE statement can be rolled back. 
    TRUNCATE can also be rolled back provided it is enclosed in a TRANSACTION block and session is not closed. 
    Once session is closed you won't be able to Rollback TRUNCATE.
  • Restrictions : The DELETE statement may fail if it violates a trigger or tries to remove a row referenced by data in another table with a FOREIGN KEY constraint. If the DELETE removes multiple rows, and any one of the removed rows violates a trigger or constraint, the statement is canceled, an error is returned, and no rows are removed. 
    And if DELETE is used against View, that View must be an Updatable view.

    TRUNCATE cannot be used against the table used in Indexed view. 

    TRUNCATE cannot be used against the table referenced by a FOREIGN KEY constraint, unless a table that has a foreign key that references itself.

Source :
DELETE -
http://msdn.microsoft.com/en-us/library/ms189835.aspx 
TRUCNATE -
http://msdn.microsoft.com/en-us/library/ms177570.aspx

I hope you liked my this post on the topic: the difference between DELETE and TRUNCATE in SQL Server.

Friday, June 26, 2009

PIVOT Multiple Columns

In one of my previous post I showed you how to UNPIVOT multiple columns.  On similar lines I also wanted to write on “How to PIVOT multiple columns?”, so this post was due for some time.  Actually I was looking for some efficient way of doing it.  Limitation of PIVOT operator is, it supports pivoting only on a single column.  But you can always have multiple PIVOT operators in the FROM clause. I was trying to create a PIVOT query with multiple columns with multiple PIVOT operators. But at the end of it I found that our old fashioned CASE expression is performing much better than a multiple PIVOT operator query. 

Even though I’m writing this post on how to write a multiple PIVOT operator query, my suggestion is use CASE expressions instead for getting better performance.  Though personally I like to avoid CASE also.   Normally I like to do it in Reporting Services, by creating a Matrix report.  Now a days almost all Reporting Tools provides you an option of creating Matrix report.  And good thing about Matrix report is unlike PIVOT operator you don’t need to hard code any column value.

If you try to write a PIVOT query with 2 PIVOT operators, and use same column in FOR clause you will get an error : Invalid column name NameOfColumn.

Or if you use same column, but by declaring it again and using a different alias name, you still get an error : The column name ValueOfColumn specified in the PIVOT operator conflicts with the existing column name in the PIVOT argument.

So what’s the solution?  Solution is, declare the same column again, change the values in the column by some constant(you can add some constant, or you can concat some identifier ) and assign a new alias name to column.

Lets see the following example, I have used the AdventureWorks database of SQL Server 2005.

USE AdventureWorks
GO
SET ANSI_WARNINGS OFF
SELECT
CustomerId,
        SUM([Q2001]) AS Qty2001,
        SUM([Q2002]) AS Qty2002,
        SUM([V2001]) AS Val2001,
        SUM([V2002]) AS Val2002
FROM (
        SELECT     H.CustomerId,
                SUM(D.OrderQty) AS TotalQty,
                SUM(D.LineTotal) AS TotalVal,
                'Q'+CONVERT(VARCHAR(4),H.OrderDate,120)  AS QYear,
                'V'+CONVERT(VARCHAR(4),H.OrderDate,120)  AS VYear
        FROM Sales.SalesOrderDetail AS D INNER JOIN
        Sales.SalesOrderHeader AS H ON D.SalesOrderId = H.SalesOrderId
        WHERE D.ProductId=771
        AND H.OrderDate >='20010101'
        AND H.OrderDate <'20030101'
        GROUP BY H.CustomerId,
                CONVERT(VARCHAR(4),H.OrderDate,120)
    )Main
PIVOT
    (
        SUM(TotalQty)
        FOR QYear IN ([Q2001],[Q2002])
    ) PQ
PIVOT
    (
        SUM(TotalVal)
        FOR VYear IN ([V2001],[V2002])
    ) PV
GROUP BY CustomerId
ORDER BY CustomerId
GO

The query returns total quantity and line amount for year 2001 and 2002 for the product id 771 for all customers.  If look at the query carefully in Main sub query, CONVERT(VARCHAR(4),H.OrderDate,120) this convert statement will take out the Year part from the OrderDate column.  I have declared the same column twice, at first I concatenated Q to the Year, and at second time I concatenated the V.  Just execute the Main sub query, so it will be easy to understand for you.

SELECT     H.CustomerId,
                SUM(D.OrderQty) AS TotalQty,
                SUM(D.LineTotal) AS TotalVal,
                'Q'+CONVERT(VARCHAR(4),H.OrderDate,120)  AS QYear,
                'V'+CONVERT(VARCHAR(4),H.OrderDate,120)  AS VYear 
FROM Sales.SalesOrderDetail AS D INNER JOIN 
Sales.SalesOrderHeader AS H ON D.SalesOrderId = H.SalesOrderId 
WHERE D.ProductId=771 
                AND H.OrderDate >='20010101' 
                AND H.OrderDate <'20030101' 
GROUP BY H.CustomerId,
                CONVERT(VARCHAR(4),H.OrderDate,120)

Now we have 2 columns, with different values, and we can use them in different PIVOT with same effect, and that’s what I have done in my 1st query.

Here is a CASE expression version of same query, which gives much better performance if you scale it for large amount data.

USE AdventureWorks
GO
SELECT     H.CustomerId,
               SUM(CASE YEAR(H.OrderDate)
                      WHEN 2001
                      THEN D.OrderQty
                      END) AS Qty2001,
               SUM(CASE YEAR(H.OrderDate)
                      WHEN 2002
                      THEN D.OrderQty
                      END) AS Qty2002,
              SUM(CASE YEAR(H.OrderDate)
                     WHEN 2001
                     THEN D.LineTotal
                     END) AS Val2001,
              SUM(CASE YEAR(H.OrderDate)
                     WHEN 2002
                     THEN D.LineTotal
                     END) AS Val2002
FROM Sales.SalesOrderDetail AS D INNER JOIN
Sales.SalesOrderHeader AS H ON D.SalesOrderId = H.SalesOrderId
WHERE D.ProductId=771
          AND H.OrderDate >='20010101'
          AND H.OrderDate <'20030101'
GROUP BY H.CustomerId
ORDER BY H.CustomerId
GO

You can test the performance of both queries.  If you want to scale it for larger data you can remove the WHERE conditions added by me.  Total execution time for CASE query is almost half to that of PIVOT query.