Learning Corner #6 – Grouping sets in SQL Server

Grouping sets in SQL Server always comes into handy when you find yourself in the need to perform data aggregation based on groups. The groups are made by defining a set of attributes that the engine will use to split the data.

Because it’s always easier to learn when seeing good examples, we will use the Production.Product table from AdventureWorks2012 database for further exemplification. I am working on a SQL Server 2012 server and if you don’t have AdventureWorks already installed, you can get it from here (also you will find the database available for other versions of SQL Server, depending on your system).

In this article we will cover the clauses that we can use in T-SQL queries to group data sets:

  • GROUP BY
  • GROUPING SETS
  • CUBE
  • ROLLUP

 

GROUP BY

GROUP BY clause is used to define the columns included in the actual data grouping when one or more aggregate functions are used.

Let’s start with an example that uses an aggregate function but does not have an explicit GROUP BY clause.

SQLGroupingSets_image01

The query will return the number of existing products from the Product table and in this case is 504. But what if we want to see the number of products from each subcategory?

SQLGroupingSets_image02

We have included a new column in our query that is used to split the data – ProductSubcategoryID. If we try to run the statement without including a GROUP BY clause the query will fail. Each column that is used to group the data (and implicitly not used in an aggregate function) must be included in the GROUP BY clause. So now we have a separate COUNT of products for each of the subcategories.

In the next example we added another column to the query and obtained even more groups of data. The grouping now takes into account every combination of the values found in ProductSubcategoryID and ProductModelID. We have also included an ORDER BY clause to arrange the data for an easier overview.

SQLGroupingSets_image03

For the next example we will apply aggregate functions on two of the columns and the result will be grouped only based on ProductID. This demonstrates that we can use as many aggregate functions as we like for the same group.

SQLGroupingSets_image04

Because of the order of logical query processing the WHERE clause cannot be used when we are trying to filter data based on the grouping itself. Any statement using the WHERE clause will first filter the input data and only after that will start grouping the result (read more about logical query processing here). In order to apply filters to groups we will need to use the HAVING clause. Let’s take a look at the following example to see how this works.

SQLGroupingSets_image05

We have included a WHERE clause to exclude values with no subcategory but it’s the HAVING clause that has the role to exclude any groups with a sum of products that is less than 5 and higher than 10 from the result set.

Good to mention so far is that:

  • Any column that does not appear in the GROUP BY clause must have an aggregate function applied to it
  • We can also use the DISTINCT option together with the aggregate functions to include only unique elements in the query

 


GROUPING SETS

The first clause that supports multiple grouping sets in T-SQL is called exactly GROUPING SETS.

The GROUPING SETS clause lets us to list multiple combinations based on the columns defined in the clause. To better understand its behavior, let’s have a look at the following example.

SQLGroupingSets_image06

Note that all of the clauses that support multiple grouping sets are used with GROUP BY. The previous query includes three grouping sets. They are enclosed between the main parentheses and then each grouping set is enclosed in its own parentheses and split by commas.

The first grouping splits the data by both CustomerID and the year of the OrderDate. The second one uses only the CustomerID and ignores the year, while the third keeps only of the year when the order was placed. For the second and third grouping sets NULLs are placed for the column values that are not included in the grouping. The result of the query could be also achieved by using an UNION ALL operator and writing three distinct queries with three different grouping sets (also keep in mind that the number of columns should match).


 

CUBE

The second clause used to support multiple grouping sets is CUBE. Its role is to return all possible combinations that can be formed from the list of grouping columns. If it’s not very clear at this moment let’s have a look at the next query.

SQLGroupingSets_image07

Note that we have also used an ORDER BY clause to arrange the result for a better understanding of how CUBE works.

The CUBE clause used the two columns that were passed as input and built all possible grouping sets:

  • CustomerID and YEAR(OrderDate)
  • CustomerID
  • YEAR(OrderDate)
  • none

When none of the columns is used the query group we will get the same result as:

SQLGroupingSets_image12

The other returned combinations are the ones we used in the GROUPING SETS example. If we want to return all the possible grouping sets from a query we can use the CUBE clause instead of writing explicitly all the possibilities in a GROUPING SETS clause. We can save some typing time especially when it’s done with more than two columns.


 

ROLLUP

The ROLLUP is also similar to GROUPING SETS and CUBE, but it is used when there’s a hierarchy that we wish to form from the input elements. While CUBE will return all possible combinations from the input, the ROLLUP clause will eliminate the last column element for each grouping set and structure the result as a hierarchy. Let’s analyze the following query and result.

SQLGroupingSets_image08

The result was ordered for a better overview and you can see that there are 335.974 products in inventory, from which 72.899 are stored in the location with an ID of 1. Going further, there are 2.727 items in shelf A, and 408 of those are in Bin 1, 427 in Bin 2 and so on. The combination using only Self and Bin will not be displayed using the ROLLUP clause because it doesn’t follow the hierarchy used in the clause: LocationID <- Shelf <- Bin (there is a Bin stored on a Shelf at a LocationID).

Good to remember:

  • Although it’s not actually part of the main chapter of this lesson, be aware of the differences between COUNT(*) and COUNT([column_name]). The latter will exclude NULLs from the result.
  • In all the clauses using multiple grouping sets, NULL is used as a placeholder for the columns that are not part of a grouping set. If wehave a column that allows NULL, wecould end up in a situation where we cannot tell if the NULL value is coming from the data or if it is used as a placeholder.

To overcome this issue, we can use the GROUPING and GROUPING_ID functions.

SQLGroupingSets_image09

The GROUPING function returns 0 for an element that is part of a grouping set and 1 when it is not.

SQLGroupingSets_image10

The GROUPING_ID function works in a way similar to how an IP address is read and translated into an integer. You will have a number of bits equal to the number of the columns that form the grouping set. In this case we will have 3 bits. Each of these bits represent 2 raised to the power of the bit position minus 1. The result that is displayed in the GROUPING_ID column represents the sum of the elements that are not part of the grouping sets.

In our example, the first row’s value was obtained using the following calculation: 20 + 21 + 22 = 7, meaning that none of the columns are part of the grouping set.

The second line showing a 3 that means that the first column is now used in the grouping set.

  • You can use multiple GROUPING SETS, CUBE and ROLLUP clauses in the same GROUP BY but you need to separate them by commas.

 

I hope you found this article useful. Now that you know about grouping sets and how they are used you can take the following quiz to test your knowledge.

Leave a Reply

Your email address will not be published. Required fields are marked *

twelve + 4 =