Common Table Expressions (CTEs) are a standard SQL feature supported across many platforms. They allow you to break down complex queries into simple, modular parts that are easy to read and maintain. Snowflake supports CTEs and provides additional optimizations to enhance their performance.
In this article, we will cover how to write Snowflake CTE and optimize it for best performance. Specifically, we will look at the basic syntax for defining Snowflake CTE, using Snowflake CTE to simplify complex queries with hierarchical data, and finally Optimizing Snowflake CTE for best performance.
What are Snowflake CTEs (Snowflake Common Table Expression)?
Common Table Expression (CTE) is a temporary named result set that exists within the scope of a single statement and that can be referred to later within that statement, possibly multiple times.
Snowflake Common Table Expression (CTE) is defined using the WITH clause followed by the CTE name and a query that defines the CTE. The CTE can then be used like a regular table or view in a SELECT, INSERT, UPDATE, or DELETE statement.
Snowflake CTE is particularly useful when working with recursive queries—queries that reference themselves. Recursive CTEs can be used to perform complex data processing tasks that would otherwise require multiple separate SQL statements. Notably, in Snowflake, CTEs aren't materialized—they aren't stored in the database but re-evaluated upon each reference, offering a flexible approach to deconstructing complex queries.
Here is a quick example of a Snowflake CTE:
/** Snowflake Common Table Expression (Snowflake CTE) Example **/
WITH CTEs_students AS (
SELECT student_id, first_name, last_name
FROM students
)
SELECT *
FROM CTE_students
WHERE class_id = 100;
As you can see, this Snowflake CTE first creates a temporary view called CTE_students that contains the student ID, first name, and last name of all students in the students table. Then, the CTEs are used in the outer SELECT statement to select all of the rows from the temporary view where the student ID is 100.
Keep in mind that Snowflake CTE can be separated by commas—allowing us to define multiple by simply using comma(,) delimiters.
/** Snowflake Common Table Expression (Snowflake CTE) Example **/
WITH
cte1 AS (
SELECT ...
),
cte2 AS (
SELECT ...
)
SELECT *
FROM cte1
JOIN cte2
Also, you can nest Snowflake CTE within other CTEs:
/** Snowflake Common Table Expression (Snowflake CTE) Example **/
WITH outer_cte AS (
WITH inner_cte AS (
SELECT 'Chaos Genius' AS greeting
)
SELECT greeting FROM inner_cte
)
SELECT *
FROM outer_cte;
Want to take Chaos Genius for a spin?
It takes less than 5 minutes.
Things to avoid while choosing Snowflake CTE names:
- SQL function names / Reserved words: Snowflake CTE names should not be the same as the names of Snowflake SQL functions, which can cause confusion and errors when the CTE is used in other Snowflake queries.
- Ambiguous names: Snowflake CTE names should be unique and unambiguous so that they can be easily identified and referenced in other Snowflake queries. Avoid using names that could be confused with other objects in the database, such as tables, views, or stored procedures.
- Long names: Snowflake CTE names should be concise and to the point. Avoid using names that are too long, as this can make your Snowflake queries difficult to read and understand.
Strength of Snowflake CTE in SQL Management
Let's delve deeper into each of the strengths of Snowflake CTE in SQL Management:
1) Modularization of Complex Query Logic:
Snowflake CTE allows you to break down complex SQL queries into smaller, more manageable chunks. By doing so, you can isolate specific logic or calculations in one section of your query, making it easier to understand and maintain.
2) Procedural-like SQL Style:
Traditional SQL is set-based, meaning it's designed to operate on entire sets of data at once. But, with CTE, you can write SQL that feels more procedural in nature, meaning you can structure your queries in a step-by-step manner, similar to how you'd write procedures in programming languages, which can make the logic flow of your SQL more intuitive.
3) DRY Principle (Don't Repeat Yourself):
DRY principle is a software development concept that helps the reduction of repetition. With Snowflake CTE, you can define a piece of logic or a data transformation once and then reference it multiple times in your query. This not only reduces redundancy but also ensures consistency, as changes made to the CTE logic will be reflected wherever the CTE is referenced.
4) Improved Readability:
As SQL queries grow in complexity, they can become difficult to read and understand. So, by making use of Snowflake CTE, you can segment your query into named sections, each handling a specific part of the logic. This structure, combined with the benefits of modularization and the DRY principle, greatly enhances the readability of your SQL, making it easier for developers to easily understand it.
What are the benefits of using Snowflake CTE?
Snowflake Common Table Expressions are extremely powerful. When used precisely, they can significantly simplify complex queries and enhance the overall performance of Snowflake queries.
Snowflake Common Table Expressions (CTEs) provide several key benefits. These benefits are:
- Simplify complex queries
- Query reusability
- Simplify JOINs
- Recursive CTEs
Now let's explore each of these benefits in more detail:
1) Simplify complex queries
CTEs can be used to break down complex queries into smaller, more manageable pieces, making it easier to read and understand and easier to debug.
Before:
/** Snowflake Common Table Expression (Snowflake CTE) Example **/
SELECT FirstName, LastName
FROM Employees
WHERE DepartmentID = (SELECT DepartmentID FROM Departments WHERE DepartmentName = 'Sales');
See? This query is complex and difficult to read and understand. It is also difficult to debug, as it is not clear where the errors are occurring.
After:
/** Snowflake Common Table Expression (Snowflake CTE) Example **/
WITH SalesDepartment AS (
SELECT DepartmentID
FROM Departments
WHERE DepartmentName = 'Sales'
)
SELECT FirstName, LastName
FROM Employees
WHERE DepartmentID = (SELECT DepartmentID FROM SalesDepartment);
What’s different? This query is much easier to read and understand. Snowflake CTE breaks down the complex query into smaller, more manageable chunks, making the query easier to debug and easier to identify where the errors are occurring.
2) Query Reusability
CTEs can be used to reuse the results of Snowflake queries multiple times in a single statement, which can significantly save time and improve performance, especially for queries that are frequently used.
Before:
/** Snowflake Common Table Expression (Snowflake CTE) Example **/
SELECT
'Sales' AS department_name,
AVG(salary) AS average_salary
FROM employees
WHERE department_id =
(SELECT department_id
FROM departments
WHERE department_name = 'Sales')
UNION ALL
SELECT
'Marketing' AS department_name,
AVG(salary) AS average_salary
FROM employees
WHERE department_id =
(SELECT department_id
FROM departments
WHERE department_name = 'Marketing');
In the above example, we are executing the subquery twice to get the department_id for 'Sales' and 'Marketing' departments. This is inefficient as we are retrieving the same data twice.
Now, let's see how we can optimize this using Snowflake CTE:
After:
/** Snowflake Common Table Expression (Snowflake CTE) Example **/
WITH department_cte AS
(SELECT department_id, department_name
FROM departments
WHERE department_name IN ('Sales', 'Marketing'))
SELECT department_name, AVG(salary)
FROM employees
JOIN department_cte
ON employees.department_id = department_cte.department_id
GROUP BY department_name;
As you can see, this query uses CTEs to get the department_ids for the 'Sales' and 'Marketing' departments only once and then reusing it in the main query. This way, we are not executing the same subquery multiple times.
3) Simplify JOINs
CTEs can simplify queries with self-joins or joins between more than two tables. For example:
Before:
/** Snowflake Common Table Expression (Snowflake CTE) Example **/
SELECT o.OrderID, o.OrderDate, c.CustomerName, SUM(oi.Quantity) AS TotalQuantity
FROM Orders o
JOIN Customers c ON o.CustomerID = c.CustomerID
JOIN OrderItems oi ON o.OrderID = oi.OrderID
GROUP BY o.OrderID, o.OrderDate, c.CustomerName;
After:
/** Snowflake Common Table Expression (Snowflake CTE) Example **/
WITH OrderDetails AS (
SELECT OrderID, OrderDate, CustomerID
FROM Orders
),
OrderItemsSummary AS (
SELECT OrderID, SUM(Quantity) AS TotalQuantity
FROM OrderItems
GROUP BY OrderID
)
SELECT od.OrderID, od.OrderDate, c.CustomerName, oi.TotalQuantity
FROM OrderDetails od
JOIN Customers c ON od.CustomerID = c.CustomerID
JOIN OrderItemsSummary oi ON od.OrderID = oi.OrderID;
As you can see, the GROUP BY is now simplified from 3 columns to a single column since the aggregate is being done inside the CTE.
4) Snowflake Recursive CTEs:
You might be thinking, 'Why all the fuss about Snowflake Common Table Expressions (CTEs)?' Well, here's the thing: CTEs really shine when you're working with hierarchical data. That's where their magic truly comes to life.
For example:
WITH RECURSIVE CTEs_staffs(indent, staff_id, supervisor_id, supervisor_title) AS (
-- Anchor clause
SELECT '', staff_id, supervisor_id, position
FROM staff
WHERE position = 'Principal'
UNION ALL
-- Recursive clause
SELECT indent || '---', s.staff_id, s.supervisor_id, s.position
FROM staff s JOIN CTEs_staffs c ON s.supervisor_id = c.staff_id
)
SELECT indent || supervisor_title AS Title, staff_id, supervisor_id
FROM CTEs_staffs;
As you can see in the above example, we created a recursive CTE named 'dept_hierarchy'. This CTE starts by selecting all records from the 'dept' table where 'parent_dept_id' is NULL, which represents the top level of the hierarchy (i.e., departments with no parent department).
We then use the 'UNION ALL' operator to merge this result with the outcome of a recursive query. This recursive query joins the 'dept' table to the 'dept_hierarchy' CTE, aligning records where the 'parent_dept_id' in the 'dept' table matches the 'dept_id' in the CTE. This process effectively picks out all child departments for every parent department in the hierarchy.
The recursion continues, with each cycle going one level deeper into the hierarchy until no further child departments are identified. The final output is a table showcasing the entire department hierarchy, where each entry displays the 'dept_id', 'parent_dept_id', and 'name' of each department
What are the types of Snowflake Common Table Expression?
There are two types of Snowflake CTE (Common Table Expressions): non-recursive CTEs and recursive CTEs.
1) Snowflake Non-recursive CTEs
Non-recursive Snowflake CTEs are the most common type of CTEs. Non-recursive CTEs are used to process non-hierarchical data. They work by executing a single query and returning the results of the query.
The structure of a non-recursive CTEs is as follows:
/** Snowflake Common Table Expression (Snowflake CTE) Example **/
WITH CTE_name AS (
SELECT ...
WHERE ...
)
SELECT ...
FROM CTE_name;
Example of a non-recursive CTEs:
/** Snowflake Common Table Expression (Snowflake CTE) Example **/
WITH CTEs_employees AS (
SELECT employee_id, first_name, last_name
FROM employees
WHERE department_id = 100
)
SELECT *
FROM CTE_employees;
2) Snowflake Recursive CTEs
Recursive CTEs are used to process hierarchical data. They work by repeatedly querying themselves until they reach a stopping point.
The structure of a Snowflake recursive CTEs is as follows:
WITH recursive CTE_name AS (
-- The anchor clause defines the initial query.
SELECT ...
WHERE ...
-- The recursive clause defines the query that is repeated.
UNION ALL
SELECT ...
FROM CTE_name
WHERE ...
)
SELECT ...
FROM CTE_name;
As you can see:
- The anchor clause is the first query in the CTEs. This query is executed once, and its results are used to populate the CTEs.
- The recursive clause is the second query in the CTEs. This query is executed repeatedly, and its results are unioned with the results of the anchor clause. The recursive clause continues to execute until it reaches a row where the parent ID is NULL.
- The SELECT statement at the end of the CTEs selects the columns that you want to return from the CTEs.
Example of a Snowflake recursive CTEs:
/** Snowflake recursive Common Table Expression (Snowflake CTE) Example **/
WITH recursive CTE_employees AS (
SELECT employee_id, first_name, last_name
FROM employees
WHERE manager_id IS NULL
UNION ALL
SELECT employee_id, first_name, last_name
FROM CTE_employees
WHERE manager_id IN (
SELECT employee_id
FROM CTE_employees
)
)
SELECT *
FROM CTEs_employees;
Now, as you can see, this query first creates a CTEs called CTE_employees that selects all rows from the employees table where the manager_id column is NULL. The CTEs is then used in the UNION ALL operator to union the results of the CTEs with itself. The WHERE clause in the UNION ALL operator ensures that only employees who have managers are included in the CTEs.
“ This process repeats until there are no more employees who have managers. “
Behind the Scenes—Snowflake CTE & Query Plan
To understand how Snowflake treats Common Table Expressions (CTE), we need to consider the performance implications.
What happens if the Snowflake CTE is referenced only once?
If a Snowflake CTE is referenced only once, there are little to no performance implications. To verify this, we can examine the Snowflake query profile.
Let's start by writing a normal query and analyzing the Snowflake query profile. Then, we will introduce Snowflake CTE by modifying the query and referencing the CTE only once. Finally, we will modify the query once more and reference the CTEs multiple times, which will allow us to observe the performance differences between these variations.
First, let's examine the Snowflake query profile for a regular query:
/** Snowflake Common Table Expression (Snowflake CTE) Example **/
SELECT *
FROM SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.CUSTOMER
WHERE C_BIRTH_YEAR > 1930;
Next, let's modify the query by implementing Snowflake CTE and analyze the updated Snowflake query profile:
/** Snowflake Common Table Expression (Snowflake CTE) Example **/
WITH CTEs_customers AS (
SELECT *
FROM SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.CUSTOMER
WHERE C_BIRTH_YEAR > 1930
)
SELECT *
FROM CTEs_customers;
As you can see, in analyzing Snowflake's handling of CTEs versus standard SQL queries, there's no performance difference between the two. Specifically, when a Snowflake CTE is referenced only once, it’s always a pass-through, and the query profile shows no sign of it whatsoever. This means that using a CTE that’s referenced only once will never impact performance compared to avoiding the CTE.
What happens if the Snowflake CTE is referenced multiple times?
Like this:
/** Snowflake Common Table Expression (Snowflake CTE) Example **/
WITH CTEs_customers_data AS (
SELECT *
FROM SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.CUSTOMER
),
customers_before_1960 AS (
SELECT *
FROM CTEs_customers_data
WHERE C_BIRTH_YEAR < 1960
),
customers_1960_to_1970 AS (
SELECT *
FROM CTEs_customers_data
WHERE C_BIRTH_YEAR >= 1960 AND C_BIRTH_YEAR <= 1970
),
customers_1971_to_1980 AS (
SELECT *
FROM CTEs_customers_data
WHERE C_BIRTH_YEAR >= 1971 AND C_BIRTH_YEAR <= 1980
),
customers_after_1980 AS (
SELECT *
FROM CTEs_customers_data
WHERE C_BIRTH_YEAR > 1980
)
SELECT *
FROM customers_before_1960
UNION ALL
SELECT *
FROM customers_1960_to_1970
UNION ALL
SELECT *
FROM customers_1971_to_1980
UNION ALL
SELECT *
FROM customers_after_1980;
In the above screenshot, you'll notice the presence of four new node types with WithClause and WithReference.
The WithClause and WithReference nodes are node types—introduced in Snowflake's 2023 release. These nodes are used to represent Common Table Expressions (CTEs) in the query plan.
- WithClause node: Represents the output stream and buffer from the CTEs.
- WithReference node: Represents a reference to the CTEs in the rest of the query plan.
When Snowflake parses a query that contains CTEs, it creates a WithClause node for the CTEs and then adds it to the query plan. The WithReference nodes are then created for each place in the query plan where the CTEs are referenced.
The WithClause and WithReference nodes allow Snowflake to optimize the execution of queries that contain CTEs. For example, if CTE contains a filter, Snowflake can push the filter down to the underlying TableScans node.
This can improve the performance of the query by avoiding the need to scan the entire table.
Now, let's actually convert that CTE into a normal query and check what performance difference it has
/** Snowflake Common Table Expression (Snowflake CTE) Example **/
SELECT *
FROM SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.CUSTOMER
WHERE C_BIRTH_YEAR < 1960
UNION ALL
SELECT *
FROM SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.CUSTOMER
WHERE C_BIRTH_YEAR >= 1960 AND C_BIRTH_YEAR <= 1970
UNION ALL
SELECT *
FROM SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.CUSTOMER
WHERE C_BIRTH_YEAR >= 1971 AND C_BIRTH_YEAR <= 1980
UNION ALL
SELECT *
FROM SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.CUSTOMER
WHERE C_BIRTH_YEAR > 1980;
As you can see,
Snowflake CTE:
- Processing: 57.1% of 49 seconds = 27.979 seconds
- Local Disk I/O: 24.1% of 49 seconds = 11.809 seconds
- Remote Disk I/O: 7.5% of 49 seconds = 3.675 seconds
- Synchronization: 0.1% of 49 seconds = 0.049 seconds
- Initialization: 11.2% of 49 seconds = 5.488 seconds
Normal Query:
- Processing: 70.7% of 106 seconds = 74.942 seconds
- Local Disk I/O: 0.1% of 106 seconds = 0.106 seconds
- Remote Disk I/O: 13.4% of 106 seconds = 14.204 seconds
- Synchronization: 0.3% of 106 seconds = 0.318 seconds
- Initialization: 15.5% of 106 seconds = 16.43 seconds
As you can see, after implementing the Snowflake CTE, the query becomes notably more efficient, running approximately ~116.33% faster than the standard SQL query. Breaking down the specifics:
- Processing: The CTE-based query completes its processing in just 27.979 seconds, while the standard SQL query takes a longer 74.942 seconds.
- Local Disk I/O: The CTE-based query has a more involved local disk I/O time of 11.809 seconds, compared to the minimal 0.106 seconds in the standard query. This might be due to the CTE creating temporary results or tables that are read/written locally.
- Remote Disk I/O: The CTE-based query is more efficient in remote data retrieval, taking only 3.675 seconds, whereas the standard SQL query takes 14.204 seconds. This suggests that the CTE might be streamlining data retrieval operations, reducing the need for extensive remote I/O operations.
- Synchronization: Both queries have minimal synchronization times, but the difference is negligible with the CTE-based query at 0.049 seconds and the standard SQL query at 0.318 seconds.
- Initialization: The initialization times are relatively close, with the CTE-based query at 5.488 seconds and the standard SQL query at 16.43 seconds.
In short, the use of Snowflake CTE provides tangible performance benefits, making it a more efficient choice over traditional SQL queries in this kind of scenario.
Now, What happens if the Snowflake CTE is referenced multiple times with a direct table reference?
Now, let's replace the references to the CTEs_customers_data CTEs with a direct reference to the SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.CUSTOMER table and observe the differences.
/** Snowflake Common Table Expression (Snowflake CTE) Example **/
WITH customers_before_1960 AS (
SELECT *
FROM SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.CUSTOMER
WHERE C_BIRTH_YEAR < 1960
),
customers_1960_to_1970 AS (
SELECT *
FROM SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.CUSTOMER
WHERE C_BIRTH_YEAR >= 1960 AND C_BIRTH_YEAR <= 1970
),
customers_1971_to_1980 AS (
SELECT *
FROM SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.CUSTOMER
WHERE C_BIRTH_YEAR >= 1971 AND C_BIRTH_YEAR <= 1980
),
customers_after_1980 AS (
SELECT *
FROM SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.CUSTOMER
WHERE C_BIRTH_YEAR > 1980
)
SELECT *
FROM customers_before_1960
UNION ALL
SELECT *
FROM customers_1960_to_1970
UNION ALL
SELECT *
FROM customers_1971_to_1980
UNION ALL
SELECT *
FROM customers_after_1980;
You'll notice that there are four TableScans instead of just one in the execution plan. The TableScan on the left side is responsible for reading data from the remote storage, while the TableScan on the right side utilizes the locally cached result from the left TableScan. Also, the number of Filters has decreased to four because when a filter is applied after a TableScan, the TableScan node itself handles the filtering operation.
Note: Whenever you are using CTEs in Snowflake, be aware of potential performance impacts. While CTEs can optimize queries, excessive nesting or referencing the same CTE multiple times in one query can degrade performance. In such cases, a standard SQL query may be more efficient than a CTE-based approach.
Heads Up! Just a quick note to say that these Snowflake CTE performance findings were completely based on Snowflake sample data. Everyone's data and scenarios can be a bit different, so your performance might vary. And by the way, we ran our tests on an X-Small warehouse in Snowflake. If you're using a bigger warehouse (Small, Medium, Large ….), the performance will be very different.
Use cases and Performance improvement by using Snowflake CTE
Here are the use cases of Snowflake recursive and non-recursive Snowflake CTE:
Use cases of Snowflake Recursive CTEs:
- Snowflake Recursive CTEs are primarily used with hierarchical data, especially when the hierarchical nature of the data is important in the output.
- They can be used to traverse through data that has a tree-like structure, such as organizational charts, file directories, or any data that has parent-child relationships.
- Snowflake Recursive CTEs should be used cautiously because they can easily lead to infinite loops.
Use cases of Non-recursive Snowflake CTE:
- Non-recursive CTEs have the same use cases as subqueries, meaning that they can be used to temporarily store data, to break down complex Snowflake queries into smaller, more manageable pieces, and to improve the readability and maintainability of queries.
- Non-recursive CTEs can be used to make the code more readable and structured. This is because they can be used to group related Snowflake queries together and to give them meaningful names. This can make Snowflake queries easier to understand and maintain.
- Non-recursive CTEs are useful even with hierarchical data when the hierarchical nature of the data is not important in the output.
Performance improvements from using CTEs in Snowflake:
- Snowflake is able to optimize CTE queries by pushing down filters and joins to the CTEs definitions, which reduces the amount of data processed in the main query.
- The results of a Snowflake CTE can be reused multiple times in the main query or in other CTEs, which avoids re-computing the same results multiple times.
- Snowflake CTE can simplify complex Snowflake queries by breaking them into smaller, logical steps, making the Snowflake queries easier to read, understand, and optimize.
- CTEs materialize intermediate results in temporary storage, allowing the results to be used multiple times without re-computation.
- Snowflake CTE can execute in parallel, which improves Snowflake query performance, especially for large datasets.
What are the limitations of using Snowflake CTE?
There are a few limitations to using Snowflake CTE, which include:
- Recursive CTE Limitations: The Snowflake implementation of recursive CTEs does not support certain keywords that some other systems might support. These include:
SEARCH DEPTH FIRST BY ...
CYCLE ... SET ...
- Infinite Loops: When using a recursive CTE, there's a possibility to create a query that goes into an infinite loop, which can consume credits until the query succeeds, times out, or is manually canceled.
- CTEs can heavily impact performance: CTEs can impact the performance of a query, especially if the CTEs are really complex or are referenced/nested multiple times in the query.
/** Snowflake Common Table Expression (Snowflake CTE) Example **/
WITH
cte1 AS (
SELECT ...
),
cte2 AS (
SELECT ...
),
cte3 AS (
SELECT ...
),
cte4 AS (
SELECT ...
),
cte5 AS (
SELECT ...
),
......
......
......
SELECT *
FROM cte1
Hands-on Example of Snowflake Common Table Expression
Still confused about Snowflake Common Table Expressions (CTEs)? Don't worry! Let's delve into a hands-on example that will help you understand and utilize Snowflake CTE more effectively.
Let's start by creating a table at first and inserting some dummy values. To do so, fire the query mentioned below
/** Snowflake Common Table Expression (Snowflake CTE) Example **/
CREATE OR REPLACE TABLE staff (position VARCHAR, staff_ID INTEGER, supervisor_ID INTEGER);
Now, insert some value to that table:
INSERT INTO staff (position, staff_ID, supervisor_ID)
VALUES
('Principal', 1, NULL), -- The Principal has no coordinator.
('Vice Principal', 10, 1),
('Math Teacher', 100, 20),
('Science Teacher', 101, 20),
('STEM Coordinator', 20, 1),
('English Teacher', 200, 30),
('Physical Education Teacher', 201, 30),
('History Teacher', 202, 30),
('Librarian', 203, 30),
('Art Teacher', 204, 30),
('Engineering Teacher', 205, 20),
('Other Coordinator', 30, 1);
As you can see in the query above, you can observe how the values were inserted to establish a hierarchical structure within the staff members of the school. The Principal holds the highest position without a supervisor. The Vice Principal and STEM Coordinator both report to the Principal. The Math and Science Teachers report to the STEM Coordinator, while an Engineering Teacher also reports to the STEM Coordinator. The English Teacher, Physical Education Teacher, History Teacher, Librarian, and Art Teacher all report to the Other Coordinator, who in turn reports to the Principal. This query effectively establishes the hierarchical relationships and reporting structure among the staff members of the school.
Now, let’s map each position with their respective supervisor_id they belong to using:
1) Non-recursive Snowflake CTE:
/** Snowflake Common Table Expression (Snowflake CTE) Example **/
WITH CTEs_staffs (Positions, Supervisor) as (
-- Snowflake subquery begins
SELECT staff.position as position, supervisor.position as supervisor
from staff
LEFT OUTER JOIN staff as supervisor on staff.supervisor_ID = supervisor.staff_ID
)
SELECT * from CTEs_staffs
See? It maps accurately and associates each position with its corresponding supervisor. However, the result lacks hierarchical information, making it challenging to determine the number of hierarchical levels, especially in larger datasets.
Now, let’s convert that same non-recursive query into 👇
2) Recursive Snowflake CTE:
/** Snowflake Common Table Expression (Snowflake CTE) Example **/
WITH RECURSIVE CTEs_staffs (
indent,
staff_id,
supervisor_id,
supervisor_title,
) AS (
-- Anchor clause
SELECT
'' AS indent,
staff_id,
supervisor_id,
position
AS supervisor_title
FROM staff
WHERE position = 'Principal'
UNION ALL
-- Recursive clause
SELECT
indent || '---',
staff.staff_id,
staff.supervisor_id,
staff.position
FROM staff JOIN CTEs_staffs ON staff.supervisor_id = CTEs_staffs.staff_id
)
SELECT
indent || supervisor_title AS Title,
staff_id,
supervisor_id
FROM CTEs_staffs;
As you can see, the [WITH RECURSIVE] clause initiates a recursive common table expression (CTE) named 'CTEs_staffs' with columns 'indent', 'staff_id', 'supervisor_id', and 'supervisor_title'. The anchor clause starts the process by selecting the 'Principal' from the 'staff' table. The recursive clause then identifies all staff members reporting to the 'Principal', adding '---' to the 'indent' for each level of the hierarchy. This process keeps on repeating until it reaches staff members who don't supervise anyone.
The UNION ALL clause then accumulates the results from each iteration, which are made available to the main SELECT clause in the query. The output displays the hierarchical data with indentation, controlled by the 'indent' column. The indentation increases by three characters '---' for each level in the hierarchy. The construction of joins and selection of columns in the recursive clause is also crucial. The columns in the recursive clause's SELECT must align with the anchor clause's columns.
The query starts with the Principal, then selects the staff members reporting to the Principal, and so on. Each iteration looks for staff members whose 'supervisor_id' field matches one of the 'staff_id' values from the previous iteration.
Basically, the 'staff_id' in the 'CTEs_staffs' view becomes the 'supervisor_id' for the next level of staff members. The 'staff_id' must progress downward through the hierarchy during each iteration. If they don't, the query could loop infinitely, skip a level, or might even fail in other ways.
Here is the result:
But note that the result is not in order. To fix this, we need to include an ORDER BY clause. But, using just the ORDER BY clause may cause staff to not appear directly under their respective supervisor, despite correct indentation.
To address this issue, we can modify the previous query by introducing an additional column called 'sort_key' in the ORDER BY clause. The 'sort_key' accumulates the chain of command as the recursive clause iterates. It can be seen as a string that represents the entire hierarchy above a staff member, with the most senior position (the Principal) at the beginning of the 'sort_key' string.
Note: Each iteration should increase the length of the 'sort_key' by a consistent number of characters. To achieve this, we can utilize a small query that incorporates a user-defined function (UDF) called 'skey'. (This function ensures the generation of consistent-length segments for the 'sort_key').
Here is how your UDF function should look like:
CREATE OR REPLACE FUNCTION skey(ID VARCHAR)
RETURNS VARCHAR
AS
$$
SUBSTRING('0000' || ID::VARCHAR, -4) || ' '
$$
;
Andd, here is the final version of the query:
/** Snowflake Common Table Expression (Snowflake CTE) Example **/
WITH RECURSIVE CTEs_staffs (
indent,
staff_id,
supervisor_id,
supervisor_title,
sort_key
) AS (
-- Anchor clause
SELECT
'' AS indent,
staff_id,
supervisor_id,
position
AS supervisor_title, skey(staff_id)
FROM staff
WHERE position = 'Principal'
UNION ALL
-- Recursive clause
SELECT
indent || '➖➖➖',
staff.staff_id,
staff.supervisor_id,
staff.position,
sort_key || skey(staff.staff_id)
FROM staff JOIN CTEs_staffs ON staff.supervisor_id = CTEs_staffs.staff_id
)
SELECT
indent || supervisor_title AS Title,
staff_id,
supervisor_id,
sort_key
FROM CTEs_staffs Order by sort_key;
Congratulations! With just a few steps, we have successfully created both recursive and non-recursive Snowflake Common Table Expressions (CTEs).
Additional Resources
Read:
- Snowflake Documentation - Working with CTEs (Common Table Expressions)
- Snowflake query optimiser: unoptimised
- CTEs are Passthroughs–Some research!
Watch:
Save up to 30% on your Snowflake spend in a few minutes!
Conclusion
Common Table Expressions (CTEs) are a powerful way to organize and simplify complex Snowflake SQL queries. They break down convoluted queries into easy to read steps that can be reused within a statement. Like the layers of an onion, CTEs let you peel back complexity one level at a time.
In this article, we explored how to define and optimize CTEs in Snowflake. We looked at the basic syntax for crafting CTEs, how to use Snowflake recursive CTEs to query hierarchical data, and best practices for optimizing CTEs performance.
To wrap up, think of Snowflake CTEs as the building blocks of a Lego structure. Each block, or CTEs, is simple and manageable on its own, but when pieced together, they form a complex, intricate structure. Just as you would strategically place each Lego block for stability, you should also carefully construct your CTEs for readability, maintainability, and performance.
FAQs
What are Snowflake CTE?
TLDR; Snowflake CTE are named subqueries defined within the WITH clause of a Snowflake SQL statement that can be used to break up complex Snowflake queries into smaller, more manageable chunks.
How do you define CTEs in Snowflake SQL?
You define a CTEs using the WITH clause. The basic syntax is:
/** Snowflake Common Table Expression (Snowflake CTE) Example **/
WITH CTEs_name AS (
SELECT …CTEs
)
SELECT * FROM CTEs_name;
Does Snowflake support recursive CTEs?
Yes, Snowflake supports recursive CTEs.
Are CTEs materialized in Snowflake?
No, CTEs are not materialized in Snowflake. They are virtual views that exist only for the duration of the query. This means CTEs have little overhead but can impact performance for complex Snowflake queries.
What are the different types of Snowflake CTE?
There are two types of Snowflake CTE: recursive and non-recursive. Recursive CTEs are used to create a table that contains a hierarchy of data. Non-recursive CTEs are used to create a table that contains a single set of data.
When should I use CTEs in Snowflake?
CTEs can be used in a variety of situations, but they are especially useful for the following:
- Snowflake queries that involve multiple tables
- Snowflake queries that involve recursive data
- Snowflake queries that are performance-sensitive
What is the difference between Snowflake CTE and subquery?
The main difference between Snowflake CTE and subqueries is that CTEs are materialized in local storage while subqueries are not.
Is Snowflake CTE faster than subqueries?
No, Snowflake CTE is not faster than subqueries. But, CTEs can be more efficient than subqueries in some cases. For instance, Snowflake can cache the CTEs if you are using multiple different filters on the same table, which can enhance the Snowflake query performance.
How do recursive CTEs handle hierarchical data in Snowflake?
Recursive CTEs can process hierarchical data structures, such as management hierarchies or component-subcomponent relationships. They allow joining a table to itself multiple times to navigate through different levels of hierarchy.
What precautions should be taken with recursive CTEs in Snowflake?
Incorrect construction of a recursive CTE can lead to infinite loops. Make sure that the data hierarchy doesn't contain cycles and that the recursive clause is correctly structured to avoid infinite iterations.