Tuesday, June 30, 2020

Cross Apply and outer apply

SQL Server APPLY operator has two variants; CROSS APPLY and OUTER APPLY

  • The CROSS APPLY operator returns only those rows from the left table expression (in its final output) if it matches with the right table expression. In other words, the right table expression returns rows for the left table expression match only.
  • The OUTER APPLY operator returns all the rows from the left table expression irrespective of its match with the right table expression. For those rows for which there are no corresponding matches in the right table expression, it contains NULL values in columns of the right table expression.
  • So you might conclude, the CROSS APPLY is equivalent to an INNER JOIN (or to be more precise its like a CROSS JOIN with a correlated sub-query) with an implicit join condition of 1=1 whereas the OUTER APPLY is equivalent to a LEFT OUTER JOIN.
You might be wondering if the same can be achieved with a regular JOIN clause, so why and when do you use the APPLY operator? Although the same can be achieved with a normal JOIN, the need of APPLY arises if you have a table-valued expression on the right part and in some cases the use of the APPLY operator boosts performance of your query. Let me explain with some examples.

Create Sample Data for CROSS APPLY and OUTER APPLY examples

Script #1 creates a Department table to hold information about departments. Then it creates an Employee table which holds information about the employees. Please note, each employee belongs to a department, hence the Employee table has referential integrity with the Department table.
--Script #1 - Creating some temporary objects to work on...

USE [tempdb] 
GO
 
IF EXISTS (SELECT * FROM sys.objects WHERE OBJECT_ID = OBJECT_ID(N'[Employee]') AND type IN (N'U')) 
BEGIN 
   DROP TABLE [Employee] 
END 
GO 

IF EXISTS (SELECT * FROM sys.objects WHERE OBJECT_ID = OBJECT_ID(N'[Department]') AND type IN (N'U')) 
BEGIN 
   DROP TABLE [Department] 
END 

CREATE TABLE [Department]( 
   [DepartmentID] [int] NOT NULL PRIMARY KEY, 
   [Name] VARCHAR(250) NOT NULL, 
) ON [PRIMARY] 

INSERT [Department] ([DepartmentID], [Name])  
VALUES (1, N'Engineering') 
INSERT [Department] ([DepartmentID], [Name])  
VALUES (2, N'Administration') 
INSERT [Department] ([DepartmentID], [Name])  
VALUES (3, N'Sales') 
INSERT [Department] ([DepartmentID], [Name])  
VALUES (4, N'Marketing') 
INSERT [Department] ([DepartmentID], [Name])  
VALUES (5, N'Finance') 
GO 

CREATE TABLE [Employee]( 
   [EmployeeID] [int] NOT NULL PRIMARY KEY, 
   [FirstName] VARCHAR(250) NOT NULL, 
   [LastName] VARCHAR(250) NOT NULL, 
   [DepartmentID] [int] NOT NULL REFERENCES [Department](DepartmentID), 
) ON [PRIMARY] 
GO
 
INSERT [Employee] ([EmployeeID], [FirstName], [LastName], [DepartmentID]) 
VALUES (1, N'Orlando', N'Gee', 1 ) 
INSERT [Employee] ([EmployeeID], [FirstName], [LastName], [DepartmentID]) 
VALUES (2, N'Keith', N'Harris', 2 ) 
INSERT [Employee] ([EmployeeID], [FirstName], [LastName], [DepartmentID]) 
VALUES (3, N'Donna', N'Carreras', 3 ) 
INSERT [Employee] ([EmployeeID], [FirstName], [LastName], [DepartmentID]) 
VALUES (4, N'Janet', N'Gates', 3 ) 

SQL Server CROSS APPLY vs INNER JOIN example

The first query in Script #2 selects data from the Department table and uses a CROSS APPLY to evaluate the Employee table for each record of the Department table. The second query simply joins the Department table with the Employee table and all matching records are produced.
--Script #2 - CROSS APPLY and INNER JOIN

SELECT * FROM Department D 
CROSS APPLY 
   ( 
   SELECT * FROM Employee E 
   WHERE E.DepartmentID = D.DepartmentID 
   ) A 
GO
 
SELECT * FROM Department D 
INNER JOIN Employee E ON D.DepartmentID = E.DepartmentID 
GO 
If you look at the results, you can see see they are the same.
cross apply and inner join result set
Also, the execution plans for these queries are similar and they have an equal query cost, as you can see in the image below.
cross apply and inner join query plan
So what is the use of APPLY operator? How does it differ from a JOIN and how does it help in writing more efficient queries? I will discuss this later.

SQL Sever OUTER APPLY vs LEFT OUTER JOIN example

The first query in Script #3 selects data from Department table and uses an OUTER APPLY to evaluate the Employee table for each record of the Department table. For those rows for which there is not a match in the Employee table, those rows contain NULL values as you can see in case of row 5 and 6 below. The second query simply uses a LEFT OUTER JOIN between the Department table and the Employee table. As expected the query returns all rows from Department table, even for those rows for which there is no match in the Employee table.
--Script #3 - OUTER APPLY and LEFT OUTER JOIN

SELECT * FROM Department D 
OUTER APPLY 
   ( 
   SELECT * FROM Employee E 
   WHERE E.DepartmentID = D.DepartmentID 
   ) A 
GO
 
SELECT * FROM Department D 
LEFT OUTER JOIN Employee E ON D.DepartmentID = E.DepartmentID 
GO 
outer apply and left outer join result set
Even though the above two queries return the same information, the execution plan is a bit different. Although cost wise there is not much difference, the query with the OUTER APPLY uses a Compute Scalar operator (with estimated operator cost of 0.0000103 or around 0%) before the Nested Loops operator to evaluate and produce the columns of the Employee table.
outer apply and left outer join query plan

Joining table valued functions and tables using APPLY operators

 In Script #4, I am creating a table-valued function which accepts DepartmentID as its parameter and returns all the employees who belong to this department. The next query selects data from the Department table and uses a CROSS APPLY to join with the function we created. It passes the DepartmentID for each row from the outer table expression (in our case Department table) and evaluates the function for each row similar to a correlated subquery. The next query uses the OUTER APPLY in place of the CROSS APPLY and hence unlike the CROSS APPLY which returned only correlated data, the OUTER APPLY returns non-correlated data as well, placing NULLs into the missing columns.
--Script #4 - APPLY with table-valued function

IF EXISTS (SELECT * FROM sys.objects WHERE OBJECT_ID = OBJECT_ID(N'[fn_GetAllEmployeeOfADepartment]') AND type IN (N'IF')) 
BEGIN 
   DROP FUNCTION dbo.fn_GetAllEmployeeOfADepartment 
END 
GO
 
CREATE FUNCTION dbo.fn_GetAllEmployeeOfADepartment(@DeptID AS INT)  
RETURNS TABLE 
AS 
RETURN 
   ( 
   SELECT * FROM Employee E 
   WHERE E.DepartmentID = @DeptID 
   ) 
GO
 
SELECT * FROM Department D 
CROSS APPLY dbo.fn_GetAllEmployeeOfADepartment(D.DepartmentID) 
GO
 
SELECT * FROM Department D 
OUTER APPLY dbo.fn_GetAllEmployeeOfADepartment(D.DepartmentID) 
GO 
apply with table valued function result set
You might be wondering if we can use a simple join in place of the above queries, the answer is NO. If you replace the CROSS/OUTER APPLY in the above queries with an INNER JOIN/LEFT OUTER JOIN, specifying the ON clause with 1=1 and run the query, you will get the error "The multi-part identifier "D.DepartmentID" could not be bound.". This is because with JOINs the execution context of the outer query is different from the execution context of the function (or a derived table), and you cannot bind a value/variable from the outer query to the function as a parameter. Hence the APPLY operator is required for such queries.
So in summary the APPLY operator is required when you have to use a table-valued function in the query, but it can also be used with inline SELECT statements.

Joining table valued system functions and tables using APPLY operators

Let me show you another query with a Dynamic Management Function (DMF). Script #5 returns all the currently executing user queries except for the queries being executed by the current session. As you can see in the script below, the sys.dm_exec_requests dynamic management view is being CROSS APPLY'ed with the sys.dm_exec_sql_text dynamic management function which accepts a "plan handle" for the query and the "plan handle" is being passed from the left/outer expression to the function to return the data.
--Script #5 - APPLY with Dynamic Management Function (DMF)

USE master 
GO
 
SELECT DB_NAME(r.database_id) AS [Database], st.[text] AS [Query]  
FROM sys.dm_exec_requests r 
CROSS APPLY sys.dm_exec_sql_text(r.plan_handle) st 
WHERE r.session_Id > 50           -- Consider spids for users only, no system spids. 
AND r.session_Id NOT IN (@@SPID)  -- Don't include request from current spid. 
Note, for the above query, the [text] column in the query returns all queries submitted in a batch. If you want to see only the active (currently executing) query you can use the statement_start_offset and statement_end_offset columns to trim the active part of the query. 


SQL Server Hints

Hints are options or strategies specified for enforcement by the SQL Server query processor on SELECT, INSERT, UPDATE, or DELETE statements. The hints override any execution plan the query optimizer might select for a query.
Caution: Because the SQL Server query optimizer typically selects the best execution plan for a query, we recommend that <join_hint>, <query_hint>, and <table_hint> be used only as a last resort by experienced developers and database administrators.

Join Hints
Join hints specify that the query optimizer enforce a join strategy between two tables in SQL Server 2019
Applies to:
Delete, Update, Select
Syntax:
<join_hint> ::= { LOOP | HASH | MERGE | REMOTE }

Arguments

LOOP | HASH | MERGE
Specifies that the join in the query should use looping, hashing, or merging. Using LOOP |HASH | MERGE JOIN enforces a particular join between two tables. LOOP cannot be specified together with RIGHT or FULL as a join type. For more information, see Joins.
REMOTE
Specifies that the join operation is performed on the site of the right table. This is useful when the left table is a local table and the right table is a remote table. REMOTE should be used only when the left table has fewer rows than the right table.
If the right table is local, the join is performed locally. If both tables are remote but from different data sources, REMOTE causes the join to be performed on the site of the right table. If both tables are remote tables from the same data source, REMOTE is not required.
REMOTE cannot be used when one of the values being compared in the join predicate is cast to a different collation using the COLLATE clause.
REMOTE can be used only for INNER JOIN operations.

Remarks

Join hints are specified in the FROM clause of a query. Join hints enforce a join strategy between two tables. If a join hint is specified for any two tables, the query optimizer automatically enforces the join order for all joined tables in the query, based on the position of the ON keywords. When a CROSS JOIN is used without the ON clause, parentheses can be used to indicate the join order.

Examples

A. Using HASH

The following example specifies that the JOIN operation in the query is performed by a HASH join. The example uses the AdventureWorks2012 database.
SELECT p.Name, pr.ProductReviewID FROM Production.Product AS p LEFT OUTER HASH JOIN Production.ProductReview AS pr ON p.ProductID = pr.ProductID ORDER BY ProductReviewID DESC;

B. Using LOOP

The following example specifies that the JOIN operation in the query is performed by a LOOP join. The example uses the AdventureWorks2012 database.
DELETE FROM Sales.SalesPersonQuotaHistory FROM Sales.SalesPersonQuotaHistory AS spqh INNER LOOP JOIN Sales.SalesPerson AS sp ON spqh.SalesPersonID = sp.SalesPersonID WHERE sp.SalesYTD > 2500000.00; GO

C. Using MERGE

The following example specifies that the JOIN operation in the query is performed by a MERGE join. The example uses the AdventureWorks2012 database.
SELECT poh.PurchaseOrderID, poh.OrderDate, pod.ProductID, pod.DueDate, poh.VendorID FROM Purchasing.PurchaseOrderHeader AS poh INNER MERGE JOIN Purchasing.PurchaseOrderDetail AS pod ON poh.PurchaseOrderID = pod.PurchaseOrderID; GO
Query Hints
Query hints specify that the indicated hints should be used throughout the query. They affect all operators in the statement. If UNION is involved in the main query, only the last query involving a UNION operation can have the OPTION clause. Query hints are specified as part of the OPTION clause. Error 8622 occurs if one or more query hints cause the Query Optimizer not to generate a valid plan.
Applies to: Delete, Update, Insert, Select, Merge

Arguments

{ HASH | ORDER } GROUP
Specifies that aggregations that the query's GROUP BY or DISTINCT clause describes should use hashing or ordering.
{ MERGE | HASH | CONCAT } UNION
Specifies that all UNION operations are run by merging, hashing, or concatenating UNION sets. If more than one UNION hint is specified, the Query Optimizer selects the least expensive strategy from those hints specified.
{ LOOP | MERGE | HASH } JOIN
Specifies all join operations are performed by LOOP JOIN, MERGE JOIN, or HASH JOIN in the whole query. If you specify more than one join hint, the optimizer selects the least expensive join strategy from the allowed ones.
If you specify a join hint in the same query's FROM clause for a specific table pair, this join hint takes precedence in the joining of the two tables. The query hints, though, must still be honored. The join hint for the pair of tables may only restrict the selection of allowed join methods in the query hint. For more information, see Join Hints (Transact-SQL).
EXPAND VIEWS
Specifies the indexed views are expanded. Also specifies the Query Optimizer won't consider any indexed view as a replacement for any query part. A view is expanded when the view definition replaces the view name in the query text.
This query hint virtually disallows direct use of indexed views and indexes on indexed views in the query plan.
The indexed view remains condensed if there's a direct reference to the view in the query's SELECT part. The view also remains condensed if you specify WITH (NOEXPAND) or WITH (NOEXPAND, INDEX(index_value_ [ ,...n ] ) ). For more information about the query hint NOEXPAND, see Using NOEXPAND.
The hint only affects the views in the statements' SELECT part, including those views in INSERT, UPDATE, MERGE, and DELETE statements.
FAST number_rows
Specifies that the query is optimized for fast retrieval of the first number_rows. This result is a nonnegative integer. After the first number_rows are returned, the query continues execution and produces its full result set.
FORCE ORDER
Specifies that the join order indicated by the query syntax is preserved during query optimization. Using FORCE ORDER doesn't affect possible role reversal behavior of the Query Optimizer.

Table Hints
Table hints override the default behavior of the Query Optimizer for the duration of the data manipulation language (DML) statement by specifying a locking method, one or more indexes, a query-processing operation such as a table scan or index seek, or other options. Table hints are specified in the FROM clause of the DML statement and affect only the table or view referenced in that clause.

KEEPDEFAULTS
Is applicable only in an INSERT statement when the BULK option is used with OPENROWSET.
Specifies insertion of a table column's default value, if any, instead of NULL when the data record lacks a value for the column.
For an example that uses this hint in an INSERT ... SELECT * FROM OPENROWSET(BULK...) statement, see Keep Nulls or Use Default Values During Bulk Import (SQL Server).
FORCESEEK [ (index_value(index_column_name [ ,... n ] )) ]
Specifies that the query optimizer use only an index seek operation as the access path to the data in the table or view.

Using a Table Hint as a Query Hint

Table hints can also be specified as a query hint by using the OPTION (TABLE HINT) clause. We recommend using a table hint as a query hint only in the context of a plan guide. For ad-hoc queries, specify these hints only as table hints. For more information, see Query Hints (Transact-SQL).

Permissions

The KEEPIDENTITY, IGNORE_CONSTRAINTS, and IGNORE_TRIGGERS hints require ALTER permissions on the table.

Examples

A. Using the TABLOCK hint to specify a locking method

The following example specifies that a shared lock is taken on the Production.Product table in the AdventureWorks2012 database and is held until the end of the UPDATE statement.
UPDATE Production.Product WITH (TABLOCK) SET ListPrice = ListPrice * 1.10 WHERE ProductNumber LIKE 'BK-%'; GO

B. Using the FORCESEEK hint to specify an index seek operation

The following example uses the FORCESEEK hint without specifying an index to force the query optimizer to perform an index seek operation on the Sales.SalesOrderDetail table in the AdventureWorks2012 database.
SELECT * FROM Sales.SalesOrderHeader AS h INNER JOIN Sales.SalesOrderDetail AS d WITH (FORCESEEK) ON h.SalesOrderID = d.SalesOrderID WHERE h.TotalDue > 100 AND (d.OrderQty > 5 OR d.LineTotal < 1000.00); GO

SQL Server physical joins

Nested Loops Join

 A Nested Loops join is a logical structure in which one loop (iteration) resides inside another one, that is to say for each iteration of the outer loop all the iterations of the inner loop are executed/processed.
A Nested Loops join works in the same way. One of the joining tables is designated as the outer table and another one as the inner table. For each row of the outer table, all the rows from the inner table are matched one by one if the row matches it is included in the result-set otherwise it is ignored. Then the next row from the outer table is picked up and the same process is repeated and so on.
The SQL Server optimizer might choose a Nested Loops join when one of the joining tables is small (considered as the outer table) and another one is large (considered as the inner table which is indexed on the column that is in the join) and hence it requires minimal I/O and the fewest comparisons.
The optimizer considers three variants for a Nested Loops join:
  • naive nested loops join in which case the search scans the whole table or index
  • index nested loops join when the search can utilize an existing index to perform lookups
  • temporary index nested loops join if the optimizer creates a temporary index as part of the query plan and destroys it after query execution completes.
An index nested loops join performs better than a merge join or hash join if a small set of rows are involved. Whereas, if a large set of rows are involved the Nested Loops join might not be an optimal choice. Nested Loops support almost all join types except right and full outer joins, right semi-join and right anti-semi join.
In Script #1, I am joining the SalesOrderHeader table with SalesOrderDetail table and specifying the criteria to filter the result of the customer with CustomerID = 670. This filtered criteria returns 12 records from the SalesOrderHeader table and hence being the smaller one, this table has been considered as the outer table (top one in the graphical query execution plan) by the optimizer. For each row of these 12 rows of the outer table, rows from the inner table are matched (or the inner table is scanned 12 times each time for each row using the index seek or correlated parameter from the outer table) and 312 matching rows are returned as you can see in the second image. In the second query below, I am using SET STATISTICS PROFILE ON to display profile information of the query execution along with the query result-set.
Script #1 - Nested Loops Join

SELECT H.CustomerIDH.SalesOrderIDD.ProductIDD.LineTotalFROM Sales.SalesOrderHeader HINNER JOIN Sales.SalesOrderDetail D ON H.SalesOrderID D.SalesOrderIDWHERE H.CustomerID 670

SET STATISTICS PROFILE ON
SELECT 
H.CustomerIDH.SalesOrderIDD.ProductIDD.LineTotalFROM Sales.SalesOrderHeader HINNER JOIN Sales.SalesOrderDetail D ON H.SalesOrderID D.SalesOrderIDWHERE H.CustomerID 670SET STATISTICS PROFILE OFF

Nested Loops Join
If the number of records involved is large, SQL Server might choose to parallelize a nested loop by distributing the outer table rows randomly among the available Nested Loops threads dynamically. It does not apply the same for the inner table rows though.

Merge Join

The first thing that you need to know about a Merge join is that it requires both inputs to be sorted on join keys/merge columns (or both input tables have clustered indexes on the column that joins the tables) and it also requires at least one equijoin (equals to) expression/predicate.

Because the rows are pre-sorted, a Merge join immediately begins the matching process. It reads a row from one input and compares it with the row of another input. If the rows match, that matched row is considered in the result-set (then it reads the next row from the input table, does the same comparison/match and so on) or else the lesser of the two rows is ignored and the process continues this way until all rows have been processed..
A Merge join performs better when joining large input tables (pre-indexed / sorted) as the cost is the summation of rows in both input tables as opposed to the Nested Loops where it is a product of rows of both input tables. Sometimes the optimizer decides to use a Merge join when the input tables are not sorted and therefore it uses an explicit sort physical operator, but it might be slower than using an index (pre-sorted input table).
In Script #2, I am using a similar query as above, but this time I have added a WHERE clause to get all customer greater than 100. In this case, the optimizer decides to use a Merge join as both inputs are large in terms of rows and they are also pre-indexed/sorted. You can also notice that both inputs are scanned only once as opposed to the 12 scans we saw in the Nested Loops join above.
Script #2 - Merge Join

SELECT H.CustomerIDH.SalesOrderIDD.ProductIDD.LineTotalFROM Sales.SalesOrderHeader HINNER JOIN Sales.SalesOrderDetail D ON H.SalesOrderID D.SalesOrderIDWHERE H.CustomerID 100
SET STATISTICS PROFILE ON
SELECT 
H.CustomerIDH.SalesOrderIDD.ProductIDD.LineTotalFROM Sales.SalesOrderHeader HINNER JOIN Sales.SalesOrderDetail D ON H.SalesOrderID D.SalesOrderIDWHERE H.CustomerID 100SET STATISTICS PROFILE OFF

Merge Join
A Merge join is often a more efficient and faster join operator if the sorted data can be obtained from an existing B-tree index and it performs almost all join operations as long as there is at least one equality join predicate involved. It also supports multiple equality join predicates as long as the input tables are sorted on all joining keys involved and are in the same order.
The presence of a Compute Scalar operator indicates the evaluation of an expression to produce a computed scalar value. In the above query I am selecting LineTotal which is a derived column, hence it has been used in the execution plan.

Hash Join

A Hash join is normally used when input tables are quite large and no adequate indexes exist on them. A Hash join is performed in two phases; the Build phase and the Probe phase and hence the hash join has two inputs i.e. build input and probe input. The smaller of the inputs is considered as the build input (to minimize the memory requirement to store hash table discussed later) and obviously the other one is the probe input.

During the build phase, joining keys of all the rows of the build table are scanned. Hashes are generated and placed in an in-memory hash table. Unlike the Merge join, it's blocking (no rows are returned) until this point.

During the probe phase, joining keys of each row of the probe table are scanned. Again hashes are generated (using the same hash function as above) and compared against the corresponding hash table for a match.
A Hash function requires significant amount of CPU cycles to generate hashes and memory resources to store the hash table. If there is memory pressure, some of the partitions of the hash table are swapped to tempdb and whenever there is a need (either to probe or to update the contents) it is brought back into the cache. To achieve high performance, the query optimizer may parallelize a Hash join to scale better than any other join, for more details click here.
There are basically three different types of hash joins,
  • In-memory Hash Join in which case enough memory is available to store the hash table
  • Grace Hash Join in which case the hash table can not fit in memory and some partitions are spilled to tempdb
  • Recursive Hash Join in which case a hash table is so large the optimizer has to use many levels of merge joins. For more details about these different types click here.
In Script #3, I am creating two new large tables (from the existing AdventureWorks tables) without indexes. You can see the optimizer chose to use a Hash join in this case. Again unlike a Nested Loops join, it does not scan the inner table multiple times.
Script #3 - Hash Join

--Create tables without indexes from existing tables
--of AdventureWorks database
SELECT INTO Sales.SalesOrderHeader1 FROM Sales.SalesOrderHeaderSELECT INTO Sales.SalesOrderDetail1 FROM Sales.SalesOrderDetail
GO
SELECT H.CustomerIDH.SalesOrderIDD.ProductIDD.LineTotalFROM Sales.SalesOrderHeader1 HINNER JOIN Sales.SalesOrderDetail1 D ON H.SalesOrderID D.SalesOrderIDWHERE H.CustomerID 670
GO

SET STATISTICS PROFILE ON
SELECT 
H.CustomerIDH.SalesOrderIDD.ProductIDD.LineTotalFROM Sales.SalesOrderHeader1 HINNER JOIN Sales.SalesOrderDetail1 D ON H.SalesOrderID D.SalesOrderIDWHERE H.CustomerID 670SET STATISTICS PROFILE OFF

Hash Join
--Drop the tables created for demonstrationDROP TABLE Sales.SalesOrderHeader1DROP TABLE Sales.SalesOrderDetail1
Note: SQL Server does a pretty good job in deciding which join operator to use in each condition. Understanding these condition helps you to understand what can be done in performance tuning. It's not recommended to use join hints (using OPTION clause) to force SQL Server to use a specific join operator (unless you have no other way out), but rather you can use other means like updating statistics, creating indexes or re-writing your query.