日志

SQL Performance (2), Optimized SELECT Query (B)

已有 2144 次阅读2021-5-15 21:32 |个人分类:Tech|系统分类:原创博文 | 宽屏请点击显示宽屏，再点击恢复窄屏 | 动漫全图如只见部分动漫，请点击显示全图，再点击恢复窄图

This series of articles will discuss SQL server performance. In the title, I use SQL Perfromance because thearticles could include SQL statement (Query) optimization and also include SQL Server performance issues, and probably include stored procedure performance issue.

SQL Performance (0), Summary
SQL Performance (1), Optimized SELECT Query (A)
SQL Performance (2), Optimized SELECT Query (B) --- this article
SQL Performance (3), Optimized UPDATE Query
SQL Performance (4), Optimized INSERT Query
SQL Performance (5), Tune the SQL Server Database

A: Introduction

According to Pareto principle’s 80/20 rule, we put 20% effort in article 1, SQL Performance (1), Optimized SELECT Query (A), to discuss the minimum actions we need to do for SQL performance tuning SELECT statements, and hopefully to get 80% performance improvement; while in the this article, we will discuss the 80% effort to get another 20% performance improvement.

We will discuss with 12 tips below.

B: Tips for Improvement of Tuning Query

Create joins with INNER JOIN (not WHERE --- Cross Join)
Avoid Multiple Joins in a Single Query
Avoid using table variables in joins

Using Temp Table wisely
Using a Subquery
Using a CTE

Avoid Correlated SQL Subqueries
Avoid Multi-Statement Table Valued Functions (TVFs)
Avoid using wildcard characters at the beginning of LIKE pattern
Avoid using COUNT(), instead of Exist()
Avoid Using GROUP BY, ORDER BY and DISTINCT
Avoid Different Datatype on JOIN and WHERE Conditions
Avoid using HAVING to define filter, instead of WHERE
Avoid OR conditions
Avoid sorting with a mixed order

1, Create joins with INNER JOIN (not WHERE --- Cross Join)^[ref]

Some SQL developers, such as me, starting from Oracle, used to make joins with WHERE clauses, such as the following:

SELECT Customers.CustomerID, Customers.Name, Sales.LastSaleDate
FROM Customers, Sales
WHERE Customers.CustomerID = Sales.CustomerID

This type of join creates a Cartesian Join, also called a Cartesian Product or CROSS JOIN.

Ina Cartesian Join, all possible combinations of the variables are created. In this example, if we had 1,000 customers with 1,000 total sales, the query would first generate 1,000,000 results, then filter forthe 1,000 records where CustomerID is correctly joined. This is an inefficient use of database resources, as the database has done 100x more work than required. Cartesian Joins are especially problematic in large-scale databases, because a Cartesian Join of two large tables could create billions or trillions of results.

To prevent creating a Cartesian Join, use INNER JOIN instead:

SELECT Customers.CustomerID, Customers.Name, Sales.LastSaleDate
FROM Customers INNER JOIN Sales
ON Customers.CustomerID = Sales.CustomerID

The database would only generate the 1,000 desired records where CustomerID is equal.

Note:

Some DBMS systems are able to recognize WHERE joins and automatically run them as INNER JOINs instead. In those DBMS systems, there will be no difference in performance between a WHERE join and INNER JOIN. However, INNER JOIN is recognized by all DBMS systems.

I check SQL Server 2019 that is belong to this category.

2, Avoid Multiple Joins in a Single Query^[ref]

Making joins less complicated.

Tryto avoid writing a query using multiple joins that include outer joins, cross apply, outer apply and other complex subqueries.

Reason: When the joins are complex, the number of possible execution plans exponentially increases. This, in turn, will reduce the choices for the query optimizer to decide the join order and join type. Sometimes, the query optimizer is forced to use nested loop joins, irrespective of the performance consequences for queries with excessively complex cross apply or subqueries. Also, due to time and performance constraint, SQL Engine will try to generate the “good enough” execution plan versus trying to generate the best possible execution plan.

What would address your question is the subject JOIN DECOMPOSITION.

You can decompose a join by running multiple single-table queries instead of a multitable join, and then performing the join in the application. For example, instead of this single query:

SELECT * FROM tag
JOIN tag_post ON tag_post.tag_id = tag.id
JOIN post ON tag_post.post_id = post.id
WHERE tag.tag = 'mysql';

You might run these queries:

SELECT * FROM tag WHERE tag = 'mysql';
SELECT * FROM tag_post WHERE tag_id=1234;
SELECT * FROM post WHERE post.id IN (123,456,567,9098,8904);

Why do this? It looks wasteful at first glance, because you've increased the number of queries without getting anything in return. However, such restructuring can actually give significant performance advantages:

Caching can be more efficient. Many applications cache "objects" that map directly to tables. In this example, if the object with the tag mysql is already cached, the application will skip the first query. If you find posts with an ID of 123, 567, or 908 in the cache, you can remove them from the IN() list. The query cache might also benefit from this strategy. If only one of the tables changes frequently, decomposing a join can reduce the number of cache invalidations.
Executing the queries individually can sometimes reduce lock contention
The queries themselves can be more efficient. In this example, using an IN() list instead of a join lets MySQL sort row IDs and retrieve rows more optimally than might be possible with a join.
You can reduce redundant row accesses. Doing a join in the application means retrieving each row only once., whereas a join in the query is essentially a denormalization that might repeatedly access the same data. For the same reason, such restructuring might also reduce the total network traffic and memory usage.
To some extent, you can view this technique as manually implementing a hash join instead of the nested loops algorithm MySQL uses to execute a join. A hash join might be more efficient.

As a result, doings joins in the application can be more efficient when you cache and reuse a lot of datafrom earlier queries, you distribute data across multiple servers, you replace joins with IN() lists, or a join refers to the same table multiple times.

Note:

Another example can be seen here.

3, Avoid using table variables in joins^[ref]

Use temporary tables, CTEs (Common Table Expressions) or derived tables in joins instead.

Reason: Even though table variables are very fast and efficient in a lot of situations, the SQL Server engine sees it as a single row. Due to this reason, SQL will produce an execution plan that will perform horribly when used in joins.

a. Using Temp Table wisely^[ref]

Temporary tables usually increase a query’s complexity. If your code can be written in a simple, straightforward manner.

When you have to join a large table and there are conditions on said table, you can increase database performance by transferring your data in a temp table, and then making a join on that. Your temp table will have fewer rows than the original (large) table, so the join will finish faster!

The decision isn’t always straightforward, but this example will giveyou a sense for situations in which you might want to use temp tables:

Imagine a customer table with millions of records. You have to make ajoin on a specific region. You can achieve this by using a SELECT INTO statement and then joining with the temp table:

SELECT * INTO #Temp FROM Customer WHERE RegionID = 5
SELECT r.RegionName, t.Name FROM Region r JOIN #Temp t ON t.RegionID = r.RegionID

(Note: some SQL developers also avoid using SELECT INTOto create temp tables, saying that this command locks the tempdb database, disallowing other users from creating temp tables. Fortunately, this is fixed in 7.0 and later.)

b. Using a Subquery^[ref]

As an alternative to temp tables, you might consider using a subquery as a table:

SELECT r.RegionName, t.Name FROM Region r
JOIN (SELECT Name, RegionID FROM Customer WHERE RegionID = 5) AS t
ON t.RegionID = r.RegionID

All of these SQL snippets will return the same data. But with temp tables, we could, for example, create an index in the temp table to improve performance. There’s some good discussion here on the differences between temporary tables and subqueries.

c. Using a CTE^[ref]

Another option, you might consider using a CTE as a table:

WITH Employee_CTE (EmployeeNumber, Title)
AS
(SELECT NationalIDNumber,
JobTitle
FROM HumanResources.Employee)
SELECT EmployeeNumber,
Title
FROM Employee_CTE

CTE Query Definition

The blue portion is the CTE. Notice it contains a query that can be run on its own in SQL.

you can define more than one CTE within a WITH statement. This can help you simplify some very complicated queries which are ultimately joined together. Each complicated piece can include in their own CTE which is then referred to and joined outside the WITH clause.

Here is an example using of TWO CTE’s, it’s a simple example, but it shows how two CTE’s are defined, and then used in an INNER JOIN

The first common table expression is colored green, the second blue. As you can see from the SELECT statement the CTE’s are joined as if they were tables. Hopefully you can see that as your queries become morecomplicated, CTE’s can become a really useful way to separate operations; therefore, simplify your final query.

4, Avoid Correlated SQL Subqueries^[ref]

A correlated subquery is one which uses values from the parent query. This kind of SQL query tends to run row-by-row,once for each row returned by the outer query, and thus decreases SQL query performance. New SQL developers are often caught structuring theirqueries in this way—because it’s usually the easy route.

Here’s an example of a correlated subquery:

SELECT c.Name,
c.City,
(SELECT CompanyName FROM Company WHERE ID = c.CompanyID) AS CompanyName
FROM Customer c

In particular, the problem is that the inner query (SELECT CompanyName…) is run for each row returned by the outer query (SELECT c.Name…). But why go over the Company again and again for every row processed by the outer query?

A more efficient SQL performance tuning technique would be to refactor the correlated subquery as a join:

SELECT c.Name,
c.City,
co.CompanyName
FROM Customer c
LEFT JOIN Company co
ON c.CompanyID = co.CompanyID

In this case, we go over the Company table just once, at the start, and JOIN it with the Customer table. From then on, we can select the values we need (co.CompanyName) more efficiently.

5, Avoid Multi-Statement Table Valued Functions (TVFs)^[ref]

Multi-statement TVFs are more costly than inline TVFs.

Reason:SQL Server expands inline TVFs into the main query like it expands views, but evaluates multi-statement TVFs in a separate context from themain query and materializes the results of multi-statements into temporary work tables. The separate context and work table makes multi-statement TVFs costly.

6, Avoid using wildcard characters at the beginning of LIKE pattern^[ref]

In order to fine tune your SQL queries, you must avoid using LIKE pattern in the following manner:

SELECT* FROM Customers WHERE address LIKE ‘%bar%’;

Here, the database will not be able to use a suitable index if it exists because of % wildcard. The system starts by performing a full table scan and this takes a toll on its speed. Thus, the better way to write this query is:

SELECT* FROM Customers WHERE address LIKE ‘bar%’;

7, Avoid using COUNT(), instead of Exist()^[ref]

This SQL optimization technique concerns the use of EXISTS(). If you want to check if a record exists, use EXISTS()

IF EXISTS(SELECT FIRSTNAME FROM EMPLOYEES WHERE FIRSTNAME LIKE '%JOHN%')
PRINT 'YES'

instead of COUNT()

IF (SELECT COUNT(1) FROM EMPLOYEES WHERE FIRSTNAME LIKE '%JOHN%') > 0
PRINT 'YES'

While COUNT() scans the entire table, counting up all entries matching your condition, EXISTS() will exit as soon as it sees the result it needs.

8, Avoid Using GROUP BY, ORDER BY and DISTINCT^[ref]

Avoid using GROUP BY, ORDER BY, and DISTINCT whenever possible.

Reason:When using GROUP BY, ORDER BY, or DISTINCT, the SQL Server engine creates a work table and puts the data on the work table. After that, itorganizes this data in the work table as requested by the query, and then it returns the final result.

9, Avoid Different Datatype on JOIN and WHERE Conditions^[ref][ref]

This is easier said than done depending on your permissions to make changes to the schema.

Reason:When joining or comparing two fields with different datatypes, SQL mustdo an on-the-fly conversion of the field before it can do a comparison,even if the fields are indexed. If mismatched datatypes are unavoidable, try to cast the larger datatype to the smaller datatype whenever possible.

Consider this query, which selects the number of red fruits in a forest.

SELECT
COUNT(*)
FROM
forest
WHERE
fruit_color = 5; /* 5 = red */

Assuming the column fruit_color's type is VARCHAR, or just anything non-numeric, indexing that column won't be very helpful, as therequired implicit cast will prevent the database from using the index for the filtering process.

10, Avoid using HAVING to define filter, instead of WHERE^[ref]

This is easier said than done depending on your permissions to make changes to the schema.

The goal of an efficient query is to pull only the required records from the database. Per the SQL Order of Operations, HAVINGstatements are calculated after WHERE statements. If the intent is to filter a query based on conditions, a WHERE statement is more efficient.

For example, let’s assume 200 sales have been made in the year 2016, and we want to query for the number of sales per customer in 2016.

SELECT Customers.CustomerID, Customers.Name, Count(Sales.SalesID)
FROM Customers
INNER JOIN Sales
ON Customers.CustomerID = Sales.CustomerID
GROUP BY Customers.CustomerID, Customers.Name
HAVING Sales.LastSaleDate BETWEEN #1/1/2016# AND #12/31/2016#

This query would pull 1,000 sales records from the Sales table, then filter for the 200 records generated in the year 2016, and finally countthe records in the dataset.

In comparison, WHERE clauses limit the number of records pulled:

SELECT Customers.CustomerID, Customers.Name, Count(Sales.SalesID)
FROM Customers
INNER JOIN Sales
ON Customers.CustomerID = Sales.CustomerID
WHERE Sales.LastSaleDate BETWEEN #1/1/2016# AND #12/31/2016#
GROUP BY Customers.CustomerID, Customers.Name

This query would pull the 200 records from the year 2016, and then count the records in the dataset. The first step in the HAVING clause has been completely eliminated.

HAVING should only be used when filtering on an aggregated field. In the query above, we could additionally filter for customers with greater than 5 sales using a HAVING statement.

SELECT Customers.CustomerID, Customers.Name, Count(Sales.SalesID)
FROM Customers
INNER JOIN Sales
ON Customers.CustomerID = Sales.CustomerID
WHERE Sales.LastSaleDate BETWEEN #1/1/2016# AND #12/31/2016#
GROUP BY Customers.CustomerID, Customers.Name
HAVING Count(Sales.SalesID) > 5

11, Avoid OR conditions^[ref]

Consider this query, which selects the amount of posts on Facebook posted after new year's eve, or posted by a user named Mark.

SELECT
COUNT(*)
FROM
fb_posts
WHERE
username = ‘Mark’
OR post_time > ‘2018-01-01’

Having an index on both the username and post_time columns might sound helpful, but in most cases, the database won't use it, at least not in full. The reason will be the connection between the two conditions - the OR operator, which makes the database fetch the results of each part of the condition separately.

An alternative way to look at this query can be to 'split' the OR condition and 'combine' it using a UNION clause. This alternative will allow you to index each of the conditions separately, so the database will use the indexes to search for the results and then combine the results with the UNION clause.

SELECT …
FROM …
WHERE username = ‘Mark’
UNION
SELECT …
FROM …
WHERE post_time > ‘2018-01-01’

Please note that if you don't mind duplicate records in your result set,you can also use UNION ALL (which will perform better than the default UNION DISTINCT).

12, Avoid sorting with a mixed order^[ref]

Consider this query, which selects all posts from Facebook and sorts them by the username in an ascending order, and then by the post date ina descending order.

SELECT
username, post_type
FROM
fb_posts
ORDER BY username ASC , post_type DESC

MySQL (and so many other relational databases), cannot use indexes when sorting with a mixed order (both ASC and DESC in the same ORDER BY clause). This changed with the release of the reversed indexes functionality and MySQL 8.x.

Summary

These tips are mainly got from online without detailed test, but I believe these will still be helpful hints if we need to use them or verify them.

References

Performance tuning select statements --- Google
7 SQL Query Performance Tuning Tips --- klipfolio.com

Avoid using SELECT*
Use temporary tables wisely
Avoid using COUNT() --- 7
Avoid using wildcard characters at the beginning of LIKE pattern --- 6
Avoid using SELECT DISTINCT
Run at Night for Time Consuming Queries

Looping statements
Running SELECT* on large tables with over 1 million records
Nested subqueries
Wildcard searches
CROSS JOINs
SELECT DISTINCT statements

Supercharge Your SQL Queries for Production Databases --- sisense.com

SELECT fields instead of using SELECT *
Avoid SELECT DISTINCT
Create joins with INNER JOIN (not WHERE --- Cross Join) --- 1
Use WHERE instead of HAVING to define filters --- 10
Use wildcards at the end of a phrase only --- 6
Use LIMIT to sample query results
Run your query during off-peak hours

Selecting from large tables (>1,000,000 records)
Cartesian Joins or CROSS JOINs
Looping statements
SELECT DISTINCT statements
Nested subqueries
Wildcard searches in long text or memo fields
Multiple schema queries

SQL Tuning or SQL Optimization --- beginner-sql-tutorial.com

use the actual columns names in SELECT statement instead of than '*'.
Do not use HAVING clause for any other purposes. --- 10
minimize the number of subquery block in your query.
Use operator EXISTS, IN and table joins appropriately in your query.

a) Usually IN has the slowest performance.
b) IN is efficient when most of the filter criteria is in the sub-query.
c) EXISTS is efficient when most of the filter criteria is in the main query.

Use EXISTS instead of DISTINCT when using joins which involves tables having one-to-many relationship.
Try to use UNION ALL in place of UNION.
Be careful while using conditions in WHERE clause.
To store large binary objects, first place them in the file system and add the file path in the database.
To write queries which provide efficient performance follow the general SQL standard rules.

a) Use single case for all SQL verbs
b) Begin all SQL verbs on a new line
c) Separate all words with a single space
d) Right or left aligning verbs within the initial SQL verb

SQL Database Performance Tuning for Developers --- toptal

Indexes
Execution Plans
Avoid Coding Loops
Avoid Correlated SQL Subqueries --- 4
Select Fields instead of *
Wise Use of Temporary Tables (#Temp) --- 3-a,b
Use Exist instead of Count(*) --- 7

SQL Performance Tuning: 5 Best Tips for Developers --- eversql.com

Create indexes, but do it wisely
Avoid wrapping indexed columns with functions
Avoid OR conditions --- 11
Avoid sorting with a mixed order --- 12
Avoid conditions with different column types --- 9
Avoid LIKE searches with prefix wildcards

Query optimization techniques in SQL Server: tips and tricks --- sqlshack.com
25 tips to Improve SQL Query Performance --- winwire.com
SQL Query Performance --- medium.com

THEORY BEHIND QUERY RUN TIME

Table Size: The larger the table, the longer it takes to read through the data.
Joins: If your joins substantially increase the row count of the result set, your query is likely to be slow.
Aggregations: Combining multiple rows to produce a result requires more computation than simply retrieving those rows.

Reducing Table Size Filter the table to return only what you need.
Simplifying the Joins
13 QUERY PERFORMANCE TIPS

1. Owner/Schema Name
2. The * Operator
3. Nullable Columns
4. Able Variables and Joins --- 3
5. Stored Procedure Names
6. Use SET NOCOUNT ON
7. Avoid Using GROUP BY, ORDER BY and DISTINCT --- 8
8. Check Indexes There should be indexes on all fields used in the WHERE and JOIN portions of the SQL statement.
9. Use Same Datatype on JOIN and WHERE Clauses --- 9
10. Avoid Using Calculated Fields in JOIN and WHERE Clauses
11. Avoid Multiple Joins in a Single Query --- 2
12. Avoid Multi-Statement Table Valued Functions (TVFs) --- 5
13. Use Data Compression Whenever Possible