SQL Server Similar Query different execution plan and performance

djmepvbi  于 2023-04-04  发布在  其他
关注(0)|答案(1)|浏览(166)

This is the script I use to generate 10M rows on Clients table from 200 different cities. (It took 11 minutes on a desktop PC)

*Important note at the end

-- Crear la tabla Clientes
CREATE TABLE Clientes (
    id_cliente BIGINT PRIMARY KEY IDENTITY(1, 1),
    id_ciudad INT,
    nombre NVARCHAR(100),
    email NVARCHAR(320),
    ciudad nvarchar(100)
);

-- Crear la tabla Clientes_Ciudad
CREATE TABLE Clientes_Ciudad (
    id_ciudad INT PRIMARY KEY IDENTITY(1, 1),
    ciudad NVARCHAR(100) UNIQUE
);
go 

alter PROCEDURE [dbo].[PopulateData]
AS
BEGIN
    SET NOCOUNT ON;

    -- Crear la tabla de números con 10 millones de registros
    WITH
    L0 AS (SELECT 1 AS c 
            FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),
                        (1),(1),(1),(1),(1),(1),(1),(1)) AS D(c)),
    L1 AS (SELECT 1 AS c FROM L0 CROSS JOIN L0 AS B),
    L2 AS (SELECT 1 AS c FROM L1 CROSS JOIN L1 AS B),
    L3 AS (SELECT 1 AS c FROM L2 CROSS JOIN L2 AS B),
    L4 AS (SELECT 1 AS c FROM L3 CROSS JOIN L3 AS B),
    Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L4)
    SELECT TOP 10000000 n INTO #Numbers FROM Nums;

    -- Inserción de ciudades
    INSERT INTO Clientes_Ciudad (ciudad)
    SELECT N'ciudad_' + RIGHT('00000' + CAST(n AS NVARCHAR(5)), 5)
    FROM #Numbers
    WHERE n <= 10000;

    -- Calcular la cantidad de clientes por ciudad
    DECLARE @current_city_id INT = 1;
    DECLARE @city_increment INT = 1;
    DECLARE @clients_in_city INT = 1000;

    CREATE TABLE #CityClients (city_id INT PRIMARY KEY, client_start INT, client_end INT);

    WHILE @current_city_id <= 10000
    BEGIN
        INSERT INTO #CityClients (city_id, client_start, client_end)
        VALUES (@current_city_id, @clients_in_city * (@current_city_id - 1) + 1, @clients_in_city * @current_city_id);

        SET @current_city_id = @current_city_id + 1;
        SET @clients_in_city = @clients_in_city + 500 * @city_increment;
        SET @city_increment = @city_increment + 1;
    END

    -- Inserción de clientes
    INSERT INTO clientes (id_ciudad, nombre, email,ciudad)
    SELECT
        CC.city_id,
        N'cliente_' + RIGHT('0000000' + CAST(N.n AS NVARCHAR(7)), 7),
        N'email_' + CAST(NEWID() AS NVARCHAR(36)) + N'@example.com',
        N'ciudad_' + RIGHT('00000' + CAST(n AS NVARCHAR(5)), 5)
    FROM #Numbers AS N
    INNER JOIN #CityClients AS CC
        ON N.n BETWEEN CC.client_start AND CC.client_end;

    -- Crear índices
    CREATE INDEX idx_clientes_id_ciudad ON clientes (id_ciudad);
    CREATE INDEX idx_clientes_ciudad_nombre_email ON clientes (id_ciudad) INCLUDE (nombre, email);
    CREATE INDEX idx_clientes_ciudad ON Clientes

        -- Eliminar las tablas temporales
    DROP TABLE #Numbers;
    DROP TABLE #CityClients;
END
GO

-- Ejecuta el stored procedure para poblar las tablas
EXEC [dbo].[PopulateData]
GO
delete Clientes_Ciudad where id_ciudad>10000
go

I did some sanitization but it is almost the same result.

The question is why these two queries perform differently. Why does the second execution plan use Nested Loop (Inner Join) and how to avoid it?

declare @ciudad_id int;
SELECT @ciudad_id= id_ciudad FROM Clientes_Ciudad WHERE ciudad = N'Ciudad_00199';

SELECT id_cliente, nombre, email 
FROM clientes
WHERE id_ciudad = @ciudad_id;

SELECT id_cliente, nombre, email
FROM clientes   
WHERE  id_ciudad =(SELECT id_ciudad FROM Clientes_Ciudad WHERE ciudad = N'Ciudad_00199');

This is the execution plan I got. If I change ciudad = N'Ciudad_00199' for a different one in one of the queries, the result is the same.

NOTE

From the scratch.

I was proving to a friend these two queries perform slightly differently in favor of the former, and it did.

SELECT id_cliente, nombre, email 
FROM clientes
WHERE id_ciudad = 18;
  

SELECT id_cliente, nombre, email
FROM clientes a INNER JOIN Clientes_Ciudad b ON a.id_ciudad = b.id_ciudad
WHERE b.ciudad = N'Ciudad_00018';

But, when I compare this, both perform the same. Surprisingly for me

SELECT id_cliente, nombre, email
FROM clientes   
WHERE  id_ciudad =(SELECT id_ciudad FROM Clientes_Ciudad WHERE ciudad = N'Ciudad_00017');

SELECT id_cliente, nombre, email
FROM clientes a INNER JOIN Clientes_Ciudad b ON a.id_ciudad = b.id_ciudad
WHERE b.ciudad = N'Ciudad_00018';

So I craft the script for getting the variable first and got the results and the performance I wanted and came here to help me understand what was going on and how to do it in a single query and not the script(just to learn if it is possible, no special need)

declare @ciudad_id int;
SELECT @ciudad_id= id_ciudad FROM Clientes_Ciudad WHERE ciudad = N'Ciudad_00015';

SELECT id_cliente, nombre, email 
FROM clientes
WHERE id_ciudad = @ciudad_id;
  

SELECT id_cliente, nombre, email
FROM clientes   
WHERE  id_ciudad =(SELECT id_ciudad FROM Clientes_Ciudad WHERE ciudad = N'Ciudad_00017');

The engine is more intelligent than I am, and I know that after WHERE id_ciudad = you can accept only a scalar value (or null), so why it is using a nested loop, and how can I force the other way?

NOTE2

Link to query plan https://www.brentozar.com/pastetheplan/?id=BJXcbeG-3

f0ofjuux

f0ofjuux1#

There is nothing wrong with the join query. It's a perfectly decent plan, and is as efficient as the two separate queries.

SQL Server has recognized that the inner subquery can be transformed into a join. It is normally very good at such things.

In the first version, you did not have a unique index/constraint on ciudad , so to use a join correctly it had to ensure that a maximum of one row was returned. That was what the Assert was doing.

In the second version, the compiler has correctly recognized that the subquery is guaranteed to return exactly one row (or none) so it has transformed into a join.

The Nested Loops is not a problem, as the outer side is only a single row anyway. The result of that row is fed into the Nested Loops as an outer reference, so it is doing a Seek on the second table, rather than scanning the whole thing (ie this is not a Naive Nested Loops). This query is almost exactly the same in logical terms as doing the two halves separately.

相关问题