SQL Server SQL Query for SCD Type 2

8nuwlpux  于 2023-11-16  发布在  其他
关注(0)|答案(6)|浏览(103)

I am given the following table with the following problem:

Create a Slowly Changing Dimension Type 2 from the dataset. EMPLOYEE table has daily records for each employee. Type 2 - Will have effective data and expire date.
| Employee ID | Date | Name | Manager ID |
| ------------ | ------------ | ------------ | ------------ |
| 123 | 1-Mar | John Smith | 1 |
| 123 | 2-Mar | John Smith | 1 |
| 123 | 3-Mar | John Smith | 2 |
| 123 | 4-Mar | John Smith | 3 |
| 123 | 5-Mar | John Smith | 3 |

I believe my target table is supposed to look like this:

Employee IDNameManager IDEffective DateExpiration Date
123John Smith11-Mar3-Mar
123John Smith23-Mar4-Mar
123John Smith34-MarNull

I attempted the following query:

SELECT employee_id, name, manager_id,
CASE
    WHEN LAG(manager_id) OVER() != manager_id THEN e.date 
    WHEN e.date = FIRST_VALUE(e.date) OVER() THEN e.date
    ELSE NULL
END as "Effective Date",
CASE 
    WHEN LEAD(manager_id) OVER() != manager_id THEN LEAD(e.date) OVER()
    ELSE NULL
END as "Expiration Date"
FROM employee e

My resulting table is as follows:

Employee IDNameManager IDEffective DateExpiration Date
123John Smith11-MarNull
123John Smith1Null3-Mar
123John Smith23-Mar4-Mar
123John Smith34-MarNull
123John Smith3NullNull

Does anyone know of any way that I can alter my query to achieve my target table, based on what I've achieved thus far? I somehow need to only result in the 3 Manager ID's but distinct will not work. Also, I need to find a way to combine the effective date and expiration date for each manager ID. Any help at all would be greatly appreciated.

nszi6y05

nszi6y051#

The following does what you require, and shows how to add DDL+DML as well. Its probably a bit convoluted but I can't see an obvious way to simplify it.

This solution takes into account the possibility that the manager could repeat. And it doesn't assume that ever day will exist, so if a day is missing it will still work.

declare @Test table (EmployeeID int, [Date] date, [Name] varchar(32), ManagerID int);

insert into @Test (EmployeeID, [Date], [Name], ManagerID)
values
(123, '1 Mar 2021', 'John Smith', 1),
(123, '2 Mar 2021', 'John Smith', 1),
(123, '3 Mar 2021', 'John Smith', 2),
(123, '4 Mar 2021', 'John Smith', 3),
(123, '5 Mar 2021', 'John Smith', 3);
--(123, '6 Mar 2021', 'John Smith', 2);

select EmployeeId, [Name], ManagerId, MinDate
  -- Use lead to get the last date of the next grouping - since it could in theory be more than one day on
  , lead(MinDate) over (partition by EmployeeId, [Name] order by Grouped) MaxDate
from (
  -- Get the min and max dates for a given grouping
  select EmployeeId, [Name], ManagerId, min([Date]) MinDate, max([Date]) MaxDate, Grouped
  from (
    select *
       -- Sum the change in manager to ensure that if a manager is repeated they form a different group
       , sum(Lagged) over (order by Date asc) Grouped
    from (
      select *
        -- Lag the manager to detect when it changes
        , case when lag(ManagerId,1,-1) over (order by [Date] asc) <> ManagerId then 1 else 0 end Lagged
      from @Test
    ) X
  ) Y
  group by EmployeeId, [Name], ManagerId, Grouped
) Z
order by EmployeeId, [Name], Grouped;

Returns:

EmployeeIdNameManagerIdMinDateMaxDate
123John Smith12021-03-012021-03-03
123John Smith22021-03-032021-03-04
123John Smith32021-03-04NULL
vaqhlq81

vaqhlq812#

Use this, this will be simpler.

Explanation: The nested query will give the rows where there is change in managers, filter out rest of the rows as it is redundant information.

Once the data is filtered, find the next date when the manager got changed, mark that data as end date

SELECT 
   EmployeeID, 
   Name, 
   ManagerID, 
   [Date] StartDate, 
   Lead([Date]) OVER (ORDER BY [Date])  EndDate  
FROM 
    (SELECT *, lag(ManagerID,1,-1) OVER (ORDER BY [Date]) p_mgid FROM #temp ) s  
WHERE ManagerID <>p_mgid;
    ;

rdlzhqv9

rdlzhqv93#

create table test  (EmployeeID int, [Date] date, [Name] varchar(32), ManagerID int);

insert into Test (EmployeeID, [Date], [Name], ManagerID)
values
(123, '1 Mar 2021', 'John Smith', 1),
(123, '2 Mar 2021', 'John Smith', 1),
(123, '3 Mar 2021', 'John Smith', 2),
(123, '4 Mar 2021', 'John Smith', 3),
(123, '5 Mar 2021', 'John Smith', 3)

select a.employeeid,a.name,a.managerid,a.effective as effective_date,
lead(effective) over(partition by employeeid order by maximum) as expiration_date 
from (select employeeid,min(date) as effective,max(date)as maximum,name,managerid from test
group by employeeid,name,managerid) a
rggaifut

rggaifut4#

from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('test').getOrCreate()

l1=[[123,'John Smith','1-March-2022',1],[123,'John Smith','2-March-2022',1],[123,'John Smith','3-March-2022',2],[123,'John Smith','4-March-2022',3],[123,'John Smith','5-March-2022',3]]
col=['empid','name','date','mgr']

df=spark.createDataFrame(l1,col)

df.createOrReplaceTempView('tempview')

spark.sql("""select empid,name,mgr,effective_from, lead(effective_from) over (partition by empid order by effective_from ) as effective_to from  
                (select empid,name,mgr ,min(date) as effective_from,max(date) as effective_to from tempview group by mgr,empid,name)""").show()
5cnsuln7

5cnsuln75#

Set up schema

Use the following query to setup an employees table in sql server (or RDBMS of your choice with some syntax mods)

create table employees (EmployeeID int, [Date] date, [Name] varchar(32), ManagerID int);

insert into employees (EmployeeID, [Date], [Name], ManagerID)
values
(123, '1 Mar 2021', 'John Smith', 1),
(123, '2 Mar 2021', 'John Smith', 1),
(123, '3 Mar 2021', 'John Smith', 2),
(123, '4 Mar 2021', 'John Smith', 3),
(123, '5 Mar 2021', 'John Smith', 3)

Walkthrough of Solution

Step 1: Ascertain the Start Dates

Transform the Employees table to show the start date for each manager an employee has worked under.

SELECT EmployeeID, Name, ManagerID, MIN(Date) AS start_date
    FROM test e
    GROUP BY EmployeeID,Name,ManagerID

This returns:

EmployeeIDNameManagerIDstart_date
123John Smith12021-03-01
123John Smith22021-03-03
123John Smith32021-03-04

Step 2: Adding end_date

Use this table to calculate a column that displays the end date for the given employee - manager pair. Using the concept that the start_date of an employee working with one manager, is the end date of the employees tenure with their previous manager.

Use the LEAD function and create partitions over Employee, making sure to order by start_date to create a column that represents the end_date of an employee - manager pair.

See below how we use the first query to generate a CTE called t1, then build the end_date column in a query carried out against t1.

WITH t1 AS (
    SELECT EmployeeID, Name, ManagerID, MIN(Date) AS start_date
    FROM test e
    GROUP BY EmployeeID,Name,ManagerID
)

SELECT *, 
LEAD(start_date) OVER (PARTITION BY EmployeeID ORDER BY start_date) AS end_date
FROM t1

Final rowset matches desired ouput:

EmployeeIDNameManagerIDstart_dateend_date
123John Smith12021-03-012021-03-03
123John Smith22021-03-032021-03-04
123John Smith32021-03-04NULL

Further Test

To be safe, test the case where there is more than one employee in input dataset.

insert into employees (EmployeeID, [Date], [Name], ManagerID)
values
(133, '1 Mar 2021', 'Sean Smith', 1),
(133, '2 Mar 2021', 'Sean Smith', 2),
(133, '3 Mar 2021', 'Sean Smith', 3),
(143, '4 Mar 2021', 'Mark Smith', 3),
(143, '5 Mar 2021', 'Mark Smith', 4)

The output shows our solution passed the test:

EmployeeIDNameManagerIDstart_dateend_date
123John Smith12021-03-012021-03-03
123John Smith22021-03-032021-03-04
123John Smith32021-03-04NULL
133Sean Smith12021-03-012021-03-02
133Sean Smith22021-03-022021-03-03
133Sean Smith32021-03-03NULL
143Mark Smith32021-03-042021-03-05
143Mark Smith42021-03-05NULL
6rvt4ljy

6rvt4ljy6#

declare @Test table (EmployeeID int, [Date] date, [Name] varchar(32), ManagerID int);

insert into @Test (EmployeeID, [Date], [Name], ManagerID)
values
(123, '1 Mar 2021', 'John Smith', 1),
(123, '2 Mar 2021', 'John Smith', 1),
(123, '3 Mar 2021', 'John Smith', 2),
(123, '4 Mar 2021', 'John Smith', 3),
(123, '5 Mar 2021', 'John Smith', 3);
--(123, '6 Mar 2021', 'John Smith', 2);

;WITH CTE AS (
SELECT *,ROW_NUMBER() OVER (PARTITION BY NAME,MANAGERID ORDER BY DATE) RW FROM @TEST
)
SELECT EmployeeID,NAME,ManagerID,DATE AS FROMDATE ,(SELECT DATE FROM CTE B WHERE A.RW =B.RW AND B.ManagerID = A.ManagerID +1) AS ENDDATE
FROM CTE A 
WHERE RW=1

相关问题