I created the below stored procedure to pull data from a source table and insert only "new records" into a destination table. I need to run this every minute. The challenge is that the table stores "images" with over 2 million rows.
When I run the stored procedure it is still running over 1 hour and 22 minutes. I've already moved all images over to this table so there are no new images to pull over so I'm assuming it's still reading the table for new records. The job is not being blocked and is still runnable.
Is there any way to optimize this stored procedure so that it pulls only new records. My goal would be to optimize so that we can run this on a re-occuring schedule. The business requirements are to run this every minute, but based on the current outcome that is not possible.
I've included the source and target table schema so you can see all columns that I have to work with. I am not sure if I can use the created_date in the source table to filter the data so it doesn't look thru all the data. There is also an identity column ( acc_image_id
) in the source table. I'm not sure how I should modify the stored procedure to pull data more quickly based on the max(identity)
.
FYI, I can add columns to the target table if it makes sense to add a column to help filter this data further so that it only looks for recent images.
Source table:
CREATE TABLE [dbo].[acc_image]
(
[acc_image_id] [int] IDENTITY(1,1) NOT FOR REPLICATION NOT NULL,
[acc_id] [int] NULL,
[image_type_id] [int] NOT NULL,
[data_format] [varchar](10) NOT NULL,
[label] [varchar](50) NOT NULL,
[description] [varchar](255) NULL,
[image_width] [smallint] NULL,
[image_height] [smallint] NULL,
[image_color_depth] [tinyint] NULL,
[image_thumbnail] [image] NOT NULL,
[data] [image] NOT NULL,
[notes] [text] NULL,
[acc_specimen_id] [int] NULL,
[acc_slide_id] [int] NULL,
[include_in_report] [char](1) NOT NULL,
[include_in_internet] [char](1) NOT NULL,
[created_date] [datetime] NOT NULL,
[row_version] [timestamp] NOT NULL,
[report_section_number] [smallint] NULL,
[external_notes] [text] NULL,
[source_filename] [varchar](255) NULL,
[page_count] [int] NOT NULL,
[include_as_attachment] [char](1) NOT NULL,
[sort_order] [smallint] NULL,
[acc_parent_image_id] [int] NULL,
[created_by_id] [int] NULL,
[annotation] [text] NULL,
[specimen_results_enabled] [char](1) NOT NULL,
[dis_imageserver_id] [int] NULL,
[external_slide_image_id] [varchar](80) NULL,
[external_report_image_id] [varchar](80) NULL,
)
Here is the target table schema. I have an inserted_date
that I don't know if I can use to optimize:
CREATE TABLE [dbo].[acc_image]
(
[acc_image_id] [int] NOT NULL,
[acc_id] [int] NULL,
[image_type_id] [int] NOT NULL,
[data_format] [varchar](10) NOT NULL,
[label] [varchar](50) NOT NULL,
[description] [varchar](255) NULL,
[image_width] [int] NULL,
[image_height] [int] NULL,
[image_color_depth] [tinyint] NULL,
[image_thumbnail] [image] NOT NULL,
[data] [image] NOT NULL,
[image_guid] [uniqueidentifier] NULL,
[created_date] [datetime] NOT NULL,
[row_version] [varbinary](12) NOT NULL,
[sort_order] [int] NULL,
[insert_date] [datetime] NOT NULL,
[updated_date] [datetime] NULL,
)
Here's the stored procedure code:
ALTER PROCEDURE [dbo].[get_image]
AS
BEGIN
DECLARE @RecCt AS int = 0
BEGIN TRY
INSERT INTO connect_onprem.dbo.acc_image (acc_image_id, acc_id, image_type_id,
data_format, label, description,
image_width, image_height,
image_color_depth, image_thumbnail,
data, created_date, row_version, sort_order)
SELECT
src.acc_image_id, a.id, src.image_type_id,
src.data_format, src.label, src.description,
src.image_width, src.image_height,
src.image_color_depth, src.image_thumbnail,
src.data, src.created_date, src.row_version, src.sort_order
FROM
[ARKPPTEST\POWERPATHTEST].[Powerpath_Test].[dbo].accession_2 a
INNER JOIN
[ARKPPTEST\POWERPATHTEST].[Powerpath_Test].[dbo].acc_specimen s ON a.primary_specimen_id = s.id
INNER JOIN
[ARKPPTEST\POWERPATHTEST].[Powerpath_Test].[dbo].acc_slide ass ON s.id = ass.acc_specimen_id
--source table
INNER JOIN
[ARKPPTEST\POWERPATHTEST].[Powerpath_Test].[dbo].acc_image src ON ass.id = src.acc_slide_id
--target table
LEFT JOIN
connect_onprem.dbo.acc_image tgt ON src.acc_image_id = tgt.acc_image_id
WHERE
tgt.acc_image_id IS NULL
AND a.acc_type_id <> 134
AND a.status_final = 'Y'
ORDER BY
acc_image_id
SET @RecCt = @@ROWCOUNT
IF @RecCt > 0
BEGIN
INSERT INTO connect_onprem.dbo.ErrorLog (UserName, ErrorNumber, ErrorState,
ErrorSeverity, ErrorLine, ErrorProcedure, ErrorMsg, ErrorDateTime)
Values ('RecTrack', @RecCt, 0, 0, 0, 'get_image',
'connect_onprem.dbo.acc_image Inserted Records', GETDATE());
END
END TRY
BEGIN CATCH
INSERT INTO connect_onprem.dbo.ErrorLog (UserName, ErrorNumber, ErrorState, ErrorSeverity,
ErrorLine, ErrorProcedure, ErrorMsg, ErrorDateTime)
VALUES (SUSER_SNAME(), ERROR_NUMBER(), ERROR_STATE(), ERROR_SEVERITY(), ERROR_LINE(),
ERROR_PROCEDURE(), ERROR_MESSAGE(), GETDATE());
END CATCH
END
Goal: optimize this to run as fast as possible; to read the source table and insert "new" records into the target table using a linked server running across the network.
Right now, the stored procedure is running over 1 hour and 22 minutes. I need to somehow change this to be able to run at the minimum every 1 minute (if possible).
1条答案
按热度按时间8oomwypt1#
I suspect that the remote part of the query is retrieving and transferring all selected data (including the presumably large
data
andimage_thumbnail
values) from the remote database before checking locally for the existence of a local copy.I would assume that once the initial load is complete, most of the retrieved data will be discarded as duplicate. For example, you might have one million records in your remote source, but only one thousand need to be inserted as new rows to your target table.
The solution may be to initially just select ID values into a temp table (or table variable) and then use that in a second query to actually select and insert the final data. retrieving one million ID values to be checked up front should be immensely faster than retrieving one million complete rows. Later, after the IDs have been filtered, retrieving just those one thousand rows may complete relatively fast.
Something like:
It appears that you can eliminate several joins from the second query by saving the accession ID during the preselect. Since all of the filters have already been applied, and the rest of the data comes from the acc_image table, the other tables need not be referenced.
The query engine might perform a loop join to retrieve one remote
acc_image
row at a time or it might send a list of all preselected ids to the remote server for a batch retrieval. You should run performance tests on both and examine the resulting execution plans.You might even try the following hybrid of the above in the second query:
(Note that I commented out the
order by
, as I believe it to be unnecessary.)