Physical row movement during a scan can cause missed rows and duplicate reads (same row appearing twice in the result) in SELECT queries under certain conditions. For MS SQL Server this is well documented in connection with the NOLOCK
hint.
Some sources - among them Dmitri Korotkevitch and Paul White - state that only SERIALIZABLE
isolation can avoid these phenomena, others indicate that READ COMMITTED
or REPEATABLE READ
are sufficient. One argument that convinced me is that a read can hardly be called repeatable if it can miss rows or return them multiple times, simply because concurrent updates change some field values. Also, what good would be UNIQUE
or PRIMARY KEY
constraints on a table if duplicate reads could undermine them any time?
I'd like some clarity on the issue, at least with regard to MS SQL Server and Postgres, but I would like it to be somewhat definitive and not based purely on opinion as pretty much everything that weeks of research have turned up so far (on paper and on the 'Net).
Experiencing/reproducing the phenomena under READ COMMITTED
or REPEATABLE READ
would certainly clinch it, but Google only turned up several challenges to find such repros. Explicit statements in DBMS documentation or by the DBMS developers would certainly clinch it as well, as would wording from the SQL standards and associated documents.
So: can missed rows and/or duplicated reads occur under READ COMMITTED
or REPEATABLE READ
isolation?
Note: the phantom read commonly cited in explanations for the SERIALIZABLE
isolation level is an entirely different phenomenon and not of interest here. See If 'phantom read' involves two or more consecutive queries, what is it called when happening in one and the same query? , which seeks to clarify the terms and also has links to further sources/discussions.
Clarification: this question concerns a row getting missed or multiple versions of it being contained in a result set because of concurrent changes to its attributes, not due to concurrent insertions or deletions.
3条答案
按热度按时间bq9c1y661#
This answer is SQL Server specific.
For the locking implementation of read committed locks are generally acquired and released as the data is read. This provides scope for both missed rows and duplicate reads during a single scan. For repeatable read locks are acquired as data is read and released at transaction end. This provides scope for missed rows during a single scan.
DBMS documentation or by the DBMS developers would certainly clinch it
See the articles on read committed locking and repeatable read by Craig Freedman on the SQL Server team - though TBH Paul White is one of the most credible sources on SQL Server.
In the repeatable read article I would draw your attention to the following quote
These locks do not protect rows that have not yet been scanned from updates or deletes and do not prevent the insertion of new rows amid the rows that are already locked.
Experiencing/reproducing the phenomena under READ COMMITTED or REPEATABLE READ would certainly clinch it.
There are some repros in the above links. A different one below for the double clinch.
Setup (make sure to re-run this each time to put the table back into the expected initial state)
Update query
Select query
Results (after one minute wait)
Notice that the result only has 7 rows. The execution plan for the
SELECT
had not got as far as reading the row withID
of1
when that was updated and moved in the index to after the point theSELECT
had already read up to (with a backwards ordered scan so moving after means it is missed).5ktev3wc2#
Regarding SQL Server:
READ COMMITTED
isolation is known to have serious issues with missing and duplicate rows, this is why RCSI is often recommended for new work. For example Paul White outlines the issues in his series .The primary issue in this isolation level is that a row could be read, then deleted and moved by a concurrent update, then read again.
Having said that, this phenomenon is not as common as you might imagine, primarily because it's unusual to be modifying a primary key in the first place. It also can only happen if a scan is happening on an index, and an update happens around it, which is not that common.
Also, page-level locking (which is a pretty common) has a higher granularity, and
U
locks are used for updates which are not released if they are combined into aUIX
lock.a read can hardly be called repeatable if it can miss rows or return them multiple times, simply because concurrent updates change some field values.
You misunderstand what "repeatable" in
REPEATABLE READ
means here. Repeatable means that once a row has been read, it cannot be changed by another concurrent query. It does not mean that the query as a whole is repeatable. To put it another way: inserts and deletes ahead of the scan are allowed, but anything previously read is blocked.So the difference between these two isolation levels is whether a read query will release locks once it has read data or not.
You also seem to be labouring under a misunderstanding in how locking works when updating a key column. When you update a column which is not the clustering key then a lock is taken on the clustering key, not on the key of the index being read. But when an update is made to the clustering key itself then it effectively becomes an insert/delete pair. In other words, there is no major difference between inserts, updates and deletes when it comes to locking, they are all susceptible to the same phenomena.
what good would be
UNIQUE
orPRIMARY KEY
constraints on a table if duplicate reads could undermine them any time?Isolation levels are always a matter of risk appetite. The data once at rest is not problematic, it is only the isolation of different queries that is changed.
For example, you might decide you want to do a joined update on
OrderHeader
rows atREAD COMMITTED
isolation. You don't care if someone deletes or modifies your row after you did, so why not?But when you want to do an
INSERT
toOrderDetail
, joiningOrderHeader
in the process, you don't want parent (FK)OrderHeader
rows to disappear after having read them, so you useREPEATABLE READ
. TheSERIALIZABLE
level is overkill here, because no-one is changing the primary key.Then you might decide to enforce a complex multi-table constraint using a procedure or trigger, and in the interests of consistency you wish to enforce complete isolation of all queries involved. So you would use
SERIALIZABLE
.You might even have different locking hints per table reference in a single query. The point is: think about your locking and choose what it is you want to enforce.
I would like it to be somewhat definitive and not based purely on opinion
There is a quite lengthy and clear document by Microsoft on locking and isolation levels here.
Again,
READ COMMITTED SNAPSHOT
isolation (different fromSNAPSHOT
) is recommended for new work, and this is the default isolation in Azure SQL Server. This uses optimistic concurrency, so avoids almost all of the issues with the other levels.SERIALIZABLE
is very heavy on locking, so if you can avoid it then do so. Be aware that the snapshot levels have issues with write skew, which can be avoided if you code carefully.Regarding PostgreSQL:
The Posgres documenation has quite a comprehensive writeup on locking and isolation . I'm not very knowledgable on it.
But I do know that Postgres by default uses row-versioning (similar to
SNAPSHOT
in SQL Server), and this always applies when usingREAD COMMITTED
orREPEATABLE READ
, so most of the issues with that level do not happen. There is only write skew to worry about.46scxncf3#
SQL Server documentation states explicitly that
SERIALIZABLE
isolation or row versioning is required to avoid missed rows (can't believe I missed this!). The information is in the Transaction locking and row versioning guide :Missing and double reads caused by row updates
Transactions that are running at the READ UNCOMMITTED level do not issue shared locks to prevent other transactions from modifying data read by the current transaction. Transactions that are running at the READ COMMITTED level do issue shared locks, but the row or page locks are released after the row is read. In either case, when you are scanning an index, if another user changes the index key column of the row during your read, the row might appear again if the key change moved the row to a position ahead of your scan. Similarly, the row might not appear if the key change moved the row to a position in the index that you had already read. To avoid this, use the SERIALIZABLE or HOLDLOCK hint, or row versioning. For more information, see Table Hints (Transact-SQL) .
Missed rows due to page splits are dealt with as well:
When you are using READ UNCOMMITTED, if your query reads rows using an allocation order scan (using IAM pages), you might miss rows if another transaction is causing a page split. This cannot occur when you are using read committed because a table lock is held during a page split and does not happen if the table does not have a clustered index, because updates do not cause page splits.