About those Read-Only Replicas

In a post, I wrote last month I showed how you can take advantage of your replicas in SQL Sever by adding a single attribute to your connection string in .Net. Turns out you need to be aware of a caveat that I found yesterday while tracking down an odd bug.

The application in question manages lists of books. The workflow is as follows:

  1. Select 1 or more ISBNs from the list of books we publish (over 1m)

  2. Assign this collection a name and a catalog (collection of lists)

  3. Save the collection

  4. Go back to the main menu with the names and metadata of all your catalogs

The issue happens between 3 and 4. This system keeps metadata about each list and metadata about the folder (number of total lists, number of total items, last modification). It needs to make a series of calls. It can't just save the list and return you to the main screen without first making a call to the system to populate that screen with the latest information (which includes audit information). This means a call to SaveList and a call to GetCatalogs. Developers noted that 70% of the time the main menu was missing the newly saved list, but if they did a refresh it appeared. We could not reproduce this by hand (though Postman).

We are running the database on an Azure Hyperscale with RCSI (Read Committed Snapshot Isolation), so the first thought was there are independent threads making the calls and the GetCatalogs is returning the previously committed set. However, the development team confirmed that it was a single chain of events through a single thread with an average time between calls of 300ms.

All calls in this application are controlled via stored procedures, so we can guarantee the order of operations and within the SaveList procedure is a commit before the result set is returned guaranteeing the data is in the database before the thread gets the go-ahead to move to the next call. At this point I realized it was the replica.

300ms is not enough time for the data to be moved to the replica, so I simply removed the ApplicationIntent=ReadOnly and the bug disappeared.

A word of caution when designing an application. The flow of the application is important to understand while making architectural decisions. In this specific example, the need to pull back the data vs having the application carry it means the timing of when the data is where (primary vs replica) is very important. If the application did a few other things and the avg time between the save and when it made the call to GetCatalog was say 750ms or 1s, the replica might have been just fine.