本文来自 Windows Azure CAT 团队。原文名称:。翻译的不好,懂英语的读原文吧。见过很多人写如何写数据库容灾之类的文章,但是在应用程序层如何处理由于数据库镜像切换而导致的事务失败却没有找到好的案例。通过这个框架的学习收获了不少知识。
本文将围绕创建高可用的SQL Azure客户端程序,提供一系列最佳实践。本文重点关注瞬态错误处理,话句话说,在那些高密度,多用户环境下,如SQL Azure,作为高可用性程序必须处理的那些间歇性的故障,错误和异常。
背景知识
已经开始工作在SQL Azure环境的开发者也许知道SQL Azure已经引入了一些的特殊技术和探索,用以实现SQL Azure的数据访问层的管理。
其中最重要的一点也就是如何处理客户并发连接。SQL Azure 实现了节流阀行为,当客户发起到SQL Azure的连接或者运行查询时便可以显现出来。SQL Azure 可能从内部限制数据库连接数,如当资源过度使用,长时间事务处理,故障转移,负载均衡处理等情况会导致SQL Azure中止客户会话或者暂时停止建立新的连接。数据库连接也可能由于数据中心和客户端之间的网络连接状况造成数据库连接中断,如网络质量,偶尔然的客户端局域网或广域网基础连接错误以及其他不可预知的错误。
上述讨论的情况在SQL Azure团队2010年5月的博客中讨论过(没有找到)。那篇文章已经讨论为了稳定连接SQL Azure服务,在应用程序中实现重试逻辑的必要性。在一个Azure客户项目中,这个问题上我们面临的许多挑战。这段经历也驱使了我们创建一个基础的,可重用的处理瞬态错误框架,它使用了可扩展的重试策略模式。
SQL Azure 中的瞬态错误
当在客户端程序中处理SQL Azure访问异常时,最重要的一点是区分基本的错误和需要特殊处理错误。不是所有的异常都应该被当做一个瞬态错误。客户端程序需要确保只在必要的条件下进入重试状态。
下面是一些当我们使用SQL Azure服务建立连接或查询时可能出现的瞬态错误示例。
Error Number | Error Message |
40197 | The service has encountered an error processing your request. Please try again. |
40501 | The service is currently busy. Retry the request after 10 seconds. |
10053 | A transport-level error has occurred when receiving results from the server. An established connection was aborted by the software in your host machine. |
10054 | A transport-level error has occurred when sending the request to the server. (provider: TCP Provider, error: 0 -An existing connection was forcibly closed by the remote host.) |
10060 | A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 -A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.) |
40613 | Database XXXX on server YYYY is not currently available. Please retry the connection later. If the problem persists, contact customer support, and provide them the session tracing ID of ZZZZZ. |
40143 | The service has encountered an error processing your request. Please try again. |
233 | The client was unable to establish a connection because of an error during connection initialization process before login. Possible causes include the following: the client tried to connect to an unsupported version of SQL Server; the server was too busy to accept new connections; or there was a resource limitation (insufficient memory or maximum allowed connections) on the server. (provider: TCP Provider, error: 0-An existing connection was forcibly closed by the remote host.) |
64 | A connection was successfully established with the server, but then an error occurred during the login process. (provider: TCP Provider, error: 0-The specified network name is no longer available.) |
20 | The instance of SQL Server you attempted to connect to does not support encryption. |
完整的错误清单,请查阅MSDN文档, 章节。
当使用SQL Azure时为了确定一个异常是否应该当做“瞬态错误”处理,我们应该遵循以下原则:
- 首先检查异常类型。最需要处理的瞬态异常大多为 类型的。过滤出那些不需要作为瞬态错误的异常,可以根据 属性进行判断。不要尝试使用比较Exception的文字,不同.NET版本的Data Provider会有较大区别。
- 验证这个错误代码是否属于常见的瞬态错误。主要的错误代码已经在上面列举出来了。另外,可以查看。
根据这些规则我们可以很容易的将它打包到一个可重用的框架中来处理连接丢失和SQL查询执行失败等瞬态错误。
瞬态错误处理框架
这个框架考虑到了处理所有可能的瞬态错误的需求,在内部实现了一个“Retry Policy”来确保只处理需要的错误。在客户进入重试状态前会使用策略验证这个异常是否属于瞬态错误。
- 提供了一个可扩展的Retry逻辑处理瞬态错误,不仅限于SQL Azure。
- 支持一系列的重试方案(固定周期,渐进周期,随机指数退避)
- 支持SQL 连接和SQL命令使用不同的Retry策略。
- 为 和对象提供了扩展方法来实现Retry操作
- 支持Retry后的回调,通知用户代码是否发生了Retry情况
- 支持快速重试模式,当第一次发生进行Retry时会立即尝试而没有延迟
- 允许在应用程序配置文件中定义Retry策略
- 支持同步和异步请求
下一节我们将介绍实现细节,并帮助开发者理解何时以及如何使用上面所述的功能。可以在 下载代码。
技术实现
下面的类图高亮部分描述了框架的核心组件和依赖关系:
框架中关键组件为RetryPolicy<T> 和ReliableSqlConnection 类以及ITransientErrorDetectionStrategy 接口。
RetryPolicy<T> 类与他的抽象类RetryPolicy 封装了所有基本逻辑,用于迭代执行可能出现瞬态错误的开发者定义的Action。
ReliableSqlConnection类实现了类似 的功能,并且提供了一系列的方法来保证建立到SQL Azure数据库高可用的连接以及提供高可用SQL命令执行。
ITransientErrorDetectionStrategy接口用于判断哪些异常需要作为瞬态异常进行处理。NET异常会在策略类中实用检测。框架中自带SQL Azure瞬态错误检测策略的同时,也提供了 AppFabric Service Bus, AppFabric Message Buffers 和Windows Azure storage的瞬态错误检测策略。
// Defines an interface which must be implemented by custom components responsible for detecting specific transient conditions.public interface ITransientErrorDetectionStrategy{ ////// Determines whether the specified exception represents a transient failure that can be compensated by a retry. /// /// The exception object to be verified. ///True if the specified exception is considered as transient, otherwise false. bool IsTransient(Exception ex);}
另外,在类库中为.NET开发者提供了一系列的扩展方法,用于打开连接,执行SQL语句是使用Retry策略。当你无法将连接改为ReliableSqlConnection类时这些扩展方法将非常有用。例如,一个开发者可能使用了一些现有的数据访问层框架(如:),它返回的是SqlConnection类实例,这种情况下,使用扩展方法可以帮助你实现Retry功能而又不需要修改现有的代码。
使用模式
下面举例说明上述讨论中处理SQL Azure连接瞬态错误时,需要用到的一些通用的使用模式。
配置Retry策略
框架中有两种主要的设置Retry策略方法。
- 在创建RetryPolicy<T>类实例时,指定实现了ITransientErrorDetectionStrategy接口的瞬态错误检测策略类。
- 在应用程序配置文件中设置Retry策略,使用框架中提供的API实例返回需要的Retry策略实例类。
RetryPolicy<T>类可以根据特定需求创建出不同的策略。根据实例化对象时指定的泛型参数就可以创建出我们需要的RetryPolicy对象。
public class RetryPolicy: RetryPolicy where T : ITransientErrorDetectionStrategy, new(){ /// /// Initializes a new instance of the RetryPolicy class with the specified number of retry attempts and default /// fixed time interval between retries. /// /// The number of retry attempts. public RetryPolicy(int retryCount) : this(retryCount, DefaultRetryInterval) { /* ... */ } ////// Initializes a new instance of the RetryPolicy class with the specified number of retry attempts and time /// interval between retries. /// /// The number of retry attempts. /// The interval between retries. public RetryPolicy(int retryCount, TimeSpan intervalBetweenRetries) { /* ... */ } ////// Initializes a new instance of the RetryPolicy class with the specified number of retry attempts and backoff /// parameters for calculating the exponential delay between retries. /// /// The number of retry attempts. /// The minimum backoff time. /// The maximum backoff time. /// The delta value in the exponential delay between retries. public RetryPolicy(int retryCount, TimeSpan minBackoff, TimeSpan maxBackoff, TimeSpan deltaBackoff) { /* ... */ } ////// Initializes a new instance of the RetryPolicy class with the specified number of retry attempts and /// parameters defining the progressive delay between retries. /// /// The number of retry attempts. /// The initial interval which will apply for the first retry. /// The incremental time value for calculating progressive delay between retries. public RetryPolicy(int retryCount, TimeSpan initialInterval, TimeSpan increment) { /* ... */ }}
我们也可以在应用程序配置文件中定义Retry策略。
一旦我们配置了Retry策略,我们只需要以下三行简单的代码就能创建出一个实例:
// Retrieve the retry policy settings from the application configuration file.RetryPolicyConfigurationSettings retryPolicySettings = ApplicationConfiguration.Current.GetConfigurationSection(RetryPolicyConfigurationSettings.SectionName);// Retrieve the required retry policy definition by its friendly name.RetryPolicyInfo retryPolicyInfo = retryPolicySettings.Policies.Get("FixedIntervalDefault");// Create an instance of the respective retry policy using the transient error detection strategy for SQL Azure.RetryPolicy sqlAzureRetryPolicy = retryPolicyInfo.CreatePolicy ();
下面的两届中RetryPolicy实例,将展示出它在执行用户代码时识别处理瞬态错误的才能。
高可靠的打开SQL Azure数据库连接
为了高可靠地建立到SQL Azure的数据库连接,可采取以下方法之一:
- 使用ReliableSqlConnection类的Open方法。
- SqlConnection实例中使用OpenWithRetry扩展方法。
下面是一些上述方法的实例:
using (ReliableSqlConnection conn = new ReliableSqlConnection(connString)){ // Attempt to open a connection using the specified retry policy. conn.Open(sqlAzureRetryPolicy); // ... execute SQL queries against this connection ...}using (ReliableSqlConnection conn = new ReliableSqlConnection(connString, sqlAzureRetryPolicy)){ // Attempt to open a connection using the retry policy specified at construction time. conn.Open(); // ... execute SQL queries against this connection ...}using (SqlConnection conn = new SqlConnection(connString)){ // Attempt to open a connection using the specified retry policy. // The extension method is used in this context since we are dealing with a SqlConnection instance. conn.OpenWithRetry(sqlAzureRetryPolicy); // ... execute SQL queries against this connection ...}
注意,这两中方法执行的最终结果是一些样的。跟SqlConnection类相比,ReliableSqlConnection类提供一些额外能力,如恢复当前会话的CONTEXT_INFO值用以跟踪目的和在执行SQL语句中使用ExecuteCommand<T>方法。
可靠的执行SQL查询
在SQL Azure执行查询过程中,处理数据库连接中断问题同样非常重要,如查询限制。发生这种情况时,再次进行查询就非常必要。但是注意并不是所有的查询能被安全的重新执行,可能引起数据的不一致性发生。例如,当我们在没有外部事务保持原子性的情况Update多个表。
为了安全的执行Retry操作,可采取以下方法之一:
- 使用ReliableSqlConnection类中的ExecuteCommand或ExecuteCommand<T>方法。一个错误的SQL命令将依照指定的策略自动执行。如果在尝试再次执行是需要重新连接的时候,他也会自动建立数据库连接。
- 恰当的使用为SqlCommand类实现的扩展方法,如ExecuteNonQueryWithRetry,ExecuteReaderWithRetry等。
下面是一些上述方法的实例:
using (ReliableSqlConnection conn = new ReliableSqlConnection(connString, sqlAzureRetryPolicy)){ conn.Open(); SqlCommand selectCommand = new SqlCommand("select name, object_id from sys.objects where name = 'Application'", conn.Current); // Execute the above query using a retry-aware ExecuteCommand method which will // automatically retry if the query has failed (or connection was dropped) using (IDataReader dataReader = conn.ExecuteCommand(selectCommand)) { if (dataReader.Read()) { string objectName = dataReader.GetString(dataReader.GetOrdinal("name")); tableObjectID = dataReader.GetInt32(dataReader.GetOrdinal("object_id")); } }}using (ReliableSqlConnection conn = new ReliableSqlConnection(connString)){ conn.Open(sqlAzureRetryPolicy); IDbCommand selectCommand = conn.CreateCommand(); selectCommand.CommandText = "UPDATE Application SET [DateUpdated] = getdate()"; // Execute the above query using a retry-aware ExecuteCommand method which will // automatically retry if the query has failed (or connection was dropped) int recordsAffected = conn.ExecuteCommand(selectCommand, sqlAzureRetryPolicy);}using (SqlConnection conn = new SqlConnection(connString)){ conn.Open(); SqlCommand selectCommand = conn.CreateCommand(); selectCommand.CommandText = "select * from sys.objects where name = 'Application'"; int tableObjectID = Int32.MinValue; // Execute the above query using a retry-aware ExecuteReaderWithRetry method which will // automatically retry if the query has failed (or connection was dropped) using (IDataReader dataReader = selectCommand.ExecuteReaderWithRetry(sqlAzureRetryPolicy)) { if (dataReader.Read()) { string objectName = dataReader.GetString(dataReader.GetOrdinal("name")); tableObjectID = dataReader.GetInt32(dataReader.GetOrdinal("object_id")); } } selectCommand = conn.CreateCommand(); selectCommand.CommandText = "select object_id from sys.objects where name = 'Application'"; // Execute the above query using a retry-aware ExecuteScalarWithRetry method which will // automatically retry if the query has failed (or connection was dropped) object objectID = selectCommand.ExecuteScalarWithRetry(sqlAzureRetryPolicy); selectCommand = conn.CreateCommand(); selectCommand.CommandText = "UPDATE Application SET [DateUpdated] = getdate()"; // Execute the above query using a retry-aware ExecuteNonQueryWithRetry method which // will automatically retry if the query has failed (or connection was dropped) int recordsAffected = selectCommand.ExecuteNonQueryWithRetry(sqlAzureRetryPolicy);}
目前为止我们已经介绍了基本的ADO.NET例子。下面章节中将关注更多的高级场景,为SQL Azure客户端程序提供Retry功能,而无论这些应用程序如何访问数据。
高级模式
当前应用程序开发过程中通常我们不直接通过ADO.NET API 来访问数据。在过去的这些年中已经开发了相当多的高级数据库访问层框架: ,,,。这些技术打算大幅度的缩减数据管理的复杂度,简化数据的建模,查询。
当使用上述技术访问SQL Azure关系数据库平台时,瞬态错误处理也就成为一个需求。由于以上技术数据库访问层被高度分离,近似于增加了我们处理瞬态错误的难度,并且和我们以上讨论的方式有所不同。
幸运的是,得益于Retry策略框架的实现方式,我们可以很容易的包装任意用户代码实现重试的功能。遇到瞬态错误时,整个代码段将被重新执行。这种功能由ExecuteAction和ExecuteAction<T>方法实现:
sqlAzureRetryPolicy.ExecuteAction(() =>{ // Invoke a LINQ2SQL query.});return sqlAzureRetryPolicy.ExecuteAction>(() =>{ // Invoke a LINQ query against the Entity Framework model. return result;});
注意,需要重试的代码段本身应该作为一个原子操作单元。代码段尅被多次调用,并且保证数据的一致性。并且,代码中不应该吞没那些用以判断瞬态错误的异常。
一下例子来自并且在适当的改进了Retry逻辑。这样增强了这段客户代码的可靠性,能够抵抗潜在的数据库连接失败和查询限制情况的发生。
// Define the order ID for the order we want.int orderId = 43680;// Create an EntityConnection.EntityConnection conn = new EntityConnection("name=AdventureWorksEntities");// Create a long-running context with the connection.AdventureWorksEntities context = new AdventureWorksEntities(conn);try{ // Explicitly open the connection inside a retry-aware scope. sqlAzureRetryPolicy.ExecuteAction(() => { if (conn.State != ConnectionState.Open) { conn.Open(); } }); // Execute a query to return an order. Use a retry-aware scope for reliability. SalesOrderHeader order = sqlAzureRetryPolicy.ExecuteAction(() => { return context.SalesOrderHeaders.Where("it.SalesOrderID = @orderId", new ObjectParameter("orderId", orderId)).Execute(MergeOption.AppendOnly).First(); }); // Change the status of the order. order.Status = 1; // Delete the first item in the order. context.DeleteObject(order.SalesOrderDetails.First()); // Save changes inside a retry-aware scope. sqlAzureRetryPolicy.ExecuteAction(() => { context.SaveChanges(); }); SalesOrderDetail detail = new SalesOrderDetail { SalesOrderID = 1, SalesOrderDetailID = 0, OrderQty = 2, ProductID = 750, SpecialOfferID = 1, UnitPrice = (decimal)2171.2942, UnitPriceDiscount = 0, LineTotal = 0, rowguid = Guid.NewGuid(), ModifiedDate = DateTime.Now }; order.SalesOrderDetails.Add(detail); // Save changes again inside a retry-aware scope. sqlAzureRetryPolicy.ExecuteAction(() => { context.SaveChanges(); });}finally{ // Explicitly dispose of the context and the connection. context.Dispose(); conn.Dispose();}
总之,瞬态错误处理框架高度的灵活性使它可以应用在不公的场景中。无论是一段简单的SQL语句还是一个大的工作单元。在所有情况下,瞬态错误处理框架将保持一致的工作。