Initial support for auto-scale in Azure Functions Elastic Premium pla…

…ns (#15) Includes new app scaling documentation
microsoft · Apr 8, 2021 · f4219c7 · f4219c7
1 parent 72563da
commit f4219c7
Show file tree

Hide file tree

Showing 33 changed files with 1,087 additions and 73 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,15 +5,17 @@
 ### New
 
 * Added `dt.GetScaleMetric` SQL function for use with the [MSSQL KEDA Scaler](https://keda.sh/docs/scalers/mssql/).
+* Added `dt.GetScaleRecommendation` SQL function and `IScaleProvider` implementation for VNET scaling in Azure Functions.
 * Added versioning support for task activities ([#14](https://github.com/microsoft/durabletask-mssql/pull/14)) - contributed by [@usemam](https://github.com/usemam)
 
 ### Updates
 
-* Switched default task hub mode back to multitenant, since it simplifies certain test setups
+* Switched default task hub mode back to multitenant to simplify testing
+* Updated [Microsoft.Azure.WebJobs.Extensions.DurableTask](https://www.nuget.org/packages/Microsoft.Azure.WebJobs.Extensions.DurableTask) dependency to [v2.4.2](https://github.com/Azure/azure-functions-durable-extension/releases/tag/v2.4.2).
 
 ### Breaking changes
 
-* None
+* Changed `SqlDurabilityProviderFactory` and `SqlDurabilityOptions` classes from `public` to `internal`.
 
 ## v0.6.0-alpha
 

diff --git a/docs/architecture.md b/docs/architecture.md
@@ -74,6 +74,8 @@ Multiple DTFx workers can be configured to use the same SQL database connection
 
 The provider works by having a single worker take a lock on a particular orchestration instance (or entity) and then process all events for that orchestration instance. When it is done executing a particular step in the orchestration, the lock is released and other workers will have an opportunity to lock the instance if there are more events that need to be processed. Similarly, activities are distributed across all worker instances in a competing-consumer way. However, activity execution does not require taking a lock on an orchestration instance, allowing multiple workers can process activities concurrently.
 
+For more detailed information on scalability, see the [Scaling](scaling.md) topic.
+
 ### Polling
 
 The Durable SQL provider regularly polls the `dt.NewEvents` and `dt.NewTasks` tables for new events and tasks. Initially, there is a 0 to 50ms delay in between polling attempts. If no events are found, the SQL provider will slowly increase the amount of time in between polling intervals up to a maximum of **3 seconds**. This means that a mostly idle app could see up to 3 seconds delay between the time an execution is scheduled and when it is detected and executed.

diff --git a/docs/scaling.md b/docs/scaling.md
@@ -0,0 +1,112 @@
+# Scaling
+
+The Microsoft SQL Provider for the Durable Task Framework (DTFx) and Durable Functions is designed to run in elastic compute environments where nodes can be added or removed on-demand without introducing downtime. This article describes how scaling works and various options for configuring auto-scale.
+
+## Terminology
+
+Throughout this article, we'll use the term _worker_ to refer to a single replica of the DTFx backend. If you are building an app using DTFx directly, then _worker_ refers to an instance of the `TaskHubWorker` class. If you are building an app on the Azure Functions hosted service, then a _worker_ refers to a single instance of a function app. In the context of Kubernetes, a _worker_ typically corresponds to a deployment replica.
+
+## Load balancing
+
+The Durable SQL provider distributes orchestration and activity executions evenly across all workers that are configured for a particular [task hub](taskhubs.md). Each worker independently polls the database for work and will take on as much work as allowed by its [concurrency configuration settings](#concurrency-configuration) using a [competing consumer](https://docs.microsoft.com/azure/architecture/patterns/competing-consumers) load distribution strategy.
+
+![Scale-out](media/arch-diagram.png)
+
+Each worker replica is identical and capable of running _any_ orchestrator or activity task that it can fetch from the database. Assigning specific orchestrations or activities to specific workers is not supported. There's no hard limit to the number of workers that can be added to a task hub. The maximum number of workers is limited only by the amount of concurrent load that the SQL database can handle. If any worker fails or becomes unavailable, work will be automatically redistributed across the existing set of active workers within a few minutes.
+
+?> If you're familiar with the Azure Storage backend for DTFx and Durable Functions, one key difference with SQL provider is that orchestration executions can theoretically scale-out to any number of workers. There is no concept of partitions or leases.
+
+## Concurrency configuration
+
+Each task hub worker can execute multiple orchestration events and activity tasks concurrently. The actual number of events or tasks that execute concurrently is configurable and is one of the key factors that impacts scalability. For in-process .NET apps, you can specify concurrency settings in the `SqlOrchestrationServiceSettings` class. The following example code configures both the maximum number of concurrent activity tasks and orchestrator events to be the number of cores on the VM.
+
+```csharp
+var settings = new SqlOrchestrationServiceSettings
+{
+    MaxConcurrentActivities = Environment.ProcessorCount,
+    MaxActiveOrchestrations = Environment.ProcessorCount,
+};
+
+var service = new SqlOrchestrationService(settings);
+var worker = new TaskHubWorker(service);
+```
+
+When using Azure Functions, these values inferred from the existing `maxConcurrentOrchestratorFunctions` and `maxConcurrentActivityFunctions` settings in the [host.json file](https://docs.microsoft.com/azure/azure-functions/durable/durable-functions-bindings#host-json), as shown in the following example:
+
+```json
+{
+  "version": "2.0",
+  "extensions": {
+    "durableTask": {
+      "maxConcurrentOrchestratorFunctions": 8,
+      "maxConcurrentActivityFunctions": 8,
+      "storageProvider": {
+        "type": "MicrosoftSQL",
+        "connectionStringName": "SQLDB_Connection"
+      }
+    }
+  }
+}
+```
+
+The values you select will vary depending on your expected workload. For example, if your activities are CPU-intensive or consume lots of memory, then you'll likely want to configure a smaller value for activity concurrency. Similarly, if your orchestrations have large history payloads (because of large inputs, outputs, etc.) then you should consider smaller orchestration concurrency configuration values. Choosing this configuration carefully is important to ensure your app has the right balance of performance and reliability.
+
+?> Future versions of the Durable SQL provider may support automatic concurrency configuration based on available CPU, memory, and other metrics. However, until this support is available, it is recommended that you use performance and scale testing to determine the right concurrency configuration values for your expected workload.
+
+## Worker auto-scale
+
+The Durable SQL provider makes worker scale-out and scale-in recommendations based on the number of active and pending orchestration and activity tasks at any given time. The recommended number of workers is determined by dividing the current task backlog by the configured maximum per-worker concurrency settings. The basic formula looks like the following pseudocode:
+
+```pseudocode
+live_activities = rowcount(dt.Activities)
+live_orchestrators = rowcount(dt.Instances WHERE #events > 0)
+recommended_activity_workers = ceil(live_activities / max_concurrent_activities)
+recommended_orchestrator_workers = ceil(live_orchestrators / max_concurrent_orchestrators)
+recommended_worker_count = recommended_activity_workers + recommended_orchestrator_workers
+```
+
+Here are the English definitions of the variables mentioned in this algorithm:
+
+| Variable | Description |
+|-|-|
+| *live_activities* | The number of rows in the `dt.NewTasks` table. This represents both activity tasks being actively processed and those waiting to be processed. |
+| *max_concurrent_activities* | The maximum number of activities that can run concurrently on a single worker. This number is [configurable](#concurrency-configuration). |
+| *recommended_activity_workers* | The number of worker replicas needed to handle all active and pending activities (i.e. `live_activities`). |
+| *live_orchestrators* | The number of orchestration instances that are either active in memory or have events pending in the `dt.NewEvents` table. This does not include timer events scheduled in the future. |
+| *max_concurrent_orchestrators* | The maximum number of orchestrations that can run concurrently (i.e. active in memory, not idle) on a single worker. This number is [configurable](#concurrency-configuration). |
+| *recommended_orchestrator_workers* | The number of worker replicas needed to handle all active and pending orchestrator events (i.e. `live_orchestrators`). Each orchestrator must run on a single worker at a time so the actual number of events per orchestrator does not matter. |
+| *recommended_worker_count* | The total number of workers needed to handle all activity tasks and orchestrator events. |
+
+This value can be calculated automatically using either the `dt.GetScaleRecommendation` SQL function, which takes concurrency settings as parameters, or the `SqlOrchestrationService.GetScaleRecommendation` .NET API, which discovers the concurrency settings from configuration. The final number can then be given to an auto-scale compute component to change the number of allocated worker replicas.
+
+If you're using the Durable SQL provider with [Azure Durable Functions](https://docs.microsoft.com/azure/azure-functions/durable) running on the [Elastic Premium Plan](https://docs.microsoft.com/azure/azure-functions/functions-premium-plan), then auto-scaling the number of app instances is managed automatically if you enable runtime scale monitoring as described [here](https://docs.microsoft.com/azure/azure-functions/functions-networking-options#premium-plan-with-virtual-network-triggers). Note that this doesn't require you to configure any virtual networking features.
+
+!> The Azure Functions Consumption plan does not yet support Durable Functions apps configured with the Durable SQL provider.
+
+If you are running your app in Kubernetes and have [KEDA](https://keda.sh) installed in your cluster, you can use the [MSSQL](https://keda.sh/docs/scalers/mssql/) scaler to automatically scale your app deployment instances. The following is an example `ScaledObject` configuration that can be used.
+
+```yml
+apiVersion: keda.sh/v1alpha1
+kind: ScaledObject
+metadata:
+  name: mssql-scaledobject
+spec:
+  scaleTargetRef:
+    name: durabletask-mssql-app
+  triggers:
+  - type: mssql
+    metadata:
+      connectionStringFromEnv: SQLDB_Connection
+      targetValue: "1"
+      query: "SELECT dt.GetScaleRecommendation(8, 8)"
+```
+
+?> Note that the two parameters for the `dt.GetScaleRecommendation` SQL function are values for `@MaxOrchestrationsPerWorker` and `@MaxActivitiesPerWorker` respectively.
+
+The `targetValue` should always be `"1"` when using the `dt.GetScaleRecommendation` SQL function in the `query` property. This ensures there is a 1:1 mapping between workers and deployment replicas.
+
+!> Make sure that the database credentials used by the `ScaledObject` are the same as those used by the app. Otherwise the `dt.GetScaledRecommendation` might return incorrect recommendations. See the [Multitenancy](multitenancy.md) topic for more information about how database credentials are mapped to task hubs.
+
+## SQL database scale-out
+
+The current version of the Durable SQL provider supports connecting to a single database instance. In many cases, the database will be the primary performance bottleneck. The recommended way to scale-out the database compute capacity is to increase the number of cores allocated to the SQL Server instance. Instructions for scaling up a SQL Server instance is out of scope for this article. However, if you are using [Azure SQL Database](https://docs.microsoft.com/azure/azure-sql/database/sql-database-paas-overview), you have the option of using the [Serverless tier](https://docs.microsoft.com/azure/azure-sql/database/serverless-tier-overview), which auto-scales the database based on CPU usage.
diff --git a/docs/sidebar.md b/docs/sidebar.md
@@ -1,5 +1,6 @@
 * [Introduction](introduction.md "Durable Task SQL Provider")
 * [Getting started](quickstart.md)
 * [Architecture](architecture.md)
+* [Scaling](scaling.md)
 * [Task Hubs](taskhubs.md)
 * [Multitenancy](multitenancy.md)
diff --git a/docs/taskhubs.md b/docs/taskhubs.md
@@ -12,7 +12,9 @@ Task hubs are also the primary unit of isolation within a database. Each table i
 
 ## Configuring task hub names
 
-Tasks hubs can be configured explicitly in the SQL provider configuration or can be inferred by details of the SQL connection string. For self-hosted DTFx apps, you can configure the task hub directly in the `SqlProviderOptions` class.
+Tasks hubs can be configured explicitly in the SQL provider configuration or can be inferred by details of the SQL connection string. By default, the name of a task hub is the name of the database user. For more information, see the [Multitenancy](multitenancy.md) topic.
+
+For self-hosted DTFx apps that opt-out of multitenant mode, you can configure the task hub directly in the `SqlProviderOptions` class.
 
 ```csharp
 var options = new SqlProviderOptions
@@ -22,7 +24,7 @@ var options = new SqlProviderOptions
 };
 ```
 
-For Durable Functions apps, the task hub name can be configured in the `extensions/durableTask/hubName` property of the **host.json** file.
+For Durable Functions apps, explicit task hub names are configured in the `extensions/durableTask/hubName` property of the **host.json** file.
 
 ```json
 {
@@ -39,9 +41,7 @@ For Durable Functions apps, the task hub name can be configured in the `extensio
 }
 ```
 
-Task hub names can alternatively be inferred from database user credentials. For more information, see [Multitenancy](multitenancy.md).
-
-?> Task hub names are limited to 50 characters. If the specified task hub name exceeds 50 characters, the configured task hub name will be truncated and suffixed with an MD5 hash of the full task hub name to keep it within 50 characters.
+?> Task hub names are limited to 50 characters. If the specified task hub name exceeds 50 characters, it will be truncated and suffixed with an MD5 hash of the full task hub name to keep it within 50 characters. This behavior applies both to task hubs inferred from database usernames and explicitly configured task hub names.
 
 ## Case sensitivity
 

diff --git a/src/DurableTask.SqlServer.AzureFunctions/AssemblyInfo.cs b/src/DurableTask.SqlServer.AzureFunctions/AssemblyInfo.cs
@@ -0,0 +1,6 @@
+// Copyright (c) .NET Foundation. All rights reserved.
+// Licensed under the MIT License. See LICENSE in the project root for license information.
+
+using System.Runtime.CompilerServices;
+
+[assembly: InternalsVisibleTo("DurableTask.SqlServer.AzureFunctions.Tests, PublicKey=0024000004800000940000000602000000240000525341310004000001000100fd8328dce03cd2e3033a411da400c391864fb4896f1265b2e46914ae677f9268e57ce00fe5ab144bf1746670c16798821c1e821dc3bc0ebce8374c20de809e7ae1b613b71a0a2a5680782e0458cec6c520bc77a90b2c5b00425da400b611d110a43219a9db52e89ce52705e8d11e68ca536f9d5dbe1de8c054d4f70161984de3")]
diff --git a/src/DurableTask.SqlServer.AzureFunctions/DurableTask.SqlServer.AzureFunctions.csproj b/src/DurableTask.SqlServer.AzureFunctions/DurableTask.SqlServer.AzureFunctions.csproj
@@ -17,7 +17,7 @@
 
   <ItemGroup>
     <PackageReference Include="Microsoft.Azure.Functions.Extensions" Version="1.0.0" />
-    <PackageReference Include="Microsoft.Azure.WebJobs.Extensions.DurableTask" Version="2.4.1" />
+    <PackageReference Include="Microsoft.Azure.WebJobs.Extensions.DurableTask" Version="2.4.2" />
   </ItemGroup>
 
   <ItemGroup>

diff --git a/src/DurableTask.SqlServer.AzureFunctions/SqlDurabilityOptions.cs b/src/DurableTask.SqlServer.AzureFunctions/SqlDurabilityOptions.cs
@@ -10,7 +10,7 @@ namespace DurableTask.SqlServer.AzureFunctions
     using Microsoft.Extensions.Logging.Abstractions;
     using Newtonsoft.Json;
 
-    public class SqlDurabilityOptions
+    class SqlDurabilityOptions
     {
         [JsonProperty("connectionStringName")]
         public string ConnectionStringName { get; set; } = "SQLDB_Connection";
@@ -27,6 +27,7 @@ public class SqlDurabilityOptions
         internal ILoggerFactory LoggerFactory { get; set; } = NullLoggerFactory.Instance;
 
         internal SqlOrchestrationServiceSettings GetOrchestrationServiceSettings(
+            DurableTaskOptions extensionOptions,
             IConnectionStringResolver connectionStringResolver)
         {
             if (connectionStringResolver == null)
@@ -58,6 +59,16 @@ internal SqlOrchestrationServiceSettings GetOrchestrationServiceSettings(
                 WorkItemBatchSize = this.TaskEventBatchSize,
             };
 
+            if (extensionOptions.MaxConcurrentActivityFunctions.HasValue)
+            {
+                settings.MaxConcurrentActivities = extensionOptions.MaxConcurrentActivityFunctions.Value;
+            }
+
+            if (extensionOptions.MaxConcurrentOrchestratorFunctions.HasValue)
+            {
+                settings.MaxActiveOrchestrations = extensionOptions.MaxConcurrentOrchestratorFunctions.Value;
+            }
+
             return settings;
         }
     }

diff --git a/src/DurableTask.SqlServer.AzureFunctions/SqlDurabilityProvider.cs b/src/DurableTask.SqlServer.AzureFunctions/SqlDurabilityProvider.cs
@@ -9,34 +9,39 @@ namespace DurableTask.SqlServer.AzureFunctions
     using System.Threading.Tasks;
     using DurableTask.Core;
     using Microsoft.Azure.WebJobs.Extensions.DurableTask;
+    using Microsoft.Azure.WebJobs.Host.Scale;
     using Newtonsoft.Json;
     using Newtonsoft.Json.Linq;
 
     class SqlDurabilityProvider : DurabilityProvider
     {
-        readonly SqlDurabilityOptions options;
+        public const string Name = "mssql";
+
+        readonly SqlDurabilityOptions durabilityOptions;
         readonly SqlOrchestrationService service;
 
+        SqlScaleMonitor? scaleMonitor;
+
         public SqlDurabilityProvider(
             SqlOrchestrationService service,
-            SqlDurabilityOptions options)
-            : base("SQL Server", service, service, options.ConnectionStringName)
+            SqlDurabilityOptions durabilityOptions)
+            : base(Name, service, service, durabilityOptions.ConnectionStringName)
         {
-            this.options = options;
-            this.service = service;
+            this.service = service ?? throw new ArgumentNullException(nameof(service));
+            this.durabilityOptions = durabilityOptions;
         }
 
         public SqlDurabilityProvider(
             SqlOrchestrationService service,
-            SqlDurabilityOptions options,
+            SqlDurabilityOptions durabilityOptions,
             IOrchestrationServiceClient client)
-            : base("SQL Server", service, client, options.ConnectionStringName)
+            : base(Name, service, client, durabilityOptions.ConnectionStringName)
         {
-            this.options = options;
-            this.service = service;
+            this.service = service ?? throw new ArgumentNullException(nameof(service));
+            this.durabilityOptions = durabilityOptions;
         }
 
-        public override JObject ConfigurationJson => JObject.FromObject(this.options);
+        public override JObject ConfigurationJson => JObject.FromObject(this.durabilityOptions);
 
         public override async Task<IList<OrchestrationState>> GetOrchestrationStateWithInputsAsync(string instanceId, bool showInput = true)
         {
@@ -96,5 +101,16 @@ public override async Task<IList<OrchestrationState>> GetOrchestrationStateWithI
 
             return value.ToString();
         }
+
+        public override bool TryGetScaleMonitor(
+            string functionId,
+            string functionName,
+            string hubName,
+            string storageConnectionString,
+            out IScaleMonitor scaleMonitor)
+        {
+            scaleMonitor = this.scaleMonitor ??= new SqlScaleMonitor(this.service, hubName);
+            return true;
+        }
     }
 }