High Availability, and Disaster Recovery in Serverless solutions, does it matter?

Many a time when I talk to customers about Serverless architectures the topic of High Availability and Disaster Recovery comes up, the perception usually is that in a Serverless world we don’t care about these because, well; we are Serverless! 

TL;DR: In this post, we discuss design and planning considerations for building robust serverless solutions. The goal is to focus on “what” should you plan when it comes to serverless and also provide an overview of “how” to design around these considerations. We cover an array of topics ranging from Ingestion EndPoint(s), Storage, Security and of course High Availability and Disaster Recovery. We also discuss governance considerations such as Compliance, Developer Experience and Release pipeline in the serverless world.

Note that while the references in this post refer to use of Azure Services, the concepts can be applied to any cloud provider and offerings.

Let’s get started!

Characteristics of a robust solution?

A robust solution can mean many things, to scope and level set our discussion we focus on the following characteristics of a solution:

  • Reliability: The system should work correctly in case of failures and faults.
  • Scalability: The system should be able to grow with same levels of performance if not better.
  • Maintainability: The system should organize itself to be productive and modifiable in the future.

Martin Kleppman describes these characteristics in amazing detail in his book Designing Data-Intensive Applications, a highly recommended read if you have anything to do with building scalable and quality software.

The Planning Sprint

The first question when thinking about Serverless comes around — do I need to plan for Serverless?. The answer is: If you are building a production quality solution then absolutely Yes.

The planning considerations, however, change from what you would do in a typical data center oriented architecture. The focus is not on how many servers I need to scale or how will I handle replication but rather what are the thresholds for the provider service and how much reliability can be provided by it. In my opinion, this is the right definition of being serverless; Your focus has changed to an abstraction of the underlying infrastructure, and you worry more about the service capabilities and thresholds than about the underlying hardware or virtualization.

Serverless Planning = Plan for service capabilities and thresholds and not for the infrastructure that runs the service.

I highly recommend that you run a planning sprint to determine requirements and how they will affect the provider service constraints. Firstly, a planning sprint (or Sprint 0) will give an opportunity to decide if Serverless makes sense for your workloads (the discussion of Serverless v/s Containers should happen here). Second, it also allows you to analyze the capabilities of the service to determine if you are choosing the right service for our job. Finally, it addresses concerns about geographical reach, compliance, and data sovereignty and future scale of the solution.

The what of Serverless

Below are areas to focus during the planning sprint, asking these questions allow us to look at things to consider when building serverless solutions:

These are guidelines and you may have more categories and questions based on unique requirements.

Ingestion Endpoint(s):

Understand how public and internal endpoints will handle requests.

  • Typical Size of a message? (Average, Max)?: Getting the average and max helps understand two aspects for scale: how much forecasted messages should I expect over a sustained period and what will be the max message size I need to accommodate during processing. This can impact which service you choose for consumption and processing of these messages, e.g., Azure IoT Hub today support message sizes up-to 256 KB so if you want to send larger messages, you will employ other techniques such as file upload or using Azure Storage Blobs. Understanding the size can also help decide if we split our messages before sending or can we trim the message itself so unnecessary bits are not sent over the wire improving bandwidth and processing times.

To know more about service throttles and limits for all Azure services, refer here

  • Who will send the messages? : Understand clients for your services or application:
    • Plan for client(s) uniquely when required: The planning for a passive client like a browser will require fewer considerations around scale as compared to a Device which can send continuous streams of data and reach the throttle of a provider service quickly.
    • Is there a direct connection (socket) possible?: This can determine how many active connections you need to configure in your public endpoint and whether the provider services will be able to handle them. It will also provide opportunity to tweak your services for optimum scale. For example, you can use the below configuration in Azure Functions to increase the incoming request pipeline and concurrent requests:
{
   "http":{
  "routePrefix":"api",
  "maxOutstandingRequests":20,
  "maxConcurrentRequests":10,
  "dynamicThrottlesEnabled":false
   }
}
  • Is there an intermediary between the client and endpoint: This could be a gateway server on-premise, an ISP VPN or a server running on Edge. The message processing and identity for messages coming from a gateway will be different as compared to a direct connection. Additionally, you may need to do IP Whitelisting, VNET configuration and hence need to understand if the provider service supports such functionality (refer to Offloading tasks to multiple services below for details on this topic).
    • Burst mode: An additional thing to consider here is Burst mode scenarios where all the clients start sending messages at a particular time series increasing the load significantly and triggering a threshold for a downstream service. While true serverless services like Azure Functions are built to handle such scenarios, there could be a lag due to the provisioning of multiple consumption units and may also result in time-outs. In such a case you may want to dissect your workloads or move specific clients onto a dedicated tier to allow better distribution of the incoming requests.
    • Offloading tasks to multiple services : In a Serverless Compute model we deal with Functions which can be considered as small single task that make up a business functionality. Since functions are lightweight, it is an good approach to offload some of the work required in message exchanges to other components. For example, when building REST API’s you can offload tasks like Load balancing, SSL Termination, Vanity URL, DNS resolution to an API Gateway, similarly you can offload authentication and authorisation to Identity service like Azure AD, the management of services can be offloaded to a service like API management. Finally, the geo-redundancy can be achieved using a service like Azure Traffic Manager. By using a bouquet of services, the Serverless Compute layer (aka Azure Functions) can focus solely on responding to triggers or handling events, and the remaining eco-system can work on ensuring the robustness of the solution.
  • What message format(s) are we dealing with?: The consideration here is whether the downstream services supports the message formats you want to send. Example, Azure IoT Hub today allows you to send binary data, but if you are analyzing data using Azure Stream Analytics (ASA), it supports CSV, JSON, AVRO as message formats today. So if you are sending data in BSON, or your proprietary format you will have to transform the payload before ASA can process the messages. You can use Azure Logic Apps to do the transformation, but now your architecture has changed and has more moving parts to manage.
  • Can we do batching at the site:: Batching small messages (e.g., Telemetry coming from a thermostat) is always recommended since it saves bandwidth and optimizes parallelism in downstream services. When possible try to batch, however, do consider the size limits of the service. Another consideration here whether the downstream service can process messages in batches since this can impact the load levelling of the solution. In most cases, this should not be a problem, but it is worth considering each service capability to process batches before making the decision.
  • Are there any workflow requirements to define a correlation between the data?: In the Serverless world, we are driving an Event Driven architecture for our solution. While the events model provides great loose coupling between our services, they also make it difficult to manage things like transactions, workflows, and correlation between components. Plan for how the incoming messages will need to be processed when state needs to be transferred across services. Ensure you use an orchestrator like Azure Logic Apps or Azure Durable Functions when component-to-component interaction is required. Additionally, leverage cloud patterns like Retry, Circuit Breaker, and Saga to ensure you can replay or rollback events in case of failures.
  • Are there any quality of service (QoS) attributes that apply to messages : Most Services provided by a cloud provider provide At-Least once messaging as a guarantee. This is primarily because of the considerations around the CAP theorem and the cost involved to build infrastructure that will provide higher guarantees. At-Least will work well for most interactions especially if you have appropriate retry logic and message handling employed in your service. If you have scenarios where message reliability is a must; first, think twice on why do you need such guarantees, in most case you won’t! In case you still convince yourself of a higher guarantee like exactly once, isolate the payloads that require such guarantee and use them sparingly.
  • Are there any specific protocols required to be able to send data to the cloud: A lot of IoT systems have custom protocols instead of the popular HTTP protocol. Consider the protocols supported by the provider services during service technical feasibility. In case the protocol is not supported you may have to build your custom protocol gateway layers which can impact your decision to use a Serverless service v/s building your custom component.
  • Frequency in which messages are being sent?: In a provider Service world, you are bound by the Units that you deploy to a Service (e.g., IoT Hub Units, Cosmos DB Request Units, Stream Units, etc.). The idea is a pattern known as the Scale Unit pattern which provides predictable monitoring and deployment of service units as your scale up, out or down. Since each service is bottled in a unit based model, you need to have consideration around how the incoming message will impact the units you have allocated for your service. But, in a true serverless world, this should not matter since the platform should automatically scale up or out, right? While it is true for Serverless services like Azure Functions (Consumption Plan), it does not apply to all services today. Also even in the case of core serverless services, there is going to be some degradation or lag when a new consumption unit gets deployed based on your load. While this lag is usually minimal (in ms), it can impact your response time if you are running a mission-critical application.

Storage

How data is stored and processed by downstream services.

  • Real Time v/s Batch Routing: Determine how the downstream systems will process the incoming messages. Your choice of service needs to align with how soon the data needs to be processed. For example, if you are processing data in motion, you need a near real-time service such as Azure Stream Analytics to process the data before it goes to other downstream systems. Similarly, if you are processing records over a period, you would instead want to employ Azure Time Series for processing. It is recommended to conceptually design your system using a model like the Lambda architecture and then decide which platform services match your requirements better.
  • Does the incoming data has any tags to classify / categories data?: Platform Services are getting more smarter as they learn about the needs of customers. You should explore features within these services that can provide out of box solution to complicated logic processing algorithm and then enrich your incoming messages to enable the use of the services. For example, if you want to route your incoming device data based on message properties or message body, IoT Hub provides a feature called Message Routing which can send messages to different downstream services based on a parameter in your message. It is handy if you are employing Hot Path vs. Cold Path Analytics since the same stream can now be sent to multiple downstream services without writing a single line of code.
  • Retention Policies and Archival: A lot of times planning for archival can be challenging but if you know how your data is growing and how much of it will move into cold storage you can employ some neat features provides by the platform services to reduce your cost and improve performance. For Example, Azure Storage Blob support a Tier based feature which allows you to move data from Hot, Cool to Archive Tiers, the pricing of each tier significantly varies and allows reducing data cost instead of using a single plan for both current and archival data.
  • Storage used by Serverless Compute: Azure Functions use storage accounts especially Blobs, Tables for its internal operations. What this means is that your Azure Function performance can be impacted by Storage limits and IOPS. Also, while developing Azure Functions, you need to plan for associated storage accounts including separating them per Function App, handling logs separately. If you are using Azure Durable Function, they leverage Azure Storage Queues for state management so you will need to consider additional implication when using Azure Functions.

Security

Leverage Security Threat models and Fault-Tolerant guidelines to prevent malicious attacks on your solution.

  • Transport and Messaging: Consider security at both layers
    • Almost all provider services by default provide a secure transport channel of communication (HTTPS, AMQPS, etc.) for communication. Leverage this as a standard.
    • Consider your authentication scheme and whether the service supports the negotiation through that scheme. Example, Azure Functions by default provide token-based security (Function, Anonymous, and Admin), if you need additional security such oAuth, you can leverage services like Azure API Management that can enable more secure scenarios.
    • When thinking about using Third-party authentication schemes (e.g., Facebook Google, etc.) consider their SLAs. If you solely rely on a provider that has no SLA, your users may get locked out in case the external services goes down.
    • Think end-end security and not just public endpoints. With a Serverless architecture, you will end up with a bunch of services that talk with each other to provide a solution. Some of these services may not have public endpoint however they still need to be secured to ensure the end-end protection of services.
  • Encryption and Encoding: Two key considerations if you have a system that encrypts or encode data when passing events between systems.
    • Custom processes will be required if you are processing messages using platform services since they support standard formats
    • The message size will increase and can impact overall persistence targets as well as response times because of encrypting/decrypting and encoding/decoding procedures.

    Most platform services are secured at the transport layer so use these techniques sparingly for specific workloads where data security is a must. Note, I am not recommending here that you should loosen your security procedure but rather spend time in choosing your workloads and classify which messages need encryption. A way to plan for this is to build a Security and Fault Tolerance model and determine which messages can have a significant impact in case the system is compromised.

  • PII data: Whether your application has a public or internal endpoint; if users are accessing it, you need to think about their Privacy. This becomes a little tricky when using platform services since your solution is deployed on an infrastructure where the provider will also have a privacy policy. Understand the privacy policies described by the platform and align with your policy.
  • MTTR: Build a Mean Time To Respond strategy when it comes to security. You cannot stop hackers from always fiddling with your public services (especially if you are famous). With a Service provider, this becomes even lesser control for your organization. In the worst case scenario, your service or the platform provider service gets compromised, plan for a response strategy where you can limit the attack surface. For example, have proper monitoring in place and use analytics to determine variations in patterns, in case a change is detected block the impacted users, devices and issue patches through the automated build that limits the widespread of the issue.

Availability and Disaster Recovery

  • Availability out of box: The good thing about living in the serverless world is that you get availability out of the box. All services provide high availability capabilities and in most cases either autoscale or provide easy configuration to handle workloads. So technically, most of it is taken care. However, when thinking about availability, don’t restrict to the SLA provided by a “single” service; instead, focus on the end-end solution. Since we are dealing with multiple services, ensure that your solution uptime is not impacted by the aggregate SLA provided by a combination of provider services.
  • Transient Fault Handling: Serverless services provide some level of protection against transient failures through internal implementations of the Retry and Circuit Breaker patterns. For example, the WebJobs SDK which is the basis of Azure Functions provides these as part of the platform runtime.
    • In addition to the default services, you can also use frameworks like Polly in your custom code to enable implementation of such patterns.
    • Not all services provide transient fault handling capabilities so ensure you have appropriate measures on the calling end of the services. (e.g. EventHub triggers today does not have a automatic retry and the calling function needs to ensure retry logic.)
  • Disaster Recovery (DR):: There is minimal DR capabilities provided by the platform services today so if you are looking at a complete DR solution you will have to do extensive planning. Let’s look at some of these by breaking the DR conversation into the following components:
    • Serverless Compute: Azure Functions lies under this umbrella and will constitute any custom code that you are running as part of your solution. A recommendation here is to use Stateless functions as much as possible, once you do that you can enable Disaster recovery by leveraging a service like Azure Traffic Manager. As of today, Azure Functions can be configured as an AppService in Traffic Manager and allows you to use any Routing strategy. Watch out for my next post on how to configure DR for Azure Functions to get more details.
    • Data Replication: All data storage services in Azure include Azure SQL, CosmosDB, Azure Storage provide geo-replication of data across Azure data centers. These can be enabled to ensure that all data at rest can be moved to a different paired region and is available in case of a data center failure. Note that you will have to plan for the consistency pattern for the data based on your workloads, for example, if you choose eventual consistency there could be a possibility of data loss due to asynchronous replication.
    • In-stream Processing: When we think about in-stream processing, I refer to Message queues and Job pipelines like Azure Stream Analytics. This is the tricky part when it comes to using provider services. Almost none of these services provide a message replication solution and even if they do there is minimal guarantees on data loss. Few ways to approach such situations are:
      • The first approach is to identify your workloads and see if they can live with the in-stream message data loss. So, basically losing messaging that are in the queue or currently under processing. This would require a robust client which can replay the message and is not possible in all scenarios.
      • Create a active-active cluster where the same message is directed to both data centers. While this will ensure message replication it can create problems around data duplication.
      • Some services like ServiceBus provides a mechanism where you can create NameSpace pairing to ensure primary data is copied to a secondary region in an asynchronous fashion.
    • Service Availability: Last but not the least, ensure that the services that your leveraging are available in paired regions to enable a DR scenario. For example, Azure App Insights is currently available in Southeast Asia but not its paired region East Asia.
  • Throttling: Up-till now we have been discussing how to ensure the service is up and running, however, in some scenarios you want to assign thresholds to your service so that you can deny requests instead of continue to process them. Throttling pattern is a great way to ensure your service is healthy and not exceeding internal thresholds that you have set for service performance. In case of Serverless a lot of these is done for you by default. For example, based on the Unit model you select the provider service will automatically have a threshold defined and will issue HTTP 429 requests when the thresholds are reached. Additionally, when using Azure Functions in a Consumption plan you can put a throughput threshold per function to define when to throttle your endpoints. Plan for throttling and time-outs on your service to ensure the client have a predictable experience and can handle such response gracefully.

Maintenance

  • Tooling: One of the key considerations when it comes to Serverless will be whether there is sufficient tooling available for the development team to build an end-end solution. Several things to consider here:
    • Programming Language: The choice of language will depend on whether the platform supports it. This becomes especially important when you have a development team with existing skills, for example GoLang is not supported by Azure Functions today. Also, some languages might be in experimental support and will not be ready for Production (e.g. TypeScript).
    • Dependency Frameworks: The version of runtime frameworks that you need for your solutions will also be important. Example: Azure Function 1 runtime support Node 6.5.0 for production deployment, however the current LTS version is 9.6+.
    • Cross-platform support: development teams who need to deploy on Linux and Windows need to ensure the runtime and Client SDKs are supported on required OS distributions.
    • IDE support: check if the development tools are available and integral as part of the IDE. If not then look for third-party extensions available for the scenarios.

      A note for Visual Studio Code in case you are developing Azure Functions, it is perhaps the best cross-platform IDE available today with an intuitive Azure Function extension that makes development and deployment to Azure a breeze. If you have not checked it out, download it here.

    • The DevOps cycle will be significantly impacted if you have don’t have the right tools in hand. Ensure that the service not just supports a Portal deployment but also command line and integration with CI / CD tools like Jenkins, VSTS, etc.
    • Azure Pre-Compiled v/s Scripted functions: A note on Azure pre-compiled v/s Scripted functions. A lot of samples and videos that you see out there use the Azure Portal for development, when you develop in the Portal the function is called a Scripted function. While they are good for sample scenario, when developing a production system, I recommend you create a pre-compiled function using an IDE and deploy it using Azure tooling. A key reason is that scripted versions do not support versioning so every time you run such a function, the runtime creates an assembly and deploy it per function. This not only impacts scale but also makes it difficult to do change management for future iterations.
  • Monitoring: Another important aspect of Maintenance stems from how you monitor the system. The better the monitoring, the quicker you can find errors and issues and keep the system healthy. Few considerations when it comes to monitoring:
    • End-End Telemetry: Most provider services have monitoring built-in which includes capturing events as well as monitoring dashboards such as Azure Monitor. While this is great from a particular service perspective when dealing with the entire solution you need to get data about event flows within the system and not just individual services. Services such as LogAnalytics and OMS greatly help in log aggregation and then displaying meaningful insights about the solution instead of just a single service. Additionally, Application Insights can be used to transmit custom logging data to these log aggregators to ensure end-end telemetry of the system can be obtained.
    • Additionally, for custom logging scenarios leverage Semantic Logging frameworks that can assist with the integration of multiple sinks and channels without making changes to your logging API.

Compliance

  • Standard and Policies: : A Serverless solution is your solution running in a provider infrastructure so it is important to understand the implications around compliance and how much control and configuration you can enable.
    • Provider Lock-In: The idea behind Serverless solution is to host your solutions in a provider environment. This by default encourages a provider lock-in since all of the services used by the solution will be specific to the vendor. But is that a bad thing? I would say it depends, in my experience a lot of customers who stick to a cloud do not move from too often or unless they experience some serious limitations and cost benefits. Since this is an infrequent action, I would suggest embracing the vendor services instead of being conservative and thinking about generic approaches. I do not say that because I work for a Cloud Provider, but instead, I have seen customers go down this rabbit hole of being generic and limiting their use of capabilities of a provider service resulting in a solution that could have been much better if they committed themselves to the provider service. This is a big decision for an organization though so carefully assess how you want to proceed.

      Azure Function is leading the way towards an Open direction by open sourcing the Function runtime; this enables sophisticated hybrid scenarios as well as portability to other clouds. Hopefully, other cloud vendors will be able to provide a standard runtime, so at-least the custom development on serverless can become portable.

    • Regulations: In addition to lock-in, consider any legal implications of using the services.
      • Are there any standards or policies that are required to be adhered to for the data that is being persisted?
      • Are there any security standards need to be respected to ensure data security and compliance at rest?
      • Are there any requirements to ensure data is available in a specific region (e.g., all data must be persisted within a country)?

      Some of above questions can tailor or limit the use of provider services depending on their availability in a region so read the fine print carefully.

Understand the platform constraints

Apart from the customer requirements, it is essential to understand the limitations and throttles of the Serverless platform. This is critical since you are dealing with a bouquet of services and you would want to look at the end-end execution of operations to ensure you can get performance, scale, and fault-tolerance across the stack and not just for a specific service.

The Azure team has done a great job in providing best practices for most of Azure Services, you can check them out here:

Hope this post gave an in-depth tour of the considerations for a Serverless architecture. Finally, remember, as you delve into the Serverless solution, you would realize you have choices but you need to cognizant of each choice, and it can impact your long terms scalability, availability, and reliability of the solution.

Would love to hear you thoughts and comments and If you have guidelines or practices for developing serverless architectures, please do share 🙂 …