capsule.adrianhesketh.com

Idempotency in Lambda - 1 - What is it and why should I care?

This is part 1 of a 3 part series.

Part 1 - What's the problem and why should I care?

Part 2 - Using DynamoDB to manage once-only functionality

Part 3 - AWS and Stripe APIs that support idempotency.

What is it, and why should I care?

> Idempotence is the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application.

> https://en.wikipedia.org/wiki/Idempotence

Real-world processes can often be modelled as "finite state machines", for example if a retail order is "created" by some process, some work needs to be done to get that order through to a "complete" stage. Not all transitions of state are valid - we can't take a `created` order and get to the `complete` stage without going through some extra states, like `payment_started`, `payment_completed`, `picking_started`, `picking_completed`, `dispatching_started`, `dispatching_complete`, `delivery_started`, `delivery_complete`, `complete`.

The movement from `_started` to `_completed` usually requires something to happen - making an API call to a 3rd party, moving some phyical items etc. - and some data is generated as a result, maybe a tracking code from the 3rd party API, the date and time of when things started/completed and what did the work.

We can use event-driven systems to manage this, starting with an API call.

Step 1 - handle an API request and store the state change.

Receive an "order created" API call or `order_created` event.

Write a database transaction that has no effect or throws an error if the order already exists.

Writing the database transaction causes an `order_created` event to be published, e.g. via DynamoDB Streams to EventBridge.

Step 2 - start payment with the Stripe API.

It's tempting to make an API call out to Stripe in the same function that creates the order, but executing multiple actions in a single execution unit makes it possible that some of the actions succeed, but others fail or don't execute, without rolling back. This is a partially committed transaction - we could find ourselves with an order that started, but hasn't got a payment in place, and no way of retrying to get that to happen.

So, let's keep it to one side-effect or state change per unit of execution by creating a new Lambda function to handle the `order_created` event. The Lambda function does one thing - calls the Stripe API.

This has several benefits; it moves the execution out of the synchronous API call which reduces the latency of the "create order" API call, it stops the "create order" from crashing if Stripe is down or there's a network problem, and it enables automatic retries of the Stripe API call if there's a problem.

Receive the `order_created` event.

Use the Stripe API to create a payment intent.

Step 3 - receive asynchronous updates via Webhooks.

The Stripe API uses Webhooks to report back state changes, so we'll receive them and use them to change our state.

Receive the Stripe `payment_intent.created` Webhook.

Write a database transaction that updates the state from `order_created` to `payment_started` - fail if the order is not in `order_created` state, and don't update the state if the order is already in a `payment_completed` or later state.

Writing the database transaction successfully causes an `payment_started` event to be published, e.g. via DynamoDB Streams to EventBridge.

Once the user completes the process, we'll receive the `payment_succeeded` webhook, but there's also a chance that we get a `payment_succeeded` webhook before we get the `created` webhook. To make sure we're getting things in the right order, we can reject the webhook until we've received the `payment_intent.created` webhook first by throwing an error to force Stripe to retry later. Or, we could just accept that we've skipped a notification.

Receive a `payment_succeeded` webhook from Stripe.

Write a database transaction that updates the state from `payment_started` to `payment_completed` and stores information about the event.

Make sure that the transaction fails if the order is not in the `payment_started` state, or that the transaction has no effect if a duplicate event has been received.

Writing the database transaction causes an `payment_completed` event to be published, e.g. via DynamoDB Streams to EventBridge.

This kind of asynchronous processing is ideal because it enables automated retries and keeps processing simple - your code is receiving an event and making an API call, or is receiving an event and updating the state (which causes another event to be sent).

Asynchronous vs synchronous APIs

However, not all APIs are asynchronous. In some cases, we will be forced to call synchronous APIs - APIs which rely on the client storing a value provided by the API. In these cases, we need to do two things - call an API and save the data. This leads to a potential error state where we call the API successfully, but are unable to save the data.

In some APIs, this is fine. Let's imagine we upload a file to an S3 bucket with a random filename, if we do that 10 times, we'll spend a bit more on S3 storage, but it's not really a problem. However, if we've just emailed a customer or spent a lot of money because of that API call, it's not so fine. Since Lambda events are retried on failure, it's a problem because retrying will result in calling the API again due to the database failure.

To complicate matters, some synchronous APIs (e.g. AWS and Stripe APIs) have idempotency features built in that enable retries to be safe in limited circumstances - if you pass the same idempotency tokens into them, you always get the same output (terms and conditions apply). If the idempotency window aligns with your need, this sometimes enables a shortcut to be taken by making it safe to make an API call followed by a database save operation, but it's not a pattern that can be applied safely everywhere - care must be taken to do it right. These APIs are described in part 3.

In a situation where we're not priovided with an idempotent API by a 3rd party, we can use "once, and only-once" processing to turn it into an idempotent API and protect the underlying API from being called multiple times at the cost of some extra database calls, and management overhead to deal with errors. This is described in part 2.

Idempotency in Lambda

Even without database failures, if we're using non-idempotent APIs, we may run into issues because events delivered by AWS services such as SQS, EventBridge and Kinesis to Lambda have "at least once" delivery. This means that Lambda functions or other systems subscribed to these sources may end up receiving a message twice, sometimes within a few milliseconds of each other.

AWS has a guide on dealing with this, but at the time of writing, the guide at [0] provides example logic that doesn't cover all of the possible edge cases that can result in duplicate processing.

[0]

The guide suggests the following:

1. Extract the value of a unique attribute of the input event. (For example, a transaction or purchase ID.)

2. Check if the attribute value exists in a control database (such as an Amazon DynamoDB table).

3. If a unique value exists (indicating a duplicated event), gracefully terminate the execution (that is, without throwing an error). If a unique value doesn't exist, continue the execution normally.

4. When the function work finishes successfully, include a record in the control database.

5. Finish the execution.

One problem with this is that it's possible for Lambda functions to be invoked within milliseconds of each other with the same payload. Checking to see if a value exists at the start of the invocation, and then only preventing other invocations from doing the same work after the current invocation has completed the work can result in a race condition - a situation where two Lambda invocations both believe they're the only invocation carrying out the work.

Another problem is that it assumes something about the Lambda function - it assumes the Lambda function is only executing idempotent APIs. That is, it assumes that it's safe to run the function if there's no value in the control database. Let's look at some examples of where it wouldn't be.

Failed API call

In this example, there's API Call A, and API Call B in the same Lambda function. Here's what happens when there's a failure and a message is retried.

Lambda invocation 1

Control database get: No token found

API Call A: Success (call 1)

API Call B: Error, quit with Lambda error

Control database write: N/A, we already quit

Lambda invocation 2 (the retry)

Control database get: No token found

API Call A: Success (call 2)

API Call B: Success

Control database write: Written successfully

If API Call A is not idempotent, then we may have introduced a serious problem.

Failed database write after processing

This same problem would occur in the scenario where the control database write failed for some reason, even if only had one API call in the Lambda.

Lambda invocation 1

Control database get: No token found

API Call A: Success (call 1)

API Call B: Success (call 1)

Control database write: Failed to write

Lambda invocation 2 (the retry)

Control database get: No token found

API Call A: Success (call 2)

API Call B: Success (call 2)

Control database write: Written successfully

In this case, it's even worse, API Call A and API Call B were both called twice, so if either of them were not idempotent calls, we potentially have a problem.

More

Home

home

capsule.adrianhesketh.com

Idempotency in Lambda - 1 - What is it and why should I care?

What is it, and why should I care?

Asynchronous vs synchronous APIs

Idempotency in Lambda

Failed API call

Failed database write after processing

Next

More

Next

Previous

Home