A resilient payment system

The hardware and database architecture of Settle has some characteristics that it is very important for API client implementors to be aware of. The advantage of this architecture is that Settle is very resilient against the failure of individual servers or network connections. The flip side of the coin is that the return of a HTTP 5xx response code (most often 500 or 503) is more common than with more traditional system that runs on only a few machines which are assumed to not fail.

A 5xx error normally does not indicate that something is wrong. It simply indicates that the API client should try again. This behaviour is fully expected, and the API is carefully designed to make sure retrying works in all cases.

Idempotency and retries

The most important concept in the Settle Settle API is that it is idempotent, which simply means: Every request can safely be submitted multiple times. For instance, let us assume you submit a payment request to a customer by calling the POST /payment_request/ endpoint. In return you will get a transaction ID. Now, if you do the same request again with the same pos_id and pos_tid, you will always get the same ID back, and your customer will only see a single payment request. To submit a new payment request, you have to change pos_tid to something else.

A call to Settle can fail for 3 reasons:

  1. The call from the API client to Settle never arrives at Settle due to network failure. In this case nothing has been done.
  2. The response back from Settle never reaches the API client due to network failure. In this case the request has been processed successfully, but the API client does not know it.
  3. The Settle backend returns a 5xx error. In this case the request may have been processed successfully, or it may not.

In summary, whether the API call results in a timeout or a 5xx error, the API client does not know whether or not the call was successful. The appropriate reponse is to try again. If the first call was successful, the second call will always do nothing, and no harm is done.

Warning

If you get 5xx, the request may either have succeeded or failed. Treat a 5xx the same way as a timeout caused by network problems, where you do not know what has happened. The correct response is to retry the request until you get either a 2xx or 4xx return code back.

Guidelines

The simplest strategy is to retry the API call again and again with a one-second pause until the call gives a 2xx or a 4xx HTTP return code. This simple strategy should be sufficient for implementations of the Settle API.

If the merchant submits "bills" that should stay in the customers timeline for many days, timing is usually not essential, and one should normally just continue to retry every request until they go through (perhaps switching to a 5-minute delay between retries after one minute has passed).

In the situation of an interactive payment, whether in a webstore or in a physical shop, the following guidelines apply:

  • Set a short expiration on payment requests (a few minutes).
  • Until the point where a CAPTURE has been submitted to the PUT /payment_request/<tid>/ endpoint, the money will always be automatically refunded after 3 days. If a CAPTURE has not been submitted, it is OK to give up and assume the money will not arrive. If the user has paid, the money will be refunded.
  • If an AUTH has been received by the merchant, it is OK to give the goods to the customer. Settle guarantees that the authorization will be valid for 3 days, so one can put the request in a queue and retry it every hour until the network/merchant systems/Settle is available again.
  • If a 5xx or timeout happens when submitting the CAPTURE, one must retry until success. This is the only case where it is not safe to simply give up. However, as noted above, it is OK to hand out goods and delay the retry of the CAPTURE to some later time (such as end-of-day closing of cash registers).

Note

In the vast majority of cases, when you get a 5xx error a second attempt a second later will go through. The guidelines above also cater for more situations that are much more extreme. By following the guidelines you will not only be resilient against unexpected failures on the part of Settle, but also network failures, having payments correctly processed even if your own systems go down, and so on.

Background reading: The Settle backend architecture

A common way to host payments services is to have either a single or a small handful highly reliable nodes, with redundant power supplies and scheduled hardware maintenance.

Settle runs on Google App Engine, which has an architecture very different from this:

  • One run on tens or hundreds of slower commodity hardware servers instead of a few high-powered servers.
  • Servers are not assumed to stay available. All code is written so that a server can go down at any time without disturbing operations.
  • Hardware maintenance happen on the fly, without disturbing operations.

It does however put some responsibility for retrying requests on the API client. If one of the server computers is suddenly unplugged for maintenance, or have a hard drive corruption, or there is too much load on one part of the database, the API client will either see a request timeout (within 60 seconds) or a 5xx error. This is normal, and part of how the system is supposed to work.