Wednesday, June 29, 2016

Why in Oracle BPM/SOA Suite Attaching a Fault Policy to a Synchronous Service Is Not a Good Idea

In this posting I will explain why in the Oracle BPM/SOA Suite you should not attach fault policies to synchronous services.

This posting has been updated on July 1 2016 to explain why the recovery of ServiceA failed.

The other day I investigated some BPM process instance that had an unrecoverable error. It was calling a synchronous service, that on its turn was calling another synchronous service, that on its turn was calling a external, synchronous service exposed through the Oracle Service Bus. That latter call failed (due to a timeout). As all the composites were having a fault policy attached to them, it was expected that the instance was recoverable. Instead there was some JTA transaction error that rolled back all the way up to last dehydration point in the (top level) BPM process, and from there was retried two times before it finally gave up and went into a coma. Big surprise!

What I recommended to them to prevent this in the future, was the following:

  1. For all synchronous services detach the fault policy. Only attach fault policies to asynchronous and fire&forget services,
  2. Where possible, do asynchronous calls from the BPM process (instead of synchronous), 
  3. Wherever possible make all synchronous services idempotent or make them asynchronous / fire&forget.

If you want to understand why, read on!

Let's assume we have the following chain of services:

I created a FailingService that I can let succeed or fail depending on the input.

Now what happens when BPMProcess calls ServicePS with request 'fail' is the following:

  1. ServiceA errors because of the fault thrown by FailingService
  2. BPMProcess and ServicePS both error because of a timeout
  3. Because of fault policy attached to all them, all go into recoverable state (human intervention)

As ServiceA is in a recoverable state, why not try to recover and see if that will fix the flow?

Now what happens when a retry is done from ServiceA with payload 'normal' is the following:

  1. ServiceA completes successfully
  2. However, BPMProcess & ServicePS are still in recoverable state

The explanation for this is that, because ServicePS was already in a recoverable state, it will not receive the response from ServiceA, as it is no longer listening.

Now let's see what happens when we try to recover the instance of ProcessPS and change the payload from 'fail' to 'normal'

  1. Although the payload was changed to 'normal', we still end up with a new errored instance of ServiceA, as the request of the call from A to FailingService did not change with it (i.e. still 'fail')

In the meantime I understand why ServiceA still called the FailingService with payload 'fail' (I picked the wrong call to retry from the drop-down), but even if the call would have been successful, the FailingService would have been called twice, and let's just hope it is idempotent!

To prevent we get more of these duplicate calls, we recover the (top level) BPM instance instead.

Now what happens when a retry is done from BPMProces with payload 'normal' is the following:

  1. There are new (successful) calls to ServicePS -> ServiceA -> FailingService
  2. However, there are still running instances of ServicePS and ServiceA (they are still in a recoverable state)

The explanation being that these instances were still running after the previous (failed) attempt. So now we still have to abort these running instances to prevent duplicate calls. All in all not very convenient.

The solution is to never let an asynchronous service use a fault policy that either initiates human intervention or does 1 or more retries. The point being that in the meantime the consumer will have timed out, and never receive the response even it succeeds later on.

The best layer to handle errors with synchronous services is a layer that has 'knowledge' about the context of the process. Normally that is the business process itself. The reason being that a policy that fits one process may not fit another.

On the other hand, there are some good arguments for not letting system errors bubble all the way up to a business process. Instead you should consider handling it in the next layer below it - in this case being ServicePS - by making all calls from the business process to ServicePS either asynchronous, or fire&forget (the latter when successful continuation of the process is not depending upon the call). ServicePS will then handles the error using fault policies. You have two options to recover:

  • You recover the instance of ProcessPS, or (when that fails for whatever reason) 
  • Abort the instance of ProcessPS, and do an alter flow on the business process by moving the token from the Receive back to the Send activity. 

As a matter of fact, this customer actually created this ServicePS as a process-specific layer that sits in between the business process and any other service. A similar layer may not be feasible in your case, in which the solution would be to let the error bubble all the way up to the process instance and handle it there (using fault policies).