Operational Acceptance Testing (OAT)
The purpose of OAT is to prove the aspects of the system that do not affect the functionality but can still have a profound effect on how it is managed and supported.
OAT concentrates on areas such as resiliency, recoverability, integrity, manageability and supportability, with the specific exclusions of Performance, Security and Disaster Recovery, which are areas of speciality in their own right.
The required level of OAT is determined by using CDRM (Change Driven Risk Management) and the output from this will recommend the risk mitigation strategy for all phases of the project. This will enable the OAT phase to focus on mitigating the operational risks.
The following mitigation methods form the OAT phase:
- Backup & Recovery
- Change Implementation
- Change Back-out
- Component Failure
- Shutdown & Resumption
- Operational Support & Procedure
- Alerts
All methods must be performed based on the CDRM technique and TS standards in a managed non-functional test environment that is an accurate reflection of production.
Categories of OAT
Backup & Recovery
To prove both the backup and recovery processes. The testing will prove the operation, operability and integrity of backup procedures to ensure that the operating systems and data can be restored successfully at the same site and also at another site if applicable. The recovery testing includes the build and configuration of a component. These tests will ensure build quality and guarantee subsequent builds of components are to the same standard.
The testing should prove that:
· Service can be restored to an agreed recovery point utilising appropriate TS standard backup and restore methods.
· Backups taken at one site can be recovered to the same site.
· Backups taken at one site can be recovered to another other site.
Change Implementation
To prove that the implementation into the production environment will be successful and not adversely affect the existing production services.
The testing should prove that:
· The implementation into the live production environment will not adversely affect the integrity of the current production services.
· The implementation process can be replicated by using valid documentation that includes the time required for each step and the order of implementation.
Change Back-out
To prove the back-out of a failed change from the production environment will be successful and will not adversely affect existing production services.
The testing should prove that:
· All the required steps to successfully back out a change are valid.
· The time required for each step of the back-out is known and documented.
Component Failure
To prove that the infrastructure has been designed to cope with unplanned outages. Following failure and repair, the failed components should be able to be recovered into the infrastructure in line with TS Recovery Management processes and timescales.
The testing should prove that:
- The service can continue after the failure of individual components (outside its core operating environment), while issuing appropriate error messages. The system should be designed to offer transparent failover where possible and upon terminal error on the active platform (usually identified by a heartbeat failure), the failover infrastructure should be automatically activated. Ultimately, this covers the ability to continue operation at an alternative facility after the failure at the primary facility. This should be proven for new and amended components.
- The system can automatically adjust itself to availability of system resources.
- If fail-over is invoked, fail-back can be performed successfully and recovery to the original state is achievable. When component failures are resolved the service should fully recover itself with no customer impact. Any non-automated actions should be documented.
- If several components have been affected by a failure, there should be a proven plan showing the recommended order of restart, time to complete, etc.
- Failure to complete a unit of work does not result in data corruption or inconsistency and all services must handle any failures while preserving data integrity.
- Any impact on the E2E service by the failure of individual components is understood and documented.
Shutdown & Resumption
To prove that the system can be shutdown and restarted cleanly without service disruption or within an agreed window of scheduled downtime.
The testing should prove that:
· Each component can be shutdown and resumed successfully within the agreed time scale.
· The order of resumption of the components, if applicable, is valid and documented.
Operational Support & Procedure
To prove that all components of a service are capable of being supported to TS standards.
The testing should prove that:
· Diagnostic information produced in failure situations is of sufficient quality to support any manual or, ideally, automatic corrective actions.
· Any recovery documentation produced or amended, including Service Diagrams, is valid. This should be handed over to the relevant support areas.
· Documentation for each element which covers restart / recovery, error conditions, alerts, etc. must be provided.
· Full remote control capability to resolve error conditions should be proven for all new components and tools.
· Maintenance of the components should be able to be performed without disruption to the service or within an agreed outage as per the SLA. The service should be able to be started, shutdown and controlled to support maintenance.
Alerts
To prove that alerts are raised in the event of a component failure, error condition or if a threshold is breached.
The testing should prove that:
· Event Monitoring - All critical alerts go to the TEC and reference the correct resolution document. Any system that fails at an infrastructure or application level alerts on failure or is addressed by Heartbeat functionality.
· Threshold Monitoring - Alerts are in place and issued if agreed thresholds are exceeded. e.g. disk utilisation, CPU, memory etc.
· Heartbeat Monitoring (End to End) - This mimics customer experience on a regular basis. An alert will be issued if response times fall below a predetermined (by the business) threshold or fail an agreed number of times consecutively. The object of the heartbeat is to prove that key business functionality is available and performing to an acceptable standard. If end-to-end heartbeat is not appropriate, then component heartbeats should be applied.
Change Driven Risk Management
What is it?
Change Driven Risk Management (CDRM) is a technical framework to assess and manage risk over the project lifecycle. Utilising a spreadsheet-based tool, CDRM informs the relationship between change and risk mitigation methods as a customised Checklist.
It is part of the bank's testing compliance process and must be completed by the Project Management and Test Management communities within Technology Services during project Commence.
When CDRM is combined with a programme- or application platform-level Test Strategy it removes the need for a project-level Test Strategy. Note that CDRM is mandatory, while a Test Strategy is not.
Why should projects use CDRM?
CDRM is a technique which projects can use to document a risk-based assessment of the planned changes for their project. In this way it will enable projects to target testing only where there is a significant business or technical risk based on change.
The technique also allows for continuous evaluation of executed tests, the results of which may modify the risk level of non-executed planned tests and effect their removal from the test plan. This will remove unnecessary testing, and help reduce project costs and deliver changes faster, while still controlling risk.
When do you need to start CDRM?
The Project Manager and the Test Manager are required to provide CDRM input and analyse CDRM output during the Commence stage in the project lifecycle. They must provide feedback, via the use of the CDRM comments fields, on the advice given within the CDRM checklist.
As the project progresses through Analysis, Design, Construction, Testing and Implementation the Project Manager must annotate the checklist via the status and date fields to indicate progress with regard to the scheduling and execution of mitigation methods, and revisit the answers to the input questions as more information becomes available.
What inputs are needed?
A good understanding of the scope of the project being undertaken, much of which can be extracted from the Business Change Request, Terms of Reference and high-level Project Plan.
Subsequently, when the CDRM questions and checklist are revisited throughout the development lifecycle, additional input will be required in the form of key lifecycle deliverables such as the Requirements Specification, the Application Design and the Component Designs.
There is no requirement to input any explicit relationships between change, risk and mitigation as the CDRM spreadsheet is “rule based”, and contains a generic set of relationships between types of change, categories of risks and risk mitigation methods. Application-specific knowledge is required to detail any application-specific risks that overrule the generic rules mentioned above.
The CDRM spreadsheet must not be embedded or included in any other documents, it can only be linked to. This is required as the CDRM tool will be updated throughout the project lifecycle, and this will only be practical if it is kept as a separate document
How to go about it?
As projects come in various sizes, some of which can be similar in size and complexity to a programme, it is important to note that CDRM is very effective on an application by application basis. If a programme or project is changing a number of applications then CDRM output must be produced for each application.
To date over seventy mitigation methods have been identified of which under half are testing; use reviews to mitigate risk; use CDRM to help you identify effective methods you may not be familiar with.
The Testing Types matrix indicates where each of the mitigation methods is placed in relation to the development lifecycle stages. Note that this diagram corresponds to the standard testing V-model.
Types of Reviews and Tests
There are many different types of testing - not all of them applicable to any one project. Selecting which tests will be used for a project forms part of the strategy for testing on a project. In order to select the right types of tests to carry out it is necessary to apply the Change Driven Risk Management (CDRM) technique. CDRM uses the relationship between types of change and types of review and test to generate a customised checklist of methods to use throughout the lifecycle.