On Risk Management, Business Continuity, and Security
|26 July, 2017|
A single unfortunate incident (whether it be a major hurricane or a flood caused by a burst pipe) can cause significant losses to a company or, in some cases, wipe it out entirely. Sensible managers take steps to identify and manage these risks. (In some industries, such as banking and healthcare, they generally have a statutory duty to do so).
When faced with a risk we have a number of strategies available:
When deciding what strategy to adopt, we need to answer three questions:
The answers to these questions are normally presented in two formal documents:
The Risk Assessment Toolkit gives you these and many more reports. In particular it gives you an estimated timeline for every single possible disruption, as well as timelines if planned recovery objectives are not met.
Above all, we need quantitative estimates for the above. While qualitative assessments — simply grouping probabilities and impacts into categories of high, medium, or low — are easy to create, these qualitative assessments do not answer the fundamental question: how much does it make sense to spend on this? It is impossible to decide upon how much insurance you need or whether to spend $100,000 to mitigate a threat based upon whether a threat probability or impact is high, medium, or low.
Your first task (once you have management agreement), should be to determine what activities or processes your organization performs. (We will prefer the term activity over process here, because a business process generally involves a number of activities performed at different times and places by different people.)
How do you determine these activities?
The easiest method is to start with an organization chart, which will typically list all the departments and the people responsible for them. Create a representation of the organization hierarchy within the Risk Assessment Toolkit by right-clicking on My Organization in the tree view on the left hand side and selecting New / Organization Unit or New / Organization Sub-Unit. If you go wrong, you can drag and drop organization units to reorganize them. If you drop an organization unit to the right of another it becomes a sub-unit of that organization unit; if you drop it on top of it, it becomes a sibling. (If you look at the status bar at the bottom of the window, you can see what the effect of dragging and dropping will be).
You now need to interview somebody from every department (not necessarily the head of the department). For estimation purposes, assume that you need to interview most of these people for about an hour, and that you will need to interview half of them twice. You should also assume that you will be sending material for review or approval to each of these people.
The basic questions you need to ask are:
While you are recording this information make sure you have the basic "tombstone" information: who did you talk to? when did you interview them? how can you contact them if you need more information? You will need this information when you start finding inconsistencies and omissions in the information you have gathered).
As you gather this information, enter it into the Risk Assessment Toolkit. You do this in a similar manner to creating Organization Units: right click on an Organization Unit to add an Activity for which it is responsible; right-click on a Location to add a resource at that location, or to create a location within it.
Once you have put together this information, you will have a good understanding of the activities of your organization.
The Risk Assessment Toolkit includes a built-in list of common threats which most businesses and organizations should consider to see whether they apply to them. Do not regard this list as exhaustive. In particular, the built-in list will not include any threats to specific industries or threats related to particular equipment or business processes. [Tip: Ask interviewees an open-ended question about threats first. Only use this list as a checklist after you have asked people about specific and historical threats to the activities they are responsible for. When presented with a list of threats, people can sometimes find it difficult to think of threats that have not been included.]
Some of the most common threats facing any business are loss of internet, loss of electricity, loss of computer operations (e.g. equipment failure, operator error, virus infection) or local flood (e.g. burst pipe), fire and theft. Unless your business involves handling (or proximity to) hazardous materials, these types of threat are quite likely to dominate your analysis.
In building up the above picture, you have also identified the locations where everything occurs. You now need to consider specific threats against locations. Is the main office located in an area prone to hurricanes? Tornadoes? Flash flooding? Earthquakes? You can probably find some government estimates of the likelihood of these happening. Remember to only associate these threats with the highest "level" of location that they apply to — otherwise you may double-count the frequency of a threat . If your organization has two sites, one in New York and one in San Francisco, then the threat of an earthquake should be associated with each site separately. If, however, the sites were both located on the San Andreas fault then the threat should be associated with a "Southern California Sites" location since if it occurs it is likely to affect both sites at the same time.
It can also be helpful to distinguish between threats which have varying estimated durations. For example, short power disruptions (several minutes) might be common, but long power disruptions (several hours) may be rare. These are best modeled as two different threats as they typically have very different consequences.
Each activity and resource has associated with it a Recovery Time Objective (RTO). This is the maximum amount of time you are planning to allow after an incident before the activity is resumed at some minimum service level. Typically you will make a recovery time objective as short as you can, keeping in mind that the shorter the recovery time objective the more pre-planning, preparation and redundant facilities you will have to pay for. (Want your Accounts Payable department to have a recovery time objective measured in minutes? Create a parallel department in a different city standing by just in case the first one is disrupted. Horrendously expensive, but doable.)
The Risk Assessment Toolkit will warn you if your assigned RTOs are inconsistent: eg. an activity depends upon another activity with an shorter RTO.
The Recovery Point Objective (RPO) is a measure of how much data (expressed as units of time) you are prepared to lose if an incident occurs. Content with off-site overnight backups? Then your Recovery Point Objective is 24 hours. Can't afford to lose any data? Then you need synchronous replication (replicating all data off-site automatically as a transaction occurs). The shorter the duration, the higher cost (and more disruption to day-to-day operations). For analysis purposes it is important to make this value visible so that it can be discussed (and any appropriate technical measures adopted).
Both the RTO and the RPO figures need to be decided in consultation with the person responsible for the activity or resource and while reviewing the timelines associated with the activity or resource.
Because of the range of numbers involved, the normal unit used for expressing the frequency or probability of a threat in an analysis is the Annualized Rate of Occurrence or ARO. This is an estimate of how frequently (or how likely) it is for an event to occur in a single year. Most people find thinking in terms of frequencies easier than thinking in terms of probabilities. Frequencies have the useful property that they sum. If we estimate the ARO of a one hour power cut as 2, and the ARO of a small fire in the production area as 0.5, then the combined frequency of these two events (assuming that one does not cause the other) is 2.5. The Risk Assessment Toolkit allows you to enter frequencies such as 3 times per 100 years, or 1 time per decade.
You can probably arrive at a ball-park estimate of a frequency fairly easily. (Even if you are out by 50% or 100%, you are still more accurate than if you had estimated a frequency as "low", "medium", or "high"). If you make random errors in your estimates (some are high, some are low) then the errors will tend to cancel out when you consider the actual probability of disruption of an activity. For infrequent threats you should try and make sure your estimates have the correct order of magnitude, rather than getting too concerned about whether the chance of an earthquake is once in a hundred years or twice in a hundred years.
Once again, orders of magnitude are more important than highly precise values. Does it cost $1000/hour or $1250/hour if this operation is disrupted? It probably isn't enough of a difference for you to change your plans as to how to deal with the disruption. But $10,000/hour vs $1,000/hour may well make a difference. The important thing here is to have figures that are accurate enough that nobody will dispute them as being wildly inaccurate at a later date. Round the numbers to one or two significant figures: if you say it will cost approximately $2,100 to replace a piece of equipment that will be approximately correct for a while. But $2,132.78? That was only true yesterday.
First of all, make sure you pass back all the information you obtained to the people involved for review. You need people to agree with your results. They are unlikely to do that if they are presented with a fait accompli from somebody they don't know well. Ask for feedback and corrections for individual activities and resources. (You probably don't want everybody to comment on everything — you are less likely to get knowledgeable feedback, and are presenting everybody with an unreasonably large task).
Once you have completed the picture, you can review the Risk Register and Business Impact Analysis with senior managers to agree on how you will deal with the threats to your organization. Is business insurance sufficient? Is the cover high enough? Does it cover the threat and the full impact? Should you spend money to reduce a specific threat? Should you spend time and effort to create detailed plans to cope with the disruption of a particular activity so that it can be recovered in hours rather than days. You now have the information at your disposal to make strategic decisions about risk management and business continuity.
And when an incident occurs, you also have a list of direct and indirect dependencies along with a timeline which will help you determine the priority, time, money, and resources which should be dedicated to restoring that activity.
© Albion Research Ltd. 2017