Download MOF Risk Model for Operations

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Beta (finance) wikipedia , lookup

Investment fund wikipedia , lookup

Financial economics wikipedia , lookup

Public finance wikipedia , lookup

Actuary wikipedia , lookup

Investment management wikipedia , lookup

Moral hazard wikipedia , lookup

Risk wikipedia , lookup

Systemic risk wikipedia , lookup

Transcript
Microsoft Operations Framework
White Paper
Published: December 2000 version 1.0
For information on Microsoft Operations Framework, see
http://www.microsoft.com/business/services/mcsmof.asp
Risk Model for Operations
Contents
Abstract .....................................................................................................................3
Introduction ...............................................................................................................3
Why Operations Needs Risk Management ...............................................................4
Overview of the Risk Model for Operations .............................................................8
The Five Steps of Risk Management ......................................................................14
Relating the Risk Model to MOF ............................................................................26
Comparing the Risk Model for Operations to Other Risk Models ..........................31
Examples .................................................................................................................33
Conclusion...............................................................................................................37
Additional Information ............................................................................................37
Appendix A: Glossary .............................................................................................39
Appendix B: Detailed Examples .............................................................................42
 2000 Microsoft Corporation. All rights reserved.
The information contained in this document represents the current view of Microsoft Corporation on the
issues discussed as of the date of publication. Because Microsoft must respond to changing market
conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot
guarantee the accuracy of any information presented after the date of publication.
This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS
OR IMPLIED, IN THIS DOCUMENT.
Microsoft and Windows are either registered trademarks or trademarks of Microsoft in the United States
and/or other countries.
Risk Model for Operations
3
Abstract
This white paper is one of a series about Microsoft® Operations Framework (MOF).
For a complete list of these publications, please see the MOF Web site at
http://www.microsoft.com/mof.
This white paper is intended for all IT staff whose work involves operations and service
management. It explains why risk management is increasingly important in operations,
describes the risk model for operations, relates it to Microsoft Operations Framework,
and illustrates its applicability to real-world operations risks. An appendix details the
management of several risks, and a glossary explains key terms.
Anyone reading this paper already should have ready the Microsoft Operations
Framework “Executive Overview” white paper, which contains important background
information for this topic.
Introduction
Executive Summary
Information technology professionals who are responsible for mission-critical systems
have seen their work change in ways that make risk management increasingly
important. The business relies more on information technology (IT) than it used to,
which raises the impact of failure; the IT environment has more moving parts than it
used to, which raises the probability of problems; more people notice IT problems and
react to them, which creates additional consequences for failure; and more of the
infrastructure is outside the IT group’s direct control. At the same time, the IT group has
less time to react, and is less able to manage risk by applying tight change-control
measures. Combined, these trends mean that operations groups are facing larger
challenges with a smaller tool kit. The risk model for operations helps expand the tool
kit.
This white paper explains the core principles and components of the risk model for
operations. It is intended for operations staff at all levels, as well as Microsoft
Consulting Services and partner consultants. The model is applicable in nearly all
organizations, and the examples illustrate situations commonly found at service
providers, “dot-coms” and “e-businesses,” and IT groups of large organizations.
First, the paper makes the case that risk management is becoming more important and
more difficult. It then describes the risk model for operations: a process for managing
risks with a proactive approach that embeds risk management practices into every IT
team role and into every IT process. The paper concludes with examples that show how
the model can be applied to real-world operations risks.
4
Risk Model for Operations
Why Operations Needs Risk Management
Greater Risk of IT Failure
A risk is the possibility of suffering a loss, and risk management is essentially the
process of identifying risks and deciding what to do about them. Risk management is
increasingly important to IT in general, and to operations groups in particular, because
business is more able to suffer losses due to IT decisions.
Both the number and the severity of potential IT failures (specifically the ones related to
IT operations) are rising over time:
 Business transactions are increasingly dependent on IT, so failures in IT are more
likely to impact the business, and that impact is more likely to be severe.
 The IT environment is increasingly complex, so even if the environment stays the
same size, the number of potential failure points is rising.
 IT directly controls less of the infrastructure, so managing the possibility of failure is
more important because IT has less ability to react after the failure occurs.
 When an IT failure occurs, there is less time between the failure and its impact on
the business.
 IT failures are increasingly visible outside the data center, so more people react
negatively when a failure occurs.
In short, IT today has more potential to enable business than ever before, but failures in
IT have more potential to disable business.
At the same time, the traditional risk management strategy of tight change control is less
often available, and less often effective.
As a result, operations needs a larger set of risk management tools.
The next sections provide a closer examination of trends in IT failure.
Risk Model for Operations
5
Business Is More Dependent on IT
Today, more of the systems that IT manages are critical to the business. For example, 10
years ago many companies’ communications were based on non-IT services such as
paper memos, an internal mailroom service, an external postal service, and the
telephone. Today, IT is responsible for communication hubs such as e-mail service,
intranets, and Internet sites: systems that were not considered business-critical a decade
ago. Companies involved in e-commerce are at even greater risk from IT failures,
because those businesses’ core processes (such as value chain, supply chain, businessto-customer, business-to-business, and business-to-employee) now rely on IT for their
success.
Because business is increasingly dependent on IT services, those services are
increasingly a source of risk to the business: Failures in the IT group are more able to
cause failure in the business as a whole.
The Environment Is More Complex
Simply put, the IT environment includes more “moving parts” today than it did in the
past. There are more desktops, more servers, more connections, more systems
integration, and more end-to-end services. This is partly due to the move from
centralized computing, to client/server computing, to the vision of Microsoft .NET, in
which all objects are distributed. As that progression takes place, the number of items in
the infrastructure increases even if the scope of the infrastructure stays the same.
The diversity of the infrastructure has also increased. For example, IT groups that used
to worry about the links between the terminals and a handful of hosts now keep track of
local area networks (LANs) and wide area networks (WANs); land lines and dial-up
access and wireless links; internal networks as well as connections to the Internet. Client
systems are another example—in the past IT dealt with terminals, but today the client
hardware could range from desktops and laptops, to handheld computers, to wireless
information appliances, to Internet-enabled phones and pagers.
The number of users is also increasing. In the early days, a few operators interacted with
a mainframe, then later the pool of users grew to include a few dozen clerks, then a few
hundred knowledge workers on the mainframe and on personal computers. Today, even
more customers reach e-commerce sites from their home systems. In addition to the
number of users, their autonomy is increasing as well. Mainframe users didn’t upgrade
software on their own, but home users do this all the time.
Because the environment is more complex and diverse, the IT group is more able to fail
the business than it was in the past.
6
Risk Model for Operations
Traditional IT Directly Controls Less of the Infrastructure
More of the systems that are part of IT services are managed outside of the company.
For example, a retailer that receives orders on its Web site might rely on other
companies’ systems for credit verification, warehousing, and shipping.
The “virtual IT environment” does not necessarily increase the potential for failure, and
in fact it can decrease risk by outsourcing a service to specialists who are best able to
operate it and most able to prevent it from failing. This trend is important for risk
management because the customer may still expect the company to provide the end-toend service.
In other words, a virtual IT environment can reduce a company’s control over the entire
infrastructure without reducing the company’s responsibility for keeping the
infrastructure running.
Less Time Between Failure and Impact
If a service fails, there is a window of time during which the IT group can attempt to
recover the service before the failure impacts the business. If a system that prints
customers’ bills were to fail, that window might be hours or days in length because
there is often a three-week delay between printing a bill and expecting to receive the
customer’s payment. If IT can recover the service within that time then the customers
receive bills on time, or close enough that most won’t notice the delay, and the
company’s revenue stream won’t be interrupted. However, an e-commerce site’s
customers may expect transactions to complete within one minute, and expect e-mail
confirmation of each transaction within another five minutes.
In the past, the strategy of “fix-on-fail” was more feasible because there was time to
make the fix, and that’s less often true today.
Failure Is More Visible
Years ago, IT managers might have wondered, “If a service fails in the data center and
no one notices, is it a crisis?” That question has become irrelevant to many IT
organizations because so many more failures are immediately noticeable outside the
data center. Five years ago, if your company’s Web site was unavailable for an hour, the
few people who noticed were your own IT staff. Today, the list of people who would
notice that failure might include hundreds of customers, a dozen competitors, and every
analyst who tracks your company’s stock.
Visibility is important because people not only notice failures, they also react. A case in
point is a well-publicized day-long service outage suffered by an online auction site.
Customers noticed it; to satisfy them, the site’s parent company refunded all the fees it
collected for every auction in progress, a sum reportedly equal to one-third of the
company’s quarterly profits. Analysts and investors noticed the problem, too—the
company lost 25 percent of its market capitalization in two days.
The first four trends make failure more likely and more severe; the visibility of missioncritical systems outside the company amplifies the severity of failure.
Risk Model for Operations
7
Traditional Techniques Less Useful
The trends above make IT failure a greater risk to the business. At the same time, some
traditional risk management tools are less often applicable.
IT operations staff traditionally managed risks to the production infrastructure by using
change management practices. Changes to the infrastructure were either denied or they
were managed with a strict process and a long timeline. This ensured stability, but
reduced organizational flexibility.
Today, the business environment is changing more rapidly, and the business can adapt
only as quickly as its systems. Business management is more likely to tell IT what to
implement, rather than ask, so IT can less often manage risk by denying the change
requests. An IT group that used to reduce risk through six-week change cycles now
might find itself forced to make changes in six days, six hours, or even less.
Even if business management doesn’t limit the effectiveness of change control, the IT
technology can. For example:
 In the past an IT group could limit risk by announcing that no changes to the
network infrastructure would be permitted for the next 30 days. That edict can’t be
enforced when the Internet is a part of the network.
 In the past an IT group could limit risk by standardizing the hardware and software
of all order entry systems. That’s not an option when customers use their own
computers to order goods from a Web site.
Implications
Over time, environmental complexity increases the probability of failures, dependence
on IT increases the impact of those that occur, and increased visibility amplifies that
impact. As the number and impact of potential failures are rising, IT directly controls
less of the infrastructure, has less time to react, and is less able to apply traditional risk
management methods to deal with the possibility of failure. What’s an IT manager to
do?
Microsoft Operations Framework recommends that operations should integrate risk
management into decision-making the way it has already integrated other critical factors
such as time, money, and labor:
 Risk management should be integrated into operations decision making in every job
function and every role.
 Risk management should be taken seriously and given an appropriate amount of
effort.
 Risk management should be done continuously to ensure that operations is dealing
with the risks that are relevant today, not just the ones that were relevant last quarter.
Fortunately, formalizing risk management practices is an achievable goal. Risk
management is a well-understood discipline, and it is readily applicable to IT
operations, as described in the next section.
8
Risk Model for Operations
Overview of the Risk Model for Operations
Benefits
The risk model for operations applies proven risk management techniques to the
problems that operations staff face every day. There are many models, frameworks, and
processes for managing risks. They’re all about planning for an uncertain future, and the
risk model for operations is no exception. However, it offers greater value than many
others through its key principles, a customized terminology, a structured and repeatable
five-step process, and integration into a larger operations framework. All of these are
elements are detailed below.
Origin of the Model
The risk model for operations was developed in response to customer requests for a
framework to help organizations that run their businesses on the Microsoft platform to
manage risk while operating and managing those services. Microsoft Solutions
Framework (MSF) defined a widely applicable risk model whose description is
customized to address risk management during projects, especially software
development and deployment projects. The risk model for operations is based on the
MSF risk model, with extensions and customizations to address the needs of operations
groups.
Characteristics of Risk
Risk has some basic aspects that most people don’t understand or don’t think about, and
a risk management model has to acknowledge them to be successful. Some aspects are
as follows:
 Risk is a fundamental part of operations. The only environment that has no risk is
one whose future has no uncertainty: no question of whether or when a particular
hard disk will fail; no question of whether a Web site’s usage will spike or when or
how much; no question of whether or when illness will leave the help desk shortstaffed. Such an environment does not exist.
 Risk is neither good nor bad. A risk is the possibility of a future loss, and although
the loss itself may be seen as “bad,” the risk as a whole is not. It may help to realize
that an opportunity is the possibility of a future gain. There is no risk without
opportunity, and no opportunity without risk.
 Risk is not something to fear, but something to manage. Because risk is not bad,
it is not something to avoid. Operations teams deal with risks by recognizing and
minimizing uncertainty and by proactively addressing each identified risk. If a loss
is one possible future outcome, then the other possible outcomes are gains, smaller
losses, or larger losses. Risk management lets the team change the situation to favor
one outcome over the others.
Risk Model for Operations
9
Principles of Successful Risk Management
The risk model for operations advocates these principles:
 Assess risks continuously. This means the team never stops searching for new
risks, and it means that existing risks are periodically reevaluated. If either part does
not happen, risk management will not benefit the company.
 Integrate risk management into every role and every function. At a high level,
this means that every IT role shares part of the responsibility for managing risk, and
every IT process is designed with risk management in mind. At a more concrete
level, it means that every process owner:
 Identifies potential sources of risk.
 Assesses the probability of the risk occurring.
 Plans to minimize the probability.
 Understands the potential impact.
 Plans to minimize the impact.
 Identifies indicators that show the risk is imminent.
 Plans how to react if the risk occurs.
For example, the support manager with overall responsibility for the help desk
function will perform all of these tasks to manage the risks that are most
important for the help desk. Other people in that manager’s extended team may
perform a subset of those tasks: Everyone will help identify new risks, but
perhaps only one or two people will be responsible for estimating probability or
making plans to minimize impact.
 Treat risk identification positively. For risk management to succeed, team
members must be willing to identify risk without fear of retribution or criticism. The
identification of a risk means the team faces one less unpleasant surprise. Until a
risk is identified, the team cannot prepare for it.
 Use risk-based scheduling. Maintaining an environment often means making
changes in a sequence, and where possible the team should make the riskiest
changes first. An example is beta-testing an application. If the company wants 10
features to work, and two of them are so important that the lack of either would
prevent the application’s adoption, test those two first. If they were to be tested last
and either was to fail, then the team would have lost the resources invested in testing
the first eight.
 Establish an acceptable level of formality. Success requires a process that the team
understands and uses. This is a balancing act. If the process has too little structure,
people may use it but the outputs won’t be useful; if it is too prescriptive, people
probably won’t use it at all.
These principles are summarized in the word proactive. A team that practices
proactive risk management acknowledges that risk is a normal part of operations,
and instead of fearing it the team views it as an opportunity to safeguard the
future. Team members demonstrate a proactive mindset by adopting a visible,
measurable, repeatable, continuous process through which they objectively evaluate
risks and opportunities, and take action that addresses risks’ causes as well as
symptoms.
10
Risk Model for Operations
Process Overview
When operations teams use proactive risk management, they assess risks continuously
and use them for decision making. The team carries the risks forward and deals with
them until they are no longer important, or until they occur and become known
problems.
The following diagram illustrates the five steps of the risk management process:
identify, analyze, plan, track, and control. It is important to understand that each risk
goes through all of these steps at least once, and often cycles through numerous times.
Also, each risk has its own timeline, so multiple risks might be in each step at any point
in time.
Later sections will detail these steps.
1
2
Identify
Retired
Risk
List
5
Analyze
Risk
Assessment
Document
Control
3
Top
n
Risks
4
Track
Figure 1 - The proactive risk management process
Plan
Risk Model for Operations
11
Risk Lists
The simplest view of the process is that the five steps feed information into and out of
three lists of risks: the master risk list, the top risks list, and the retired risk list.
Understanding the three lists makes the five steps easier to learn.
The Master Risk List
Figure 2
During each step in the process, the team gathers information about a particular risk and
adds that information into the master risk list. Each subsequent step builds on the
previous ones by adding more elements of the risk, or it draws on the current elements
to support decision making. For example, the analyzing step initially adds information
about the risk’s impact and probability. The process is cyclic, so future passes through
the analyzing step may review and revise those impact and probability estimates.
The master risk list is technology-independent. It could be as crude as a set of index
cards, though that would make certain functions (such as sorting and linking) very
labor-intensive. The list can be implemented simply as a Microsoft® Word document or
a Microsoft® Excel spreadsheet, or it can be as complex as a multitiered database
application.
Note that the size of the master risk list is more an indicator of the team’s thoroughness
than an indicator of the IT group’s health or stability.
Top Risks List
Figure 3
12
Risk Model for Operations
Managing risk takes time and effort away from daily operations activities, so it is
important for the team to balance the overhead of risk management against the expected
savings. This usually means identifying a small number of major risks that are most
deserving of the team’s limited time and resources.
One way to think of this is that the master risk list is prioritized, and the risks at the top,
the ones that are important enough to be actively managed, make up a separate top risks
list. The size of this list will vary between IT groups, and within one IT group it is likely
to vary over time.
Retired Risk List
Figure 4
The master risk list holds all the risks that the team has identified, whether they’re
important enough to appear on the top risks list or not. Some of those risks never go
away, such as those related to natural disasters. Others reach a point where they’re no
longer relevant. For example, the team might reduce the probability of the risk to zero.
Or, the source of the risk may leave the environment. Risks specific to an outdated
software application are no longer relevant after that application has been completely
phased out.
Whenever a risk becomes irrelevant, it is moved from the master risk list to the retired
risk list. This list serves as a historical reference from which the team can draw in the
future. For example, if the team has previously tracked risks related to help desk
processes, and then the help desk function is outsourced to another company, some of
the help desk-related risks might be retired. If the help desk function is later brought
back in-house, the team can draw on the retired risk list for guidance. Also, people may
consult this list as a starting point for identifying new risks. Finally, if the team lowers a
risk’s probability or impact to zero, then the notes about what the team did may benefit
other people facing similar risks.
When thinking about retiring risks, it can be useful to consider risks as having multiple
instances. For example, a corporate merger introduces the risk of severe IT budget and
staff cuts. If the group survives one merger it can retire that instance of the risk, but
other instances remain because subsequent mergers might happen.
Risk Model for Operations
13
Process-wide Best Practices
These suggestions apply to all five steps of the risk model.
Adopting Risk Management
If an operations group does not already have a culture of risk management, then
adopting it can be a significant change. The biggest obstacle to that change is the
complexity of the process. People who are not yet doing risk management in a
structured way generally don’t see the need to change, and if the risk management
process is too complex then people are likely to dismiss it as unproductive busywork.
Keep this in mind when considering the best practices in this white paper. They make
risk management more effective, but some also increase the complexity.
Chief Risk Officers
A growing practice is to appoint chief risk officers (CROs) and to have a risk
management team within the IT organization that is separate from the operations teams.
In these cases, the division of responsibilities between the risk management team and
the operations team should be very clear.
At first glance, the CRO position might seem to contradict the key principle of
integrating risk management into all job roles and functions. The distinction is whether
everyone plays a part in risk management. If so, then having a specific role such as a
CRO focus on risk management full time can be very helpful, acting as a specialist and
mentor, and coordinating risk management activities that might otherwise be inefficient
or even contradictory.
Emergency Response Teams
Large organizations often have emergency response teams (ERTs) that react to critical
failures and disasters. They are trained to respond by following established emergency
response and contingency plans. These teams need to be included during all phases of
risk management, and especially contingency planning.
Human Resources and Training
Risk management is very much dependent on operations personnel. It should commence
from the day an employee is hired. Ideally, make risk-management skills a factor when
hiring people into the IT group. Give everyone access to risk management training.
Also, make sure that everyone receives proper job training. The better people
understand a job, the more effectively they will identify and address its risks.
14
Risk Model for Operations
The Five Steps of Risk Management
Step 1: Risk Identification
1
Figure 5
Risk identification is the first step in the proactive risk management process. It provides
the opportunities, cues, and information that allow the team to raise major risks before
they adversely affect operations and hence the business.
This step is closely related to the IT Infrastructure Library (ITIL) term
“classification”—formally identifying incidents, problems, and known errors by origin,
symptoms, and causes.
In this step, the team identifies the components of the risk statement:
 Condition
 Operations consequence
 Business consequence
 Source of risk
 Mode of failure
Condition and Consequences
An intuitive way to discuss the future is “if-then” statements: The condition is the “if”
part of the statement, and the consequence is the “then” part. For example, “If the Web
server’s sole power supply fails, then the company’s Web site will be unavailable.”
Note that there can be a many-to-many relationship between condition and
consequence. A single condition can cause numerous consequences, as the following
table shows.
Condition
If the Web server’s sole power supply fails …
Consequences
… then the server will be unavailable.
… then the operations team will incur the cost of a
replacement power supply.
… then someone must interrupt their regular work
to install the new power supply.
Risk Model for Operations
15
Or, numerous conditions can cause the same consequence, as shown below.
Conditions
Consequence
If the Web server’s sole power supply fails …
… the server will be unavailable.
If a technician takes the server offline to install an
application patch …
If a construction crew cuts the buried cable that links
the data center to the rest of the company …
It is important to separate the consequence into two parts during identification: the
operations consequence and the business consequence. Continuing the previous
example, the operations consequences include the cost of a power supply and the labor
to install it; the business consequences might include the damage to the company’s
reputation, and lost revenue if the site was being used for e-commerce. Distinguishing
between these is critical later in the process when the team ranks risks to ensure the
most important ones get the attention they deserve, because a risk may have a high
operational consequence but a low business consequence, or vice versa.
Source of Risk
There are four main sources of risk in IT operations:
 People. Everyone makes mistakes, and even if the group’s processes and technology
are flawless these human errors can put the business at risk.
 Process. Flawed or badly documented processes can put the business at risk even if
they are followed perfectly.
 Technology. The IT staff may perfectly follow a perfectly designed process, yet fail
the business because of problems with the hardware, software, and so on.
 External. Some factors are beyond the IT group’s control but can still harm the
infrastructure in a way that fails the business. Natural events such as earthquakes
and floods fall into this category, as do externally generated, man-made problems
such as civil unrest, computer virus attacks, and changes to government regulations.
These are broad categories, and at face value they overlap. For example, if a newly
hired operator undergoes training on the backup software, and a week later makes a
mistake that causes the backup to fail, is the source of risk “people” or “process”? If the
company relies on a telecommunications company (a “telco”) for Internet access and
that telco’s hardware fails, is that “technology” or “external”?
There are many ways to decide which category a risk fits in, and it is more important to
define one way and stick to it, rather than spend time seeking a “perfect” way. One
option is to ask whether the IT group has any control over the risk’s cause. If not, the
source is external. This would define a telco’s hardware problem as “external.” For the
other three sources, would the problem have occurred if the person had been different,
or if the process had been different, or if the technology had been different? This would
define the operator’s failure as “people” if the operator didn’t pay attention during
training or forgot what the lesson, or as “process” if the training was incomplete or
badly designed.
Why worry about the source of risk? Because it will affect the way the team manages
the risk in later steps of the process. For example, the team will deal with the possibility
of inattentive trainees differently than the possibility of poor-quality training materials.
16
Risk Model for Operations
Mode of Failure
There are four main ways in which operations can fail the business:
 Cost. The infrastructure can work properly, but at too high a cost, causing too little
return on investment.
 Agility. The infrastructure can work properly, but be unable to change quickly
enough to meet the business needs. Capacity problems are the most obvious case.
For example, someone might have a dozen new servers ready to support increased
processing needs, but forget that the cooling systems in the data center were already
at peak capacity, and upgrading those systems will take a month.
 Performance. The infrastructure can fail to meet users’ expectations, either because
the expectations were set wrong, or because the infrastructure performs incorrectly.
 Security. The infrastructure can fail the business by not providing enough protection
for data and resources, or by enforcing so much security that legitimate users can’t
access data and resources.
Best Practices
These best practices will be beneficial during the risk identification step.
Continual Identification
When a group adopts risk management, the first step is often a brainstorming session to
identify risks. Identification does not end with that meeting. Identification happens as
often as changes are able to impact the IT infrastructure—which is to say, it happens
every day.
Discussions
Identification discussions are very important, and a key to success is representing all
relevant viewpoints, including stakeholders as well as different parts of the operations
staff. This is a powerful way to expose assumptions and differing viewpoints.
The ultimate goal of the identification discussion is to improve the team’s risk
management. Measure progress against that goal by the substance of the discussion, not
by the number of words in the risk statement it produces, or by how many minutes it
takes to create each risk statement. Sometimes the most valuable discussions take the
most time and yet produce the fewest words. This is especially true when the team first
starts using the risk model for operations.
Thinking about risk is a skill that takes time to develop, and it is far easier to develop in
group discussions than alone in an office.
Risk Model for Operations
17
Source-Mode Matrix
The set of all possible conditions is nearly infinite, and the sheer volume can make it
hard for the team to focus on one at a time, especially during brainstorming. An
effective solution, and one that has benefits later in the process, is to subdivide all of the
possible conditions into a table with one row for each of the four sources of risk, and
one column for each of the four modes of failure:
Mode of failure
Cost
Source
of risk
Agility
Performance
Security
People
Process
Technology
External
The team can then focus on one cell of the table at a time. For example, team members
might ask themselves, “How might people in the operations group make mistakes that
would cause us to do the right work at too high a cost?” Or they might ask, “How could
our technology fail to meet customers’ performance expectations? Or more specifically,
how might hardware problems cause the sales group’s order entry system to bog
down?”
The Risk Statement
Before a team can manage a risk, the team must clearly express it, and in practice this
can be a bigger challenge than it seems at first:
 Phrasing a risk often requires rethinking assumptions about a situation and
reevaluating the elements that are most important.
 Writing down risks is critical, yet for various reasons the risks a person has thought
through often stay locked inside his or her mind. The team can’t manage a risk that
isn’t shared.
The risk statement should include all parts diagrammed in the example below. Note that
the condition and the two consequences reflect the risk’s source of risk and mode of
failure, respectively.
Risk Model for Operations
Operations
Consequence
Security
Performance
Agility
Cost
Mode of
Failure
Source of
Risk
18
Condition
... then
operations will
suffer this
loss of
performance ...
People
If people do
this ...
Process
Technology
External
Business
Consequence
... which will
harm the
business in this
way...
Figure 6 - The risk statement
Risk Statement Form
A helpful way to present the information gathered during this step is through a risk
statement form, which may add information that will be valuable later, during the risk
tracking step. In addition to the five parts of the risk statement (condition, source of risk,
mode of failure, operational consequence, business consequence), the statement form
should include the following:
 Role or function. The service management function most directly involved with the
risk situation.
 Risk context. A paragraph containing additional background information that helps
to clarify the risk situation.
 Related risks.
Risk Factor Charts
A risk factor chart helps the group quickly determine the exposure it faces in general
categories of risk. One line of such a chart might look like this:
Risk
If a hard disk
fails, its data
cannot be
recovered from
tape backup.
Cues of High Exposure
Cues of Medium Exposure
Cues of Low Exposure
No one is formally
accountable for
performing backups.
Only one operator has
been trained on the new
version of the software.
The backup operator who
has been trained cannot
be reached except during
his/her shift.
Managers ensure that backups
are made every day, but
making them is a low-status
job assigned to operators with
the least seniority.
All backup operators attend a
one-hour class, but that
training covers only the backup
software User’s Guide and it
has no hands-on exercises.
Each week’s tapes are
sampled and restored to
verify integrity.
Two backup operators
are on shift at all times.
Only backup operators
who have vendor
certification are allowed
to make backups without
supervision.
Risk Model for Operations
19
Step 2: Risk Analysis
2
Figure 7
Risk analysis builds on the risk information generated in the identification step,
converting it into decision-making information. In the analyzing step, the team adds
three more elements to the risk’s entry on the master risk list: the risk’s probability,
impact, and exposure. These elements allow the team to rank risks, which in turn allows
the team to put the most energy into managing the most important risks.
Risk Probability
This is the likelihood that the condition will actually occur. Risk probability must be
greater than zero, or the operations risk does not pose a threat to the business. Likewise,
the probability must be less than 100 percent or the risk is a certainty—in other words, it
is a known problem.
The probability can also be seen as the likelihood of the consequence, because if the
condition occurs, the probability of the consequence is assumed to be 100 percent.
Risk Impact
Risk impact measures the severity of adverse effects, or the magnitude of a loss, caused
by the consequences.
The most effective solution is a numeric scale. Deciding how to estimate losses is not a
trivial matter. The best solution is a numeric scale: the larger the number, the greater the
impact. As a rule of thumb, the scale should go at least as high as three, in order to
produce a range of exposure values. However, note that the higher the scale goes, the
more time people spend picking exactly the right number, without producing much real
additional accuracy.
20
Risk Model for Operations
Risk Exposure
The exposure is the result of multiplying the probability by the impact. Sometimes a
high-probability risk has low impact and can be safely ignored; sometimes a highimpact risk has low probability and can be safely ignored. The risks that have high
probability and high impact are the ones most worth managing, and they’re the ones that
produce the highest exposure values.
When estimating probability and impact, it is often valuable to note your confidence
level. For example, if a risk might result in a million-dollar loss but the confidence that
the data are accurate is only 20 percent, document it so that the people who review the
risk analysis can put this estimate in proper perspective.
Best Practices
These best practices will be beneficial during the risk analysis step.
Settle Differences of Opinion
It is unlikely that a team will agree on risk ranking because team members with
different experiences or viewpoints will rate probability and impact differently.
Discussions can easily turn emotional, or at least political, and to maintain objectivity in
the discussion and to limit arguments, be sure to decide as a team how to resolve these
differences before starting this step. Options include a majority-rule vote, picking the
worst-case estimate, or siding with the person who has the longest experience dealing
with the situation in which the risk occurs.
Measure Financial Impact
It is often helpful to roughly estimate the impact in financial terms, and record this in
addition to the impact’s numeric estimate. If several risks have the same exposure value
then the financial estimate can help determine which one is most important. Also, the
financial data helps in the planning step to ensure that the cost of preventing a risk is
lower than the cost of incurring the consequences.
It might seem that the financial estimate is preferable, and could be used in place of a
numeric value. However, in practice, financial impact values tend to be a much more
labor-intensive way to produce the same top risks list.
If you decide to use a monetary scale for impact, use it for all risks. If one risk’s impact
uses a numeric scale and another’s uses a monetary scale, then the two can’t be
compared to each other, so there’s no way to rank one over the other.
Perform a Business Impact Analysis
Perform a business impact analysis, for example using a questionnaire that IT users fill
out, estimating the importance and impact of service outages. This can help IT
understand the services’ perceived value, and this might be a factor to consider when
ranking risks.
Record the Impact’s Classification
Some IT groups find it useful to categorize the nature of the impact, such as capital
expenditure, legal, labor, and so on.
Risk Model for Operations
21
Step 3: Risk Action Planning
3
Figure 8
The planning step turns risk information into decisions and actions. Planning involves
developing actions to address individual risks, prioritizing the actions related to each
risk, and creating an integrated risk management plan.
Key tasks within this step include defining three more elements of the risk: mitigations,
triggers, and contingencies.
Mitigations
Mitigations are steps the team can take before the condition occurs, and each has one of
three effects on the risk:
 Reduce. Risk reduction minimizes the risk’s probability or its impact, or both. For
example, redundancy generally reduces the impact of failure. If one component fails
there is no impact because the redundant component is still working. Keeping track
of those components’ expected lifespan and replacing them before they’re expected
to fail reduces the probability of the failure. Ideally, a reduction method reduces
probability or impact to zero, but this is not always possible.
 Avoid. Risk avoidance prevents the team from taking actions that increase exposure
too much to justify the benefit. An example is upgrading an unimportant, rarely used
application on all 50,000 desktops of an enterprise. In most cases, the benefit doesn’t
justify the exposure, so IT avoids the risk by not upgrading the application.
 Transfer. Whereas the avoidance strategy eliminates a risk, the transference
strategy often leaves the risk intact but shifts responsibility for it to another group.
For example, a company with an e-commerce site might outsource credit
verification to another company. The risks still exist, but they become the outsource
partner’s responsibility. However, if the outsource partner is better able to perform
credit verification, then transferring the risks can also reduce them.
It is vitally important to assign an owner to every mitigation plan, and it is helpful to
define the plan’s milestones in order to track its progress, and its success metrics in
order to track whether it is having the desired effect.
22
Risk Model for Operations
Triggers
Triggers are indicators that tell the team a condition is about to occur, or has occurred,
and therefore it is time to put the contingency plan into effect.
When defining the risk elements, it can be difficult to distinguish between consequences
and triggers. Ideally, the trigger becomes true before the consequences occur. It may
help to think of them as warning lights that illuminate while there is still time to avoid
danger. For example, if the condition is that the server runs out of hard disk space, the
trigger might be that the server’s disk has reached 95 percent of its capacity and is
trending upward.
In some cases, the triggers may be date-driven. For example, if the condition is that a
newly ordered server might not arrive in time to support the launch of a mission-critical
application, a trigger might be set for the latest date on which the server could safely
arrive.
Contingencies
A contingency is a step the team takes if the condition occurs or a trigger becomes true.
The contingency plan documents the set of contingencies the team will use when
reacting to a particular condition.
Continuing the previous example, if the server does not arrive in time and the trigger
becomes true, one contingency might be to borrow an existing server from a lessimportant service.
If the condition is that the server runs out of disk space, a trigger might be set to notify
operators when only 5 percent of the disk is still free. One contingency might be to free
disk space by moving little-used files to another server. Another contingency might be
to shut down non-critical applications so that a mission-critical one has no competition
for the remaining 5 percent of the disk’s space.
Best Practices
This best practice will be beneficial during the risk action planning step.
Prioritize
A mitigation plan might have several actions, and the sequence might affect the
mitigation’s success at reducing, avoiding, or transferring the risk, so it is important to
prioritize the steps in this plan.
A contingency plan is essentially a description of how to shift away from normal
operations when a condition occurs. Especially if the consequences disrupt many
services, it may be valuable to bring some services back on line first. Decide beforehand
the order in which to restore service, and decide how long each part can be offline.
Risk Model for Operations
23
Step 4: Risk Tracking
4
Figure 9
During the tracking step, the team gathers information about how risks are changing;
this information supports the decisions and actions that will be made in the next step
(control).
This step monitors three main changes:
 Trigger values. If a trigger becomes true, the contingency plan needs to be
executed.
 The risk’s condition, consequences, probability, and impact. If any of these
change (or are found to be inaccurate), they need to be reevaluated.
 The progress of a mitigation plan. If the plan is behind schedule or isn’t having the
desired effect, it needs to be reevaluated.
This step monitors the above changes on three main timeframes:
 Constant. Many risks in operations can be monitored constantly, or at least many
times each day. For example, automated tools can monitor a Web server’s
bandwidth usage every few seconds.
 Periodic. The team periodically reviews the top risks list, looking for changes in the
major elements. This often happens at team meetings, change advisory board
meetings, and so on.
 Ad-hoc. In some cases, someone simply notices that part of a risk has changed.
Risk Status Reporting
For operations reviews, the team should show the major risks and the status of risk
management actions. If operations reviews are regularly scheduled (monthly or at major
milestones), it helps to show the previous ranking of risks as well as the number of
times a risk was in the “top risk” list.
24
Risk Model for Operations
Best Practices
This best practice will be beneficial during the risk tracking step.
Review Routinely
Make risk review a part of regular work. This typically means making it a permanent
agenda item for any recurring meeting. The review can be highly effective without
taking very much time. This is the key to managing risks continuously.
Review Triggers
If the team has highly visible triggers that are automated and constantly monitored, it
can be easy to focus on them and overlook triggers that can’t be automated. Forgetting
to review triggers during a team meeting means that if one of them has become true, it
won’t be noticed until the next meeting, further delaying the contingency plan, and
often compounding the consequences.
Review Trends
Look for trends in risk data. For example, if a particular risk’s probability has increased
5 percent every week for the last month, then even though the probability is still low,
the trend may justify ranking the risk higher on the top risks list.
Step 5: Risk Control
5
Figure 10
The previous step (tracking) gathers information about a risk, and when something
changes, the controlling step executes a planned reaction to the change:
 If a trigger value has become true, then execute the contingency plan.
 If a risk has become irrelevant, then retire the risk.
 If the condition or a consequence has changed, then redirect to the identification step
to reevaluate that element.
 If the probability or impact has changed, then redirect to the analyzing step to update
the analysis.
 If a mitigation plan is no longer on track, then redirect to the planning step to review
and revise the plan.
At first this step may not seem necessary, and the distinction between it and the tracking
step may be unclear. In practice, the need to act is often detected by a tool, or by people
who don’t have the required responsibility, authority, or expertise to react on their own.
The controlling step ensures that the right people act at the right time.
Risk Model for Operations
25
For example:
 An automated tool might constantly monitor a Web server’s bandwidth usage. A
trigger has been defined so that if the usage jumps 10 percent in 10 minutes, then the
tool pages an operator who can execute a contingency plan by allocating more
bandwidth to the server. Detecting the change is part of the tracking step, paging the
operator is the transition from tracking to control, and the operator’s action is the
controlling step.
 The IT group might not have the expertise to operate certain applications involved
with e-commerce, and that lack of expertise creates certain risks. Suppose that
another company is contracted to manage those applications, so some of those risks
may no longer be relevant. Realizing that some risks might now be irrelevant is part
of the tracking step. The controlling step starts the reevaluation, and if the risk is
found to be irrelevant, the controlling step retires the risk.
 Suppose someone researches security problems and finds that the impact of a known
risk may be much smaller than its current estimate. Realizing the need to reevaluate
the risk is the tracking step. In the controlling step someone is assigned to reevaluate
the probability, and control of the risk passes to the analyzing step where the
probability can be reviewed.
 Suppose a mitigation plan was designed to reduce a particular risk’s probability by
20 percent within three months. A periodic review of the mitigation plan shows that
in the first two months it has reduced the probability only 5 percent, and the risk
owner is told to investigate the shortfall. The periodic review is the tracking step.
The controlling step consists of notifying the owner that the mitigation plan needs
reevaluating.
Best Practices
The controlling step relies heavily on effective communication, both to receive
notification that parts of risks and plans have changed, and to ensure that the right
people take action at the right time. The controlling step can’t be effective unless
communication within IT is also effective.
26
Risk Model for Operations
Relating the Risk Model to MOF
One of Three Core Models
MOF is a collection of best practices, principles, and models. It provides comprehensive
technical guidance for achieving mission-critical production system reliability,
availability, supportability, and manageability on Microsoft’s products and
technologies.
MOF is composed of three core models that are closely integrated with each other:
 The process model for operations
 The team model for operations
 The risk model for operations
The process model defines a set of service management functions (SMFs) and four
review milestones that provide operational integrity in the IT infrastructure. The team
model consists of a set of operations team role clusters that efficiently support the
operations process. The associated risk model manages the risks inherent in the IT
operations.
Additionally, MOF is based on and recognizes the current industry best practices for IT
service management that have been documented in the Central Computer and
Telecommunications Agency’s (CCTA) IT Infrastructure Library.
Risk Model for Operations
27
The MOF Process Model
MOF simplifies the complex set of dynamics involved in the IT operations
infrastructure into a framework that is easy to understand and whose principles and
practices are easy to incorporate and apply. The power of this simplified approach will
enable the operations staff of an enterprise of any size, regardless of maturity level, to
realize tangible benefits to the existing, or proposed, operations.
The MOF process model has four main concepts that are key to understanding the
model:
 IT service management, like software development, has a life cycle.
 The life cycle is made up of distinct, logical phases that run concurrently.
 Each phase has an operations review process. Operations reviews must be release
based and time based.
 IT service management touches every aspect of the enterprise.
With this understanding, the MOF process model consists of four integrated phases or
“quadrants”:
 Changing
 Operating
 Supporting
 Optimizing
These quadrants form a spiral life cycle that can be applied to a specific application, a
data center, or an entire operations environment with multiple data centers, including
outsourced operations and hosted applications.
Each quadrant culminates with a review milestone specifically tailored to assess the
operational effectiveness of the preceding quadrant. These quadrants, coupled with their
designated review milestones, work together to meet organizational goals and
objectives.
The following diagram illustrates the MOF process model and the relationship of the
life cycle quadrants, the reviews following each quadrant, and the concept of IT service
management at the core of the model. The diagram depicts each quadrant of the IT
operation connected in a continuous spiral life cycle.
28
Risk Model for Operations
Figure 11 - MOF and IT service management functions
Service Management Functions
IT SMFs are the core of the MOF process model. Although no SMF is exclusive to a
given quadrant in MOF, each SMF has a “home” quadrant or primary planning and
execution quadrant. Grouping SMFs with a primary MOF quadrant is a more intuitive
way to introduce an SMF in the context of the process model. The following is a
comprehensive list of MOF SMFs along with their description.
Changing
Change management. Responsible for managing any change in the organization.
Configuration management. Responsible for identifying, recording, tracking, and
reporting key IT components or assets.
Release management. Facilitates the introduction of software and hardware releases
into the managed IT environment.
Risk Model for Operations
29
Operating
Security administration. Responsible for maintaining a safe computing environment.
System administration. Responsible for keeping the enterprise systems running.
Network administration. Designs and manages all networks within an organization.
Service monitoring and control. Allows operations to observe the health of an IT
service in real time.
Directory services administration. Allows users and applications to find network
resources such as users, servers, applications, tools, services, and other information over
the network.
Storage management. Deals with on-site and off-site data storage for the purposes of
data restoration and historical archiving.
Job scheduling. Deals with assigning batch processing tasks at different times to
maximize the use of system resources while not compromising business and system
functions.
Print and output management. Deals with all data that is printed or compiled into
reports that are distributed to various members of the organization.
Supporting
Service desk management. Responsible for first-line support to the user community for
problems associated with the use of IT services.
Incident management. Manages and controls faults and disruptions in the use or
implementation of IT services as reported by customers or IT partners.
Problem management. Investigates and resolves the root causes of faults and
disruptions that affect large numbers of users.
Failover and recovery. Ensures that if a failure occurs, services are available in
accordance with the service continuity plan and service level agreements.
30
Risk Model for Operations
Optimizing
Service level management. Responsible for planning, coordinating, drafting, agreeing,
monitoring, and reporting on service level agreements (SLAs), and the ongoing review
of service achievements to ensure that IT and business are aligned and that service
quality is cost justifiable.
Capacity management. Ensures that appropriate IT resources are available to meet
business requirements.
Availability management. Concerned with the availability and reliability of the overall
system.
Financial management. Provides sound management of monetary resources in support
of organizational goals.
Workforce management. Recommends best practices to continuously assess key
aspects of the IT.
IT service continuity management. Focuses on preventing service outages and also on
recovery planning.
With regard to risk management, it is worth noting that some SMFs traditionally do
quite a bit of risk management. The obvious example is IT service continuity
management (formerly known as contingency planning), which focuses on disaster
recovery but employs risk management techniques. Availability management also relies
on risk management to ensure that changes in the environment don’t impact service
availability.
Risk Model for Operations
31
The MOF Team Model
The MOF team model offers guidelines for IT service management based on a set of
consistent, quality goals that exist in successful IT operations at organizations of various
sizes, from large corporate IT departments to smaller, growing e-business data centers
and application service providers.
The team model describes:
 How to structure operations teams.
 The key activities, tasks, and skills of each of the role functions.
 What guiding principles to uphold to be most successful at running and operating
distributed computing environments on the Microsoft platform.
The following diagram illustrates the MOF team model and each team’s associated
SMFs.











intellectual property protection
network & system security
intrusion detection
virus protection
audit and compliance admin
contingency planning



maintenance vendors
environment support
managed services,
outsourcers, trading
partners
software/hardware
suppliers

change management
release/systems engineering
configuration control/asset
mgmt
software distribution/licensing
quality assurance
Release
Security
Infrastructure





enterprise architecture
infrastructure engineering
capacity mgmt
cost/IT budget mgmt
resource & long range
planning




service desk/helpdesk
production/product support
problem management
service level management
Communication
Support
Partner
Operations





messaging ops
database ops
network admin
monintoring/metrics
availability mgmt
Example Function Teams within Ops Team Model Roles
Figure 12
32
Risk Model for Operations
Comparing the Risk Model for Operations to Other Risk Models
Beyond Security
Many risk models focus on security and view management of risks from the perspective
of maintaining hardware and data security and integrity. One example is the CCTA’s
Risk Analysis and Management Methodology (CRAMM). This is a valuable approach,
but the risk model for operations broadens the scope of potential risks beyond security
to include risks related to people, process, and technology.
CRAMM
Sanctioned by ITIL, CRAMM was developed by Insight Consulting. CRAMM is a
structured method for assessing risks to information systems and identifying appropriate
countermeasures. As such, MOF acknowledges the value of this approach.
When comparing the security aspects of risk management, CRAMM’s structured
approach, embodied in a software package, is a three-step process that allows users to
identify the valuation of their assets, assess the threats and vulnerabilities, and then
apply recommended countermeasures to their IT infrastructure.
Comparison with Risk Model for Operations
MOF risk management as applied to security provides guidance and stresses continual
review of security risks in five steps. MOF risk management emphasizes the continual
process of identifying, analyzing, planning, tracking, and controlling security measures
because new security threats are continually surfacing.
Moreover, MOF recognizes that security management is just one component of
managing risks in the operations environment. Where the MOF risk model differs from
other risk models is that it takes on a comprehensive view of risk management that
includes risks associated with agility, performance, and cost in addition to security.
From the business perspective, an IT operation can have a tight security structure that
takes into account and manages potential security threats, but it still could fail if it
doesn’t address the risks inherent in agility, performance, and cost.
Risk Model for Operations
33
Examples
Risk Management in Each Role
The first section of this white paper made the case that business is changing and
operations needs new risk management tools in order to adjust. The second section
described the theory behind the risk model for operations. This section demonstrates
that the risk model for operations actually works when applied to real-world operations
risks.
A basic principle of the risk model for operations is that risk management should be
integrated into every role in the MOF team model. The examples below are organized
according to those roles. For each role, a representative SMF has been selected.
MOF Team Role
Release
Infrastructure
Support
Operations
Partner
Security
SMF
Configuration management
Capacity management
Service desk
Availability management
Financial management
Security
Please note that:
 All of the SMFs in MOF are important and operations needs to manage risks in
each. One has been selected here for each role simply to avoid presenting an
unmanageable number of examples.
 The examples below have been chosen to demonstrate the model’s applicability to
the widest audience. They are not intended to be comprehensive or exhaustive.
Microsoft Consulting Services consultants and partner consultants can help
demonstrate the application of these principles to a particular operations
environment.
 Detailed analysis of each risk below is presented in Appendix B.
34
Risk Model for Operations
Release Role
The configuration management SMF is most commonly associated with this role.
Configuration management is responsible for the identification, recording, tracking, and
reporting of key IT components or assets. The goal of configuration management is to
ensure that only authorized configuration items are used in the IT environment and that
all changes to configuration items are recorded and tracked through their component life
cycle.
Suppose someone is doing configuration management work related to the deployment
of Microsoft® Windows® 2000. Recently, the team has debated the level of detail to
collect on the systems being upgraded. If the team collects too much information it may
fall behind schedule, but what if it collects too little? An operational consequence might
be that the other SMFs don’t have the information they need to perform correctly, so
problems occur that would have been easy to prevent. Once recognized, the team could
mitigate the risk by collecting additional detail.
Infrastructure Role
The capacity management SMF is most commonly associated with this role.
Suppose someone is doing capacity management work at an application service
provider (ASP). This person spends considerable time analyzing statistics generated by
various tools. Everyone in the group is impressed by the volume of detail that a new
tool provides, so much that it can be hard to find the most important measurements.
What if it becomes too hard to spot them? One consequence might be that outages and
bottlenecks seem to occur without warning, which severely impacts customer
satisfaction. Mitigations include reconfiguring the user interface, upgrading the tool, or
replacing it with one that does not pose this risk. If none of these are an option, or if
they will take time to implement, a contingency is to add capacity in hopes of staying
ahead of demand.
Risk Model for Operations
35
Support Role
The service desk SMF is typically associated with the support role from the MOF team
model.
The service desk function is responsible for first-line support to the user community for
problems associated with the use of IT services. The service desk also attempts to
identify a problem or rectify a known error through discrepancies or incidents
communicated by users. A service desk may be an organizational unit composed of
multiple service groups—for example, a call center and one or more site support teams.
Suppose that service desk personnel are working to recover service levels at a businessto-customer (B2C) e-commerce site. The customers are encountering problems and
reporting them, and the incident management staff are trying to gather all the relevant
information, but the automated tool they use (a database front end) wasn’t designed to
accept information that is critical to the problem. For example, the front end might
include hard-coded fields for items like the customer’s computer make and model, but
not the bandwidth of the line they use to connect to the Internet. This would prevent the
correct information from reaching the problem management group, which would slow
its response to problems and leave customers with the impression that the company isn’t
committed to providing good service. The group can try to change the tool if possible,
or work around the problem by repurposing existing fields to store the newly needed
data.
Operations Role
The operations role from the MOF team model often provides availability management.
Availability management is concerned with the availability and reliability of the overall
system. The goal of availability management is to ensure optimal availability of IT
services with the correct use of resources, methods, and technology.
Someone in the operations role, performing availability management, might realize that
if the group’s staffing were to drop, the group would not be able to meet the required
service levels. In particular, there are rumors of a merger with another company. If that
happens, then some IT staff roles are likely to be cut. If the staff levels are cut, the
messaging service would experience more frequent failures and it would take longer to
recover from each, which would reduce the company’s internal productivity, and
damage its reputation with other companies. Automated tools might reduce the impact
of the staff cuts, and if the cuts take place the availability management group might
borrow resources from other departments or hire contingent staff, and reset customer
expectations about service levels.
36
Risk Model for Operations
Partner Role
The partner role includes a broad collection of IT partners, service suppliers, and
outsource vendors who work as virtual members of the IT staff in providing hardware,
software, networking, hosting, and support services. The degree to which an IT
organization utilizes supplier services varies widely from business to business,
depending on the size, location, industry type, and the strategic goals of the business.
Internet e-businesses, for example, will focus on their core competencies of building
and running an e-commerce site, while they might outsource their customer service and
product fulfillment, hardware support, and possibly other functions.
The partner role is most closely associated with the financial management SMF.
Financial management encompasses many of the same accounting principles found in
use today across a wide variety of industries. In common practice today, cost
management for IT includes budgeting, cost accounting, cost recovery, cost allocations,
charge-back models, and revenue accounting. The key aspects of financial management
that ITIL and MOF address are its linkage to other service management functions.
This role is often responsible for service level agreements and maintenance contracts,
such as the ones that govern how one company might outsource part of its operations
work to another company. Outsourcing often makes financial sense, and it can be a risk
management strategy by itself: It transfers certain risks to the vendor or contractor.
However, this doesn’t mean the company is immune to external sources of risk, such as
a tightening of the labor market. One consequence of that risk is that operational costs
go up, which can impact other IT budgets. Labor-saving tools can somewhat reduce
impacts such as high turnover, less-skilled staff, and decreased staff availability
resulting from the need for additional training. The group might react to this market by
implementing non-monetary reward systems to encourage staff retention.
Security Administration Role
Security administration is responsible for maintaining a safe computing environment.
Security is an important part of the infrastructure of the enterprise.
This role and SMF are critical in nearly every environment. Someone in this role might
face many risks while managing security in a business-to-business e-commerce site that
one company uses to facilitate transactions with its business partners. For example, if
business management understands the needs and the costs of maintaining security, then
this role’s job is easier. What might happen if management does not have a realistic
understanding of security costs, and underfunds this role? One possible consequence is
that an under-funded staff does not properly protect partner data, leading to lawsuits.
The security staff can mitigate the risk by convincing upper management that additional
funding is required, and if that does prevent the condition from occurring, the staff
should at least prioritize their work to ensure they do the best job they can with the
limited resources.
Risk Model for Operations
37
Conclusion
Making Risk Management Easier
Most IT groups have seen the changes described above: business becoming more reliant
on IT, computing environments becoming more complex, visibility to the outside world
increasing, and IT groups having less control. The risks are getting bigger, but the risk
model for operations makes them easier to manage through the principles of proactive
management, and embedding risk management in all processes and all roles.
The example risk statements above are intended to prove this point, to demonstrate how
the risk model for operations can be applied to real-world situations. A larger set of
specific risk statements will be made available, especially through guides released by
OpsCentral, and through the personalized assistance of MCS and Microsoft’s consulting
partners.
Additional Information
Courses
For course availability, see http://www.microsoft.com/es.
A MOF course is being developed and will be available shortly.
Books
The following book serves as a bibliography for this paper or as recommended reading
to further understand the concepts contained herein:
IT Service Management. IT Service Management Forum/CCTA. ITIMF Ltd., 1995.
38
Risk Model for Operations
Web Sites
For more information on Microsoft’s enterprise frameworks and offerings, see:
http://www.microsoft.com/business/services/mcsmsf.asp
http://www.microsoft.com/mrf
http://www.microsoft.com/business/services/mcsmof.asp
http://www.microsoft.com/es
For more information on the MSF risk model, see the “MSF Risk Management Process”
white paper, http://www.microsoft.com/business/whitepapers/riskmgmt.doc
For more information on ITIL, see http://www.itil.co.uk/.
For more information on the Help Desk Institute, see http://www.helpdeskinst.com/.
Risk Model for Operations
39
Appendix A: Glossary
Using Consistent Terminology
The basics of risk management are simple and intuitive, to the point that most people do
it all day long without conscious thought. To understand the five-step process at the
heart of the risk model for operations, and to apply it effectively, it is important that
team members adopt a consistent terminology so that they can discuss and understand
the nuances that they usually don’t think about.
Different people and different organizations often represent the same idea using
different terms, or represent different ideas using the same terms. There is no
universally agreed-upon set of terms for describing risk. The risk model for operations
uses the terms below because they are the most common ones. If your organization uses
a different term than one of these, don’t panic: Using the words consistently is more
important than using any particular set.
The following list presents words in the order in which each definition becomes
important in the risk-management process:
 ITIL. Information Technology Infrastructure Library. A set of comprehensive,
consistent, and coherent codes of best practice for IT service management.
Developed by the Central Computer and Telecommunications Agency (CCTA) in
the United Kingdom.
 Risk. The possibility of suffering a loss; an event that may or may not happen. If an
event is guaranteed then it is not a risk—it is a known problem that you can plan for.
The loss is relative. Failure to achieve the maximum possible gain is considered to
be a loss. The opposite of a risk is an opportunity: the possibility of experiencing a
gain.
 Risk management. Sets forth a discipline and environment of embedded decisions
and actions to assess continuously what can go wrong, determine what risks are
important to deal with, and implement strategies to deal with those risks.
 Identifier. A name that the team uses to uniquely identify and track a particular risk.
40
Risk Model for Operations
 Sources of risk. Related to the ITIL term “category.” There are four main sources of
risk in IT operations:
 People. Even if the group’s processes and technology are flawless, everyone
makes mistakes, and these mistakes can put the business at risk.
 Process. Flawed or badly documented processes can put the business at risk even
if they are followed perfectly.
 Technology. The IT staff may perfectly follow a perfectly designed process, yet
the business can fail because of problems with the hardware, software, and so on.
 External. Some factors are beyond the IT group’s control but can still harm the
infrastructure in a way that causes business failure. Natural events such as
earthquakes and floods fall into this category, as do externally generated, manmade problems such as civil unrest, computer virus attacks, and changes to
government regulations.
 Mode of operational failure. There are four main ways in which IT operations
problems can cause failure:
 Cost. The infrastructure can work properly, but at too high a cost, causing too
little return on investment.
 Agility. The infrastructure can work properly, but be unable to adapt to changing
circumstances.
 Performance. The infrastructure can fail to meet users’ expectations, either
because the expectations were set wrong, or because the infrastructure performs
incorrectly.
 Security. The infrastructure can fail the business by not providing enough
protection for data and resources, or by enforcing so much security that
legitimate users can’t access data and resources.
 Risk condition. A description of a possible future event that could result in a loss.
 Operational consequence. A description of the way in which the condition would
affect the IT environment. The mode of failure typically influences the operational
consequence.
Risk Model for Operations
41
 Business consequence. A description of the way in which the operational
consequence would affect the business as a whole.
 Risk statement. The combination of the elements of a risk that the identification
step produces: source of risk, mode of failure, condition, operations consequence,
business consequence.
 Probability. The likelihood that the condition will occur. Note that this is not the
likelihood of the consequence. It is assumed that if the condition happens, the
consequence is a guaranteed result. Probability is measured on a numeric scale, and
it is never zero (because a risk that can’t happen isn’t something to manage) and
never 100 percent (because that condition would be guaranteed: a known problem,
not a risk).
 Impact. The degree of loss that the business consequence would cause. This is
measured on a numeric scale: the higher the impact, the higher the number. This is
closely related to the ITIL meaning of this term: the business criticality of an
incident.
 Exposure. The result of multiplying the probability by the impact. For example, if
the probability is 20 percent and the impact is 3, then the exposure is .6.
 Mitigation. Action the team can take prior to the condition and/or consequence
occurring. A mitigation may reduce the probability, or the impact, or both, or
transfer the risk to another party, or avoid the risk altogether. A single condition may
have multiple mitigations, or one, or sometimes none.
 Trigger. A measurement threshold that indicates that the condition is about to occur.
It is a value that is either true or false. When it shifts from false to true, the team
executes the contingency plan.
 Contingency plan. Action the team takes if the risk condition occurs or if the trigger
is activated. A single consequence may have multiple contingencies, or one, or none.
 IT service continuity management, availability management. Two of the service
management functions in Microsoft Operations Framework that rely heavily on risk
management practices.
42
Risk Model for Operations
Appendix B: Detailed Examples
Overview
This appendix presents the same risks as the Examples section of the white paper, but
presents more detail on each.
Note that the risk model details all parts of each risk, including the parts identified in the
analyzing phase (impact, probability, and exposure). This data is critical in the real
world, but it is also very situation-dependent, and listing specific values would distract
more than it would enlighten, so it has been omitted in the examples below.
Also, most relationships between elements of a risk are many-to-many, but for the sake
of brevity these examples focus on only one element. Each example lists one trigger,
when in the real world there might be several.
Risk Model for Operations
43
Release Role
The context for this risk is that someone in the release role is performing configuration
management work related to a Microsoft® Windows® 2000 deployment. That person is
trying to assess the best amount of data to collect for each part of the infrastructure
affected by the deployment.
Risk component
Statement
Source of risk:
Mode of failure:
Condition: If the future turns out
this way …
Process
Performance
Configuration management team does not record enough detail about
each configuration item (CI) during the deployment, so the information
never reaches the configuration management database (CMDB).
Other SMFs have insufficient information and are not able to perform
their jobs effectively.
Operations Consequence: …
then operations will be hurt in this
manner …
Business Consequence: … and Employee productivity suffers due to undocumented anomalies in the
the business as a whole will be
configuration, anomalies that incident management and problem
management would have detected quickly had the relevant
hurt in this manner …
information been in the CMDB.
Mitigation: Prior to the condition Reduce both impact and probability by beginning to add detail to each
occurring, we’ll try to reduce the
of the CIs in the database.
impact and/or probability by …
Trigger: If the condition is
The CMDB might indicate that everyone in one department runs a
imminent (but hasn’t yet occurred) particular application, but the users complain that they can’t share
we’ll know because this
data, and the cause turns out to be the mixed versions in use: a
problem that wasn’t apparent because the configuration management
happens…
team tracked only the names of installed applications, not the version
numbers.
Contingency: If we’re unable to Add levels of attribute detail to the affected CIs.
prevent the condition, we’ll
respond to the trigger in this way
…
In this example the trigger may seem a bit contrived, but it illustrates the reason that this
is a thorny operations problem. It’s very hard to know when the information you have is
no longer enough to do the job. The trigger listed here is just one example of how the
lack of needed information might manifest itself. The team managing this risk would
likely produce a more generic trigger, or would define several other specific triggers.
44
Risk Model for Operations
Infrastructure Role
The context for this risk is that someone acting in the infrastructure role at an
application service provider (ASP) is performing capacity management, and speculating
about the risks related to a new tool. The tool may eliminate some risks related to older
tools, but it may introduce the following risk as well.
Risk component
Source of risk:
Mode of failure:
Condition: If the future turns out
this way …
Operations Consequence: …
then operations will be hurt in this
manner …
Business Consequence: … and
the business as a whole will be
hurt in this manner …
Mitigation: Prior to the condition
occurring, we’ll try to reduce the
impact and/or probability by …
Trigger: If the condition is
imminent (but hasn’t yet occurred)
we’ll know because this
happens…
Contingency: If we’re unable to
prevent the condition, we’ll
respond to the trigger in this way
…
Statement
People
Performance
The capacity management staff uses monitoring tools whose user
interfaces are so complex that it is easy to overlook relevant
information.
Capacity management is faced with outages and bottlenecks that
seem to occur without warning.
Customers are dissatisfied with the ASP’s inability to support the
demand for service, and the customers react by switching to a
competing ASP.
Simplify the user interface by reconfiguring the existing tools, or by
installing an upgraded version of the tool, or by replacing the current
tool with a better one from a different vendor.
The ASP finds itself unable to meet service level agreements because
of inaccurate capacity utilization forecasts.
Add capacity in hopes of staying ahead of demand.
Everything in this risk hinges on the source of risk: people. Presumably, the problem
stems from people misusing a tool that is functioning correctly. Under different
circumstances, the risk management team might have decided that this is a technology
issue, especially if the tool cannot be reconfigured to reduce the volume of data it
presents. Taking it one step further, the risk management team might have asked
whether the people knew that the tool could be reconfigured. If they did not because that
topic wasn’t covered in training, the failure in training would define “process” as the
source of risk.
Risk Model for Operations
45
The distinctions are relevant for four reasons:
 The real source of the problem (people, process, technology) greatly affects the
mitigation. For example, altering the training won’t prevent problems caused by
defects in the tool.
 The team may analyze current risks by grouping them according to source of risk.
This might, for example, expose a set of risks related to poor training, or tools from
a certain vendor.
 This illustrates how valuable diverse viewpoints can be during the identification
step. Many people who consider the condition by itself would focus on one
particular source of risk, potentially missing the others.
 This illustrates why precision is important when documenting a risk. If this
condition’s wording were changed slightly, none of the other elements of the risk
would make sense.
Support Role
The context for this risk is that service desk personnel use tools to collect information
from customers who log complaints regarding the company’s business-to-consumer ecommerce Web site. The tools seem to collect the right information today, but what if
future problems arise that require additional information, which the tools weren’t
designed to collect?
Risk component
Source of risk:
Mode of failure:
Condition: If the future turns out
this way …
Operations Consequence: …
then operations will be hurt in this
manner …
Business Consequence: … and
the business as a whole will be
hurt in this manner …
Mitigation: Prior to the condition
occurring, we’ll try to reduce the
impact and/or probability by …
Trigger: If the condition is
imminent (but hasn’t yet occurred)
we’ll know because this
happens…
Contingency: If we’re unable to
prevent the condition, we’ll
respond to the trigger in this way
…
Statement
Technology
Performance
Customers using the B2C site encounter problems and report them,
but the incident management tools do not collect all relevant data.
Problem management doesn’t receive the information they need to
track down the underlying problems.
Customers perceive the slow response to problem reports as a sign
the business doesn’t take customer service seriously, and the
customers switch to a competitor’s B2C site.
Change the incident tracking tool if possible, and if not, attempt to
store the additional required information in an unused field.
Problem management realizes that the information needed to solve a
problem isn’t being collected by incident management.
Change the incident tracking tool if possible, and if not, attempt to
store the additional required information in an unused field.
Note that in this case, the mitigation and contingency are the same. In some cases, the
difference between them is timing. The contingency plan executed for today’s condition
may serve as a mitigation, reducing the impact or probability of the condition in the
future.
46
Risk Model for Operations
Operations Role
The context for this risk is that someone acting in the operations role, and doing
availability management for the company’s messaging service, realizes that staff cuts
resulting from a merger would prevent the group from meeting its requirements.
Risk component
Source of risk:
Mode of failure:
Condition: If the future turns out
this way …
Operations Consequence: …
then operations will be hurt in this
manner …
Business Consequence: … and
the business as a whole will be
hurt in this manner …
Mitigation: Prior to the condition
occurring, we’ll try to reduce the
impact and/or probability by …
Trigger: If the condition is
imminent (but hasn’t yet occurred)
we’ll know because this
happens…
Contingency: If we’re unable to
prevent the condition, we’ll
respond to the trigger in this way
…
Statement
People
Performance
There are too few people in availability management to properly
manage messaging service availability.
The messaging service has low availability due to high mean time
between failures (MTBF) and low mean time to repair (MTTR).
Reduced internal productivity; reduced ability to communicate with
external partners, which causes those partners to lose confidence in
your organization.
Deploy automated system-monitoring tools to compensate for lack of
staff.
Service level agreements aren’t being met.
The merger is announced and upper management states that layoffs
are possible.
Hire contingent staff, or borrow staff from other departments, or reset
partner expectations.
This is a good example of a single condition that can trigger multiple contingency plans.
Downsizing may affect many other teams as well as availability management, and each
team may have different plans.
Also, consider the case in which availability management is the only team being
downsized. That team is uniquely positioned to anticipate this risk condition. This is one
case in which the principle of integrated risk management (performing risk management
in every role and every job function) can be vitally important.
Finally, it is worth noting that availability management is one of the SMFs in the
optimizing quadrant of the MOF life cycle, and SMFs in that quadrant tend to focus on
planning for the future. That often makes it difficult to identify specific triggers for the
risks they face.
Risk Model for Operations
47
Partner Role
The context for this example is that someone acting in the partner role is doing financial
management work, thinking about the impacts on IT if the labor market tightens.
Risk component
Statement
Source of risk:
Mode of failure:
Condition: If the future turns out
this way …
Operations Consequence: …
then operations will be hurt in this
manner …
Business Consequence: … and
the business as a whole will be
hurt in this manner …
Mitigation: Prior to the condition
occurring, we’ll try to reduce the
impact and/or probability by …
Trigger: If the condition is
imminent (but hasn’t yet occurred)
we’ll know because this
happens…
Contingency: If we’re unable to
prevent the condition, we’ll
respond to the trigger in this way
…
External
Cost
The labor market tightens, making it harder to retain qualified IT staff.
The partner role faces increased costs for recruiting, training, and
retaining staff.
Cost overruns for staffing cause cuts elsewhere in the IT budget
(which can affect the business) or in non-IT budgets.
Implement automated labor-saving tools.
The staff turnover rate reaches a particular threshold; the cost of new
labor contracts reaches a particular threshold.
Implement non-monetary reward systems to encourage staff retention.
48
Risk Model for Operations
Security Role
The context for this risk is that people performing the security role worry that the
company’s upper management will not fund security management well enough for the
staff to do their jobs.
Risk component
Statement
Source of risk:
Mode of failure:
Condition: If the future turns out
this way …
Operations Consequence: …
then operations will be hurt in this
manner …
Business Consequence: … and
the business as a whole will be
hurt in this manner …
Mitigation: Prior to the condition
occurring, we’ll try to reduce the
impact and/or probability by …
Trigger: If the condition is
imminent (but hasn’t yet occurred)
we’ll know because this
happens…
Contingency: If we’re unable to
prevent the condition, we’ll
respond to the trigger in this way
…
People
Security
Management does not take security risks seriously enough to
adequately fund security management.
Under-funded security staff fails to protect partners’ data on the B2B
site, so one partner is able to view a second partner’s data, and the
second partner discovers this fact.
The second partner sues the business for failure to protect privileged
information, and in addition to the legal judgment the company suffers
from negative press coverage.
Provide management with security audits to prove that additional work
needs to be funded.
Success of denial-of-service attacks; compromised passwords; e-mail
bombing; confidential data discovered in public view.
If it is not possible to get additional funding, spend the time analyzing
security risks to ensure the limited money is spent as effectively as
possible.
The contingency shows that risk management can be part of the risk management plan.
In some cases, becoming more rigorous about risk management is a good response to a
potential loss. In fact, that’s a key reason that operations groups need to become better
at performing risk management. The business environment is changing and there are
more risks, with higher probabilities and impacts, than ever before.