Download What is Copy Data?

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
What is Copy Data?
Prepared by: George Crump, Lead Analyst
Prepared: October 2014
What is Copy Data?
Copy Data is the term used to describe the copies of primary data made for data protection,
testing, archives, eDiscovery and analytics. The typical focus of copy data is data protection to
recover data when something goes wrong. The problem is that each type of recovery requires a
different copy of data. Recovery from corruption requires snapshots. Recovery from server
failure requires disk backup. Protection from disk backup requires tape. Finally, recovery from a
site disaster requires that all these copies be off-site. Add to the data protection copy problem,
all the copies being made for test/development, archives, eDiscovery and now analytics. The
end result: copy data is about much more than data protection and providing the capacity to
manage all these copies has become a significant challenge to the data center.
If the demand for primary storage capacity is growing, copy data capacity demands are
exploding. Storage Switzerland believes the "copy" data set is outpacing primary storage by as
much as 20X. The challenge for the IT planner is how to get a handle on this copy data. They
have to make sure that the costs of the copy data infrastructure does not outpace the costs of
primary storage.
Where does Copy Data come from?
Data centers have always had to manage copy data. In the past when tape was the backup for
all data, off-site protection required a copy of those tapes. This produced an immediate 2X
growth in copy data. The tape backup also stored many versions of these backups for some
period of time. As a result, the tape capacity required was close to 4X primary storage. Also,
most backup administrators run full backups once a week, once a month, or even once a
quarter. As a result, the 4X could grow to 24X for a two-year retention period. The only saving
grace for tape was that its cost per GB was, and still is, very low compared to disk.
Next, organizations began to look at disk to support faster application recovery. The first form of
this was synchronous disk mirroring and asynchronous data replication. An application could
use these copies for rapid recovery with minimal data loss, but its downside was it required
expensive primary storage. This type of copy represented a 4X growth and 6X growth when
replicated off-site for disaster recovery.
Data centers then started to look at disk based backup to improve data protection performance.
At first these were raw disk systems, with no data efficiency capabilities. These systems then
added technologies like deduplication to reduce costs but they still had to start with a baseline of
disk capacity, this typically adds another 2X. Unlike tape, deduplication does reduce full backup
redundancy, but it still stores the unique data between those backups. As a result, deduplicated
backup copy data can be as much as 10X the size of primary storage, if replicated for DR, that
jumps to 20X.
Then recovery requirements and demands only continued to increase. Even with disk backup,
the time to transfer that data over the network was too time consuming. The next step was for
IT planners to move to storage systems with snapshot capabilities which became broadly
available about the same time that backup disk with deduplication came to market.
Snapshots allow many point-in-time views of primary storage. When first taken, a snapshot
should not immediately require a 2X storage increase like a mirrored copy does. Over time
however, as the active data set changes and the storage system has to maintain the older view,
data growth occurs. As a result, the storage system has to maintain two or more copies of any
active data set. In actuality, this copy method is one of the most efficient available to the data
center.
October 2014
Page 2 of 4
Storage Switzerland, LLC
In general, the more snapshots that are created and the longer they are retained, the more
primary storage capacity will be consumed. To make matters worse, some storage systems also
require that the storage administrator hard provision a snapshot reserve which can be as much
as 30% of total production capacity. Despite the efficiencies of snapshots, IT planners should
assume a 1.5X capacity requirement, above the total capacity of the system, for snapshot data
space.
Copy Data Caused By More Than Just Data Protection
The above examples revolve around the data protection process. But processes other than data
protection create copy data. A long time creator of copy data is the test - development process
(Test/Dev). This needs near real-time copies of production data, the closer to real-time
production the data is, the higher quality the test results will be. It is common for test/dev to
count on several copies of production data so that various iterations can be verified. IT planners
should assume that test development data account for as much as 6X the size of primary
storage.
The other non-data protection process, which creates a need for copy data, is data analytics.
While many organizations are beginning to experiment with data analytic processing, it is
becoming a rapidly growing source of copy data. It is hard to quantify just how much analytics
will cause the copy data set to grow, however, it is safe to assume that analytics may require at
least another 2X added to the production data set.
The Cost of Copy Data
The cost of maintaining copy data can be significant. First, there is the obvious cost of just
acquiring the raw capacity. But this is just the tip of a titanic sized iceberg. The big problem is
the management of these processes. Most of these copies are on different storage silos that are
all managed by different processes.
Just from the examples above, copy data could use at least six different storage systems. It can
also mean six different processes to create that data. This can lead to a much more significant
chance of human error with so many processes running in parallel to each other. Each of these
separate processes typically need a direct interface with the primary data set to make their
copies and this can lead to a loss in application performance and potentially even data
corruption.
Insight is another big challenge. All these processes count on working with the most recent copy
of production data. How can systems managers be assured that they are always getting the
most recent image? Also, how does the storage management team know that the test/dev team
will prune excess copies when their work is complete?
Copy Data Management is Copy Data Convergence
An investment in copy data management can have a big impact on the cost of storage in the
data center. And unlike other storage management processes, it does so with no impact or
change to production storage. A copy data management solution converges all of the above
copies into a single tree that can have many branches.
Companies like Catalogic Software are providing solutions that build on the efficiency of
snapshots, but do so across storage systems, essentially they ‘catalog’ all files and snapshots,
vaults and mirrors be they on local, remote or cloud storage. This helps eliminate a key
vulnerability to snapshot technology, the outright failure of the storage system and loss of
access to snapshot copies. In addition, copy management solutions add in concepts from
October 2014
Page 3 of 4
What is Copy Data?
backup to provide data cataloging, which allows for rapid data search and retrieval. This second
copy, updated in near real time, can then be used to capture a snapshot and/or be replicated
many times over to feed the various copy data management processes outlined above. Another
unique attribute of a copy data management system is its ability to allow these secondary
snapshots to be writable, which is a requirement for test/dev or analytic environments.
The result is a secured storage system that leverages the efficiencies of snapshots to feed all
the processes that require access to copy data. A copy data management solution could lead to
a dramatic reduction of secondary data and improve operational efficiencies.
Summary
An increasing number of processes in the data center use copy data, leading to unabated data
growth. The cost to procure and maintain copy data storage it, is out of control. Moreover, the
current copy data "infrastructure" is often just a hodge-podge of technology thrown at the
problem in hopes it will stem the tide. In reality, it makes matters worse.
It's time for IT to take a step back and consider a holistic solution to the problem. One where a
single process interfaces with production storage and then that process feeds all the consumers
of copying data. Doing so will create a more stable production environment and a more cost
effective secondary environment that is providing higher quality data.
!
Storage Switzerland is the leading storage analyst firm focused on the emerging storage categories of
memory based storage (Flash), big data, virtualization, cloud computing and data protection. The firm is
widely recognized for its blogs, white papers, and videos on such current technologies like all-flash
arrays, deduplication, software-defined storage, backup appliances, and storage networking. The
“Switzerland” in the firm’s name indicates our pledge to provide neutral analysis of the storage
marketplace, rather than focusing on a single vendor or approach.
October 2014
Page 4 of 4