Download dna data storage - University of Pittsburgh

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

DNA sequencing wikipedia , lookup

Molecular evolution wikipedia , lookup

Maurice Wilkins wikipedia , lookup

Replisome wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

Transformation (genetics) wikipedia , lookup

Molecular cloning wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Non-coding DNA wikipedia , lookup

Community fingerprinting wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

DNA supercoil wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Transcript
Session A3
Paper #85
Disclaimer—This paper partially fulfills a writing requirement for first year (freshman) engineering students at the
University of Pittsburgh Swanson School of Engineering. This paper is a student, not a professional, paper. This paper is
based on publicly available information and may not provide complete analyses of all relevant data. If this paper is used for
any purpose other than these authors’ partial fulfillment of a writing requirement for first year (freshman) engineering
students at the University of Pittsburgh Swanson School of Engineering, the user does so at his or her own risk.
DNA DATA STORAGE: THE BIG DATA SOLUTION
Marlo Garrison, [email protected], 3pm Mena, Chris Kolimago, [email protected], 3pm Mena
Abstract—DNA data storage is a developing technology that
uses DNA base pair sequencing to store information inside of
a cell. Once DNA data storage is fully developed, it will have
the potential to optimize how companies store data. Instead
of using ones and zeros in binary coding to store information,
DNA data storage utilizes the four different nucleotide bases
of DNA. These synthetic DNA strands are stored in a cell for
future retrieval of data. DNA data storage is an important
breakthrough because it solves the accumulating big data
problem. Electronic storage devices are not improving as
quickly as the amount of data produced. Since one cell has
the capacity to store an entire genome, DNA as a medium of
storage can be used in place of current magnetic tape
technology. This paper describes the rapidly growing big
data problem and explains the technology behind DNA data
storage. It highlights the advantages of DNA data storage as
an industrial storage medium. The advancement of DNA data
storage is significant to any data user because it has the
potential to improve the efficiency and sustainability of the
digital world.
Key Words—Big data problem, Data Storage, DNA, DNA
data storage capacity and density, DNA sequencing,
Magnetic tape, Microsoft
DNA DATA STORAGE: BRINGING
INFORMATION TO LIFE
A developing technology that has the potential to
change the way companies store mass quantities of
information is deoxyribonucleic acid (DNA) data storage.
DNA data storage takes the need for large scale data storage
and combines it with the powerful properties of DNA to
create an efficient storage medium. The information is stored
directly in DNA, like a genetic code, and kept for future
retrieval. This medium can be the solution to the big data
problem that is growing exponentially. People are creating
more data than there is space to store it, so DNA data storage
presents a compressible alternative to current large scale data
storage methods, like magnetic tape. The technology and
bioengineering of DNA data storage involve base sequencing
and encoding, storage, and retrieval. Aspects like
compressibility and durability make DNA data storage an
advantageous alternative to current storage mediums, while
University of Pittsburgh Swanson School of Engineering 1
Submission Date 03.03.2017.
issues with reliability and error are obstacles for researchers
to tackle during this developmental stage. DNA data storage
will have an important societal impact because of advantages
in conserving environmental and economic resources, thus
making data storage sustainable for future generations.
Companies like Microsoft have taken on the role of
conducting research and development on DNA data storage
technology. The future of DNA data storage relies on solving
reliability issues and reducing cost so that companies and
institutions can utilize this medium efficiently and on a large
scale. DNA data storage has the potential to improve and
redefine information storage in the commercial and
institutional world.
THE BIG DATA PROBLEM AND CURRENT
SOLUTION
The Big Problem with Data
Big data is best understood by drawing a parallel
between its technical definition and the growing technology it
represents. Just as new information is being created all the
time, “Big Data” is a dynamic term. As explained by
researchers S. Kaisler et al. in a paper presented at the fortysixth Hawaii International Conference on System Sciences,
originally, “Big Data” meant the volume of data that could
not be processed efficiently by traditional database methods
and tools. Now, it is defined as the amount of data just
beyond technology’s capability to store, manage and process
efficiently [1]. The key feature of both definitions is that big
data cannot be processed efficiently. Imagine a scenario in
which a major city is running out of space for building
parking lots, while the rate of new commuters and residents
coming into the city is constant. Now picture the same
scenario, but replace the cars with Excel spreadsheets and
parking spots with hard drives. However, Excel spreadsheets
are a small fraction of the data that is processed daily. IBM,
one of the world’s largest information technology companies,
claims that every day 2.5 quintillion bytes of data are created,
from sources such as posts to social media sites, digital
pictures and videos, purchase transaction records and cell
phone GPS signals, to name a few [2]. Even this short list
demonstrates how easy it is to take data management for
Marlo Garrison
Chris Kolimago
is greater than other storage mediums, such as Seagate’s sixty
TB solid state drive (2016), it is still not enough to manage
the zettabytes of data required [7]. The second disadvantage
is the size of each cartridge, Sony’s is 5.25 inches in length,
and the size of the data centers that hold and process the tapes
[6]. Think about how many of these buildings it would take to
manage 44 zettabytes of information.
The underlying issue is finding a medium that can
effectively manage the voluminous, inactive cold data. Due to
its capacity limitations, magnetic tape cannot accomplish this
task. Instead, the solution lies in DNA data storage, which
has the potential to replace unsustainable magnetic tape
facilities. The advantages over other current storage
mediums, and the features that make DNA look so promising,
are its compressibility and durability, which will be detailed
in later sections.
granted; think about how different life would be without
these basic functions.
International Data Corporation (IDC) Research Inc., a
source that has been tracking the terabytes of disk storage
shipped each year, claims that that by 2020, the digital
universe will reach forty-four zettabytes, or forty-four trillion
gigabytes (GB) [3]. To put this into perspective, in the same
paper that defined big data, S. Kaisler et al. estimate that
current disk technology limits are about four terabytes, or
4,096 GB, per disk [1]. This means that it would require over
ten billion disk drives, to contain all the world’s data in 2020;
the real issue is not that it is impossible to store all the
world’s data, but rather that data is produced at a much faster
rate than storage mediums. The solution lies in the following
question: What if there was a medium that could replace the
quality of soon-to-be obsolete storage technology and contain
the growing quantity of data?
THE BUILDING BLOCKS OF DNA DATA
STORAGE
Magnetic Tape: The Current Technology
Society’s need for speed is a driving force behind many
technological advances, including the evolution of
computing. The Economist Journal demonstrates how data
management is no exception to this obsession. In the 1980’s
and 1990’s, magnetic tape cassettes and cartridges ruled large
computer systems. Then, fast spinning disks and even faster
spinning hard drives almost drove magnetic tape to
extinction. Today, users expect immediate storage and
retrieval of data with cloud computing systems, named for
their nearly complete elimination of hardware [4]. Recently,
however, the demand for more efficient “cold data” storage
has sparked the revitalization of magnetic tape, a technology
that was almost labeled obsolete.
Cold data is the type of data that lies in long-term
storage; it is the type of data that piles up over time and
remains mostly unused. A few key features of magnetic tape
give it an edge over other available storage mediums for
storing this type of data. First, tapes do not require constant
power to preserve the data held on them, whereas other
mediums that operate by rotating a disk may fail if power
stops being provided [4]. Like magnetic tape, storage
mediums that do not have this disadvantage are important
because they address the sustainability issue of reducing
energy consumption and related costs. Furthermore, as the
volume of data increases, so does the cost; this is another key
feature of magnetic tape. B. Rossi, Vitesse Media’s editorial
director, explains why magnetic tape is a better option than
cloud storage for storing large amounts of data. He says that
as the quantity increases, “the price per gigabyte per month
model [of cloud storage] falls down fairly quickly. Also, this
does not include the in-and-out charges that can really spike
the cost of cloud storage” [5].
On the other hand, as reported by S. Anthony of VR
World, an online technological news source, in his article,
Seagate’s new 60TB SSD is world’s largest, the highest tape
cartridge capacity is 185 TB (185,00 GB) [6]. Although this
Base Sequencing and Encoding
A major component of DNA data storage is the
manipulation of base sequencing, which usually codes for the
genetic information in DNA, to code for any type of
information. According to Genetics Generation, the backbone
of the double helix of DNA is composed of a five-carbon
sugar and phosphate group. A nitrogenous base remains in
the center of the double helix. The four types of nitrogenous
bases are adenine, cytosine, guanine and thymine. Each base
has a complementary pair—adenine to thymine and cytosine
to guanine. These nitrogenous bases form hydrogen bonds
with their complementary base pair to hold the double helix
together. It is these base pairs that form long strings that code
for a certain genetic trait [8]. The current data storage
technique relies on binary coding, which uses ones and zeros
that code for information. Combining these two ideas creates
the fundamental concept behind DNA data storage. The
information needs to be encoded, or converted into a code
form, with some type of uniform key that can be applied on a
massive scale. Instead of using the ones and zeros of binary
coding, DNA data storage uses adenine, cytosine, guanine
and thymine as its encoding key. The European
Bioinformatics Institute (EBI) says, “binary zero could be
encoded by the bases adenine or cytosine, and a binary one
could be represented by guanine or thymine.” This technique
allowed EBI to successfully encode five files, including
Shakespeare's sonnets and a part of Martin Luther King's 'I
have a dream' speech [9]. The words of the sonnets were
matched to a certain pattern of nucleotide sequencing and
converted into a new language. This new base pair language
was then stored inside of DNA, and when retrieved, was
decoded for the same, original sonnet.
Next, the sequences are chemically manufactured; there
are many methods and options of manufacturing still being
2
Marlo Garrison
Chris Kolimago
developed. For example, one method the University of
Washington research uses involves a silicon-based DNA
synthesis substrate that creates different sequences in parallel
[10]. After manufacturing, the DNA is put in test tubes and
dehydrated for storage. Some researchers use the DNA of live
E. coli bacteria to store the sequenced data. Thus, the idea of
replacing binary coding with base pair sequencing brought
DNA data storage into the realm of possibility. These
nucleotides act as a new coding language that can properly
convert information into a code that can be stored and later
retrieved. Encoding with nucleotide base pairing creates a
storage medium that is compatible with a living structure.
Once the information is properly encoded, it must be stored
securely.
al. the retrieval process is broken into three parts: payload,
address, and primer [12]. The payload is the actual string of
nucleotide encoded information that a user wishes to retrieve.
The address is a block of base pairing that identifies the
location of the information. Additionally, much like a stop
and start codon in regular DNA processes which indicate
where a section begins and ends, two primers enclose the
information. Primers are nucleotide sequence placeholders at
the beginning and end of a string of information that allow
certain sections of information to be retrieved instead of an
entire string; see Figure 1 below.
Storing Data in DNA
Another major component of DNA data storage is its
storage capacity. After the DNA has encoded the information,
it safely stores the information until retrieval. In an article
about how DNA storage works, G. Templeton describes the
components of storage. Chromatin is a DNA protein system
that makes up chromosomes. The chromatin fibers consist of
small proteins called histones and DNA. The DNA wraps
around nucleosome structures to condense, and then these
nucleosome structures, each consisting of about one hundred
and fifty base pairs, fold to produce the chromatin fiber that
coils to form a chromosome. This allows DNA to pack
tightly, but also unravel quickly for easy access to
information. Chromatin stores an entire genome of
information into a single chromosome, creating an incredibly
dense structure, and therefore a dense storage volume [11].
Similar to when the body requests access to these DNA
patches, encoded information can also be requested in the
same fashion. Like a chromosome, information will be stored
in these tight and coiled DNA protein structures, and the
structures will uncoil upon request. According to EBI, DNA,
with its ability to store an entire genome in its microscopic
region, has a storage capacity of approximately ten quintillion
bits per cubic centimeter [9]. DNA’s dense storage capacity is
an important advantage for commercial use that will be
analyzed further in the compressibility. The DNA itself also
has optimal storage capability. As long as DNA is kept cool
and dry, it can remain intact for hundreds of thousands of
years. DNA data storage has the same durability as DNA
found in ancient fossils, a feature that will be discussed in the
section on durability. Using DNA data storage would make
the information stored as timeless as fossils themselves and
companies will benefit from storage mediums that can
withstand the test of time.
FIGURE 1 [12]
Nucleotide sequence structure input and output
This figure comes from the research paper of J. Bornholt et
al. and illustrates the nucleotide sequence structure with
primers, payload, and address. The article goes on to describe
the importance of this process because it leads to the ability
of random access. Random access means that sequencing can
be performed on a selected group of strands instead of the
entire structure [12]. Companies can create unique primer
nucleotide sequences to section off the data that is stored.
This is how data can be sorted and retrieved in an organized
manner. It would be inefficient to recall all stored information
each time it is retrieved. DNA’s random access capability
makes it comparable to a computer’s Random Access
Memory, or RAM. According to the M. Nichols in her article
about how DNA data storage works, this means that data in
the operating system is stored in RAM for immediate access.
RAM evades waiting for a computer to retrieve information
from its hard drive, and essentially means the physical
location of the data does not matter, as long as it can be
immediately accessed [13]. The random access of DNA and
computer RAM are analogous in concept, which is why DNA
data storage can be an advantageous replacement.
The random-access functions via a polymerase chain
reaction (PCR) involve DNA polymerase. Per Learn
Genetics, a genetic science learning center, “DNA
Polymerase is a naturally occurring complex of proteins
whose function is to copy a cell's DNA before it divides in
two” [14]. PCR amplifies a single copy of DNA, or in the
case of DNA data storage, a single piece of data, and
generates numerous copies. These copies can act as a backup
file for the information being stored. Random access not only
allows the user to retrieve specific data, but it also allows the
user to make thousands of copies. DNA data storage
encompasses existing data storage concepts, like RAM and
Retrieving Information
The most important aspect of the DNA data storage
system is its reliable retrieval of data. In a research paper
about DNA-Based Archival Storage System by J. Bornholt et
3
Marlo Garrison
Chris Kolimago
replication, but adds the advantages of using living material
to make a worthy alternative.
solution, Escherichia coli, and DNA in general, is the clear
winner.
Durability
ADVANTAGES AND DISADVANTAGES OF
DNA
In March of 2016, Massachusetts Institute of
Technology (MIT) researchers announced they had partially
reconstructed the genomes of ancient humans whose bones
had been in a Spanish cave for more than 400,000 years [17].
The amount of information on dinosaurs and other prehistoric
creatures that has been collected from fossils, a type of
ancient DNA, is massive; imagine the amount of new
information just waiting to be discovered at the bottom of the
sea, deep under the ground, or lying in a cave. These
discoveries are possible due to one of DNA’s most
advantageous properties: durability.
The half-life of DNA, which paleogeneticists M.
Allentoft at the University of Copenhagen and M. Bunce at
Murdoch University in Perth, Australia, have estimated to be
about 521 years, is largely based on two factors [18]. The
first is the condition that the DNA is stored in. According to a
simulation done by J. Bornholt et al., storage in lower
temperatures significantly increases the half-life of single
stranded DNA [12]. This makes sense when considering the
environment that ancient fossils rest in before they are
discovered, such as a cool, dark cave. The second factor is the
ease of replication that DNA possesses. By copying the
natural replication process of DNA, Harvard genetics
professor Dr. G. Church, claims that billions of copies of
DNA can be reproduced quickly and cheaply; see Figure 2
below [19].
Compressibility
Recall the reasons for why magnetic tape will not be the
final solution for big data management, including storage
limitations, physical size and durability. Instead consider a
solution that lies on the inside. Recently, researchers have
been considering a certain type of bacteria that can be found
in the intestines of people and animals, called Escherichia
coli (E. coli); they discovered that E. coli contains a special
structure which can record and store information. According
to a research paper by F. Farzadfard and T. K. Lu, this
beneficial structure, known as a retron, is an extra singlestranded cache of DNA that exists separate from the cell’s
regular set of double stranded DNA. By introducing certain
chemicals, these single DNA strands preserve an accurate
record of the chemical environment of the cell; these
chemicals can increase the rate at which the single stranded
DNAs recombine with the main double stranded DNA [15].
In other words, such a chemical could trigger the singlestranded E. coli DNA to begin “recording” data, which the
bacteria would then store in its secure double-stranded DNA.
This is significant because any organism with retrons,
typically bacteria, will have these capabilities.
A few additional factors which give DNA its high
compressibility are size and three-dimensionality. An
analysis done by V. Zhirnov et al. of Nature Materials, an
international scientific journal, shows that an E. coli cell has a
volume of 1 µm3 (1 µm = 1*10-6 m) and 9.2 Mbit (1 Mbit =
125,000 bytes) of memory [16]. A volume of one picometer
is vastly smaller than that of a magnetic tape cartridge, which
contains dimensions measured in centimeters (1 cm = 0.01
m). Furthermore, the cell’s 9.2 Mbit of memory equates to
about 1,150,000 bytes, or 0.00115 GB in storage capacity.
Combining these properties give a single E. coli cell an
astounding 3-dimensional memory density of ~1019 bit/cm3
(1.25*1024 bytes/m3) [16]. The third dimension is important
compared to drive storage, which is mostly two-dimensional
and much larger in volume. Society will benefit from the
valuable compressibility abilities DNA data storage holds
because less space used for storing data means more room for
other investments. In an environmental aspect, resources used
to harbor data will be conserved by reducing the volume of
data storage centers. Instead of a large building or room—
requiring steel, cement, electricity, valuable time, and
workers to construct—companies may only need to dedicate
mere inches of space to data storage. Therefore, DNA data
storage shows tremendous potential as being a sustainable
solution for environmentally conscientious companies who
wish to conserve resources. In the race for a big data
FIGURE 2 [12]
Average number of copies of sequences required to
ensure a desired reliability over time.
This graph is presented in a research paper by J. Bornholt et
al. pertaining to a realistic architecture for a DNA-based
archival system. Using a simulation, they tested a new
encoding scheme and compared features such as redundancy,
reliability and density to current standards. Figure 2 shows
that in the simulation, researchers calculated that a 99.99%
4
Marlo Garrison
Chris Kolimago
chance of recovering encoded data only requires 10 copies of
strands for that value. Additionally, they estimated that after
1000 years, only 1000 copies will be required for the same
near-perfect rate [12]. Therefore, the durability of DNA
storage is based on the level of environmental regulation and
the amount of strand replication.
Further research conducted on the durability of DNA
supports the aforementioned claim. Dr. R. Grass, a lecturer at
ETH Zurich, ran analyses on DNA encapsulated in silica. His
team exposed the encapsulated DNA to 70 degrees Celsius
for a week, which is equivalent to four decay half-lives, and
could recover the original data without final error. From this
data, he predicted that digital information could be stored
encapsulated in silica at the Global Seed vault (-18º C) for
over two million years [20]. These findings are phenomenal
when compared to magnetic tape, which must be regulated
similarly and lasts only a few decades at most. Recall how, as
explained in the Magnetic Tape section, using DNA to store
data is more cost effective for storing data than other
conventional mediums because it does not require constant
power. In fact, according to V. Zhirnov et al. of Nature
Materials, hard disks, flash memory and RAM use up to 0.04
Watts per GB, whereas cellular DNA requires less than 1*10 10
Watts per GB [16]. As mentioned before, lower
temperatures dramatically increase the durability of DNA. As
opposed to DNA, which can last millions of years under ideal
conditions, magnetic tape data must be copied onto a new
cartridge every few decades; regardless of whether cartridge
manufacturers use recycled or raw materials, this process
raises costs. Therefore, the sustainable solution involves a
tradeoff between storage costs of DNA and replication costs
of traditional storage mediums. Data management companies
can continue to consume resources, squander money on
copying data more frequently, and produce more waste, or
invest in DNA data storage, despite the high price tag, which
leads to advancements in the technology and future
reductions in costs. Clearly, DNA shows much greater
potential for preserving data than any other storage mediums,
due to its long half-life, its ability to survive under a wide
range of conditions and its ease of replication.
advantage of this by selling bases for up to ten cents each
[17]. To estimate the total cost of synthetic materials,
consider that a gene consists of at least thousands of base
pairs.
Another factor that must be considered is accuracy,
which J. Bornholt et al. determined by comparing two
synthesis methods. In the first, each sequence was
synthesized individually; this resulted in low error rates, but
higher cost and time. In the second, they used a more
common array-based synthesis process; the results were less
accurate, but the costs were much lower; see Figure 3 below
[12].
FIGURE 3 [12]
Distribution of DNA errors from two synthesis
technologies.
Using the same simulation referenced below Figure 2 in the
Durability section, these researchers were able to gather data
on the cost and error for each method. According to the
results displayed in Figure 3, they found that synthesis error
was about the same for both methods, and could therefore be
ignored [12]. This meant that most of the error had to come
from sequencing.
Retrieval error is another issue that DNA data storage
faces. Selecting the best retrieval process involves a choice
between reliability and density, which J. Bornholt et al. have
demonstrated by comparing two types of encoding processes.
The first process, called Naive encoding has the lower
reliability because it does not involve duplicating strands.
The second is called Goldman encoding; it is tunable,
meaning that its redundancy can be manipulated [12]. The
advantage of Goldman encoding lies in the fact that the more
copies of DNA that are made, the less likelihood that every
single one will fail. Intuitively, Naive encoding should
produce more errors because are no backups in place. The
simulation proved this hypothesis and brought a new issue to
light: additional redundancy increases accuracy, but affects
density negatively [12]. Clearly, each copy will decrease the
amount of space available for new information. Before DNA
data storage is ready for large scale industrial use, choices
must be made involving each step of the process.
Obstacles to Industrialization
Despite the numerous advantages that DNA has over
other conventional storage mediums, there are some
drawbacks to consider. Since DNA data storage is a relatively
new technology, costs are high and funding is devoted to
research and development. So far, per MIT Technology
Review, the greatest costs come from synthesizing and
retrieving the DNA [17]. Net Industries, an educational
source and provider licensed material originally published in
print form, describes synthesis as a process in which strands
of nucleic acids are created and preserved for sequencing
[21]. While this process occurs naturally in all living
organisms, a man-made version is possible through genetic
engineering. Companies like Twist Bioscience take
5
Marlo Garrison
Chris Kolimago
us to better understand what kinds of errors crop up and how
to deal with them” [10]. For now, researchers will continue
with the developmental stage of this technology, better
understand its weaknesses, and make improvements to
progress it towards the goal of commercial use.
DEVELOPING RESEARCH
As many companies race to develop DNA data storage
technology first, cementing themselves as the pioneers,
Microsoft seems to be in the lead. According to an MIT
Tecnology review article by A. Rosenblum, on July 7, 2016,
Microsoft reported to have written roughly 200 megabytes of
data into DNA with the help of researchers from the
University of Washington (UW). The stored information
included War and Peace and ninety-nine other literary
classics. This is the largest amount of data successfully stored
into DNA and retrieved to date, and Microsoft continues to
develop this technology; see Figure 4 below.
FUTURE APPLICATIONS
Usage and Industrialization
Once DNA data storage is fully functional and ready for
the world, the possibilities of its use are huge. If Microsoft is
the first to develop the operational technology, it can patent it
and sell it to the public like any other software program. This
would push DNA data storage towards a commercial future,
being sold to big companies through Microsoft for large scale
data storage; this would condense and replace the huge
magnetic tape data storage centers. Any company or
institution with massive amounts of information to keep track
of will benefit from DNA data storage. Take the University
of Pittsburgh, for example, with its 35,000 students, hundreds
of thousands of alumni, and records to keep for everyone,
past, present, and future. A university like Pitt can invest in
DNA data storage technology and, in the future, store these
records in DNA. The government could also benefit from
DNA data storage technology because of its security. As of
now, there is no way to hack into a living piece of material,
so top secret information from agencies like the Central
Intelligence Agency could store data securely in DNA. This
information would then have to be kept safe from being
physically stolen, but could not be compromised like a
computer.
Another example of DNA data storage’s use is in
hospitals. It is vital that hospitals keep accurate records of
thousands of patients, whether someone visits once or visits
daily, for decades at a time. This information accumulates
rapidly. Hospitals already have the storage capabilities to
house the DNA data storage in a cold and dry environment,
just like any other DNA sample the hospital handles. The
space saving compressibility of DNA gives hospitals a
sustainable method for storing data. As detailed in the
compressibility section, DNA data storage technology can
store patient records in a marginal fraction of the space.
Drawing a similar conclusion as before, extra space in the
hospital could be used for additional check-up and operating
rooms; therefore, this reduction of patient records would lead
to more lives being saved. In addition, once DNA data
storage technology decreases in price, it will be more cost
efficient for hospitals to convert to DNA storage. This cost
reduction means more investment can be made in medical
technology, research, and the welfare of the people instead of
the maintance of data.
FIGURE 4 [17]
200 MB DNA storage by Microsoft
In the above image from A. Rosenblum’s article about
Microsoft’s research , the pink smear on the tip of the tube is
the 200 megabytes of stored data Microsoft created, with a
pencil tip to reference its size [17].
This project required about 1.5 billion bases and was
synthesized by Twist Bioscience—a company Microsoft has
partnered with to develop DNA data storage. As of the 200megabyte development, commercially available synthesis
costs as little as .04 cents per base. Furthermore, reading one
million bases costs roughly a penny [17]. Although Microsoft
has not disclosed its expenditures for this research, the 200megabyte storage is estimated to cost roughly 60 million
dollars. If Microsoft successfully develops DNA data storage
technology for commercial use and can sell it as a large data
storage medium to other companies, the hope is that the price
of reading and writing DNA will plunge in the upcoming
years of development. DNA data storage must cost less than
magnetic tape for companies to invest in it. As Microsoft and
other companies continue to develop DNA data storage
research, it is projected that the cost of commercial DNA data
storage will significantly decrease, much like most biotech
research in this rapidly developing era. Per L. Ceze, the UW
project lead researcher, “This experiment led to several
important breakthroughs that improved our ability to
manipulate more complex pools of synthetic DNA. It allowed
6
Marlo Garrison
Chris Kolimago
The results from Figure 5 are based on the simulation by J.
Bornholt et al. introduced below Figure 2 and referenced in
Figure 3. As shown in the graph above, a small percentage of
strands were at the target length of 120 nucleotides after
synthesis. Therefore, focus in reducing the number of
shortened strands will decrease both the cost of synthesis and
sequencing error.
Random access retrieval methods have been shown to
increase efficiency by allowing a user to retrieve specific
data, rather than reading the entire sequence. Currently, the
best way to reduce retrieval error is by creating numerous
copies of the sequence. Therefore, an effective way to reduce
overall cost would be to reduce the cost involved with
replicating DNA. As previously described under the retrieval
section, PCR is the exponential replication of DNA that
occurs naturally in the body. Medical technology provider,
Roche Molecular Diagnostics, explains that in the lab, a PCR
reaction can be induced, creating more than one billion copies
of the target DNA in thirty to forty cycles [23]. Focus and
research on this aspect can bring down the cost of
synthesizing and reading DNA.
Advances such as these demonstrate how the field of
genetics is continuously improving; this is a good predictor
for the future success of DNA data storage. Semiconductor
Research Corporation director and chief scientist V. Zhirnov
is optimistic that the cost of synthesis can be orders of
magnitude below today's levels. “There are no fundamental
reasons why it's high,” he says [9]. DNA data storage will not
become the dominant storage medium overnight. N.
Goldman, of the European Bioinformatic Institute, sums it up
perfectly. He says, “Six magnitudes is no big deal in
genomics. You just wait a bit [9].” Rather, the greatest
improvements in DNA data storage will come with time,
especially by focusing on lowering rates of error and costs.
As scientists take the time to develop the technology, costs
will decrease. The payoff for all this time, money and effort:
one final solution to the big data problem.
In conclusion, once DNA data storage is fully
developed it can change the way the world harbors
information. It can combat the big data problem with its
advantageous compressibility and durability once the
technology overcomes the obstacles of industrialization. The
environmental and economic sustainability will aid the next
generation of data and benefit the society that depends on it.
In the future, it may be normal to find information at a
hospital or university stored inside live DNA. DNA data
storage is important to society because it will increase the
capacity of data storage and the efficiency of the digital
world, and important to engineering because it will lead to
new opportunities in the fields of bioengineering and
genetics.
Bridging the Gap
Despite the obstacles on the road to success, DNA
research is continuously being conducted by companies like
Microsoft, Twist Bioscience, and the United States
Government. These benefactors are helping to reduce the
issues with DNA data storage and prepare it for use in
industry. Since the completion of the Human Genome Project
in 2003, the cost of sequencing DNA has dropped from $100
million per genome to less than $10,000 per genome [19].
The National Human Genome Research Institute describes
how, as of 2015, the most common routine for sequencing
DNA involves generating a “draft”, or unfinished, sequence,
whose quality is dependent on the amount of base
redundancy [22]. Since these drafts are unfinished and can
utilize information from previously created sequences, the
cost is much lower. Reducing the cost of sequencing is
especially important, since it currently accounts for such
large portions of error. A significant aspect to consider in
sustainability is the economic cost of the product. In order for
DNA data storage to be a sustainable alternative to current
mediums, its cost must be lower than the competition. A
slightly higher cost could be justified as a tradeoff for
benefits like compressibility, reliability, and other
advantageous qualities that make DNA data storage
worthwhile to a company. Initially, it would fall on the
company to decide whether the advantages of DNA data
storage are worth a higher cost. However, as the cost of DNA
technology, including sequencing, continues to decrease,
DNA data storage will become a sustainable economic option
that more efficiently conserves resources and energy, making
it the smartest choice.
J. Bornholt et al. describe a process for increasing the
efficiency of synthesis that can also lead to lower costs. Low
cost synthesis methods can produce truncated strands, which
standard sequencing processes exclude [12]. Since these
truncated strands are ignored, they are not used for data
recovery and represent waste; see Figure 5 below.
FIGURE 5 [12]
Variations in strand length
7
Marlo Garrison
Chris Kolimago
[13] M. Nichols. “How DNA Data Storage Works.”
Colocation America. 09.12.2016. Accessed 02.20.2017.
https://www.colocationamerica.com/blog/dna-data-storage
[14] “PCR.” Learn Genetics. Accessed 02.27.2017.
http://learn.genetics.utah.edu/content/labs/pcr/
[15] J. Hewitt. “MIT can now use E. coli DNA tape recorders
for living and replicating data storage.” 11.17.2014. Accessed
03.01.2017. https://www.extremetech.com/extreme/194116mit-can-now-use-e-coli-dna-to-store-up-to-455-exabytes-ofself-replicating-data-per-gram
[16] V. Zhirnov, R. M. Zadegan, G. S. Sandhu, G. M. Church
and W. L. Hughes. “Table 1: Comparison between baseline
memory technologies and DNA memory.” 03.23.2016.
Accessed
03.01.2017.
http://www.nature.com/nmat/journal/v15/n4/fig_tab/nmat459
4_T1.html
[17] A. Rosenblum. “Microsoft Reports a Big Leap Forward
for DNA Data Storage.” MIT Technology Review.
06.07.2016. Accessed 03.01.2017.
https://www.technologyreview.com/s/601851/microsoftreports-a-big-leap-forward-for-dna-data-storage/
[18] M. Kaplan. “DNA has a 521-year half-life.” Nature.com.
10.10.2012.
Accessed
03.01.2017.
http://www.nature.com/news/dna-has-a-521-year-half-life1.11555
[19] E. Munson. “This Harvard scientist is coding an entire
movie onto DNA.” Public Radio International. 08.10.2015.
Accessed 03.01.2017. https://www.pri.org/stories/2015-0810/harvard-scientist-coding-entire-movie-dna
[20] R. N. Grass, R. Heckel, M. Pudda, D. Paunescu and W.
J. Stark. “Robust Chemical Preservation of Digital
Information on DNA in Silica with Error-Correcting Codes.”
02.04.2015.
Accessed
03.01.2017.
http://onlinelibrary.wiley.com/doi/10.1002/anie.201411378/f
ull
[21] “DNA Synthesis.” Net Industries. Accessed 03.02.2017.
http://science.jrank.org/pages/2133/DNA-Synthesis.html
[22] “The Cost of Sequencing a Human Genome.” National
Human Genome Research Institute. 06.06.2016. Accessed
03.02.2017. https://www.genome.gov/sequencingcosts/
[23] “What is PCR?” Roche Molecular Diagnostics.
Accessed
03.02.2017.
https://molecular.roche.com/innovation/pcr/what-is-pcr/
SOURCES
[1] S. Kaisler, F. Armour, J. A. Espinosa and W. Money.
"Big Data: Issues and Challenges Moving Forward." 2013
46th Hawaii International Conference on System Sciences.
Accessed
01.10.2017.
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=64
79953&isnumber=6479821
[2] “What is big data?” IBM. Accessed 02.22.2017.
https://www-01.ibm.com/software/data/bigdata/what-is-bigdata.html
[3] “Executive Summary.” IDC. 04.2014. Accessed
02.22.2017.
https://www.emc.com/leadership/digitaluniverse/2014iview/executive-summary.htm
[4] Economist.com. “Tape rescues big data.” The Economist.
11.26.2013.
Accessed
02.23.2017.
http://www.economist.com/blogs/babbage/2013/09/informati
on-storage
[5] B. Rossi. “The rise, fall and re-rise of magnetic tape.”
Information Age. 01.15.2015. Accecssed 02.23.2017.
http://www.information-age.com/rise-fall-and-re-risemagnetic-tape-123458854/
[6] “Sony’s New 185 TB Tape Drive is Not a Cassette.” VR
World.
05.05.2014.
Accessed
02.23.2017.
https://vrworld.com/2014/05/05/sonys-new-185-tb-tapedrive-cassette/
[7] S. Anthony. “Seagate’s new 60TB SSD is world’s
largest.” ARS Technica. 08.11.2016. Accessed 03.02.2017.
https://arstechnica.com/gadgets/2016/08/seagate-unveils60tb-ssd-the-worlds-largest-hard-drive/
[8] “Nucleotides and Bases.” Genetics Generation. Accessed
01.10.2017. http://knowgenetics.org/nucleotides-and-bases/
[9] A. Extance. “How DNA could store all the world’s data.”
nature.com. 08.31.2016. Accessed 01.10.2017.
http://www.nature.com/news/how-dna-could-store-all-theworld-s-data-1.20496
[10] J. Langston. “UW, Microsoft Researchers Break Record
for DNA Data Storage.” University of Washington.
07.07.2016.
Accessed
02.27.2017.
http://www.washington.edu/news/2016/07/07/uw-microsoftresearchers-break-record-for-dna-data-storage/
[11] G. Templeton. “How DNA Data Storage Works.”
ExtremeTech.
07.08.2016.
Accessed
03.01.2017.
https://www.extremetech.com/extreme/231343-how-dnadata-storage-works-as-scientists-create-the-first-dna-ram
[12] J. Bornholt, R. Lopez, D. M. Carmean, L. Ceze, G.
Seelig and K. Strauss. “A DNA-Based Archival Storage
System.” Computer Science & Engineering University of
Washington. Accessed 01.10.2017.
https://homes.cs.washington.edu/~bornholt/papers/dnastorage
-asplos16.pdf
ACKNOWLEDGMENTS
We would like to acknowledge our peer advisor, Grace
Bova, for her work and collaboration in the creation of this
research paper. A special thanks to the Swanson School of
Engineering for this blessed learning opportunity. Finally,
thank you to our our families for their support.
8