* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download dna data storage - University of Pittsburgh
DNA sequencing wikipedia , lookup
Molecular evolution wikipedia , lookup
Maurice Wilkins wikipedia , lookup
Gel electrophoresis of nucleic acids wikipedia , lookup
Transformation (genetics) wikipedia , lookup
Molecular cloning wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Non-coding DNA wikipedia , lookup
Community fingerprinting wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Session A3 Paper #85 Disclaimer—This paper partially fulfills a writing requirement for first year (freshman) engineering students at the University of Pittsburgh Swanson School of Engineering. This paper is a student, not a professional, paper. This paper is based on publicly available information and may not provide complete analyses of all relevant data. If this paper is used for any purpose other than these authors’ partial fulfillment of a writing requirement for first year (freshman) engineering students at the University of Pittsburgh Swanson School of Engineering, the user does so at his or her own risk. DNA DATA STORAGE: THE BIG DATA SOLUTION Marlo Garrison, [email protected], 3pm Mena, Chris Kolimago, [email protected], 3pm Mena Abstract—DNA data storage is a developing technology that uses DNA base pair sequencing to store information inside of a cell. Once DNA data storage is fully developed, it will have the potential to optimize how companies store data. Instead of using ones and zeros in binary coding to store information, DNA data storage utilizes the four different nucleotide bases of DNA. These synthetic DNA strands are stored in a cell for future retrieval of data. DNA data storage is an important breakthrough because it solves the accumulating big data problem. Electronic storage devices are not improving as quickly as the amount of data produced. Since one cell has the capacity to store an entire genome, DNA as a medium of storage can be used in place of current magnetic tape technology. This paper describes the rapidly growing big data problem and explains the technology behind DNA data storage. It highlights the advantages of DNA data storage as an industrial storage medium. The advancement of DNA data storage is significant to any data user because it has the potential to improve the efficiency and sustainability of the digital world. Key Words—Big data problem, Data Storage, DNA, DNA data storage capacity and density, DNA sequencing, Magnetic tape, Microsoft DNA DATA STORAGE: BRINGING INFORMATION TO LIFE A developing technology that has the potential to change the way companies store mass quantities of information is deoxyribonucleic acid (DNA) data storage. DNA data storage takes the need for large scale data storage and combines it with the powerful properties of DNA to create an efficient storage medium. The information is stored directly in DNA, like a genetic code, and kept for future retrieval. This medium can be the solution to the big data problem that is growing exponentially. People are creating more data than there is space to store it, so DNA data storage presents a compressible alternative to current large scale data storage methods, like magnetic tape. The technology and bioengineering of DNA data storage involve base sequencing and encoding, storage, and retrieval. Aspects like compressibility and durability make DNA data storage an advantageous alternative to current storage mediums, while University of Pittsburgh Swanson School of Engineering 1 Submission Date 03.03.2017. issues with reliability and error are obstacles for researchers to tackle during this developmental stage. DNA data storage will have an important societal impact because of advantages in conserving environmental and economic resources, thus making data storage sustainable for future generations. Companies like Microsoft have taken on the role of conducting research and development on DNA data storage technology. The future of DNA data storage relies on solving reliability issues and reducing cost so that companies and institutions can utilize this medium efficiently and on a large scale. DNA data storage has the potential to improve and redefine information storage in the commercial and institutional world. THE BIG DATA PROBLEM AND CURRENT SOLUTION The Big Problem with Data Big data is best understood by drawing a parallel between its technical definition and the growing technology it represents. Just as new information is being created all the time, “Big Data” is a dynamic term. As explained by researchers S. Kaisler et al. in a paper presented at the fortysixth Hawaii International Conference on System Sciences, originally, “Big Data” meant the volume of data that could not be processed efficiently by traditional database methods and tools. Now, it is defined as the amount of data just beyond technology’s capability to store, manage and process efficiently [1]. The key feature of both definitions is that big data cannot be processed efficiently. Imagine a scenario in which a major city is running out of space for building parking lots, while the rate of new commuters and residents coming into the city is constant. Now picture the same scenario, but replace the cars with Excel spreadsheets and parking spots with hard drives. However, Excel spreadsheets are a small fraction of the data that is processed daily. IBM, one of the world’s largest information technology companies, claims that every day 2.5 quintillion bytes of data are created, from sources such as posts to social media sites, digital pictures and videos, purchase transaction records and cell phone GPS signals, to name a few [2]. Even this short list demonstrates how easy it is to take data management for Marlo Garrison Chris Kolimago is greater than other storage mediums, such as Seagate’s sixty TB solid state drive (2016), it is still not enough to manage the zettabytes of data required [7]. The second disadvantage is the size of each cartridge, Sony’s is 5.25 inches in length, and the size of the data centers that hold and process the tapes [6]. Think about how many of these buildings it would take to manage 44 zettabytes of information. The underlying issue is finding a medium that can effectively manage the voluminous, inactive cold data. Due to its capacity limitations, magnetic tape cannot accomplish this task. Instead, the solution lies in DNA data storage, which has the potential to replace unsustainable magnetic tape facilities. The advantages over other current storage mediums, and the features that make DNA look so promising, are its compressibility and durability, which will be detailed in later sections. granted; think about how different life would be without these basic functions. International Data Corporation (IDC) Research Inc., a source that has been tracking the terabytes of disk storage shipped each year, claims that that by 2020, the digital universe will reach forty-four zettabytes, or forty-four trillion gigabytes (GB) [3]. To put this into perspective, in the same paper that defined big data, S. Kaisler et al. estimate that current disk technology limits are about four terabytes, or 4,096 GB, per disk [1]. This means that it would require over ten billion disk drives, to contain all the world’s data in 2020; the real issue is not that it is impossible to store all the world’s data, but rather that data is produced at a much faster rate than storage mediums. The solution lies in the following question: What if there was a medium that could replace the quality of soon-to-be obsolete storage technology and contain the growing quantity of data? THE BUILDING BLOCKS OF DNA DATA STORAGE Magnetic Tape: The Current Technology Society’s need for speed is a driving force behind many technological advances, including the evolution of computing. The Economist Journal demonstrates how data management is no exception to this obsession. In the 1980’s and 1990’s, magnetic tape cassettes and cartridges ruled large computer systems. Then, fast spinning disks and even faster spinning hard drives almost drove magnetic tape to extinction. Today, users expect immediate storage and retrieval of data with cloud computing systems, named for their nearly complete elimination of hardware [4]. Recently, however, the demand for more efficient “cold data” storage has sparked the revitalization of magnetic tape, a technology that was almost labeled obsolete. Cold data is the type of data that lies in long-term storage; it is the type of data that piles up over time and remains mostly unused. A few key features of magnetic tape give it an edge over other available storage mediums for storing this type of data. First, tapes do not require constant power to preserve the data held on them, whereas other mediums that operate by rotating a disk may fail if power stops being provided [4]. Like magnetic tape, storage mediums that do not have this disadvantage are important because they address the sustainability issue of reducing energy consumption and related costs. Furthermore, as the volume of data increases, so does the cost; this is another key feature of magnetic tape. B. Rossi, Vitesse Media’s editorial director, explains why magnetic tape is a better option than cloud storage for storing large amounts of data. He says that as the quantity increases, “the price per gigabyte per month model [of cloud storage] falls down fairly quickly. Also, this does not include the in-and-out charges that can really spike the cost of cloud storage” [5]. On the other hand, as reported by S. Anthony of VR World, an online technological news source, in his article, Seagate’s new 60TB SSD is world’s largest, the highest tape cartridge capacity is 185 TB (185,00 GB) [6]. Although this Base Sequencing and Encoding A major component of DNA data storage is the manipulation of base sequencing, which usually codes for the genetic information in DNA, to code for any type of information. According to Genetics Generation, the backbone of the double helix of DNA is composed of a five-carbon sugar and phosphate group. A nitrogenous base remains in the center of the double helix. The four types of nitrogenous bases are adenine, cytosine, guanine and thymine. Each base has a complementary pair—adenine to thymine and cytosine to guanine. These nitrogenous bases form hydrogen bonds with their complementary base pair to hold the double helix together. It is these base pairs that form long strings that code for a certain genetic trait [8]. The current data storage technique relies on binary coding, which uses ones and zeros that code for information. Combining these two ideas creates the fundamental concept behind DNA data storage. The information needs to be encoded, or converted into a code form, with some type of uniform key that can be applied on a massive scale. Instead of using the ones and zeros of binary coding, DNA data storage uses adenine, cytosine, guanine and thymine as its encoding key. The European Bioinformatics Institute (EBI) says, “binary zero could be encoded by the bases adenine or cytosine, and a binary one could be represented by guanine or thymine.” This technique allowed EBI to successfully encode five files, including Shakespeare's sonnets and a part of Martin Luther King's 'I have a dream' speech [9]. The words of the sonnets were matched to a certain pattern of nucleotide sequencing and converted into a new language. This new base pair language was then stored inside of DNA, and when retrieved, was decoded for the same, original sonnet. Next, the sequences are chemically manufactured; there are many methods and options of manufacturing still being 2 Marlo Garrison Chris Kolimago developed. For example, one method the University of Washington research uses involves a silicon-based DNA synthesis substrate that creates different sequences in parallel [10]. After manufacturing, the DNA is put in test tubes and dehydrated for storage. Some researchers use the DNA of live E. coli bacteria to store the sequenced data. Thus, the idea of replacing binary coding with base pair sequencing brought DNA data storage into the realm of possibility. These nucleotides act as a new coding language that can properly convert information into a code that can be stored and later retrieved. Encoding with nucleotide base pairing creates a storage medium that is compatible with a living structure. Once the information is properly encoded, it must be stored securely. al. the retrieval process is broken into three parts: payload, address, and primer [12]. The payload is the actual string of nucleotide encoded information that a user wishes to retrieve. The address is a block of base pairing that identifies the location of the information. Additionally, much like a stop and start codon in regular DNA processes which indicate where a section begins and ends, two primers enclose the information. Primers are nucleotide sequence placeholders at the beginning and end of a string of information that allow certain sections of information to be retrieved instead of an entire string; see Figure 1 below. Storing Data in DNA Another major component of DNA data storage is its storage capacity. After the DNA has encoded the information, it safely stores the information until retrieval. In an article about how DNA storage works, G. Templeton describes the components of storage. Chromatin is a DNA protein system that makes up chromosomes. The chromatin fibers consist of small proteins called histones and DNA. The DNA wraps around nucleosome structures to condense, and then these nucleosome structures, each consisting of about one hundred and fifty base pairs, fold to produce the chromatin fiber that coils to form a chromosome. This allows DNA to pack tightly, but also unravel quickly for easy access to information. Chromatin stores an entire genome of information into a single chromosome, creating an incredibly dense structure, and therefore a dense storage volume [11]. Similar to when the body requests access to these DNA patches, encoded information can also be requested in the same fashion. Like a chromosome, information will be stored in these tight and coiled DNA protein structures, and the structures will uncoil upon request. According to EBI, DNA, with its ability to store an entire genome in its microscopic region, has a storage capacity of approximately ten quintillion bits per cubic centimeter [9]. DNA’s dense storage capacity is an important advantage for commercial use that will be analyzed further in the compressibility. The DNA itself also has optimal storage capability. As long as DNA is kept cool and dry, it can remain intact for hundreds of thousands of years. DNA data storage has the same durability as DNA found in ancient fossils, a feature that will be discussed in the section on durability. Using DNA data storage would make the information stored as timeless as fossils themselves and companies will benefit from storage mediums that can withstand the test of time. FIGURE 1 [12] Nucleotide sequence structure input and output This figure comes from the research paper of J. Bornholt et al. and illustrates the nucleotide sequence structure with primers, payload, and address. The article goes on to describe the importance of this process because it leads to the ability of random access. Random access means that sequencing can be performed on a selected group of strands instead of the entire structure [12]. Companies can create unique primer nucleotide sequences to section off the data that is stored. This is how data can be sorted and retrieved in an organized manner. It would be inefficient to recall all stored information each time it is retrieved. DNA’s random access capability makes it comparable to a computer’s Random Access Memory, or RAM. According to the M. Nichols in her article about how DNA data storage works, this means that data in the operating system is stored in RAM for immediate access. RAM evades waiting for a computer to retrieve information from its hard drive, and essentially means the physical location of the data does not matter, as long as it can be immediately accessed [13]. The random access of DNA and computer RAM are analogous in concept, which is why DNA data storage can be an advantageous replacement. The random-access functions via a polymerase chain reaction (PCR) involve DNA polymerase. Per Learn Genetics, a genetic science learning center, “DNA Polymerase is a naturally occurring complex of proteins whose function is to copy a cell's DNA before it divides in two” [14]. PCR amplifies a single copy of DNA, or in the case of DNA data storage, a single piece of data, and generates numerous copies. These copies can act as a backup file for the information being stored. Random access not only allows the user to retrieve specific data, but it also allows the user to make thousands of copies. DNA data storage encompasses existing data storage concepts, like RAM and Retrieving Information The most important aspect of the DNA data storage system is its reliable retrieval of data. In a research paper about DNA-Based Archival Storage System by J. Bornholt et 3 Marlo Garrison Chris Kolimago replication, but adds the advantages of using living material to make a worthy alternative. solution, Escherichia coli, and DNA in general, is the clear winner. Durability ADVANTAGES AND DISADVANTAGES OF DNA In March of 2016, Massachusetts Institute of Technology (MIT) researchers announced they had partially reconstructed the genomes of ancient humans whose bones had been in a Spanish cave for more than 400,000 years [17]. The amount of information on dinosaurs and other prehistoric creatures that has been collected from fossils, a type of ancient DNA, is massive; imagine the amount of new information just waiting to be discovered at the bottom of the sea, deep under the ground, or lying in a cave. These discoveries are possible due to one of DNA’s most advantageous properties: durability. The half-life of DNA, which paleogeneticists M. Allentoft at the University of Copenhagen and M. Bunce at Murdoch University in Perth, Australia, have estimated to be about 521 years, is largely based on two factors [18]. The first is the condition that the DNA is stored in. According to a simulation done by J. Bornholt et al., storage in lower temperatures significantly increases the half-life of single stranded DNA [12]. This makes sense when considering the environment that ancient fossils rest in before they are discovered, such as a cool, dark cave. The second factor is the ease of replication that DNA possesses. By copying the natural replication process of DNA, Harvard genetics professor Dr. G. Church, claims that billions of copies of DNA can be reproduced quickly and cheaply; see Figure 2 below [19]. Compressibility Recall the reasons for why magnetic tape will not be the final solution for big data management, including storage limitations, physical size and durability. Instead consider a solution that lies on the inside. Recently, researchers have been considering a certain type of bacteria that can be found in the intestines of people and animals, called Escherichia coli (E. coli); they discovered that E. coli contains a special structure which can record and store information. According to a research paper by F. Farzadfard and T. K. Lu, this beneficial structure, known as a retron, is an extra singlestranded cache of DNA that exists separate from the cell’s regular set of double stranded DNA. By introducing certain chemicals, these single DNA strands preserve an accurate record of the chemical environment of the cell; these chemicals can increase the rate at which the single stranded DNAs recombine with the main double stranded DNA [15]. In other words, such a chemical could trigger the singlestranded E. coli DNA to begin “recording” data, which the bacteria would then store in its secure double-stranded DNA. This is significant because any organism with retrons, typically bacteria, will have these capabilities. A few additional factors which give DNA its high compressibility are size and three-dimensionality. An analysis done by V. Zhirnov et al. of Nature Materials, an international scientific journal, shows that an E. coli cell has a volume of 1 µm3 (1 µm = 1*10-6 m) and 9.2 Mbit (1 Mbit = 125,000 bytes) of memory [16]. A volume of one picometer is vastly smaller than that of a magnetic tape cartridge, which contains dimensions measured in centimeters (1 cm = 0.01 m). Furthermore, the cell’s 9.2 Mbit of memory equates to about 1,150,000 bytes, or 0.00115 GB in storage capacity. Combining these properties give a single E. coli cell an astounding 3-dimensional memory density of ~1019 bit/cm3 (1.25*1024 bytes/m3) [16]. The third dimension is important compared to drive storage, which is mostly two-dimensional and much larger in volume. Society will benefit from the valuable compressibility abilities DNA data storage holds because less space used for storing data means more room for other investments. In an environmental aspect, resources used to harbor data will be conserved by reducing the volume of data storage centers. Instead of a large building or room— requiring steel, cement, electricity, valuable time, and workers to construct—companies may only need to dedicate mere inches of space to data storage. Therefore, DNA data storage shows tremendous potential as being a sustainable solution for environmentally conscientious companies who wish to conserve resources. In the race for a big data FIGURE 2 [12] Average number of copies of sequences required to ensure a desired reliability over time. This graph is presented in a research paper by J. Bornholt et al. pertaining to a realistic architecture for a DNA-based archival system. Using a simulation, they tested a new encoding scheme and compared features such as redundancy, reliability and density to current standards. Figure 2 shows that in the simulation, researchers calculated that a 99.99% 4 Marlo Garrison Chris Kolimago chance of recovering encoded data only requires 10 copies of strands for that value. Additionally, they estimated that after 1000 years, only 1000 copies will be required for the same near-perfect rate [12]. Therefore, the durability of DNA storage is based on the level of environmental regulation and the amount of strand replication. Further research conducted on the durability of DNA supports the aforementioned claim. Dr. R. Grass, a lecturer at ETH Zurich, ran analyses on DNA encapsulated in silica. His team exposed the encapsulated DNA to 70 degrees Celsius for a week, which is equivalent to four decay half-lives, and could recover the original data without final error. From this data, he predicted that digital information could be stored encapsulated in silica at the Global Seed vault (-18º C) for over two million years [20]. These findings are phenomenal when compared to magnetic tape, which must be regulated similarly and lasts only a few decades at most. Recall how, as explained in the Magnetic Tape section, using DNA to store data is more cost effective for storing data than other conventional mediums because it does not require constant power. In fact, according to V. Zhirnov et al. of Nature Materials, hard disks, flash memory and RAM use up to 0.04 Watts per GB, whereas cellular DNA requires less than 1*10 10 Watts per GB [16]. As mentioned before, lower temperatures dramatically increase the durability of DNA. As opposed to DNA, which can last millions of years under ideal conditions, magnetic tape data must be copied onto a new cartridge every few decades; regardless of whether cartridge manufacturers use recycled or raw materials, this process raises costs. Therefore, the sustainable solution involves a tradeoff between storage costs of DNA and replication costs of traditional storage mediums. Data management companies can continue to consume resources, squander money on copying data more frequently, and produce more waste, or invest in DNA data storage, despite the high price tag, which leads to advancements in the technology and future reductions in costs. Clearly, DNA shows much greater potential for preserving data than any other storage mediums, due to its long half-life, its ability to survive under a wide range of conditions and its ease of replication. advantage of this by selling bases for up to ten cents each [17]. To estimate the total cost of synthetic materials, consider that a gene consists of at least thousands of base pairs. Another factor that must be considered is accuracy, which J. Bornholt et al. determined by comparing two synthesis methods. In the first, each sequence was synthesized individually; this resulted in low error rates, but higher cost and time. In the second, they used a more common array-based synthesis process; the results were less accurate, but the costs were much lower; see Figure 3 below [12]. FIGURE 3 [12] Distribution of DNA errors from two synthesis technologies. Using the same simulation referenced below Figure 2 in the Durability section, these researchers were able to gather data on the cost and error for each method. According to the results displayed in Figure 3, they found that synthesis error was about the same for both methods, and could therefore be ignored [12]. This meant that most of the error had to come from sequencing. Retrieval error is another issue that DNA data storage faces. Selecting the best retrieval process involves a choice between reliability and density, which J. Bornholt et al. have demonstrated by comparing two types of encoding processes. The first process, called Naive encoding has the lower reliability because it does not involve duplicating strands. The second is called Goldman encoding; it is tunable, meaning that its redundancy can be manipulated [12]. The advantage of Goldman encoding lies in the fact that the more copies of DNA that are made, the less likelihood that every single one will fail. Intuitively, Naive encoding should produce more errors because are no backups in place. The simulation proved this hypothesis and brought a new issue to light: additional redundancy increases accuracy, but affects density negatively [12]. Clearly, each copy will decrease the amount of space available for new information. Before DNA data storage is ready for large scale industrial use, choices must be made involving each step of the process. Obstacles to Industrialization Despite the numerous advantages that DNA has over other conventional storage mediums, there are some drawbacks to consider. Since DNA data storage is a relatively new technology, costs are high and funding is devoted to research and development. So far, per MIT Technology Review, the greatest costs come from synthesizing and retrieving the DNA [17]. Net Industries, an educational source and provider licensed material originally published in print form, describes synthesis as a process in which strands of nucleic acids are created and preserved for sequencing [21]. While this process occurs naturally in all living organisms, a man-made version is possible through genetic engineering. Companies like Twist Bioscience take 5 Marlo Garrison Chris Kolimago us to better understand what kinds of errors crop up and how to deal with them” [10]. For now, researchers will continue with the developmental stage of this technology, better understand its weaknesses, and make improvements to progress it towards the goal of commercial use. DEVELOPING RESEARCH As many companies race to develop DNA data storage technology first, cementing themselves as the pioneers, Microsoft seems to be in the lead. According to an MIT Tecnology review article by A. Rosenblum, on July 7, 2016, Microsoft reported to have written roughly 200 megabytes of data into DNA with the help of researchers from the University of Washington (UW). The stored information included War and Peace and ninety-nine other literary classics. This is the largest amount of data successfully stored into DNA and retrieved to date, and Microsoft continues to develop this technology; see Figure 4 below. FUTURE APPLICATIONS Usage and Industrialization Once DNA data storage is fully functional and ready for the world, the possibilities of its use are huge. If Microsoft is the first to develop the operational technology, it can patent it and sell it to the public like any other software program. This would push DNA data storage towards a commercial future, being sold to big companies through Microsoft for large scale data storage; this would condense and replace the huge magnetic tape data storage centers. Any company or institution with massive amounts of information to keep track of will benefit from DNA data storage. Take the University of Pittsburgh, for example, with its 35,000 students, hundreds of thousands of alumni, and records to keep for everyone, past, present, and future. A university like Pitt can invest in DNA data storage technology and, in the future, store these records in DNA. The government could also benefit from DNA data storage technology because of its security. As of now, there is no way to hack into a living piece of material, so top secret information from agencies like the Central Intelligence Agency could store data securely in DNA. This information would then have to be kept safe from being physically stolen, but could not be compromised like a computer. Another example of DNA data storage’s use is in hospitals. It is vital that hospitals keep accurate records of thousands of patients, whether someone visits once or visits daily, for decades at a time. This information accumulates rapidly. Hospitals already have the storage capabilities to house the DNA data storage in a cold and dry environment, just like any other DNA sample the hospital handles. The space saving compressibility of DNA gives hospitals a sustainable method for storing data. As detailed in the compressibility section, DNA data storage technology can store patient records in a marginal fraction of the space. Drawing a similar conclusion as before, extra space in the hospital could be used for additional check-up and operating rooms; therefore, this reduction of patient records would lead to more lives being saved. In addition, once DNA data storage technology decreases in price, it will be more cost efficient for hospitals to convert to DNA storage. This cost reduction means more investment can be made in medical technology, research, and the welfare of the people instead of the maintance of data. FIGURE 4 [17] 200 MB DNA storage by Microsoft In the above image from A. Rosenblum’s article about Microsoft’s research , the pink smear on the tip of the tube is the 200 megabytes of stored data Microsoft created, with a pencil tip to reference its size [17]. This project required about 1.5 billion bases and was synthesized by Twist Bioscience—a company Microsoft has partnered with to develop DNA data storage. As of the 200megabyte development, commercially available synthesis costs as little as .04 cents per base. Furthermore, reading one million bases costs roughly a penny [17]. Although Microsoft has not disclosed its expenditures for this research, the 200megabyte storage is estimated to cost roughly 60 million dollars. If Microsoft successfully develops DNA data storage technology for commercial use and can sell it as a large data storage medium to other companies, the hope is that the price of reading and writing DNA will plunge in the upcoming years of development. DNA data storage must cost less than magnetic tape for companies to invest in it. As Microsoft and other companies continue to develop DNA data storage research, it is projected that the cost of commercial DNA data storage will significantly decrease, much like most biotech research in this rapidly developing era. Per L. Ceze, the UW project lead researcher, “This experiment led to several important breakthroughs that improved our ability to manipulate more complex pools of synthetic DNA. It allowed 6 Marlo Garrison Chris Kolimago The results from Figure 5 are based on the simulation by J. Bornholt et al. introduced below Figure 2 and referenced in Figure 3. As shown in the graph above, a small percentage of strands were at the target length of 120 nucleotides after synthesis. Therefore, focus in reducing the number of shortened strands will decrease both the cost of synthesis and sequencing error. Random access retrieval methods have been shown to increase efficiency by allowing a user to retrieve specific data, rather than reading the entire sequence. Currently, the best way to reduce retrieval error is by creating numerous copies of the sequence. Therefore, an effective way to reduce overall cost would be to reduce the cost involved with replicating DNA. As previously described under the retrieval section, PCR is the exponential replication of DNA that occurs naturally in the body. Medical technology provider, Roche Molecular Diagnostics, explains that in the lab, a PCR reaction can be induced, creating more than one billion copies of the target DNA in thirty to forty cycles [23]. Focus and research on this aspect can bring down the cost of synthesizing and reading DNA. Advances such as these demonstrate how the field of genetics is continuously improving; this is a good predictor for the future success of DNA data storage. Semiconductor Research Corporation director and chief scientist V. Zhirnov is optimistic that the cost of synthesis can be orders of magnitude below today's levels. “There are no fundamental reasons why it's high,” he says [9]. DNA data storage will not become the dominant storage medium overnight. N. Goldman, of the European Bioinformatic Institute, sums it up perfectly. He says, “Six magnitudes is no big deal in genomics. You just wait a bit [9].” Rather, the greatest improvements in DNA data storage will come with time, especially by focusing on lowering rates of error and costs. As scientists take the time to develop the technology, costs will decrease. The payoff for all this time, money and effort: one final solution to the big data problem. In conclusion, once DNA data storage is fully developed it can change the way the world harbors information. It can combat the big data problem with its advantageous compressibility and durability once the technology overcomes the obstacles of industrialization. The environmental and economic sustainability will aid the next generation of data and benefit the society that depends on it. In the future, it may be normal to find information at a hospital or university stored inside live DNA. DNA data storage is important to society because it will increase the capacity of data storage and the efficiency of the digital world, and important to engineering because it will lead to new opportunities in the fields of bioengineering and genetics. Bridging the Gap Despite the obstacles on the road to success, DNA research is continuously being conducted by companies like Microsoft, Twist Bioscience, and the United States Government. These benefactors are helping to reduce the issues with DNA data storage and prepare it for use in industry. Since the completion of the Human Genome Project in 2003, the cost of sequencing DNA has dropped from $100 million per genome to less than $10,000 per genome [19]. The National Human Genome Research Institute describes how, as of 2015, the most common routine for sequencing DNA involves generating a “draft”, or unfinished, sequence, whose quality is dependent on the amount of base redundancy [22]. Since these drafts are unfinished and can utilize information from previously created sequences, the cost is much lower. Reducing the cost of sequencing is especially important, since it currently accounts for such large portions of error. A significant aspect to consider in sustainability is the economic cost of the product. In order for DNA data storage to be a sustainable alternative to current mediums, its cost must be lower than the competition. A slightly higher cost could be justified as a tradeoff for benefits like compressibility, reliability, and other advantageous qualities that make DNA data storage worthwhile to a company. Initially, it would fall on the company to decide whether the advantages of DNA data storage are worth a higher cost. However, as the cost of DNA technology, including sequencing, continues to decrease, DNA data storage will become a sustainable economic option that more efficiently conserves resources and energy, making it the smartest choice. J. Bornholt et al. describe a process for increasing the efficiency of synthesis that can also lead to lower costs. Low cost synthesis methods can produce truncated strands, which standard sequencing processes exclude [12]. Since these truncated strands are ignored, they are not used for data recovery and represent waste; see Figure 5 below. FIGURE 5 [12] Variations in strand length 7 Marlo Garrison Chris Kolimago [13] M. Nichols. “How DNA Data Storage Works.” Colocation America. 09.12.2016. Accessed 02.20.2017. https://www.colocationamerica.com/blog/dna-data-storage [14] “PCR.” Learn Genetics. Accessed 02.27.2017. http://learn.genetics.utah.edu/content/labs/pcr/ [15] J. Hewitt. “MIT can now use E. coli DNA tape recorders for living and replicating data storage.” 11.17.2014. Accessed 03.01.2017. https://www.extremetech.com/extreme/194116mit-can-now-use-e-coli-dna-to-store-up-to-455-exabytes-ofself-replicating-data-per-gram [16] V. Zhirnov, R. M. Zadegan, G. S. Sandhu, G. M. Church and W. L. Hughes. “Table 1: Comparison between baseline memory technologies and DNA memory.” 03.23.2016. Accessed 03.01.2017. http://www.nature.com/nmat/journal/v15/n4/fig_tab/nmat459 4_T1.html [17] A. Rosenblum. “Microsoft Reports a Big Leap Forward for DNA Data Storage.” MIT Technology Review. 06.07.2016. Accessed 03.01.2017. https://www.technologyreview.com/s/601851/microsoftreports-a-big-leap-forward-for-dna-data-storage/ [18] M. Kaplan. “DNA has a 521-year half-life.” Nature.com. 10.10.2012. Accessed 03.01.2017. http://www.nature.com/news/dna-has-a-521-year-half-life1.11555 [19] E. Munson. “This Harvard scientist is coding an entire movie onto DNA.” Public Radio International. 08.10.2015. Accessed 03.01.2017. https://www.pri.org/stories/2015-0810/harvard-scientist-coding-entire-movie-dna [20] R. N. Grass, R. Heckel, M. Pudda, D. Paunescu and W. J. Stark. “Robust Chemical Preservation of Digital Information on DNA in Silica with Error-Correcting Codes.” 02.04.2015. Accessed 03.01.2017. http://onlinelibrary.wiley.com/doi/10.1002/anie.201411378/f ull [21] “DNA Synthesis.” Net Industries. Accessed 03.02.2017. http://science.jrank.org/pages/2133/DNA-Synthesis.html [22] “The Cost of Sequencing a Human Genome.” National Human Genome Research Institute. 06.06.2016. Accessed 03.02.2017. https://www.genome.gov/sequencingcosts/ [23] “What is PCR?” Roche Molecular Diagnostics. Accessed 03.02.2017. https://molecular.roche.com/innovation/pcr/what-is-pcr/ SOURCES [1] S. Kaisler, F. Armour, J. A. Espinosa and W. Money. "Big Data: Issues and Challenges Moving Forward." 2013 46th Hawaii International Conference on System Sciences. Accessed 01.10.2017. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=64 79953&isnumber=6479821 [2] “What is big data?” IBM. Accessed 02.22.2017. https://www-01.ibm.com/software/data/bigdata/what-is-bigdata.html [3] “Executive Summary.” IDC. 04.2014. Accessed 02.22.2017. https://www.emc.com/leadership/digitaluniverse/2014iview/executive-summary.htm [4] Economist.com. “Tape rescues big data.” The Economist. 11.26.2013. Accessed 02.23.2017. http://www.economist.com/blogs/babbage/2013/09/informati on-storage [5] B. Rossi. “The rise, fall and re-rise of magnetic tape.” Information Age. 01.15.2015. Accecssed 02.23.2017. http://www.information-age.com/rise-fall-and-re-risemagnetic-tape-123458854/ [6] “Sony’s New 185 TB Tape Drive is Not a Cassette.” VR World. 05.05.2014. Accessed 02.23.2017. https://vrworld.com/2014/05/05/sonys-new-185-tb-tapedrive-cassette/ [7] S. Anthony. “Seagate’s new 60TB SSD is world’s largest.” ARS Technica. 08.11.2016. Accessed 03.02.2017. https://arstechnica.com/gadgets/2016/08/seagate-unveils60tb-ssd-the-worlds-largest-hard-drive/ [8] “Nucleotides and Bases.” Genetics Generation. Accessed 01.10.2017. http://knowgenetics.org/nucleotides-and-bases/ [9] A. Extance. “How DNA could store all the world’s data.” nature.com. 08.31.2016. Accessed 01.10.2017. http://www.nature.com/news/how-dna-could-store-all-theworld-s-data-1.20496 [10] J. Langston. “UW, Microsoft Researchers Break Record for DNA Data Storage.” University of Washington. 07.07.2016. Accessed 02.27.2017. http://www.washington.edu/news/2016/07/07/uw-microsoftresearchers-break-record-for-dna-data-storage/ [11] G. Templeton. “How DNA Data Storage Works.” ExtremeTech. 07.08.2016. Accessed 03.01.2017. https://www.extremetech.com/extreme/231343-how-dnadata-storage-works-as-scientists-create-the-first-dna-ram [12] J. Bornholt, R. Lopez, D. M. Carmean, L. Ceze, G. Seelig and K. Strauss. “A DNA-Based Archival Storage System.” Computer Science & Engineering University of Washington. Accessed 01.10.2017. https://homes.cs.washington.edu/~bornholt/papers/dnastorage -asplos16.pdf ACKNOWLEDGMENTS We would like to acknowledge our peer advisor, Grace Bova, for her work and collaboration in the creation of this research paper. A special thanks to the Swanson School of Engineering for this blessed learning opportunity. Finally, thank you to our our families for their support. 8