Download big and small data - Raremark: about us

January 2017 Tackling rare disease with big and small data raremark.com 1 Big data promises to create valuable insights in rare disease. Technologies such as next-generation sequencing and natural-language processing, alongside whole-exome analyses and other novel scientific approaches, are helping clinicians treat patients who previously had no therapeutic options. At the same time, deep interrogation of smaller patient samples can provide information of great benefi t to developers of orphan drugs. Realizin g the full potential of big data will require models that can also integrate intelligence from datasets that are small, writes Pete Chan 2 raremark.com In 2012, a group of researchers the predictions of a panel of leading organized a crowdsourcing competition ALS clinicians, and they both picked to shed new light on amyotrophic lateral up prize money of US$20,000 (Küffner sclerosis (ALS), a rare neurodegenerative et al., 2015; Zach et al., 2015). One disease. Participants were given three algorithm discriminated perfectly mon ths of data from ALS patients who between individuals with slow and fast- had taken part in clinical trials and progressing ALS: potentially useful asked to predict how the disease would insight for the stratification of patient progress in the same individuals over the cohorts in clinical trials. The organizers following nine months. More than 1,000 of the competition, known as the ALS teams from over 60 countries stepped Prediction Prize, estimated that by up to the challenge. Two winning groups modeling the progression of disease in created algorithms that outperformed individual ALS patients, the two Image by geralt on Pixabay raremark.com 3 Table 1: Clinical trials in PRO-ACT 1 2 3 4 Clinical Clinical Clinical Clinical trial trial trial trial of of of of arimoclomol in ALS creatine in ALS celecoxib in ALS gabapentin in ALS 10 11 12 13 14 6 Clinical trial of lithium in combination with riluzole in ALS Clinical trial of rHBDNF in ALS 7 Clinical trial of rHCNTF in ALS 16 8 Clinical trial of riluzole in ALS 17 9 Clinical trial of riluzole in the treatment of advanced ALS 5 15 Clinical trial of TCH346 in ALS Clinical trial of talampanel in ALS Clinical trial of topiramate in ALS French prospective observational study in ALS Clinical trial of vitamin E in ALS Clinical trial of xaliproden in ALS: first Phase III trial Clinical trial of xaliproden in ALS: second Phase III trial Unpublished clinical trial of xaliproden in advanced ALS Abbreviation: PRO-ACT = Pooled Resource Open-Access ALS Clinical Trials database Source: Atassi et al., 2014 Table 2: PRO-ACT in numbers Data category No. subjects No. records No. values Adverse events 8,628 74,545 748,566 ALSFRS(R) 6,844 60,775 791,473 Concomitant medications 7,656 111,848 376,098 Death report 4,633 4,634 8,033 Demographics 10,723 10,723 39,107 Family history 1,007 1,071 2,452 Forced vital capacity 8,848 48,856 200,200 Laboratory data 8,342 2,445,059 9,659,191 Riluzole use 8,817 8,817 17,633 Slow vital capacity 2,717 9,525 25,532 Subject ALS history 9,394 12,058 35,967 Treatment group 9,640 9,640 16,830 Vital signs 9,973 72,422 717,715 Abbreviations: ALS = amyotrophic lateral sclerosis; ALSFRS(R); = revised ALS Functional Rating Scale; PRO-ACT = Pooled Resource Open-Access ALS Clinical Trials database Source: PRO-ACT, 2011 4 raremark.com algorithms cou ld help reduce the number of patients required for a hypothetical clinical study by up to 20%. (In diseases with a wide range in natural rates of progression, cl inical trials need larger numbers of patients to help discern the effects of the investigational drug.) Great expectations from big data The pioneering application of machinelearning algorithms to ALS research was PRO-ACT and the research projects it made possible by PRO-ACT, an open- has enabled illustrate how big data access repository of longitudinal clinical approaches can be applied to biomedical trial data (Atassi et al., 2014; PRO-ACT, research in rare disease. They will 2011). At the time, it held data on more inspire those who are convinced of the than 8,600 people who had taken part in role of big data in the orphan drug Phase II/III ALS studies between 1990 sector, not just in clinical trials but also and 2010. Rival teams were given some R&D more broadly. Their excitement is sample data to design their algorithms, understandable. On the one hand, they before putting these to work on the ALS are faced with the familiar challenges of Prediction Prize dataset. PRO-ACT was rare disease research, including: s mall officially launched in December 2012 patient cohorts; poor understanding of with eight million data points, growing epidemiology; lack of natural history since then to more than 10 million (see studies; and the variable quality of Tables 1&2). More than 400 researchers, patient registries. On the other, they including representatives of around 40 are bombarded with a steady stream pharma companies, requested access to of health-related, big data success PRO-ACT within two years of its launch stories, with benefits ranging from (Zach et al., 2015). the prediction of patient responses to drugs and side effects through to better patient segmentation and the delivery of personalized medicine. They hope big data might do for rare disease what it has delivered in common medical conditions. raremark.com 5 Delve a little deeper, though, and you’re wrote (Mayer-Schönberger and Cukier, just as likely to find skeptics who argue 2013). Their argument is that data that big data and rare disease research practitioners shouldn’ t get hung up on are two different and incompatible the number of data points they gather; worlds. Why the divergent views? instead they should view big data as using “as much of the entire datas et The main reason is lack of consensus as feasible”. By their logic, sequencing about how to define big data: the UC the entire genome of a person with Berkeley School of Information lists a rare disease and using the data to no fewer than 43 definitions (Dutcher, help that individual qualifies as a big 2014). The definition that resonates with data approach. PRO-ACT, now bringing most is the principle that big data should together 25 years’ worth of longitudinal have three Vs: volume, velocity and data, is the single largest effort to variety. Critics say rare disease data – assemble the entire dataset of clinical collected from small patient populat ions, trials in ALS. but difficult to source and often of dubious quality – certainly fails on the Daphna Laifenfeld, Director, Personalized volume measure, and possibly the other Medicine and Pharmacogenomics two as well. at Teva Pharmaceutical Industries, defines big data as the combinatio n of A more helpful perspective comes from genetics, omics [a term used to describe Viktor Mayer-Schönberger and Kenneth disciplines of biology such as proteomics Cukier, whose 2013 book, Big Data: and transcriptomics], patient-reported A Revolution That Will Transform How We and clinical data (Laifenfeld, 2016). Live, Work, and Think, helped stoke big- Her list could be expanded further but, for researchers, it ’s a helpful guide for dividing into familiar categories what Big data should have three Vs: volume, velocity and variety. people really mean when they talk about data fervor among the masses. “When we Viewed through this lens, it ’s apparent talk about big data, we mean ‘big’ less in that innovative big data methodologies absolute than in relative terms: relative are being implemented in rare disease. to the comprehensive set of data,” they Key applications include: drug discovery; 6 big data in medicine. raremark.com the discovery of disease-related genes, In a stellar example of international genetic mutations and biomarkers; academic collaboration, the Exome matchmaking of rare disease cases Aggregation Consortium (ExAC) has to help diagnose patients; and drug aggregated genetic sequencing data repurposing. from around 20 separate research studies, creating an open-access Going deep into the genome database of genetic variants in more than 60,000 people; in other words, the genetic variation we might expect to find in a normal population (see Figure 1). Writing in Nature, the ExAC team described their undertaking as “the It ’s a diverse list. But even a cursory most comprehensive catalogue (to our glance at the literature reveals that most knowledge) of human protein-coding efforts are focussed on genomics. There genetic variation to date” (Lek et al., are scientific and economic drivers at 2016). Since the launch of the ExAC work here: the advent of next-generation database in 2014, researchers the world sequencing has made it feasible to over have interrogated the resource, sequence the entire genomes of humans including the 10 million identified at a reasonable cost. The other factor variants, principally to better understand is specific to rare disease: the fact that the genetic variations seen in rare 80% of the known orphan conditions disease patients. result from genetic defects, and tha t the majority of these are monogenic. In a Broad Institute public lecture, Daniel MacArthur, the researcher who led But, to understand which genetic the ExAC consortium, said: “We’ve now variants are implicated in rare diseases, sequenced in our lab more than 1,000 researchers first need to filter out those families affected by a rare disease . For variants that o ccur normally. This is no more than 400 of those families, we’ve trivial task, given the tens of thousands been able to give them back a diagnosis of genetic variants that occur in a typical and, for several dozen of those families, exome (the 1-2% of a genome that codes it ’s been possible to convert what had for proteins). And it is here that big data previously been an untreatable disease analysis has proven invaluable. into a disease where it ’s actually possible raremark.com 7 Figure 1: ExAC in numbers No. exomes in final dataset after filtering for quality No. natural genetic variants identified No. international research studies donating data No. contributing authors in Nature paper No. exomes in raw dataset Sources: Broad Institute, 2016; Lek et al., 2016 to give a medication to alleviate at least within six months, a dozen other families some of those symptoms. That small with the same mutation were identified, fraction will grow as we begin to develop creating a small network of people who more and better drugs to treat rare had been medically isolated just a year diseases.” (Broad Institute, 2016) before. One of Dr MacArthur ’s case studies Reflecting on the limitations of their involved two sisters with a rare condition resource, Dr MacArthur and colleagues that led to extreme weakness in the explained that most ExAC samples facial muscles. Before the girls’ DNA are not accompanied by detailed was sent to Broad, the family had phenotypic data; that is, information been through nine years of muscle on the symptoms and other observable biopsies, pathology tests and other properties in an individual (Lek et al., procedures; none of which identified 2016). This is an important point. The the cause of their disease. Thanks to ability to link genotypic and phenotypic the availability of the ExAC database, data is precisely what ’s needed if Dr MacArthur managed to trace it back the troves of big data generated by to two extremely rare mutations in a DNA sequencing are to be interpreted gene known as LMOD3. The sisters were correctly, and translated into patient diag nosed with nemaline myopathy. And benefit in clinical settings. Genotype-to- 8 raremark.com phenotype connections need to be made not only in individual cases, but also in unrelated people if scientists are to be confident in a condition’s genetic ca use. The ability to link genotypic and phenotypic data is precisely what’s needed if the troves of big data generated by DNA sequencing are to be interpreted correctly, and translated into patient benefit in clinical settings. Matchmaking for clinicians in rare disease Going some way to bridge this gap, a Canadian-led team has created PhenomeCentral, an online matchmaking service for clinicians and researchers working in rare disease, often those Spyros Mousses, founder and president whose patients have yet to receive a of Systems Imagination, a data analytics diagnosis. PhenomeCentral aggregates company, says researchers are routinely phenotypic and genotypic data from “looking at billions of measurements FORGE Canada, CARE for RARE, the US from an individual’s genome”: activities NIH Undiagnosed Diseases Project and he calls “deep genotyping” (RARECast, other rare disease-focused consortia 2016). But in his view, the depth of (Buske et al., 2015). PhenomeCentral analysis being performed in genomics is users query the database by submitting absent from phenomes. “We’re measuring a patient record that includes clinical not billions but dozens of traits and symptoms and any available information clinical phenotypes,” said Dr Mousses. on patients’ genetic variants. Andrew Morris, a director of the Farr PhenomeCentral’s algorithms mine the Institute, a UK-based specialist in health phenotypic data held in the repository, informatics, wants to see the health-data identifying patients most likely to have debate shift towards “deep phenotyping” the same condition, and predicting (Morris, 2016). which genes or genetic variants might be responsible. Users are then able to contact others whose patient cases match theirs, hopefully leading to a posit ive diagnosis. In 2015, PhenomeCentral raremark.com 9 held records on more than 1,000 deeply- factors, hospital records, and other phenotyped rare disease patients. Most data gathered during the life course of had had their exomes sequenced and patients (Hill, 2016). In other words, remained undiagnosed. deep phenotypic and longitudinal data on the sort of scale that the country’s Achieving scale is an acknowledged National Health Service (NHS), among challenge in rare disease, but comparable systems globally, is uniquely PhenomeCentral will surely be aided in placed to provide. this respect by its decision to join The MatchMaker Exchange (MME), a network And working at international level, of matchmaking services, each with its RD-Connect is an EU FP7-funded project own cohort of users (Philippakis et al., that aims to break down historical data 2015). Under t his model, researchers silos in rare disease. A key objective is have the option of querying not just to make it easier for the rare disease PhenomeCentral but also other members research community to share data. of the MME network at the same time. To this end, it is creating a platform to More shots on goal give them a better integrate patient registries, biobanks chance of finding a patient match. and databases of genomic, phenotypic, natural history and clinical trial data Elsewhere, two high-profile initiatives (McCormack, 2016; Thompson et al., promise to integrate many more diverse 2014). sources of data beyond genotypes and phenotypes, and have received plenty of RD-Connect piloted its model by attention for their big data ambitions. pulling in data from two European Later this year, the UK’s 100,000 research consortia: NeurOmics, with Genomes Proje ct is expected to have a focus on rare neurodegenerative sequenced the genomes of 25,000 cancer and neuromuscular disorders, and patients and around 17,000 people with EURenOmics in the field of rare kidney rare diseases, as well as their families disorders; with each contributing around (Genomics England, 2015). Alongside 1,000 sequenced exomes. The Broad genomic data, the project will also Institute, Newcastle University and other collect clinical data, pathology and international partners have come on histopathology results, imaging results, board more recently (see Figure 2). information on treatments and risk 10 raremark.com Figure 2: Initiatives contributing exome data to RD-Connect E URenOmic s (EU) SeqNMD (US) NeurOmic s (EU) NCNP Japan M YO-S EQ (UK) Key 1,000 exomes 500 exomes 300 exomes CNAG Rare (Spain) CMG Slovenia Source: McCormack, 2016 gameshow Jeopardy! in 2011. Since then, Cognitive assistant for digital doctors it has gone on to capture the imagination of the data science community with its ability to analyze large quantities of data, to understand complex questions posed in natural language, and to propose evidence-based answers. The Boston team fed medical literature Meanwhile, one of the world’s best-known and clinical data relating to SRNS into artificial intelli gence systems is being Watson, before adding genomic data piloted in two rare disease projects, with from patients retrospectively. This is the aim of creating what some describe the first time Watson has been used as a digital doctor ’s assistant. to help doctors diagnose rare disease and identify treatment options – and For the past year, orphan disease the results will be eagerly awaited. If researchers at Boston Children’s Hospital it proves successful in SRNS, the plan have been training IBM Watson, the tech is to extend the approach to neurologic company’s flagship cognitive computing disorders and other rare pediatric platform, to understand steroid-resistant diseases studied at Boston Children’s. nephrotic syndrome (SRNS), a rare kidney disease (IBM, 2015). Watson first And at the end of 2016, researchers in gain ed notoriety by winning the Germany kicked off their own 12-month raremark.com 11 pilot project with Watson, to evaluate its potential to diagnose any rare disea se (IBM, 2016; Marks, 2016). The Center for Undiagnosed and Rare Diseases at the University Hospital Marburg has been contacted by more than 6,000 Time to downsize patients since it opened in 2013. Most patients have brought with them years All well and good. Yet a limitation that of unstructured data from their medical is common to virtually all the initiatives histories, inclu ding: lab test results; described above is the absence of the clinical reports; pathology reports; and views of patients. This is an impor tant drugs they’ve been prescribed. For the missed opportunity, given that rare Marburg researchers to review all this disease patients and families are in many information and combine it with their cases experts in their own conditions, own knowledge and the medical literature capable of interacting with health to reach a diagnosis typically takes providers on a professional level, and several days for each patient. contributing insights that only they possess. The hope is that Watson will be able to automate and accelerate the process, Addressing this issue requires acceptance quickly presenting physicians with that while great insights can be gleaned a list of possible hypotheses from from huge datasets, equally valuable and which they can make their own data - complementary intelligence can be driven diagnoses. In a further test of Watson’s capabilities in natural-language processing, the Marburg pilot will require patients’ medical histories recorded in German to be matched up with the body of rare disease-related literature published in English. A limitation that is common to virtually all current big datafocussed initiatives is the absence of the views of patients. derived from rigorous interrogation of datasets that are relatively small. As it happens, a small data movement has also emerged in the past few years; its loudest cheerleader being Martin 12 raremark.com Lindstrom, the Danish author of Sm all doing so, and the implications for the Data: The Tiny Clues That Uncover Huge patient community. In line with the Trends (Lindstrom, 2016). Mr Lindstrom’s small data model, we posed a series world is that of marketing and branding, of well-defined questions to small but it doesn’ t take a huge leap to apply groups of patients, both online and his principles of keen observation of over the phone. The study sample small samples to people living with rare comprised Raremark users with disease. an interest in three rare diseases: adrenoleukodystrophy, myasthenia And recent work in the field of patient- gravis and Sanfilippo syndrome. Work reported outcomes (PROs) has provided conducted from November 2016 to evidence that patient-generated medical January 2017 revealed an understanding data can be of comparable quality to of the importance of data sharing for data gathered from traditional sources. the benefit of others, and a willingness A group of US researchers conducted a to do so: 94% of participants said proof-of-conce pt study using the chronic they would feel comfortable sharing lymphocytic leukemia (CLL) community selected health-related information about of PatientsLikeMe, a patient-powered themselves with the community and the research network. There are several PRO pharmaceutical industry. instruments specific to CLL, meaning the supporting lite rature contain data the Raremark’s findings reflect the results researchers could use as comparators. of a larger RD-Connect study that Using a combination of online surveys included similar themes. As long as and telephone interviews, they found the right governance systems are in good alignment between the symptoms place, RD-Connect discovered, the rare that members of PatientsLikeMe’s CLL disease patient community generally community said were important to them, has a positive view on the sharing and those identified through traditional of data to support medical research. interviews and patient focus groups “All the participants understood the (McCarrier et al., 2016). incentive for [rare disease] in sharing data and samples; in fact, there were Raremark has also been exploring how several pleas for research systems to be to involve patients in the area of data standardised across the EU in order to sharing and donation, the reasons for make data sharing easier,” the authors raremark.com 13 wrote in the European Journal of Human Learning from the ALS Prediction Prize Genetics (McCormack et al., 2016). case study, in which, remarkably, fourfifths of competitors had virtually no previous experience in the condition, Intelligence: from artificial to human injections of fresh thinking from smart people from non-health disciplines may reveal exciting possibilities yet to be imagined. Machines will not be able to model some truly human things, such as how to explain to another human what it ’s like Watch a presentation by Dr Mousses of to live day to day with a rare medical Systems Imagination and you’ll be left condition, or whether a drug’s supposed with big data-driven visions of the future. benefits deliver outcomes that are Machines will be able to gather medical meaningful to them. For these insights, data, create their own models and test the only true source will be patients. hypotheses in vast numbers without the help of humans. They will also be able For big data-derived intelligence to to look at medical images and extract translate into real benefit for the rare billions of features for interpretation: disease community, we need workable a level of resolution that would simply be models for combining very large datasets impossible for pathologists. Pointing out with the very small. that traditional evidence-based med icine has failed in rare disease, he uses the term “intelligence-based medicine” Pete Chan is Head of Research & Analysis to describe the mining of deep data at Raremark. from rare disease patients – genomic, phenotypic and biometric – before these Email: [email protected]. are integrated, using machine learning, and analyzed for the benefit of those individuals (Global Genes, 2016). 14 raremark.com References Atassi, N. et al. (2014) ‘The PRO-ACT Global Genes (2016) Big data and database: Desi gn, initial analyses, intelligence-based medicine. Available and predictive features’, Neurology, at: https://www.youtube.com/ 83(19), pp. 1719–1725. doi: 10.1212/ watch?v=cTTCDReujdE&feature=youtu.be wnl.0000000000000951. (Accessed: 7 January 2017). Broad Institute (2016) Midsummer Hill, S. (2016) Beyond 100,000 Genomes: nights’ science : Using big data to Transforming the NHS into a personalised understand rar e diseases. Available medicine service. BioData World Congress at: https://www.youtube.com/ 2016. Cambridge, UK. 27 October 2016. watch?v=GFNn7z7OWU8&feature=youtu. be (Accessed: 7 January 2017). IBM (2015) Boston Children’s Hospital to tap IBM Watson to tackle rare Buske, O.J. et al. (2015) pediatric diseases. Available at: ‘PhenomeCentral: A portal for phenotypic http://www-03.ibm.com/press/us/en/ and genotypic matchmaking of patients pressrelease/48031.wss (Accessed: with rare genetic diseases’, Human 7 January 2017). Mutation, 36(1 0), pp. 931–940. doi: 10.1002/humu.22851. IBM (2016) Rhön-Klinikum hospitals to study how IBM Watson can support Dutcher, J. (2014) What is big data? doctors in the diagnosis of rare diseases. Available at: https://datascience. Available at: https://www-03.ibm.com/ berkeley.edu/what-is-big-data (Accessed: press/us/en/pressrelease/50803.wss 7 Ja nuary 2017). (Accessed: 7 January 2017). Genomics England (2015) Genomics Küffner, R. et al. (2015) ‘Crowdsourced England and the 100,000 Genomes analysis of clinical trial data to Project. Available at: https://www. predict amyotrophic lateral sclerosis genomicsengland.co.uk/wp-content/ progression’, Nature Biotechnology, 33, uploads/2015/05/Genomics-Englad- pp. 51–57. Narrative-May-20152.pdf (Accessed: 7 Ja nuary 2017). raremark.com 15 Laifenfeld, D. (2016) Preventive and McCormack, P. (2016) RD-Connect: predictive genetics: towards personalised Big data for rare disease. Available medicine. BioData World Congress 2016. at: http://www.geneticdisordersuk. Cambridge, UK. 26 October 2016. org/static/media/up/GDLS2016_ PaulineMccormack_RDConnect.pdf Lek, M. et al. (2016) ‘Analysis of protein- (Accessed: 7 January 2017). coding genetic variation in 60,706 humans’, Nature, 536(7616), pp. 285– McCormack, P. et al. (2016) ‘“You should 291. doi: 10.1038/nature19057. at least ask”. The expectations, hopes and fears of rare disease patients on Lind strom, M. (2016) Small Data: The large-scale data and biomaterial sharing Tiny Clues That Uncover Huge Trend s. for genomics research’, European Journal London, United Kingdom: John Murray of Human Genetics, 24(10), pp. 1403– Learning. 1408. doi: 10.1038/ejhg.2016.30. Marks, P. (2016) Dr House goes digital Morris, A. (2016) Options and as IBM’s Watson diagnoses rare diseases. opportunities for health data scien ce in Available at: https://www.newscientist. the UK. BioData World Congress 2016. com/article/2109354-dr-house-goes- Cambridge, UK. 27 October 2016. digital-as-ibms-watson-diagnoses-rarediseases/ (Accessed: 7 January 2017). Philippakis, A. et al. (2015) ‘The Matchmaker Exchange: A platform for Mayer-Schönberger, V. and Cukier, K. rare disease gene discovery’, Human (2013) Big data: A Revolution That Will Mutation, 36(10), pp. 915–921. doi: Transform How We Live, Work, and Think. 10.1002/humu.22858. London: John Murray Publishers. PRO-ACT (2011) Available at: https:// McCarrier, K.P. et al. (2016) ‘Concept nctu.partners.org/ProACT/ (Access ed: elicitation within patient-powered 7 January 2017). research networks: A feasibility study in chronic lymphocytic leukemia’, Value in Health , 19(1), pp. 42–52. doi: 10.1016/j. jval.2015.10.013. 16 raremark.com RARECast (2016) Harnessing big data to work for rare disease patients by RARECast Global Genes. Available at: https://soundcloud.com/rarecast/ harnessing-big-data-to-work-for-raredisease-patients (Accessed: 7 January 2017). Thompson, R. et al. (2014) ‘RD-Connect: An integrated platform connecting databases, registries, biobanks and clinical bioinformatics for rare disea se research’, Journal of General Internal Medicine, 29(S3), pp. 780–787. doi: 10.1007/s11606-014-2908-8. Zach, N. et al. (2015) ‘Being PROACTive: What can a clinical trial dat abase reveal about ALS?’, Neurotherapeutics, 12(2), pp. 417–423. doi: 10.1007/ s13311-015-0336-z. raremark.com 17 raremark.com 18 raremark.com

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download big and small data - Raremark: about us