Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Operational transformation wikipedia , lookup
Data Protection Act, 2012 wikipedia , lookup
Data center wikipedia , lookup
Information privacy law wikipedia , lookup
Data analysis wikipedia , lookup
Forecasting wikipedia , lookup
Data vault modeling wikipedia , lookup
soundbyte Newsletter Department of Computer Science & Engineering Newsletter of of thethe Department of Computer Science & Engineering A department of the Institute of Technology Computer Science has been at the forefront of the digital information age over the past four decades, shaping the way many fields pursue their CS&E on Data Science goals and objectives. Now, another revolution is under way that promises new transformations. In the 1980s, the so-called ``Computer as King’’ era ushered in the use of computational simulation as a new paradigm for science and engineering. In the 1990s, the Internet boom and World Wide Web brought us the ``Network as King’’ era built upon many advances in networking software protocols, systems, and applications. Today, we are squarely in the era of ``Big Data’’ or ``Data as King’’, and Computer Science is once again at the heart of Minnesota CS&E emerges as national leader in big data this revolution. Realizing long ago that the amount of data being collected far exceeds what humans can analyze without assistance, Computer Science has invented new methods to automatically analyze very large scale, multi-dimensional, dynamically changing, and heterogeneous datasets. This has allowed us to build models that help explain the underlying phenomena behind the data, whether it be physical sciences, social sciences, life sciences, engineering or business. A number of academic institutions are participating in this data science revolution. Minnesota CS&E is one of the undisputed leaders, having established its credentials years before the data science field became fashionable. One of the most important indicators of our prominence is Microsoft Academic Search (academic. research.microsoft.com), which ranks institutions by the quality of their research reputation. For data mining, Minnesota is ranked 8th worldwide out of 4,796 institutions, including academia, industry and government labs, and is ranked 4th world-wide amongst academic institutions. This leadership is reflected in the high profile of our faculty who are sought-after keynote speakers at major conferences, are invited to be on national and international advisory boards in the government and in industry, have received major recognitions and awards for their scientific contributions, and have authored major text books that are used world-wide. The department has also been a leader in the development of applications and software of great importance to industry and in the training of a generation of new Data Scientists. Hence it should be no surprise that CS&E faculty has received well over $40 Million in federal, state, and industry for research in data science over 5 years alone. Minnesota CS&E is a leader not only in the foundations of data science -- including data mining, machine learning, data visualization, smart storage, computing infrastructure (Story continued on next page) 1 Data Science Continued for handling big data -- but also in its application extensively cited review articles on anomaly at the heart of many machine learning and to problems of national importance. We are detection as well as spatial outlier detection have data mining problems. Our faculty has been leading the effort in a number of high-profile been written by our faculty and students. developing parallel optimization methods for multi-disciplinary, multi-institutional collaborations that have garnered national and international attention. For years, CS&E faculty have embraced widely used approach for exploratory data analysis. The goal is to find groups of similar ``Data as King’’ as a guiding research principle. data points, according to suitable distance/ What stands out is the extensive breadth and similarity measures, geometric properties, depth of our research in addressing the core or other representations of data objects. challenges across the entire ``big data pipeline’’ Professors Banerjee, Boley, Karypis, and Kumar that covers the spectrum of algorithms for data have produced some of the most innovative and analysis, infrastructure, and applications. widely used clustering methods and software. Data Analysis Methods To yield insights hidden in the vast data across a variety of domains, Minnesota CS&E faculty have developed a wide-range of powerful data analysis methods. Faculty working on data analysis methods span several overlapping areas that include data-mining, machine learning, optimization, and visual analytics (Professors Arindam Banerjee, Daniel Boley, Vicki Interrante, George Karypis, Dan Keefe, Rui Kuang, Vipin Kumar, Yousef Saad, Shashi Shekhar, and Jaideep Srivastava). Data mining: As data mining has evolved over the past two decades, pioneering work Many data sets can be represented as graphs, and finding cohesive partitionings of graphs is a key task. Our faculty has developed highly scalable and high quality graph partitioning and clustering algorithms, including Metis and Cluto. The methods have been generalized to work with hypergraphs, along parallel implementations which scale to large datasets with ease. Another prominent approach to data clustering is the k-means family of methods, which simultaneously estimate the cluster structure as well as a representative or centroid for each cluster. Our faculty have unified the vast literature on the theme using Bregman clustering, establishing connections to statistical mixture models. on several aspects of data mining has been Clusters have been used to approximate led by our faculty. Frequent pattern mining, large data sets to obtain scalable approximate where the goal is to find salient and persistent large-scale matrix and tensor approximations. patterns in data, and anomaly detection, where Spectral approaches to data analysis and the goal is to find unusual or outlying data clustering have also been heavily used due to points within a large data set (``a needle in a their solid theoretical basis and strong empirical haystack’’) have emerged as key data mining performance, and our faculty have made methodologies with wide ranging applications. significant contributions to this approach such as Professors Banerjee, Karypis, Kumar, Shekhar, PDDP (Principal Direction Divisive Partitioning). Srivastava have contributed key advances to the All these methods have been widely adopted by fundamental algorithms to these methodologies, the community, both in industry and academia, including scalable algorithms capable of for a wide array of applications like text analysis, handling enormous web-scale data sets and to recommendation systems, bioinformatics, and novel data types such as temporal and spatial social network analysis. Several of our papers in patterns, differential patterns, relational and this area are amongst the most cited papers on graph patterns. Algorithms developed here for the topic. univariate and multivariate time-series anomalies, spatial and spatiotemporal anomalies, e.g., hotspots, change footprints, etc., and anomalies for discrete sequences have been particularly successful. Some of the most influential and 2 Clustering: Clustering is arguably the most Large Scale Optimization: In the era of big data, scaling up models and methods to billions of data points has emerged as a key challenge. Optimization is a core technology machine learning models from data, which can seamlessly scale to large and possibly streaming datasets. Certain key theoretical advances in parallel optimization, especially to the alternating direction method of multipliers (ADMMs), have been made by our faculty in recent years. Great promise in handling big data has been shownwith examples such as solving constrained optimization problems such as linear programming with a quarter billion variables in around a minute, which is far beyond the capacity of any existing commercial package. The Big Data Message Passing Interface (BDMPI) and other developments are pushing the envelope on large scale intensive big data analysis. Predictive Analytics: Modern predictive modeling often encounters high-dimensional problems, where the number of possible features/factors affecting a response variable is large, possibly running into millions. In recent years, important advances have been made in sparse and structured estimation problems for such high dimensional problems, which can correctly estimate statistical dependencies (not just correlation) even with small number of examples. Professors Banerjee, Boley, Karypis, Kuang, and Saad have been working on both computational and statistical aspects of such development, including a unified theory of such estimation problems and applications to various real world problems. Visualization and Visual Analytics: Visualization is a key tool to extract patterns and intuition from large complex data sets. Visuals are the fastest way to convey ideas and patterns in big data to people. To turn visuals into a powerful tool for discovery of new patterns requires users to be able to explore their data. Professors Interrante and Keefe are developing interactive data visualization systems that tightly integrate computer graphics with interactive techniques for querying and exploring data. These systems make use of emerging technologies, such as hardware-accelerated 3D 2 of performance or reliability, and the system NEW! Master of Science in Data Science • • • A rigorous new degree for the modern digital age A strong foundation in the science of Big Data One single program combining data collection and management, data analytics, scalable data--driven pattern discovery, and fundamental algorithmic and statistical concepts will take care of the rest. Once specified, the application sees a uniform data interface that hides the diversity and complexity of storage systems and geographic distribution. Tiera will also enable in-situ computation on its data to further enhance performance. To date, Tiera has been ported to both MySQL and HDFS greatly improving their underlying performance. SpatialHadoop is a full-fledged MapReduce Interested? Visit www.datascience.umn.edu for more information, or email framework with native support for spatial data [email protected] designed by Professor Mohammed Mokhbel Now accepting applications for admissions! (spatialhadoop.cs.umn.edu). It is built inside Hadoop as a comprehensive extension to Hadoop computer graphics, virtual reality, multitouch Tripathi), networking faculty (Professors Zhang user interfaces, haptics, and 3D gestural user and He), database faculty (Professors Mokhbel, interfaces. Interdisciplinary collaborations Srivastava, and Shekhar), and storage faculty include using clinical and experimental motion (Professor Du), all working on portions of the capture data to analyze the biomechanics data pipeline: from capture, to storage, to of the neck, using supercomputer-based computation, to analysis. simulation to design more effective medical devices, and working with artists to design Storage: In the big data storage area, creative new data visualizations of scientific Professor Du’s current research focus is on climate data simply by sketching on a computer. large-volume data including research into new Data Analysis Infrastructure To cope with the massive amounts of data the technology needed to handle and preserve memory/storage technologies like NVRAM (NonVolatile RAM), SSD (Solid State Drives) and SWD (Shingled Write Disks. Tiera is a next generation inherent in data science domains, Minnesota cloud storage system developed by Professors CS&E faculty have developed computer Chandra and Weissman. Tiera spans not only systems infrastructure to enable the scalable the different storage tiers within a cloud data storage, transmission, and computing of center but may also span multiple data centers data to support data analysis methods. The or cloud providers in the wide-area. Using Tiera, infrastructure group consists of systems an application designer can easily specify their faculty (Professors Chandra, Weissman, and desired data management requirements in terms base code that pushes spatial constructs and spatial data awareness inside Hadoop core functionality. This results in allowing MapReduce programs and frameworks running on top of SpatialHadoop to make use of its embedded spatial functionality to achieve orders of magnitude better performance. SpatialHadoop is open-source and is being used extensively world-wide. The first version was released on March 2013, and a second version on Oct 2013. Both versions have been downloaded more than 75,000 times thus far. Computation: CS&E faculty are working on the computational infrastructure needed to support distributed data-intensive computing. To support computation on widely distributed data, a new cloud infrastructure called Nebula has been developed. With a simple click of a chrome browser, users around the globe can join a Nebula contributing computational or storage resources. Nebula is a form of distributed cloud that allows computation to occur near the source of data at the network edge dramatically improving performance. It also allows data applications to be located near endusers improving latency. A Nebula prototype that runs across the globe has been developed and is currently operational. It supports dataintensive computing such as MapReduce on data scattered across the world. Tiera and Nebula are part of the distributed computing systems group led by Professors Jon Weissman and Abhishek Chandra. Modern analytics services require the analysis of large streams of data generated from disparate geo-distributed sources, such as users, devices, sensors, and servers located around the globe. Analyzing biomechanics datasets via visualizations driven by highdimensional data clustering algorithms 3 In order to extract the most timely and valuable 3 Data Science Continued The Nebula Architecture information from such data, many applications require a combination of both real-time and historical analysis, resulting in complex tradeoffs between cost, performance, and information quality. Professor Chandra is examining fundamental systems and resource management issues in streaming analytics. These issues include determining where, when, and at what quality level to process and store the data in order to optimize the desired metrics. To support transactions on big data, Professor Anand Tripathi develops scalable transaction management techniques or NoSQLcloud data storage systems is group has developed transaction management techniques for Hadoop/HBase supporting multi-key transactions. Another focus of his research is on developing techniques for supporting scalable transaction management for geo-replicated data across services -- especially video streaming services Shekhar, and Michael Steinbach), genomics such as Netflix, Hulu and Youtube. It is (Professors Dan Boley, Dan Knights, Rui Kuang, estimated that Netflix represents the single Vipin Kumar and Chad Myers), social networks largest source of Internet traffic, consuming (Arindam Baneerjee and Jaideep Srivastava), social 29.7% of peak downstream traffic in North computing and business intelligence (Brent Hecht, America in 2011. Cisco projects that by 2015 George Karypis, Joe Konstan, Jaideep Srivastava, there will be nearly 1 million minutes of video Loren Terveen, and late Professor John Riedl), and crossing the Internet per second. Large-scale smart health (Vipin Kumar, Jaideep Srivastava, online content delivery requires a vast, complex and Michael Steinach). They lead big data projects and costly infrastructure that employs huge data in collaboration with scientists from the medical centers with enormous computing and storage school, business school and school of biological capacities, and relies on content distribution sciences at the University of Minnesota and other institutions. networks (CDNs) with a large number of geographically dispersed edge servers to achieve quality delivery performance, e.g., low latency and high availability. Large scale content distribution also involves a variety of entities and actors -- such as content creators/ owners, content providers, CDNs, ISPs, advertisers, and so forth -- with intricate relationships. CS&E Professor Zhi-li Zhang focuses on understanding the complex interactions among these different entities with the objective of providing better architectural solutions to facilitate those interactions. His work promises to help guide the evolution of future Internet services, resulting in better quality-of-experience (QoE) for the users and greater system efficiencies for the entities in the ecosystem. Data Science Applications Minnesota CS&E faculty actively work on different cloud data centers. The goal is big data applications as lead collaborators in to support a spectrum of different data many areas, including: environmental science consistency models in such environments, (Arindam Baneerjee, Vipin Kumar, Shashi Genomics: The biotechnologies recently developed for massive biological data collection are transforming genomics research into a quantitative science based on informatics. CS&E faculty work on big data analytics of various genomic data to studying biomedical applications such as evolution of pathogen affecting humans, gut microbiomes, cancer biology, and chemicalgenetic interactions in drug design. The influenza virus is a rapidly evolving pathogen affecting humans, as well as animals in swine and poultry. Professor Dan Boley’s lab has used advanced data mining techniques to develop high throughput methods to track the evolution of the flu virus over the last century. This analysis has led to novel ways to model the distinct effect a vaccination program has on the evolution of the virus, as well as novel scalable methods to uncover the possible sources of new strains based on their genetic make-up. ranging from strong consistency as in ACID transactions to weaker consistency levels such as snapshot isolation, causal consistency, and eventual consistency. Based on this optimistic transaction execution model, his group has also developed a parallel programming system called Beehive for graph data analytics applications on cluster computing platforms. Networking: A key part of data science infrastructure is the network transport of large data. The past few years have seen the widespread popularity and expansive growth of large-scale online content distribution Evolution of the human flu virus discovered by unsupervised methods from gene sequence data. 4 Data Science continued Our bodies are home to trillions of microbes, the majority of them living in our guts. Most of these bacteria don’t grow easily in the lab, but by sequencing their DNA in massive quantities and using big data analytics Professor Dan Knights and his peers have learned that the ``microbiomes’’ living in us contain hundreds of different species, and that an imbalance, or dysbiosis, of the gut microbiome can lead to various human diseases. The focus of Professor Knights’s research is to develop a statistical and experimental framework for defining, diagnosing, and treating dysbiosis in human gastrointestinal diseases. He combines expertise in big data mining and biological experimentation to carry out this interdisciplinary research. He uses machine learning to find patterns in these microbial metagenomes that link to human health, and uses those patterns to help develop new diagnostic tools and therapeutic interventions. Professor Kuang’s lab develops machine learning and network analysis algorithms to detect cancer biomarkers and disease phenotype-gene associations in collaboration with medical doctors and biologists. The machine learning methods target on summarizing molecular signals from massive amount of short reads of genomic sequencing data to make predictions for improvement of cancer treatment. The network analysis methods further integrate and explore the modular relations among high-dimensional molecular signals to understand disease molecular mechanisms. Professor Myers’s lab is developing computational approaches to mine complex A global genetic map of a yeast cell with colored points capturing predicted chemicalp rotein interactions of 1000 uncharacterized compounds generating large-scale data measuring the effects of millions of combinatorial genetic perturbations in the model organism yeast. Computational approaches developed by the lab for these data have revealed several fundamental principles about how genes interact to carry out biological functions in yeast, and also how these principles can be applied to discover genetic interactions to diagnose or treat disease in humans. The lab is also working on methods for large-scale mapping of chemical-genetic interactions, with the goal of establishing new big-data driven technology for rapid elucidation of how uncharacterized chemicals interact with cells. The ultimate impact of this technology could be a safer, faster, and cheaper paradigm for drug discovery. Professor Kumar’s lab has been working to analyze the abundance of next-generation sequence data and help researchers in biology and medicine advance their understanding of cancer genetics. In particular, within the same tumor there can be different groups of cells (called subpopulations) that are each defined by their own set of mutations. The novel algorithms being developed will characterize intra-tumor genetic heterogeneity for all types of mutations, and assemble ``personalized’’ reference sequences to represent highly-mutated tumor genomes. Smart Health: As a result of a national mandate for health organizations to implement interoperable electronic health records (EHRs), personal health information on millions of individuals has become available for researchers to investigate the health care patterns of patients and the effectiveness of various medical interventions. This data poses many challenges, as the health information in EHRs is relatively unstructured and represents an irregular and incomplete sampling of information about a patient’s health issues and the way in which they are treated. Vipin Kumar, Jaideep Srivastava, and Michael Steinbach have been collaborating with researchers at the University of Minnesota’s Health Science Cancer (including the Institute for Health Informatics, the School of Nursing, and Cancer center) to analyze EHR data. Specific projects include analysis of groups of patients with specific health issues (e.g., diabetes, mobility impairment) biological networks in a variety of organisms including yeast, several plant species, and humans. One major focus is to understand genetic interactions, which are instances where variants at multiple locations in a genome combine to produce a surprising effect on an organism. Many complex traits, including disease, are thought to be the result of such interactions. The Myers lab has established a productive collaboration with geneticists at the U. of Toronto who are 5 Patterns of risk factors associated with improvement (in red) and no improvement (in blue) of mobility impairment outcomes for the different patient subgroups who underwent Home Health Care interventions. 5 Data Science continued to understand the differences between patients with the same condition but different outcomes, and building models of preventable events such as hospitalization. Achievement of these goals is driving the development of new analysis techniques for summarizing data, analyzing irregular time series, identifying the relative risks of various disease factors, and building predictive models for sparse, incomplete, and temporal data sets. Social computing and Business intelligence: Minnesota CS&E is a leader in the data intensive field of social computing. In particular, the GroupLens Research Lab has been a long-time innovator in important areas such as recommender systems, geosocial systems, and peer-production environments (e.g. wikis). Drs. John Riedl and Joe Konstan helped to invent recommender Sparse dependencies in South American regional temperatures identified by multit ask sparse structure learning. systems, which are responsible for the movie are developing a range of data driven methods for sink management and to study the natural and recommendations you get on Netflix, the products an improved understanding of the complex nature human impacts on the ecosystems. Amazon.com suggests to you, the personalized of the earth system and the impact of climate change. Banerjee is collaborating with Peter Reich Advances in data science include novel and other ecologists to improve global land house-finding features in websites like Zillow, and many more important applications. The lab is also As part of a DOE funded project, Arindam methods for identifying relationships in spatio- models, a critical component of Earth system temporal data, sparse predictive models that models used for future projections of climate, like Wikipedia work (and fail), how and why people can handle high dimensionality of climate by shifting from the current plant functional share their locations, among other topics in the social computing space. data sets, automated methods for tracking of type based approach to one that better utilizes unlabeled spatio-temporal objects. Highlights what is known about the importance, patterns of climate science contributions include and co-founder of Ninja Metrics, a software and variability of plant traits, such as leaf discovery of new climate phenomena, robust startup that can analyze data to identify key lifespan, leaf nitrogen content, respiration, and methods for evaluating and combining output traits among massive multiplayer online gaming photosynthesis, based on TRY db, the world’s of different climate models, development of a communities. Using this data, game creators can largest database on plant trait information, comprehensive open-source ocean eddy dataset identify each player’s psycho-social motivations, and other datasets. In essence, the project that is being used by oceanography groups and take action to help ensure an enhanced will develop a quantitative characterization world-wide to understand global ocean dynamics user experience. The startup relies on novel of plant functional diversity leading to better and its interaction with climate change. understanding of terrestrial ecosystems and responsible for some of the key findings behind our understanding of how and why online communities Professor Jaideep Srivastava is a co-inventor data mining techniques, developed in part by In collaboration with scientists from NASA Minnesota CS&E, that extract key user traits and Planetary Skin Institute, Vipin Kumar’s from a massive pool of data being collected from research group has been developing novel data online gaming platforms. The potential for the mining methods that have dramatically advanced improved land surface models. Leadership in Data Science Minnesota CS&E is a leader not only in the technology has earned the interest of a number the state of the art in the monitoring of global technical aspects of data science but also in of major players in the online gaming industry. land cover using satellite data. By applying these the growth and expansion of the field through methods on a global scale, they have been able numerous initiatives and highly visible national and ecosystem data now available from satellite to create comprehensive histories of large-scale collaborations. and ground-based sensors, and climate model changes in the ecosystem due to fires, logging, simulations offer huge potential for monitoring, droughts, flood, farming, etc. This research has understanding, and predicting the behavior of the been featured in the Economist that lauded the Earth’s ecosystem and for advancing the science of role of data mining algorithms developed at the climate change. As part of a 5-year, $10 Million University of Minnesota for automated monitoring NSF funded project, Vipin Kumar, Arindam of the global forest cover that is urgently needed Banerjee, Shashi Shekhar, and Michael Steinbach to enable the use of forests for economic carbon Environmental Sciences: Wealth of climate 6 Data Science Initiatives: The department is bringing our strength in data science into focus with a number of new initiatives. In Fall 2015, we will welcome the first batch of students in the Data Science MS program (datascience.umn.edu) that will expose students to cutting-edge methods 6 leveraged strong ties to the storage system industry in the Twin Cities, long a major center of the storage industry in the United States. The current industrial support includes 14 sponsorships from 10 companies (Seagate, HGST, HP, Dell, LSI, NetApp, Symantec, Xyratex, SGI, and FedCentric). Data Science Alumni The CS&E Department has had a large contingent of faculty working in areas related to data science for years, and has graduated Global ocean eddy tracks constructed using satellite altimetry data years in areas related to data science and big and theory that will form the basis for the next at pushing the boundaries of computer science data. These graduates are in high demand in top generation of big data technology. A collaboration research. This prestigious 5-year, $10 million companies like Google, Microsoft, Amazon, Apple, between our department, the Department of multi-institution multi-disciplinary project eBay, IBM, FaceBook, Twitter, and Yahoo!, and Electrical and Computer Engineering, the School led by CS&E faculty (Vipin Kumar, Arindam our PhDs are sought after as faculty members in of Statistics, and the Division of Biostatistics, the Banerjee, Shashi Shekhar, Michael Steinbach) institutions around the world. We are especially program is being led by Professor Dan Boley, who involves collaborators from School of Statistics, proud of the fact that many of our graduates is serving as its inaugural Director of Graduate the Institute on the Environment, and College serve in leadership roles in the professional Studies. of Food, Agricultural and Natural Resource community, for example, as program chairs of Sciences at Minnesota, and collaborators at NC major conferences, editorial board members of leading (with Carlson School of Management) State, Northwestern, Northeastern and NC A&T major journals. Many of these are high-profile the Social Media and Business Analytics Universities (climatechange.cs.umn.edu). This alumni who have made substantial contributions Collaborative (SOBACO), a major initiative project aims to advance the science of climate in the data science area both in industry and in building bridges between the University and local change using novel data science methods. academia (see featured alumni in this newsletter Our department has also founded and is co- TerraPop (Profs. Interrante, Shekhar, for a few examples). It is also noteworthy that industry-academia partnerships, Minnesota CS&E Srivastava together with colleagues from the University of Minnesota Best Dissertation researchers are simultaneously creating new Geography, Library Sciences, Environmental Awards in Science and Engineering for the past knowledge in data science and helping to solve Sciences, and History) is a project developing two years have been won by the graduates of immediate problems faced by major companies. the infrastructure needed to make it easier for our department (Gang Fang - 2013 and James researchers to use data describing people along Faghmous - 2014) for their work on Big Data. the University of Minnesota Informatics Institute with data describing the places they inhabit at This is proof positive that we are attracting the (UMII), a brand new University-wide center led by global scale (www.terrapop.org). very top students, and training them well to make companies (sobaco.umn.edu). Through these The CS&E department is a key participant in department faculty member Dr. Claudia Neuhauser Professor Jaideep Srivastava is a leading player (sites.google.com/a/umn. edu/informatics-institute/ in the Virtual Worlds Observatory (VWO). This home). UMII was founded to foster data-intensive collaboration is developing novel computational their mark in the world, and do us proud. An extended and enriched version of this article research in agriculture, arts, design, engineering, techniques for analyzing large-scale networks, can be found at https://datascience.umn.edu/ environment, health, humanities, and social which will have applicability across a wide variety research/ sciences. UMII’s vision includes advancing data of domains. Along with Minnesota, Northwestern analytics, enhancing the University of Minnesota’s University, University of Southern California, competitiveness in data-intensive research across all University of Illinois, and others are involved (www.vwobservatory.com). disciplines, and partnering with industry to harness the power of big data for economic growth and development. 7 well over 100 PhD students over the past 10 Our department hosts the National Science Foundation Center for Research in Intelligent Storage (CRIS) led by Professor Du, a partnership Large-Scale Data Science Collaborations: between universities and industry (cris.cs.umn. CS&E faculty are also playing a leading role in edu). CRIS is pushing the boundaries of file and numerous high-profile large-scale collaborations. storage systems by exploring and developing One such project “Understanding Climate Change: new technologies and techniques, improving the A Data Driven Approach” is funded by NSF’s usability, scalability, security, reliability, and Expeditions in Computing program that is aimed performance of storage systems. The Center has 7