Download Data Science in the Department of Computer Science and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Operational transformation wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Big data wikipedia , lookup

Data model wikipedia , lookup

Data center wikipedia , lookup

Information privacy law wikipedia , lookup

Data analysis wikipedia , lookup

Forecasting wikipedia , lookup

Data vault modeling wikipedia , lookup

3D optical data storage wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
soundbyte
Newsletter
Department
of Computer
Science
& Engineering
Newsletter
of of
thethe
Department
of Computer
Science
& Engineering
A department of the Institute of Technology
Computer Science has been at the forefront
of the digital information age over the past four
decades, shaping the way many fields pursue their
CS&E on
Data Science
goals and objectives. Now, another revolution is
under way that promises new transformations.
In the 1980s, the so-called ``Computer as King’’
era ushered in the use of computational simulation
as a new paradigm for science and engineering. In
the 1990s, the Internet boom and World Wide Web
brought us the ``Network as King’’ era built upon
many advances in networking software protocols,
systems, and applications. Today, we are squarely
in the era of ``Big Data’’ or ``Data as King’’, and
Computer Science is once again at the heart of
Minnesota CS&E
emerges as national
leader in big data
this revolution. Realizing long ago that the amount
of data being collected far exceeds what humans
can analyze without assistance, Computer Science
has invented new methods to automatically analyze
very large scale, multi-dimensional, dynamically
changing, and heterogeneous datasets. This has
allowed us to build models that help explain the
underlying phenomena behind the data, whether
it be physical sciences, social sciences, life
sciences, engineering or business.
A number of academic institutions are
participating in this data science revolution.
Minnesota CS&E is one of the undisputed leaders, having
established its credentials years before the data science field
became fashionable. One of the most important indicators of
our prominence is Microsoft Academic Search (academic.
research.microsoft.com), which ranks institutions by
the quality of their research reputation. For data mining,
Minnesota is ranked 8th worldwide out of 4,796 institutions,
including academia, industry and government labs, and is
ranked 4th world-wide amongst academic institutions.
This leadership is reflected in the high profile of our
faculty who are sought-after keynote speakers at major
conferences, are invited to be on national and international
advisory boards in the government and in industry, have
received major recognitions and awards for their scientific
contributions, and have authored major text books that are
used world-wide. The department has also been a leader
in the development of applications and software of great
importance to industry and in the training of a generation
of new Data Scientists. Hence it should be no surprise that
CS&E faculty has received well over $40 Million in federal,
state, and industry for research in data science over 5
years alone.
Minnesota CS&E is a leader not only in the foundations
of data science -- including data mining, machine learning,
data visualization, smart storage, computing infrastructure
(Story continued on next page)
1
Data Science Continued
for handling big data -- but also in its application
extensively cited review articles on anomaly
at the heart of many machine learning and
to problems of national importance. We are
detection as well as spatial outlier detection have
data mining problems. Our faculty has been
leading the effort in a number of high-profile
been written by our faculty and students.
developing parallel optimization methods for
multi-disciplinary, multi-institutional collaborations
that have garnered national and international
attention.
For years, CS&E faculty have embraced
widely used approach for exploratory data
analysis. The goal is to find groups of similar
``Data as King’’ as a guiding research principle.
data points, according to suitable distance/
What stands out is the extensive breadth and
similarity measures, geometric properties,
depth of our research in addressing the core
or other representations of data objects.
challenges across the entire ``big data pipeline’’
Professors Banerjee, Boley, Karypis, and Kumar
that covers the spectrum of algorithms for data
have produced some of the most innovative and
analysis, infrastructure, and applications.
widely used clustering methods and software.
Data Analysis Methods
To yield insights hidden in the vast data
across a variety of domains, Minnesota CS&E
faculty have developed a wide-range of powerful
data analysis methods. Faculty working on data
analysis methods span several overlapping
areas that include data-mining, machine learning,
optimization, and visual analytics (Professors
Arindam Banerjee, Daniel Boley, Vicki Interrante,
George Karypis, Dan Keefe, Rui Kuang, Vipin
Kumar, Yousef Saad, Shashi Shekhar, and
Jaideep Srivastava).
Data mining: As data mining has evolved
over the past two decades, pioneering work
Many data sets can be represented as graphs,
and finding cohesive partitionings of graphs is
a key task. Our faculty has developed highly
scalable and high quality graph partitioning and
clustering algorithms, including Metis and Cluto.
The methods have been generalized to work
with hypergraphs, along parallel implementations
which scale to large datasets with ease. Another
prominent approach to data clustering is the
k-means family of methods, which simultaneously
estimate the cluster structure as well as a
representative or centroid for each cluster. Our
faculty have unified the vast literature on the
theme using Bregman clustering, establishing
connections to statistical mixture models.
on several aspects of data mining has been
Clusters have been used to approximate
led by our faculty. Frequent pattern mining,
large data sets to obtain scalable approximate
where the goal is to find salient and persistent
large-scale matrix and tensor approximations.
patterns in data, and anomaly detection, where
Spectral approaches to data analysis and
the goal is to find unusual or outlying data
clustering have also been heavily used due to
points within a large data set (``a needle in a
their solid theoretical basis and strong empirical
haystack’’) have emerged as key data mining
performance, and our faculty have made
methodologies with wide ranging applications.
significant contributions to this approach such as
Professors Banerjee, Karypis, Kumar, Shekhar,
PDDP (Principal Direction Divisive Partitioning).
Srivastava have contributed key advances to the
All these methods have been widely adopted by
fundamental algorithms to these methodologies,
the community, both in industry and academia,
including scalable algorithms capable of
for a wide array of applications like text analysis,
handling enormous web-scale data sets and to
recommendation systems, bioinformatics, and
novel data types such as temporal and spatial
social network analysis. Several of our papers in
patterns, differential patterns, relational and
this area are amongst the most cited papers on
graph patterns. Algorithms developed here for
the topic.
univariate and multivariate time-series anomalies,
spatial and spatiotemporal anomalies, e.g.,
hotspots, change footprints, etc., and anomalies
for discrete sequences have been particularly
successful. Some of the most influential and
2
Clustering: Clustering is arguably the most
Large Scale Optimization: In the era of
big data, scaling up models and methods to
billions of data points has emerged as a key
challenge. Optimization is a core technology
machine learning models from data, which
can seamlessly scale to large and possibly
streaming datasets. Certain key theoretical
advances in parallel optimization, especially to
the alternating direction method of multipliers
(ADMMs), have been made by our faculty in
recent years. Great promise in handling big data
has been shownwith examples such as solving
constrained optimization problems such as linear
programming with a quarter billion variables
in around a minute, which is far beyond the
capacity of any existing commercial package.
The Big Data Message Passing Interface
(BDMPI) and other developments are pushing
the envelope on large scale intensive big data
analysis.
Predictive Analytics: Modern predictive
modeling often encounters high-dimensional
problems, where the number of possible
features/factors affecting a response variable
is large, possibly running into millions. In recent
years, important advances have been made
in sparse and structured estimation problems
for such high dimensional problems, which can
correctly estimate statistical dependencies
(not just correlation) even with small number of
examples. Professors Banerjee, Boley, Karypis,
Kuang, and Saad have been working on both
computational and statistical aspects of such
development, including a unified theory of such
estimation problems and applications to various
real world problems.
Visualization and Visual Analytics:
Visualization is a key tool to extract patterns
and intuition from large complex data sets.
Visuals are the fastest way to convey ideas
and patterns in big data to people. To turn
visuals into a powerful tool for discovery of new
patterns requires users to be able to explore
their data. Professors Interrante and Keefe are
developing interactive data visualization systems
that tightly integrate computer graphics with
interactive techniques for querying and exploring
data. These systems make use of emerging
technologies, such as hardware-accelerated 3D
2
of performance or reliability, and the system
NEW! Master of Science in Data Science
•
•
•
A rigorous new degree for the modern digital age
A strong foundation in the science of Big Data
One single program combining data collection and management, data
analytics, scalable data--driven pattern discovery, and fundamental
algorithmic and statistical concepts will take care of the rest. Once specified, the
application sees a uniform data interface that
hides the diversity and complexity of storage
systems and geographic distribution. Tiera will
also enable in-situ computation on its data to
further enhance performance. To date, Tiera
has been ported to both MySQL and HDFS
greatly improving their underlying performance.
SpatialHadoop is a full-fledged MapReduce
Interested? Visit www.datascience.umn.edu for more information, or email framework with native support for spatial data
[email protected] designed by Professor Mohammed Mokhbel
Now accepting applications for admissions!
(spatialhadoop.cs.umn.edu). It is built inside
Hadoop as a comprehensive extension to Hadoop
computer graphics, virtual reality, multitouch
Tripathi), networking faculty (Professors Zhang
user interfaces, haptics, and 3D gestural user
and He), database faculty (Professors Mokhbel,
interfaces. Interdisciplinary collaborations
Srivastava, and Shekhar), and storage faculty
include using clinical and experimental motion
(Professor Du), all working on portions of the
capture data to analyze the biomechanics
data pipeline: from capture, to storage, to
of the neck, using supercomputer-based
computation, to analysis.
simulation to design more effective medical
devices, and working with artists to design
Storage: In the big data storage area,
creative new data visualizations of scientific
Professor Du’s current research focus is on
climate data simply by sketching on a
computer.
large-volume data including research into new
Data Analysis Infrastructure
To cope with the massive amounts of data
the technology needed to handle and preserve
memory/storage technologies like NVRAM (NonVolatile RAM), SSD (Solid State Drives) and SWD
(Shingled Write Disks. Tiera is a next generation
inherent in data science domains, Minnesota
cloud storage system developed by Professors
CS&E faculty have developed computer
Chandra and Weissman. Tiera spans not only
systems infrastructure to enable the scalable
the different storage tiers within a cloud data
storage, transmission, and computing of
center but may also span multiple data centers
data to support data analysis methods. The
or cloud providers in the wide-area. Using Tiera,
infrastructure group consists of systems
an application designer can easily specify their
faculty (Professors Chandra, Weissman, and
desired data management requirements in terms
base code that pushes spatial constructs and spatial
data awareness inside Hadoop core functionality.
This results in allowing MapReduce programs and
frameworks running on top of SpatialHadoop to
make use of its embedded spatial functionality to
achieve orders of magnitude better performance.
SpatialHadoop is open-source and is being used
extensively world-wide. The first version was
released on March 2013, and a second version on
Oct 2013. Both versions have been downloaded
more than 75,000 times thus far.
Computation: CS&E faculty are working
on the computational infrastructure needed to
support distributed data-intensive computing.
To support computation on widely distributed
data, a new cloud infrastructure called Nebula
has been developed. With a simple click of
a chrome browser, users around the globe
can join a Nebula contributing computational
or storage resources. Nebula is a form of
distributed cloud that allows computation to
occur near the source of data at the network
edge dramatically improving performance. It also
allows data applications to be located near endusers improving latency. A Nebula prototype
that runs across the globe has been developed
and is currently operational. It supports dataintensive computing such as MapReduce on data
scattered across the world. Tiera and Nebula
are part of the distributed computing systems
group led by Professors Jon Weissman and
Abhishek Chandra.
Modern analytics services require the analysis
of large streams of data generated from disparate
geo-distributed sources, such as users, devices,
sensors, and servers located around the globe.
Analyzing biomechanics datasets via visualizations driven by high­dimensional data clustering algorithms
3
In order to extract the most timely and valuable
3
Data Science Continued
The Nebula Architecture
information from such data, many applications
require a combination of both real-time and
historical analysis, resulting in complex
tradeoffs between cost, performance, and
information quality. Professor Chandra is
examining fundamental systems and resource
management issues in streaming analytics.
These issues include determining where,
when, and at what quality level to process and
store the data in order to optimize the desired
metrics.
To support transactions on big data,
Professor Anand Tripathi develops scalable
transaction management techniques or
NoSQLcloud data storage systems is group
has developed transaction management
techniques for Hadoop/HBase supporting
multi-key transactions. Another focus of
his research is on developing techniques
for supporting scalable transaction
management for geo-replicated data across
services -- especially video streaming services
Shekhar, and Michael Steinbach), genomics
such as Netflix, Hulu and Youtube. It is
(Professors Dan Boley, Dan Knights, Rui Kuang,
estimated that Netflix represents the single
Vipin Kumar and Chad Myers), social networks
largest source of Internet traffic, consuming
(Arindam Baneerjee and Jaideep Srivastava), social
29.7% of peak downstream traffic in North
computing and business intelligence (Brent Hecht,
America in 2011. Cisco projects that by 2015
George Karypis, Joe Konstan, Jaideep Srivastava,
there will be nearly 1 million minutes of video
Loren Terveen, and late Professor John Riedl), and
crossing the Internet per second. Large-scale
smart health (Vipin Kumar, Jaideep Srivastava,
online content delivery requires a vast, complex
and Michael Steinach). They lead big data projects
and costly infrastructure that employs huge data
in collaboration with scientists from the medical
centers with enormous computing and storage
school, business school and school of biological
capacities, and relies on content distribution
sciences at the University of Minnesota and other
institutions.
networks (CDNs) with a large number of
geographically dispersed edge servers to achieve
quality delivery performance, e.g., low latency
and high availability. Large scale content
distribution also involves a variety of entities
and actors -- such as content creators/ owners,
content providers, CDNs, ISPs, advertisers, and
so forth -- with intricate relationships. CS&E
Professor Zhi-li Zhang focuses on understanding
the complex interactions among these different
entities with the objective of providing better
architectural solutions to facilitate those
interactions. His work promises to help guide
the evolution of future Internet services,
resulting in better quality-of-experience (QoE)
for the users and greater system efficiencies for
the entities in the ecosystem.
Data Science Applications
Minnesota CS&E faculty actively work on
different cloud data centers. The goal is
big data applications as lead collaborators in
to support a spectrum of different data
many areas, including: environmental science
consistency models in such environments,
(Arindam Baneerjee, Vipin Kumar, Shashi
Genomics: The biotechnologies recently
developed for massive biological data collection
are transforming genomics research into a
quantitative science based on informatics. CS&E
faculty work on big data analytics of various
genomic data to studying biomedical applications
such as evolution of pathogen affecting humans,
gut microbiomes, cancer biology, and chemicalgenetic interactions in drug design.
The influenza virus is a rapidly evolving
pathogen affecting humans, as well as animals
in swine and poultry. Professor Dan Boley’s lab
has used advanced data mining techniques to
develop high throughput methods to track the
evolution of the flu virus over the last century.
This analysis has led to novel ways to model
the distinct effect a vaccination program has on
the evolution of the virus, as well as novel scalable
methods to uncover the possible sources of new
strains based on their genetic make-up.
ranging from strong consistency as in
ACID transactions to weaker consistency
levels such as snapshot isolation, causal
consistency, and eventual consistency.
Based on this optimistic transaction
execution model, his group has also
developed a parallel programming system
called Beehive for graph data analytics
applications on cluster computing platforms.
Networking: A key part of data science
infrastructure is the network transport of
large data. The past few years have seen the
widespread popularity and expansive growth
of large-scale online content distribution
Evolution of the human flu virus discovered by unsupervised methods from gene sequence data.
4
Data Science continued
Our bodies are home to trillions of
microbes, the majority of them living in our
guts. Most of these bacteria don’t grow
easily in the lab, but by sequencing their
DNA in massive quantities and using big
data analytics Professor Dan Knights and his
peers have learned that the ``microbiomes’’
living in us contain hundreds of different
species, and that an imbalance, or
dysbiosis, of the gut microbiome can lead
to various human diseases. The focus of
Professor Knights’s research is to develop
a statistical and experimental framework for
defining, diagnosing, and treating dysbiosis
in human gastrointestinal diseases. He
combines expertise in big data mining and
biological experimentation to carry out this
interdisciplinary research. He uses machine
learning to find patterns in these microbial
metagenomes that link to human health,
and uses those patterns to help develop
new diagnostic tools and therapeutic
interventions.
Professor Kuang’s lab develops machine
learning and network analysis algorithms
to detect cancer biomarkers and disease
phenotype-gene associations in collaboration
with medical doctors and biologists.
The machine learning methods target
on summarizing molecular signals from
massive amount of short reads of genomic
sequencing data to make predictions for
improvement of cancer treatment. The
network analysis methods further integrate
and explore the modular relations among
high-dimensional molecular signals to
understand disease molecular mechanisms.
Professor Myers’s lab is developing
computational approaches to mine complex
A global genetic map of a yeast cell with colored points capturing predicted chemical­p rotein interactions of 1000 uncharacterized compounds
generating large-scale data measuring the effects
of millions of combinatorial genetic perturbations
in the model organism yeast. Computational
approaches developed by the lab for these data
have revealed several fundamental principles
about how genes interact to carry out biological
functions in yeast, and also how these principles
can be applied to discover genetic interactions to
diagnose or treat disease in humans. The lab is
also working on methods for large-scale mapping
of chemical-genetic interactions, with the goal
of establishing new big-data driven technology
for rapid elucidation of how uncharacterized
chemicals interact with cells. The ultimate impact
of this technology could be a safer, faster, and
cheaper paradigm for drug discovery.
Professor Kumar’s lab has been working
to analyze the abundance of next-generation
sequence data and help researchers in biology and
medicine advance their understanding of cancer
genetics. In particular, within the same tumor
there can be different groups of cells (called
subpopulations) that are each defined by their
own set of mutations. The novel algorithms being
developed will characterize intra-tumor genetic
heterogeneity for all types of mutations, and
assemble ``personalized’’ reference sequences
to represent highly-mutated tumor genomes.
Smart Health: As a result of a
national mandate for health organizations
to implement interoperable electronic
health records (EHRs), personal health
information on millions of individuals has
become available for researchers to
investigate the health care patterns of
patients and the effectiveness of various
medical interventions. This data poses
many challenges, as the health information
in EHRs is relatively unstructured and
represents an irregular and incomplete
sampling of information about a patient’s
health issues and the way in which they are
treated.
Vipin Kumar, Jaideep Srivastava, and
Michael Steinbach have been collaborating
with researchers at the University of
Minnesota’s Health Science Cancer (including
the Institute for Health Informatics, the School
of Nursing, and Cancer center) to analyze
EHR data. Specific projects include analysis
of groups of patients with specific health
issues (e.g., diabetes, mobility impairment)
biological networks in a variety of organisms
including yeast, several plant species, and
humans. One major focus is to understand
genetic interactions, which are instances
where variants at multiple locations in a
genome combine to produce a surprising
effect on an organism. Many complex traits,
including disease, are thought to be the
result of such interactions. The Myers lab
has established a productive collaboration
with geneticists at the U. of Toronto who are
5
Patterns of risk factors associated with improvement (in red) and no improvement (in blue) of mobility
impairment outcomes for the different patient subgroups who underwent Home Health Care interventions.
5
Data Science continued
to understand the differences between patients
with the same condition but different outcomes,
and building models of preventable events such
as hospitalization. Achievement of these goals is
driving the development of new analysis techniques
for summarizing data, analyzing irregular time
series, identifying the relative risks of various
disease factors, and building predictive models for
sparse, incomplete, and temporal data sets.
Social computing and Business intelligence:
Minnesota CS&E is a leader in the data intensive
field of social computing. In particular, the
GroupLens Research Lab has been a long-time
innovator in important areas such as recommender
systems, geosocial systems, and peer-production
environments (e.g. wikis). Drs. John Riedl and
Joe Konstan helped to invent recommender
Sparse dependencies in South American regional temperatures identified by multi­t ask
sparse structure learning.
systems, which are responsible for the movie
are developing a range of data driven methods for
sink management and to study the natural and
recommendations you get on Netflix, the products
an improved understanding of the complex nature
human impacts on the ecosystems.
Amazon.com suggests to you, the personalized
of the earth system and the impact of climate
change.
Banerjee is collaborating with Peter Reich
Advances in data science include novel
and other ecologists to improve global land
house-finding features in websites like Zillow, and
many more important applications. The lab is also
As part of a DOE funded project, Arindam
methods for identifying relationships in spatio-
models, a critical component of Earth system
temporal data, sparse predictive models that
models used for future projections of climate,
like Wikipedia work (and fail), how and why people
can handle high dimensionality of climate
by shifting from the current plant functional
share their locations, among other topics in the
social computing space.
data sets, automated methods for tracking of
type based approach to one that better utilizes
unlabeled spatio-temporal objects. Highlights
what is known about the importance, patterns
of climate science contributions include
and co-founder of Ninja Metrics, a software
and variability of plant traits, such as leaf
discovery of new climate phenomena, robust
startup that can analyze data to identify key
lifespan, leaf nitrogen content, respiration, and
methods for evaluating and combining output
traits among massive multiplayer online gaming
photosynthesis, based on TRY db, the world’s
of different climate models, development of a
communities. Using this data, game creators can
largest database on plant trait information,
comprehensive open-source ocean eddy dataset
identify each player’s psycho-social motivations,
and other datasets. In essence, the project
that is being used by oceanography groups
and take action to help ensure an enhanced
will develop a quantitative characterization
world-wide to understand global ocean dynamics
user experience. The startup relies on novel
of plant functional diversity leading to better
and its interaction with climate change.
understanding of terrestrial ecosystems and
responsible for some of the key findings behind our
understanding of how and why online communities
Professor Jaideep Srivastava is a co-inventor
data mining techniques, developed in part by
In collaboration with scientists from NASA
Minnesota CS&E, that extract key user traits
and Planetary Skin Institute, Vipin Kumar’s
from a massive pool of data being collected from
research group has been developing novel data
online gaming platforms. The potential for the
mining methods that have dramatically advanced
improved land surface models.
Leadership in Data Science
Minnesota CS&E is a leader not only in the
technology has earned the interest of a number
the state of the art in the monitoring of global
technical aspects of data science but also in
of major players in the online gaming industry.
land cover using satellite data. By applying these
the growth and expansion of the field through
methods on a global scale, they have been able
numerous initiatives and highly visible national
and ecosystem data now available from satellite
to create comprehensive histories of large-scale
collaborations.
and ground-based sensors, and climate model
changes in the ecosystem due to fires, logging,
simulations offer huge potential for monitoring,
droughts, flood, farming, etc. This research has
understanding, and predicting the behavior of the
been featured in the Economist that lauded the
Earth’s ecosystem and for advancing the science of
role of data mining algorithms developed at the
climate change. As part of a 5-year, $10 Million
University of Minnesota for automated monitoring
NSF funded project, Vipin Kumar, Arindam
of the global forest cover that is urgently needed
Banerjee, Shashi Shekhar, and Michael Steinbach
to enable the use of forests for economic carbon
Environmental Sciences: Wealth of climate
6
Data Science Initiatives: The department is
bringing our strength in data science into focus
with a number of new initiatives. In Fall 2015,
we will welcome the first batch of students in the
Data Science MS program (datascience.umn.edu)
that will expose students to cutting-edge methods
6
leveraged strong ties to the storage system industry
in the Twin Cities, long a major center of the
storage industry in the United States. The current
industrial support includes 14 sponsorships from 10
companies (Seagate, HGST, HP, Dell, LSI, NetApp,
Symantec, Xyratex, SGI, and FedCentric).
Data Science Alumni
The CS&E Department has had a large
contingent of faculty working in areas related
to data science for years, and has graduated
Global ocean eddy tracks constructed using satellite altimetry data
years in areas related to data science and big
and theory that will form the basis for the next
at pushing the boundaries of computer science
data. These graduates are in high demand in top
generation of big data technology. A collaboration
research. This prestigious 5-year, $10 million
companies like Google, Microsoft, Amazon, Apple,
between our department, the Department of
multi-institution multi-disciplinary project
eBay, IBM, FaceBook, Twitter, and Yahoo!, and
Electrical and Computer Engineering, the School
led by CS&E faculty (Vipin Kumar, Arindam
our PhDs are sought after as faculty members in
of Statistics, and the Division of Biostatistics, the
Banerjee, Shashi Shekhar, Michael Steinbach)
institutions around the world. We are especially
program is being led by Professor Dan Boley, who
involves collaborators from School of Statistics,
proud of the fact that many of our graduates
is serving as its inaugural Director of Graduate
the Institute on the Environment, and College
serve in leadership roles in the professional
Studies.
of Food, Agricultural and Natural Resource
community, for example, as program chairs of
Sciences at Minnesota, and collaborators at NC
major conferences, editorial board members of
leading (with Carlson School of Management)
State, Northwestern, Northeastern and NC A&T
major journals. Many of these are high-profile
the Social Media and Business Analytics
Universities (climatechange.cs.umn.edu). This
alumni who have made substantial contributions
Collaborative (SOBACO), a major initiative
project aims to advance the science of climate
in the data science area both in industry and in
building bridges between the University and local
change using novel data science methods.
academia (see featured alumni in this newsletter
Our department has also founded and is co-
TerraPop (Profs. Interrante, Shekhar,
for a few examples). It is also noteworthy that
industry-academia partnerships, Minnesota CS&E
Srivastava together with colleagues from
the University of Minnesota Best Dissertation
researchers are simultaneously creating new
Geography, Library Sciences, Environmental
Awards in Science and Engineering for the past
knowledge in data science and helping to solve
Sciences, and History) is a project developing
two years have been won by the graduates of
immediate problems faced by major companies.
the infrastructure needed to make it easier for
our department (Gang Fang - 2013 and James
researchers to use data describing people along
Faghmous - 2014) for their work on Big Data.
the University of Minnesota Informatics Institute
with data describing the places they inhabit at
This is proof positive that we are attracting the
(UMII), a brand new University-wide center led by
global scale (www.terrapop.org).
very top students, and training them well to make
companies (sobaco.umn.edu). Through these
The CS&E department is a key participant in
department faculty member Dr. Claudia Neuhauser
Professor Jaideep Srivastava is a leading player
(sites.google.com/a/umn. edu/informatics-institute/
in the Virtual Worlds Observatory (VWO). This
home). UMII was founded to foster data-intensive
collaboration is developing novel computational
their mark in the world, and do us proud.
An extended and enriched version of this article
research in agriculture, arts, design, engineering,
techniques for analyzing large-scale networks,
can be found at https://datascience.umn.edu/
environment, health, humanities, and social
which will have applicability across a wide variety
research/
sciences. UMII’s vision includes advancing data
of domains. Along with Minnesota, Northwestern
analytics, enhancing the University of Minnesota’s
University, University of Southern California,
competitiveness in data-intensive research across all
University of Illinois, and others are involved
(www.vwobservatory.com).
disciplines, and partnering with industry to harness
the power of big data for economic growth and
development.
7
well over 100 PhD students over the past 10
Our department hosts the National Science
Foundation Center for Research in Intelligent
Storage (CRIS) led by Professor Du, a partnership
Large-Scale Data Science Collaborations:
between universities and industry (cris.cs.umn.
CS&E faculty are also playing a leading role in
edu). CRIS is pushing the boundaries of file and
numerous high-profile large-scale collaborations.
storage systems by exploring and developing
One such project “Understanding Climate Change:
new technologies and techniques, improving the
A Data Driven Approach” is funded by NSF’s
usability, scalability, security, reliability, and
Expeditions in Computing program that is aimed
performance of storage systems. The Center has
7