Download Document

Document related concepts

Oncogenomics wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Epigenomics wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Ridge (biology) wikipedia , lookup

Public health genomics wikipedia , lookup

Gene therapy wikipedia , lookup

Minimal genome wikipedia , lookup

Genetic engineering wikipedia , lookup

Pathogenomics wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Primary transcript wikipedia , lookup

Gene desert wikipedia , lookup

Gene nomenclature wikipedia , lookup

Point mutation wikipedia , lookup

Metagenomics wikipedia , lookup

Non-coding DNA wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

NEDD9 wikipedia , lookup

Genome (book) wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Genomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene wikipedia , lookup

Genome evolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

History of genetic engineering wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Helitron (biology) wikipedia , lookup

Designer baby wikipedia , lookup

RNA-Seq wikipedia , lookup

Microevolution wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene expression profiling wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Open Science, Open Data,
Open Source Projects for
Undergraduate Research
Experiences
Kam D. Dahlquist, Ph.D.
Department of Biology
Loyola Marymount University
BioQUEST/HHMI/CaseNet Summer Workshop
June 13, 2015
Outline
• An open science ecosystem enhances student learning
• Quick example: XMLPipeDB project in a Biological
Databases course
• Longer example: GRNmap project in Biomathematical
Modeling course
• Potential research projects for BioQUEST participants
• Challenges are also opportunities
– Computer literacy
– Data literacy
– Information literacy
Open Science Ecosystem
Open Access
(creative commons)
Open Pedagogy
Open
Source
Code
Open Data
Open Science
(open process)
Reproducible
Research
Research
Integrity
Citizen
Science
With thanks to John Jungck
Open Science Pedagogy Adds Open Source Values
and Tools to Problem Spaces
• Students solve an authentic research problem.
• They investigate large, publicly available datasets.
• They return the products of their research to the
scholarly community.
Image: http://www.bioquest.org/bedrock/problem_spaces/
Official Open Source Definition
(http://opensource.org)
Free redistribution
No discrimination against
fields of endeavor
Source code
Distribution of license
Derived works
License must not be
specific to a product
Integrity of the author’s
source code
License must not
restrict other software
No discrimination against
persons or groups
License must be
technology-neutral
Open Source Values Mirror STEM Curricular Reform
Open Source Values
Active Learning
Pedagogy
Open Source
Practices & Tools
Central code
Source code is
Authentic problem to
repository; version
available, modifiable, solve with realistic
control; provenance
and long-lived
complexity
of code
Accountability to a
developer and user
community
Participatory and
collaborative work;
peer review
Task and bug
trackers; continuous
integration; testdriven workflows
Responsibilities
accompany rights
Responsibility and
ownership of the
learning process
Documentation: inline, user manual,
web site, wiki
Pedagogy Implemented on Course Wikis
• Team-taught and cross-listed
− BIOL/CMSI 367: Biological Databases
https://xmlpipedb.cs.lmu.edu/biodb/fall2013/index.php/Main_Page
− BIOL/MATH 388: Biomathematical Modeling
http://www.openwetware.org/wiki/BIOL398-04/S15
• Single instructor
− BIOL 368: Bioinformatics Laboratory
http://www.openwetware.org/wiki/BIOL368/F14
− BIOL 478: Molecular Biology of the Genome
(wet lab, mostly offline)
data analysis: http://www.openwetware.org/wiki/BIOL478/S15:Microarray_Data_Analysis
• Weekly assignments leading up to final research project
• All projects involve exploration of DNA microarray data
Pedagogy Implemented on Course Wikis
• Team-taught and cross-listed
− BIOL/CMSI 367: Biological Databases
https://xmlpipedb.cs.lmu.edu/biodb/fall2013/index.php/Main_Page
− BIOL/MATH 388: Biomathematical Modeling
http://www.openwetware.org/wiki/BIOL398-04/S15
• Single instructor
− BIOL 368: Bioinformatics Laboratory
http://www.openwetware.org/wiki/BIOL368/F14
− BIOL 478: Molecular Biology of the Genome
(wet lab, mostly offline)
data analysis: http://www.openwetware.org/wiki/BIOL478/S15:Microarray_Data_Analysis
• Weekly assignments leading up to final research project
• All projects involve exploration of DNA microarray data
Biological Databases
Team Final Project:
create a gene database
for a bacterial species
http://xmlpipedb.cs.lmu.edu/
PostgreSQL
Intermediate
Database
GenMAPP-compatible
Gene Database
Visualize
data
Microarray data
Each Student on the Team is
Assigned a Specific Role
Coder
Project Manager
Quality
Control
Data Analysis
Student Products Are Shared with the
Scientific Community
http://sourceforge.net/projects/xmlpipedb/
Pedagogy Implemented on Course Wikis
• Team-taught and cross-listed
− BIOL/CMSI 367: Biological Databases
https://xmlpipedb.cs.lmu.edu/biodb/fall2013/index.php/Main_Page
− BIOL/MATH 388: Biomathematical Modeling
http://www.openwetware.org/wiki/BIOL398-04/S15
• Single instructor
− BIOL 368: Bioinformatics Laboratory
http://www.openwetware.org/wiki/BIOL368/F14
− BIOL 478: Molecular Biology of the Genome
(wet lab, mostly offline)
data analysis: http://www.openwetware.org/wiki/BIOL478/S15:Microarray_Data_Analysis
• Weekly assignments leading up to final research project
• All projects involve exploration of DNA microarray data
Systems Biology Workflow
DNA microarray data:
wet lab-generated or
published
New experimental
questions
Statistical analysis,
clustering,
Gene Ontology,
term enrichment
Visualizing the results
Generate gene
regulatory network
Modeling dynamics
of the network
Systems Biology Workflow
DNA microarray data:
wet lab-generated or
published
New experimental
questions
Statistical analysis,
clustering,
Gene Ontology,
term enrichment
Visualizing the results
Generate gene
regulatory network
Modeling dynamics
of the network
Central Dogma of Molecular Biology (simplified)
DNA
Transcription
mRNA
Translation
Protein
Freeman (2003)
And Now in the “omics” Era…
Genome
Transcription
Transcriptome
Translation
Proteome
Freeman (2002)
Budding Yeast, Saccharomyces cerevisiae, is
an Ideal Model Organism for Systems Biology
• Small genome of
~6000 genes
• Extensive genomewide datasets
readily accessible
• Molecular genetic
tools available
Alberts et al. (2004)
Environmental Changes and Stresses
• All organisms must respond to changes in the
environment
–
–
–
–
–
pH
oxygen availability
pressure
osmotic stress
temperature (heat and cold)
• Some changes in the environment cause cellular
damage and trigger a “stress response”
– damage from reactive oxygen species
– damage from UV radiation
– sudden and/or large change in temperature (increase or
decrease)
Cold Shock Is an Environmental Stress
that Is Not Well-Studied
• Increases in temperature (heat shock)
– response very well-characterized
– proteins denature due to heat
– induction of heat shock proteins (chaperonins), that assist in
protein folding
– conserved in all organisms (prokaryotes, eukaryotes)
• Decreases in temperature (cold shock)
–
–
–
–
–
–
response less well-characterized
decrease fluidity of membranes
stabilize DNA and RNA secondary structures
impair ribosome function and protein synthesis
decrease enzymatic activities
no equivalent set of cold shock proteins that are conserved in
all organisms
Yeast Respond to Cold Shock by
Changing Gene Expression
• Cold shock temperature range for yeast is 10-18°C
• Previous studies indicate that the cold shock response
can be divided into:
• Late response genes – 12 to 60 hours
– General environmental stress response genes (ESR) are induced
– Regulated by the Msn2/Msn4 transcription factors
• Early response genes – 15 minutes to 2 hours
– Genes unique to cold shock are induced, such as genes involved
in ribosome biogenesis and membrane fluidity
– Which transcription factors regulate this response is unknown
Transcription Factors Control Gene Expression
by Binding to Regulatory DNA Sequences
• Activators increase gene expression
• Repressors decrease gene expression
• Transcription factors are themselves proteins
that are encoded by genes
Experimental Design and Methods
Yeast Cells Were Harvested for Microarrays Before,
During, and After a Cold Shock and During Recovery
Mixture of
labeled cDNA
from two
samples
• 4 replicates of each experiment with dye swaps
• wt and transcription factor deletion strains
DNA Microarray
One spot =
one gene
Green =
decreased
relative to
control
Red =
increased
Yellow =
no change
in gene
expression
Freeman (2002)
Gene Expression Changes Due to Cold Shock
Return to Pre-shock Levels During Recovery
t30/t0 cold shock t60/t0 cold shock
• Four sets of biological
replicates were performed
• Dye orientation was
swapped for two sets of
replicates
t90/t0 recovery
t120/t0 recovery
Steps Used to Analyze DNA Microarray Data
1.
2.
3.
4.
5.
6.
Quantitate the fluorescence signal in each spot
Calculate the ratio of red/green fluorescence
Log2 transform the ratios
Normalize the ratios on each microarray slide
Normalize the ratios for a set of slides in an
experiment
Perform statistical analysis on the ratios
7.
8.
9.
10.
Compare individual genes with known data
Pattern finding algorithms/clustering
Modeling the dynamics of the gene regulatory network
Visualizing the results
Systems Biology Workflow
DNA microarray data:
wet lab-generated or
published
New experimental
questions
Excel,
stem
Statistical analysis,
clustering,
Gene Ontology,
term enrichment
Generate gene
regulatory network
Visualizing the results
Modeling dynamics
of the network
And so on…
Within-strain ANOVA Reveals How Many
Genes Had Significant Changes in
Expression at Any Timepoint
ANOVA
wt
Δgln3
p < 0.05
2378/6189 (38.42%)
1864/6189 (30.11%)
p < 0.01
1527/6189 (24.67%)
1008/6189 (16.29%)
p < 0.001
860/6189 (13.90%)
404/6189 (6.53%)
p < 0.0001
460/6189 (7.43%)
126/6189 (2.04%)
B-H p < 0.05
1656/6189 (26.76%)
913/6189 (14.75%)
Bonferroni p < 0.05
228/6189 (3.68%)
26/6189 (0.42%)
A Modified T Test Was Used to Determine Significant
Changes in Gene Expression at Each Timepoint
wild type
Number of
Genes whose
Expression
Changes
Cold Shock
t15
t30
Increased
p < 0.05
439 (7%)
668 (11%)
Decreased
p < 0.05
331 (5%)
517 (8%)
Total
p < 0.05
770 (12%) 1185 (19%) 1020 (17%)
Recovery
t60
t90
t120
609 (10%)
398 (6%)
191 (3%)
411 (7%)
249 (4%)
59 (1%)
647 (10%) 250 (4%)
Expression (log2 fold change)
Short Time Series Expression Miner (stem)
Software Clusters Genes with Similar Profiles
Time (minutes)
Expression (log2 fold change)
Short Time Series Expression Miner (stem)
Software Clusters Genes with Similar Profiles
Time (minutes)
Gene Ontology categories assigned to clusters:
• Ribosome biogenesis
• Zinc ion homeostasis
• Hexose transport
• Endomembrane system
• Protein and vesicle transport
• Negative regulation of nitrogen
compound process
The Transcription Factor Gln3 Regulates Genes
Involved in Nitrogen Metabolism
• Yeast differentiate between preferred and
non-preferred nitrogen sources.
• When the nitrogen source is poor, Gln3 localizes
to the nucleus and activates genes required to
utilize the poor nitrogen source.
• The Dgln3 strain is impaired for growth at cold
temperatures:
− Doubling time at 13°C of 15 hours vs. 8.3 hours for wild type.
• A microarray experiment was performed on the
Dgln3 strain.
Gln3 Target Genes Were Extracted from the YEASTRACT Database
37 out of 164 (23%)
have significantly
different
expression profiles
in the wild type
versus the Dgln3
strain
Systems Biology Workflow
DNA microarray data:
wet lab-generated or
published
Statistical analysis,
clustering,
Gene Ontology,
term enrichment
New experimental
questions
Visualizing the results
YEASTRACT,
Excel
Modeling dynamics
of the network
Generate gene
regulatory network
Genome-wide Location Analysis has Determined
the Relationships between Transcription Factors
and their Target Genes in Yeast
• Does not show
whether activation or
repression occurs
• Shows topology, but
not the behavior of the
network over time
• Data found in
YEASTRACT database
Lee et al. (2002)
A Transcriptional Network Controlling
the Cold Shock Response
Assumptions made in our model:
• Each node represents one gene encoding a transcription factor.
• When a gene is transcribed it is immediately translated into protein;
a node represents both the gene and the protein it encodes.
• An edge drawn between two nodes represents a regulation
relationship, either activation or repression, depending on the sign
of the weight.
Systems Biology Workflow
DNA microarray data:
wet lab-generated or
published
New experimental
questions
Statistical analysis,
clustering,
Gene Ontology,
term enrichment
Visualizing the results
Generate gene
regulatory network
Modeling dynamics
of the network
GRNmap (Windows-only)
GRNmap: Gene Regulatory Network Modeling
and Parameter Estimation
Repression
1
dxi (t )

dt
Pi


1  exp    wij x j (t )  b j 
 j

 d i xi (t )
1/w
0.5
0

Activation
1
0.5
1/w
0

• Parameters are estimated from
DNA microarray data from wild
type and transcription factor
deletion strains subjected to
cold shock conditions.
• Weight parameter, w, gives the
direction (activation or
repression) and magnitude of
regulatory relationship.
The “Worst” Rate Equation is:
dPHD1
PPHD1

 DPHD1 PHD1
dt
1  exp  w5 (CIN 5)  w10 ( FHL1)  w23 ( PHD1)  w30 ( SKN 7)  w35 ( SKO1)  w41 ( SWI 4)  w43 ( SWI 6)  b14 
Optimization of the 92 Parameters Requires
the Use of a Regularization (Penalty) Term
1 Q d
E  a   [ z (tr )  z c (tr )]2
Q  1
Least Squares Residual
2
• Plotting the least squares error
function showed that not all the
graphs had clear minima.
• We added a penalty term so that
MATLAB’s optimization algorithm
would be able to minimize the
function.
• θ is the combined production
rate, weight, and threshold
parameters.
• a is determined empirically from
the “elbow” of the L-curve.
Parameter Penalty Magnitude
Forward Simulation of the Model Fits
the Microarray Data
Systems Biology Workflow
DNA microarray data:
wet lab-generated or
published
New experimental
questions
Statistical analysis,
clustering,
Gene Ontology,
term enrichment
Visualizing the results
Generate gene
regulatory network
GRNsight
Modeling dynamics
of the network
GRNsight Rapidly Generates GRN graphs Using
Our Customizations to the Open Source D3 Library
Adobe Illustrator: several
hours to create
GRNsight: 10 milliseconds to
generate, 5 minutes to
arrange
GRNsight: colored edges for
weights reveal patterns in
data
The First Round of Modeling Has Suggested
Future Experiments
Systems Biology Workflow
DNA microarray data:
wet lab-generated or
published
New experimental
questions
Statistical analysis,
clustering,
Gene Ontology,
term enrichment
Visualizing the results
Generate gene
regulatory network
Modeling dynamics
of the network
http://www.openwetware.org/wiki/Dahlquist:BioQUEST_Summer_Workshop_2015
95% of Bioinformatics is Getting Your Data
into the Correct File Format
• Exposes deficiencies in computer literacy skills in
so-called “digital natives”
• When you leave your comfort zone, it is,
by definition, uncomfortable
• Emphasis on research process
−
−
−
−
Teamwork
Electronic lab notebook
Keeping track of files and code
Trouble-shooting problems that arise in the research
process: bugs, data issues, etc.
Summary
• An open science ecosystem enhances student learning
• Quick example: XMLPipeDB project in a Biological
Databases course
• Longer example: GRNmap project in Biomathematical
Modeling course
• Potential research projects for BioQUEST participants
• Challenges are also opportunities
– Computer literacy
– Data literacy
– Information literacy
Acknowledgments
Ben G. Fitzpatrick
LMU Math
John David N. Dionisio
LMU Computer Science
Special thanks to
John Jungck &
Sam Donovan
Juan Carrillo, Natalie Williams, K. Grace Johnson, Kevin Wyllie, Kevin McGee
Monica Hong, Nicole Anguiano, Anindita Varshneya, Trixie Roque, (Tessa Morris)