Download bioinformatix-ex

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Secreted frizzled-related protein 1 wikipedia , lookup

List of types of proteins wikipedia , lookup

X-inactivation wikipedia , lookup

Molecular evolution wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Genomic imprinting wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genome evolution wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Gene expression wikipedia , lookup

Expression vector wikipedia , lookup

Gene therapy wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Gene desert wikipedia , lookup

Gene nomenclature wikipedia , lookup

Community fingerprinting wikipedia , lookup

Silencer (genetics) wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene regulatory network wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
4-th year
1.
a. Describe the basic procedure of acquisition and warehousing of gene
expression data.
b. List two of the commonly used methods in analysis gene expression data and
describe their main functions.
c.
i.
ii.
Explain how hierarchical clustering algorithms work, make sure your
answer describes what is meant by a linkage method and how it is used.
Based on a Euclidean similarity measure, calculate the similarity matrix
between the following observations:
Gene ID
G0001
G0002
G0003
G0004
T1
1
1
5
5
T2
1
4
1
4
d. Given a gene expression profile of two populations (gene expressions in
untreated tissue samples vs. gene expressions in its corresponding treated
tissue samples), the volcano plot provides a visualisation of significance of
changes in expression values between two populations using the hypothesis
test method.
i.
ii.
Draw an analysis workflow diagram for the required data analysis.
Explain the steps in each of the stages in your workflow.
The four parts carry, respectively, 15%, 15%, 30% and 30%.
© University of London 2003 Paper
3rd year
a. Give a brief description of the concept of gene expression and its commonly
used analysis methods.
b. Describe the application of supervised learning (classification) and
unsupervised learning (clustering) in gene expression analysis.
c. Given a gene expression profile of drug treatment measured in 8 different time
points.
i. Design a workflow to cluster the co-expressed gene.
ii. Explain the steps in each of the stages in your workflow.
d. The following table shows the distance matrix between five genes,
G1
G2
G3
G4
G5
i.
ii.
iii.
G1
0
9
3
6
11
G2
G3
G4
G5
0
7
5
10
0
9
2
0
8
0
Based on a complete linkage method show the distance matrix between
the first formed cluster and the other data points.
Draw a dendrogram showing the full tree for five points based on
complete linkage.
Draw a dendrogram showing the full tree for the five points based on
single linkage.
The four parts carry, respectively, 20%, 20% and 30% and 30% of the marks.
© University of London 2003 Paper
Sample Answers (4th Year)
1.
a. Model answer should mention micro-array chips and experiment where each
dot on a chip measures the expression of one gene under a given environment
condition. The output data is stored for each gene with experiment context and
reference information usually from public sources. Commonly used analysis
methods include statistical analysis (hypothesis testing), visualisation
techniques, data mining.
[3 marks]
b. i. Clustering to find groups of gene of similar or correlated behaviour under
different environmental conditions. ii. Differential gene expression analysis to
study gene behaviour in different states (e.g. diseased and normal states)
[3 marks]
c. Standard Book Question: A similarity matrix is constructed to calculate the
distance between the pairs of points. The pair with the shortest distance is
merged into one cluster (or new point). The process is repeated resulting in a
dendrogram or tree shape.
[2 marks]
Different linkage methods define how distances between clusters are measured
(e.g. single linkage, complete linkage, average linkage
[2 marks]
ii.
G0001
G0002
G0003
G0004
G0001
Sqrt (0+3*3)=3
Sqrt (4*4+0)=4
Sqrt (4*4+3*3)=5
G0002
Sqrt (4*4+3*3)=5
Sqrt (4*4+0*0)=4
G0003
Sqrt (0+3*3)=3
G0004
-
[4 marks]
d.
i.
Cleaned
Gene
Exp
Table
Calc. Fold
Changes for
each gene
Calc.
t-test between
groups
Calc
Significance
& Effect
Draw scatter
plot
Highlight
Genes with
high
Significance
& Effect
[3 marks]
ii.
a) Starting with a table of gene expression values for two populations we
first calculate the fold changes for each gene between every two time
points in the time series as (ln t2 – ln t1).
b) Based on the newly calculated fold change table we apply a t-Test
between the two different populations, based on which we can
calculate the significance (p-value) of the changes between both
populations.
c) We calculate the effect of the change as the difference between the
logged means for a gene.
d) Genes that have both a high effect and a high significance are deemed
to be interesting genes.
[3 marks]
© University of London 2003 Paper
Sample Answers (3rd Year)
a. In each and every organism, different genes are expressed in different cell and
tissue types (spatial differences) and at different developmental stages
(temporal differences). Analysis of these variations in gene expression can
lead to a better understanding of disease states, targeting of drugs to specific
cells, tissues or individuals, development of agricultural products, etc. Model
answer should mention micro-array chips and experiment where each dot on a
chip measures the expression of one gene under a given environment
condition.
[4 marks]
b. Model answer should describe using Clustering to find groups of gene of
similar behaviour under different environmental conditions. Differential gene
expression analysis to study gene behaviour in different states (e.g. diseased
and normal states
[4 marks]
c.
Gene
Exp
Table
Clean data
from noise
(e.g. by
flooring),
scaling etc
Cluster Data
Points
Validate
Meaning of
Clusters for
functions,
pathways etc
[3 marks]
The data is presented in a table, where each row contains a gene id and a time
series of measurements. The data is then cleaned from noise e.g. using floor
functions to remove noise. A clustering algorithm using an appropriate
distance measure is applied where the time series is treated as a vector of
points in the feature space. The generated clusters are then examined to
validate their meaning to interpret the significance/meaning of the genes
assigned to the same clusters, this can be based on accessing remote database
to check for their function.
[3 marks]
d.
i.
The first cluster will be formed from G3 and G5 since they have the
minimum distance.
[1 mark]
G35
G1
G2
G4
G35
0
11
10
9
G1
G2
G4
0
9
6
0
5
0
[1 mark]
© University of London 2003 Paper
ii,iii
Single Linkage
Complete Linkage
[4 marks]
© University of London 2003 Paper