Download Bioinformatics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Copy-number variation wikipedia , lookup

Point mutation wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Gene therapy wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

NEDD9 wikipedia , lookup

Genomic library wikipedia , lookup

Oncogenomics wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Genetic engineering wikipedia , lookup

Transposable element wikipedia , lookup

Human genetic variation wikipedia , lookup

Gene desert wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Genomic imprinting wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Non-coding DNA wikipedia , lookup

Ridge (biology) wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Genomics wikipedia , lookup

Metagenomics wikipedia , lookup

Gene expression programming wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

History of genetic engineering wikipedia , lookup

Human genome wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene wikipedia , lookup

Pathogenomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Microevolution wikipedia , lookup

Helitron (biology) wikipedia , lookup

Minimal genome wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Public health genomics wikipedia , lookup

Genome editing wikipedia , lookup

Genome (book) wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
MEDG520
Block 5
Bioinformatics
Concepts:
 Sensitivity vs. Specificity
 Curated vs. Comprehensive databases
 Supervised vs. Unsupervised Machine Learning Methods
o How do the terms apply to microarray analysis?
 Accessing Gene Information
o Genome Browser
o Gene Catalogs
 Classification of Gene Orthologs
 Assessing the Data
 Are DNA sequences random?
 What is the difference between a local and global alignment?
 Over-Training
 Cross-Validation versus Independent Test Sets
 Assigned Papers
 Profile Models
 Assigned Papers
 Web Resources
 Sample Questions
Sensitivity vs. Specificity – Know the meaning of each term and how they are applied to
discuss the performance of computational methods. Presented with specific examples,
suggest which could be limiting the utility.



Sensitivity is the ability to correctly identify those who have the disease.
Specificity is the ability to correctly identify those who do not have the disease.
Ideally, a test should have 100% sensitivity and 100% specificity. In other words,
the test always correctly identifies the disease state of the people tested.
Result of Screening
Positive
Negative

Disease State
Disease
True Positive
False Negative
No Disease
False Positive
True Negative
People who have the disease are the "true positives." People who do not have the
disease are the "true negatives."
Sensitivity = (true positives / (true positives + false negatives)) x 100
(total actually diseased)

When the "false negatives" is a small number relative to the "true positives",
sensitivity approaches 100%.
Specificity = (true negatives / (true negatives + false positives)) x 100
(total actually not diseased)

When the "true negatives" is a small number relative to the "false positives",
specificity approaches 100%.
Curated vs. Comprehensive databases
Curated
 All data reviewed by actual human biologist or trained curators.
Comprehensive
 Involves application of algorithms to genome. Not necessarily curated.
Supervised vs. Unsupervised Machine Learning Methods – Know the difference
between the two. Presented with a specific example, classify the training method as
supervised or unsupervised. Indicate how specific supervision could enhance
performance.
Supervised
 The process of building data mining models using a known dependent variable,
also referred to as the target (the thing to be predicited). Classification techniques
are supervised.
 Example: Shipp et al (2002) used a supervised learning method to identify genes
which will be predictive for lymphoma. A human being chose genes from the
microarray data (8 to 16 genes were chosen) and then tested those genes ability to
accurately predict lymphoma (outcome?). This is the supervision.
Unsupervised
 The process of building data mining models without the guidance (supervision) of
a known, correct result. Clustering and association rules are unsupervised mining
functions.
 Take data from instrument and pass straight to algorithm.
 Example: Any microarray study where a competitive hybridization is conducted
on your array and you simply normalize the ratios and perform hierarchical
clustering (for example) on the data to identify clusters of coexpressed genes.
How do the terms apply to microarray analysis?
Supervised
 analysis to determine genes that fit a predetermined pattern
 Usually used to find genes with expression levels that are significantly different
between groups of samples or finding genes that accurately predict a
characteristic of the sample.


Example: find gene or genes that accurately distinguish one type of cancer from
another or a metastatic tumour from a non-metastatic one.
Two popular supervised techniques would be nearest-neighbour analysis and
support vector machines. Decision trees or neural networks are other examples.
Unsupervised
 analysis to characterize the components of a data set without a priori input or
knowledge of a training signal
 Try to find internal structure or relationships in data without trying to predict
some ‘correct answer’.
 Three classes:
1. Feature determination
o Look for genes with interesting patterns
o Eg. Principal-components analysis
2. Cluster determination
o determine groups of genes with similar expression patterns
o eg. Nearest-neighbour clustering, self-organizing maps, k-means
clustering, 2d hierarchical clustering
3. Network determination
o determine graphs representing gene-gene or gene-phenotype interactions.
o Eg. Boolean networks, Bayesian networks, relevance networks
Accessing Gene Information - Know the difference between genome browsers and gene
catalogs. Be able to suggest which would have greater utility for users with specific
problems to solve.
Genome Browsers
 Display sequences and annotations, alignments, etc for all sequences in genomes
sequenced to date.
 Useful for:
o looking for new genes or potential genes of interest.
o Understanding gene ‘neighborhood’ for your gene of interest. Eg.
proximal regulatory elements (promoters, enhancers, etc.), Neighboring
genes, intron-exon structure, snps, repeats, etc.
o Can be used to visualize practically any feature of a gene or region of
genome, including homologies to other sequences.
o Allows assessment of quality and status of whole genome assembly and
degree to which a genome is ‘finished’.
o Can perform automated assembly and annotation (eg. gene prediction) of
genome as individual sequences are created and submitted to various
public databases all over the world.
Examples
UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/hgGateway)
Ensembl Genome Browser (http://www.ensembl.org)
Vista Genome Browser (http://pipeline.lbl.gov/)
Gene Catalogs
 Databases containing information from various sources for each known gene.
Can be curated (eg. genbank) or comprehensive (eg. gene cards).
 Contains info like:
o Official name
o Synonyms
o Gene IDs for other gene-based resources (eg. LocusLink).
o Cytogenetic locus of the gene, genomic region, gene coordinates.
o Name, functions, expression patterns, of protein product.
o Similarities to other sequences, proteins.
o Involvement in diseases, medical applications.
o Papers published on gene.
o Sequences that gene information is based upon.
o Etc.
o Links to sources of all information above.
 A way of organizing and presenting the sometimes overwhelming amount of
information that is being produced for genes.
Examples
NCBI – Genbank (http://www.ncbi.nlm.nih.gov/Genbank/index.html)
GeneCards (http://bioinformatics.weizmann.ac.il/cards/)
LocusLink (http://www.ncbi.nlm.nih.gov/LocusLink/)
Classification of Gene Orthologs
Homolog
 Any member of a set of genes, DNA sequences or protein sequences whose
nucleotide sequences show a high degree of one-to-one correspondence
Ortholog
 Homologous sequences in different species that arose from a common ancestral
gene during speciation; may or may not be responsible for a similar function.
Paralog
 Homologous sequences within a single species that arose by gene duplication
Why do orthology studies tend to use protein sequence, not nucleotide?
 More likely to be conserved
 Some species favour different codon structures or average base compositions.
 Example: species living in very hot climates might favour a base composition less
prone to denaturation (high GC percent).
Be able to explain two examples of why one might need to know which genes are
orthologous.
 Stuart et al. (2003) recently demonstrated that by considering orthologous genes
from several species as metagenes together with expression data you can focus in
on gene coexpression networks more likely to represent biologically significant
interactions (instead of just the statistically significant networks found with
expression data alone).
 Most studies of disease in animal models make use of orthologous genes.
o In some cases, a gene will be identified in the disease model by linkage or
association and then the orthologous gene in humans needs to be found to
see if the study can be moved to humans
o In other cases, a gene is first linked to the disease in humans but then an
animal model is needed for further study (eg. to test different therapies).
Again, there must be a known ortholog in the animal model for it to be
useful.
 Many phylogenetic studies that determine how species are related and how they
diverged depend on the degree of sequence similarity between orthologous genes
and regulatory elements to draw conclusions.
Indicate limitations of ortholog-classification methods that are based only on
BLAST comparisons.
 Should consider more than just base or amino acid differences. Synonymous
changes are less significant than non-synonymous and conservative changes less
significant than non-conservative.
 Does not account for functions of “orthologs”. In many cases, an analysis will be
based on the assumption that orthologs (determined by sequence homology) have
the same function. But, this is not necessarily the case. For example, you might
look for regulatory motifs in the upstream region of orthologous genes on the
assumption that genes with shared function are likely to share regulatory control.
However, if one of the genes has assumed a new function over evolutionary time,
then your analysis will be unsuccessful or misleading.
Assessing the Data – Given specific examples, explain how the available data for
training an algorithm could limit performance.
 One problem is overtraining resulting from small or biased data sets.
 In some cases, there may be no consistent informative characteristics that can be
used reliably as a predictor, for the kind of test you are performing. For a
microarray analysis, it may be that there are no transcripts consistently,
differentially expressed between controls and cancer patients.
 Biological (eg. natural variations in transcript levels) and experimental variability
(eg. differences in dye concentration) can be too great to detect the subtle
differences required for the condition with statistical confidence.
For a trivial example, a method developed to predict patient survival based on
microarray results for PSA-positive prostate tumors might not be relevant for
analysis of expression data from PSA-negative tumors.
 PSA-positive tumors and PSA-negative tumors could represent very different
expression environments. They could represent patients with very different
lifestyles (eg. smokers vs non-smokers). Therefore, a model trained on only the
one data set will not necessarily work for the other.
 The data must be representative of all possible situations so that predictive
characteristics can be found that will work for any patient or patient group.
Realistically, this is very difficult to achieve.
DNA as a random string of letters - Are DNA sequences random? - If not, how might
the non-random character impact the performance of a specific software bioinformatics
method.



DNA sequences are not random.
Many models and statistics assume a random and uniform distribution of the four
bases. However, this is far from the true situation. Many regions are biased by
certain motifs, repetitive elements, etc. The genome as a whole is generally GCpoor but contains some GC-rich regions (eg. CpG islands).
This could impact many software bioinformatics methods. Example, an algorithm
that finds transcription factor binding sites using a binding matrix. If the matrix is
for a GC rich binding motif, the algorithm will return more false positives for GC
rich regions like CpG islands because the motif is more likely to occur by chance
in these regions (and not represent a true binding site).
What is the difference between a local and global alignment?
Global alignment - An alignment that assumes that the two proteins are basically similar
over the entire length of one another. The alignment attempts to match them to each other
from end to end, even though parts of the alignment are not very convincing.
Local alignment - An alignment that searches for segments of the two sequences that
match well. There is no attempt to force entire sequences into an alignment, just those
parts that appear to have good similarity, according to some criterion. Using the same
sequences as above, one could get:
Over-Training – What does it mean? Given an example, suggest how over-training
could occur.




Using an unsupervised approach can lead to overtraining (what about supervised
approaches).
Usually overtraining results from using too small of a training data set, and failure
to cross-validate or test your model against an independent data set.
You can even overtrain to your independent data set.
o Example: When the genome was first sequenced, everyone wrote genefinding algorithms that were validated against the same independent data set
of 150 genes that were not used to train their models. But, because everyone
used the same independent data set (and competed to identify as many of them
as possible), we ended up with a bunch of algorithms that were only good at
finding genes with characteristics similar to those in the independent set.
Overtraining generally results when you apply a huge number of variables (eg.
15,000 genes in a microarray) to a small number of samples (eg. 50 lymphoma
tissue samples). You are bound to find some genes which correlate with almost
anything for this sample (eg. birthdays).
Cross-Validation versus Independent Test Sets - Know the difference between crossvalidation and independent test sets for measuring the performance of bioinformatics
methods. Be able to suggest how cross-validation could be problematic for specific
examples.
Cross-validation



Used to estimate generalization error.
A method for evaluating a statistical model or algorithm that has free parameters.
Divide the training data into several parts, and in turn use one part to test the
procedure fitted to the remaining parts. It can be used for model selection or for
parameter estimation when there are many parameters. It approximates predictive
estimation. Jackknifing is a similar, but slightly different, technique.
Example: Take-one-away method. Remove one individual from the data set and
use the remaining data to predict what the one taken away was. If we are using
the expression profiles of 100 patients (50 with cancer and 50 controls). We
would take one out, and use the remaining 99 to assess which genes are predictive
of cancer and then see if we can correctly predict the state of the patient who was
removed. Then repeat, taking a different one out of data set until all 100 have
been done.
Jackknifing
 Similar to but not exactly the same as Cross-validation. Both involve a leave-oneout method but jackknifing is used to estimate the bias of the statistic instead of
the generalization error.
Bootstrapping
 A technique for simulating new data sets, to assess the robustness of a model or to
produce a set of likely models. The new data sets are created by re-sampling with

replacement from the original training set, so each datum may occur more than
once.
A way of getting a confidence level when you don’t have enough data for
validation of your results.
Independent test sets
 Any time that you train a model with a training set, you should validate your
model with an independent test set. This will minimize the chance of
overtraining.
 Independent tests set are necessary to confirm that your model or algorithm has
general biological validity beyond your training data.
Why do we need cross-validation and/or independent test sets?
 The problem is numbers.
 Example: We have a small number of people with lymphoma but a large number
of characteristics (10,000s of genes with expression measurements on the chip).
Therefore, you can expect to find by chance good predictive genes for one group
of individuals that don’t work for another group.
 Cross-validation is good but you should also validate against an independent data
set.
Bayesian Statistics
 Assigns a probability to a certain event based on existing knowledge.
 Eg. Polyabayes looks at a multiple sequence alignment of chromatograms. Takes
into account sequence reads and past knowledge of SNPs and sequencing errors
(eg. A -> T is a common sequencing error in a polyA region). Uses the prior
knowledge and calculates a probability of a SNP being real using Bayesian
statistics.
Specific Methods from Class
Profile Models – Know how a matrix-type profile is used to generate a score for the
match between a given sequence and a characteristic motif.
Given a SNP, how can we score the binding strength of a known transcription factor
for each allele?
- Take each sequence and pass it to a motif-scanning algorithm with the known TF’s
matrix and a background model and compare the binding strength score returned.
Esentially just compares TF matrix to sequence and whichever is a “better fit” gets a
better score.
Describe a matrix profile for predicting transcription factor binding sites:
Are there tools that facilitate such studies?
 There are two kinds of tools out there:
1. Motif-discovery tools
2. Motif-scanning tools – use known TFs
3. A third approach is search for modules or clusters of motifs using either of the
methods above.
What are the limitations of such predictions? Are predicted binding sites likely to
be real?
 A portion of them are but there tends to be lots of false positives and/or missed
sites.
 The short length and degenerate nature of transcription-factor-binding sites
accounts for most of these false-postives.
 Eg. The unambiguous sequence TATAA is expected once every 1,024 bp by
chance. We would thus predict 30 million potential binding sites in a mammalian
genome. Most binding sites in mammalian genome sequence are biologically
non-functional.
 Binding strength of a TF to a TFBS depends on more than just the structure of the
TF and the sequence of the TFBS.
o chromatin imposes complex rules on TF access to a binding site
o TF binding may depend on the presence of other TFs, pH, temperature,
chemical concentrations, cellular localization, etc.
 A large number of binding sites are missed (false-negatives).
o Only a fraction of TFs and TFBSs are known.
How can we reduce the number of false-positive predictions?
 Phylogenetic footprinting – assume that important regulatory elements (like
TFBSs) will be conserved across related species and look for binding sites only in
highly conserved sequences.
 Phylogenetic shadowing – multiple sequence comparisons are made between
orthologous genes across short evolutionary distances, taking relationships into
account. Ie. Attempt to differentiate between sequences shared because of recent
ancestry versus functional importance.
 Bird (1995) suggests that the trouble is not turning genes on at the right time but
keeping them from turning on at the wrong time. Therefore if a binding sequence
arises or is present where it shouldn’t be evolution will favor non-conservative
mutations at these positions. Conservative mutations will be more common in
true binding sequences.
 Genes shown to be coexpressed by gene expression studies are more likely to be
coregulated. Therefore, regions around these genes are more likely to contain the
same binding sequences.
 Characterize more TFs and promoter regions and try to develop rules or
commonalities. Eg. ~50% of promoters are found in association with a CpG






island and 93% are associated with CpG islands of a certain length and CpG
dinucleotide frequency.
Consider the sequence context. For the TATAA example, higher statistical scores
can be assigned if it is found within 30bp of a predicted transcription start site.
Nature of transcription factor binding can be considered. Eg. If it’s a dimer, you
would be looking for two similar adjacent binding sites.
Look for clustered or composite binding sites.
Consider protein-protein interactions. Similar idea to looking in upstream regions
of coexpressed genes. Proteins that interact are more likely to belong to the same
pathway and to be coregulated by the same TFs and TFBSs.
Combine as many of the above as possible. Eg. rVista makes use of sequence
conservation and motif clustering (modules) to accurately identify only the TFBSs
most likely to be functional.
Develop statistical measures and thresholds which reliably separate false-positives
from real motifs while minimizing false-negatives.
Fig: Using sequence conservation to help find TFBSs
Assigned Papers
Students should be prepared to cite specific examples (from the readings) that
demonstrate important aspects of these problems. Of the papers we read, the
following are most likely to be relevant to the questions:
Shipp MA, et al. Diffuse large B-cell lymphoma outcome prediction by gene-expression
profiling and supervised machine learning. Nat Med. 2002 Jan;8(1):68-74. PMID:
11786909
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=1
1786909&dopt=Abstract






Developed a supervised learning prediction method to identify cured versus fatal
or refractory lymphomas.
Using microarray data for 77 samples, and 6817 genes, they found 30 predictor
genes that could distinguish the 58 diffuse large B-cell lymphomas (DLBCL)
from the 19 folicular lymphomas (FL) with a 91% performance rate as measure
by a take-one-out cross-validation test.
o Ie. 1 of 77 samples is withheld and the remaining 76 used to train a geneexpression based model (determine predictor genes) and predict the class
of the withheld sample.
For the 58 DLBCL patients, a similar supervised learning approach was used to
determine predictors for outcome.
Found 13 predictors that divided patients into two groups: those predicted to be
cured and those predicted to have fatal/refractory disease. Patients predicted to be
cured did have significantly improved survival.
The main concept here is the use of supervised learning. This is a pretty classic
example. As stated in the definition of supervised learning, an kind of
classification study is supervised.
They did attempt to validate their results against an independent data set (other
published microarray data for DLBCL patients). But, the two experiments did not
have all the same genes on the microarray, and therefore, their method requires
additional validation to see if it will be applicable for other samples.
Cowles CR, Hirschhorn JN, Altshuler D, Lander ES. Detection of regulatory variation in
mouse genes. Nat Genet. 2002 Nov;32(3):432-7. Epub 2002 Oct 15. PMID: 12410233
http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids
=12410233&dopt=Abstract

Look at possible effects of variations (eg. SNPs) within cis-acting regulatory
elements on gene expression.
o A difficult challenge because it is often difficult even recognize the
regulatory regions of most genes. Which can be thousands of bases away
from the transcriptional unit.
o Also difficult to predict which nucleotide changes in regulatory elements
might have an effect on expression.





o Impossible to tell if differences in expression between individuals are due
to cis-acting regulatory regions or trans-acting factors, or environmental
factors.
Compared expression of alleles from two mouse strains (A and B) in an F1 hybrid
mouse (A X B) to control for trans effects and environmental influences (ie
assume same trans factors and enviro influences in littermates).
Distinguised between transcripts of A and B using another SNP marker (not the
regulatory variant).
Compared the levels of the two alleles in genomic DNA and mRNA using PCR
and RT-PCR and a DNA sequence detector. Considered anything greater than a
60:40 ratio to be a significant difference in expression between the two alleles.
This method identifies transcripts whose expression is affected by variation
regulatory regions but can not pinpoint for sure which variation is responsible.
Suggest there are probably many genes whose expression is effected by variations
in regulatory elements
Sachidanandam R, et al. A map of human genome sequence variation containing 1.42
million single nucleotide polymorphisms. Nature. 2001 Feb 15;409(6822):928-33.
PMID: 11237013




Describe a map of 1.42 million SNPs distributed throughout the human genome
(~1 SNP / 1.9 kb).
In the human population, most variant sites are rare, but the small number of
common variants explain the bulk of heterozygosity.
It should therefore be possible to define common haplotypes using a dense set of
polymorphic markers, and to evaluate each haplotype for association with disease.
Identified SNPs using sequences from the human genome sequencing project and
algorithms like Polybayes (see Bayesian statistics above).
Bosma PJ, Chowdhury JR, Bakker C, Gantla S, de Boer A, Oostra BA, Lindhout D,
Tytgat GN, Jansen PL, Oude Elferink RP, et al. The genetic basis of the reduced
expression of bilirubin DP-glucuronosyltransferase 1 in Gilbert's syndrome. N Engl J
Med. 1995 Nov 2;333(18):1171-5. PMID: 7565971
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=7
565971&dopt=Abstract




Proof of principle paper that changes in regulatory elements can change
phenotype leading to disease.
Sequenced the coding and promoter regions of the gene for bilirubin UDPglucuronosyltransferase 1 (enzyme responsible for bilirubin glucuronidation) in
10 unrelated patients with Gilbert’s syndrome, 16 members of a kindred with a
history of Crigler-Najjar syndrome, and 55 normal subjects.
People with Gilbert’s have mild, chronic unconjugated hyperbilirubinemia
(jaundice).
Found that Gilbert’s patients carried a mutation for an extra TA in the TATAA
element of the 5’ promoter region of the gene.

The presences of the longer TATAA element resulted in reduced expression and
was proposed as a necessary factor for Gilbert’s syndrome.
Tatusov RL, et al. The COG database: an updated version includes eukaryotes. BMC
Bioinformatics. 2003 Sep 11;4(1):41. PMID: 12969510
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=1
2969510&dopt=Abstract





Describe COG database and addition of eukaryotic version of COGs (KOGs)..
Clusters of Orthologous Groups of proteins
A way of assigning function to proteins for newly sequenced genomes based on
knowledge of orthologous genes in other species.
A COG is a 3-way reciprocal best match using a protein sequence comparison
(blastp?)
What is the problem with this method of determining orthologs?
o Will lose proteins that have a best reciprocal match with one of the other
members of the COG but not the third.
o Assignment to a COG could be based on shared domain despite being a
drastically different protein.

Web Resources:
Of those discussed in class, the following web resources are most likely to be referred to
in the examination:
Gene Info
UCSC Genome Browser: genome.ucsc.edu
GeneCards: http://bioinfo.weizmann.ac.il/cards/
SNP Databases
HGV Base
http://hgvbase.cgb.ki.se/
Genetics
OMIM:
GeneTests:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM
http://www.geneclinics.org/
Sample Questions
1. Given the specific matrix model for the binding sites of a TF (see below),
assign a score to the sample sequence. Will predictions (high-scoring
sequences) of potential binding sites be limited primarily by poor sensitivity
or specificity? Why might predicted binding sites generated with this model
occur more frequently near promoters (transcription start sites) in genes?
MATRIX
A
0 0 0 0
C
7 9 3 8
G
2 0 5 1
T
0 0 1 0
Sample Sequence:
0
0
8
1
9
0
0
0
5’TGCCCG3’
Answer:
Given the specific matrix model for the binding sites of a TF (see above),
assign a score to the sample sequence.
Highest scoring possible sequence would be: 5’CCGCGA3’
Score = 7 + 9 + 5 + 8 + 8 + 9 = 46
Sample Sequence:
5’TGCCCG3’
Score = 0 + 0 + 3 + 8 + 0 + 0 = 11
Reverse Complement:
3’CGGGCA5’
Score = 7 + 0 + 5 + 1 + 0 + 9 = 22
Will predictions (high-scoring sequences) of potential binding sites be limited
primarily by poor sensitivity or specificity?
 Sensitivity here refers to the ability
 Specificity here refers to the ability of the model to minimize falsenegatives relative to true-positives. There will be poor specificity if a lot
of actual binding sites are missed.
 It is likely that there will be both false-positives and false-negatives and
thus less than perfect sensitivity and specificity. But, Specificity will
likely be worse. This is because, there will be a lot more false-positives
than false-negatives. For a motif of 6 bases, the probability of any one
sequence is 1/46. For a genome of 3 billion bases, this means you can
expect any 6 base sequence nearly 1 million times. A binding motif
probably represents several sequences that would be considered high
scoring. Therefore, we can expect several million high-scoring motifs
throughout the genome by chance. Only a tiny fraction of these will be
true transcription factor binding sites. This is terrible specificity.
 The nature of the matrix will have an effect. Matrices that are more
specific (ie. Specify less potential high-scoring sequences) will have less
sensitivity (might miss a binding motif that is just a little too different
from matrix) but improved specificity (less false-positives). Conversely,
if the matrix is too general (specify numerous potential high-scoring
sequences) the sensitivity will improve at the expense of specificity.
2. Dr. Zany reports that his advanced neural network algorithm is able to
predict the birthdays of all cancer patients based on microarray expression
profiles of tumor samples. Fifty tumor samples were analyzed and a full
probabilistic Bayesian model was created. The model is based on only 15
genes from the 15,000 represented on the array. The performance (100%
success) was assessed using leave-one-out cross-validation. Dr. Zany claims
that the model is meaningful because the predictions are independent of
tumor type. Why might one be concerned about this system? Explain your
rationale.
Answer:
 The problem with this method is that the model is highly overtrained.
 Dr. Zany has applied a huge number of variables (15,000 genes) to a small
number of samples (50). Out of 15,000 genes, there are bound to be some
that correlate well with any other variable (eg. birthday) by chance.
 The fact that the predictions are independent of tumour type seems
meaningless to me.
 The cross-validation step would give him an estimate of the general error
within his sample but does not give much confidence that the predictions
will be accurate outside his sample.
 The method must be validated against an independent data set to see if the
15 genes are truly predictive of birthday.
3. John Smith has recently been informed that his child has a rare genetic
disorder (Shucks Syndrome) that results in the inability to yawn. Using
Google, he has identified the following resources that might contain
information that will help him understand the syndrome. He is a molecular
biologist and wishes to understand the mechanisms of the disease and
whether his next child is likely to suffer from the same problems. He will be
visiting a genetic counselor next week, but he wants to enter the conversation
prepared. Explain the intended uses of each of the following web resources:
OMIM, hgvBASE, UCSC Genome Browser and GeneTests.
OMIM – Online Mendelian Inheritance in Man.
(http://www.ncbi.nlm.nih.gov/omim/)
 This database is a catalog of human genes and genetic disorders.
 For a given disorder, will provide: DESCRIPTION, CLINICAL,
FEATURES, INHERITANCE, CYTOGENETICS, MAPPING,
MOLECULAR GENETICS, HETEROGENEITY, PATHOGENESIS,
DIAGNOSIS, CLINICAL MANAGEMENT, POPULATION
GENETICS, EVOLUTION, GENETIC VARIABILITY, ANIMAL
MODEL, HISTORY, REFERENCES, CLINICAL SYNOPSIS
 Very useful for learning about a genetic disease.
hgvBASE (http://hgvbase.cgb.ki.se/)
 The objective of HGVbase (the Human Genome Variation Database) is to
provide an accurate, high utility and ultimately fully comprehensive
catalog of normal human gene and genome variation, useful as a research


tool to help define the genetic component of human phenotypic variation.
All records are highly curated and annotated, ensuring maximal utility and
data accuracy.
Can search by sequence, genome position, gene name/ID, variation ID, or
keyword (eg. cystic fibrosis).
Useful if you want to know about all known variations for your gene of
interest.
UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/hgGateway)
 The UCSC Genome Browser provides a rapid and reliable display of any
requested portion of genomes at any scale, together with dozens of aligned
annotation tracks (known genes, predicted genes, ESTs, mRNAs, CpG
islands, assembly gaps and coverage, chromosomal bands, mouse
homologies, and more.
 The user can look at a whole chromosome to get a feel for gene density,
open a specific cytogenetic band to see a positionally mapped disease gene
candidate, or zoom in to a particular gene to view its spliced ESTs and
possible alternative splicing. The Genome Browser itself does not draw
conclusions; rather, it collates all relevant information in one location,
leaving the exploration and interpretation to the user
 The Genome Browser supports text and sequence based searches that
provide quick, precise access to any region of specific interest. Secondary
links from individual entries within annotation tracks lead to sequence
details and supplementary off-site databases.
 Clicking on an individual item within a track opens a details page
containing a summary of properties and links to off-site repositories such
as PubMed, GenBank, LocusLink, and OMIM.
 This would not be very useful for someone trying to understand a disease
for a practical purpose like visiting the doctor.
GeneTests (http://www.genetests.org/)
 Provides current, authoritative information on genetic testing and its use in
diagnosis, management, and genetic counseling, GeneTests promotes the
appropriate use of genetic services in patient care and personal decision
making.
 GeneTests is a medical genetics information resource for physicians and
other healthcare providers. The site comprises:
o GeneReviews: Expert-authored, peer-reviewed, disease-specific
Reviews describing the application of genetic testing to the
diagnosis, management, and genetic counseling of patients and
families with hereditary disorders
o Laboratory Directory: A database of US and international
laboratories performing genetic testing
o Clinic Directory: A database of US and international clinics
providing genetic counseling



o Educational Materials: Basic information on the use of genetic
services, teaching materials for genetics professionals, and an
illustrated glossary of terms used in the GeneReviews
Can search by disease, gene, locus, product, feature, OMIM, author, titles,
or text.
Searching for information about disease will give you information about:
Diagnosis, Clinical Description, Differential Diagnosis, Management,
Genetic Counseling, and Molecular Genetics.
An excellent source of information for someone who wants to prepare for
a meeting with a genetic counselor since they themselves might check the
very same site.
GeneCards (http://bioinformatics.weizmann.ac.il/cards/)
 Integrates a subset of the information stored in major data sources dealing
with human genes and their products (with a major focus on medical
aspects).
 A search for your disease of interest will return any known genes
associated with it. If you know which gene you are interested in,
genecards will provide info or link to info on: Aliases and Additional
Descriptions, Chromosomal Location, Proteins, Protein Domains/
Families/Ontologies, Sequences, Expression in Human Tissues, Similar
Genes in Other Organisms, Related Human Genes, SNPs/Variants,
Disorders & Mutations, Medical News, Research Articles, Links to
countless other databases (eg. LocusLink, Ensembl, GeneLynx, etc).
 Useful to both researchers and laymen if you know which gene you are
interested in.
Which are most relevant to Mr. Smith? Why?
 All of the above sources could potentially have some useful information
for Mr. Smith.
 However, OMIM and GeneTests would be the most useful to him. These
are the most disease specific and contain the most general information
relevant to someone with the actual disease.
o OMIM would give him information about the genetics of the
disease and how it is inherited. This will help him to determine the
chances of his next offspring inheriting the disease.
o GeneTests will give him an idea of what kind of things the
counselor will tell him and give him a chance to look up any
terminology he may not be familiar with.