Download Clustering for Accuracy, Performance, and Alternative

Genetics and Molecular Biology Tutorial II -- Computational Perspective The goal is to introduce some topics to individuals with a minimal background in genetics/biology, and yet try to provide some examples of topics to maintain the interest of individuals with extensive biological/genetics backgrounds.  Gene structure Outline – genomic structure vs mRNA structure – coding and noncoding exons – introns – primary transcript processing  aside -- nonsense mediated mRNA degradation – alternative splicing and differential polyadenylation – evolutionary conservation of coding and noncoding sequences 2 Outline…  Genomic structure – repetitive sequences  LINES and SINES – example -- Y chromosome palindromes – C value paradox – genomes of model organisms  example – yeast genome and gene-chip – single/double knockouts – cross-species sequence similarities for putative function identification  example -- “chaperonine” 3 Fundamental Genetics and Probability Concepts meiosis and sampling  patterns of inheritance  monogenic and complex inheritance  – phenocopy – reduced penetrance  DNA variation – polymorphisms, SNPs, and mutations  positional cloning 4 Gene Structure 5 Transcript Processing  DNA -> pre-mRNA -> mRNA -> protein 6 Nonsense mediated mRNA degradation – unknown mechanism – more rapidly degrades mRNA containing – Lykke-Andersen, “mRNA quality control: Marking the message for life or death.” Current Biology, 11, 2001. 7 Nonsense Mediated mRNA Degradation 8 Genome Structure -- repeat classes Class (blocks) Megasatellite (100s of kb) RS447 untitled untitled Satellite (100kb to Mbs) alphoid Sau3 A family satellite 1 (AT rich) satellites 2 and 3 Minisatellite (0.1-20 kb) telomeric family hypervariable family Microsatellite (<150 bp) Size of Repeat several kb Chr Locations 4.7 kb 2.5 kb 3.0 kb 5-171 bp 171 bp 68 bp ~50-70 copies on 4, several on 8 ~400 copies on 4 and 19 ~50 copies on X centromeric centromeric hetero all chrs centromeric hetero 1 9 13 14 15 21 22 6 centromeric hetero most chrs most chrs At or close to telomeres all telomeres all chrs, often near telomeres dispersed through all chromosomes 25-48 bp 5 bp 6-64 bp 6 bp 9-64 bp 1-4 bp various locations 9 C-Value Paradox Hartl, “Molecular melodies in high and low C,” Nat. Rev. Genetics, Nov 20001  refers to the massive, counterintuitive and seemingly arbitrary differences in genome size observed in eukaryotic organisms – Drosophila melanogaster 180 Mb – Podisma pedestris 18,000 Mb – difference is difficult to explain in view of apparently similar levels of evolutionary, developmental, and behavioral complexity 10 Alternative Splicing Every conceivable pattern of alternative splicing is found in nature. Exons have multiple 5’ or 3’ splice sites alternatively used (a, b). Single cassette exons can reside between 2 constitutive exons such that alternative exon is either included or skipped ( c ). Multiple cassette exons can reside between 2 constitutive exons such that the splicing machinery must choose between them (d). Finally, introns can be retained in the mRNA and become translated. Graveley, “Alternative splicing: increasing diversity in the proteomic world.” Trends in Genetics, Feb., 2001. 11 Classic View of Gene No Longer Valid -- Strachan pg 185 Mechanism Frequency/Examples multigenic transcription units rare. 18S, 28S, and 5.8S rRNA, mitochondria common. dystrophin gene (8) alternative promoters alternative splicing alternative polyadenylation RNA editing post-translational cleavage very frequent. slo gene (8 cassettes), >500 mRNAs common. calcitonin gene (2) extremely rare. apolipoprotein B gene (tissue specific editing – codon changed) rare. may generate functionally related polypeptides – hormones. insuline 12 Alternative Splicing Example -- Graveley 2001 13 Alternative PolyAdenylation common in human RNA (EdwardsGilbert 1997)  in many genes, 2 or more poly-A signals in 3’ UTR  – alternative transcripts can show tissue specificity  alternative poly-A signals may be brought into play following alternative splicing 14 Edwards-Gilbert. Nucleic Acids Res, 13, 1997 15  Evolution of the mitochondrial genome and origin of eukaryotic cells 16 Evolutionary Conservation of Coding and Noncoding Sequences Sequencing of H. sapiens and model organisms is basis for comparative genomics  Generally, functional solutions (encoded as genes) across organisms allows us to compare gene sequences and infer function  protein functional/structural region == “domains”  Intergenic regions are generally not 17 conserved (always exceptions)  Example - MKKS (UniGene Clusters) human rat 87.4 %  human mouse 84.9 %  human cow 87.1 %  mouse rat 97.8 %  rat cow 91.0%  mouse cow 85.1 %  frog rat 62.5 %  18 Example - MKKS 19 20 Computational Approach to Using Conserved Regions Problem -- want to screen genes for mutations  Conventional approach -- screen all exons of a single gene  Alternative -- identify domains with in multiple genes, and screen domains first, to optimize screening time and resources  21 Cross-Species Similarities  yeast – gene chip for hybridization/expression – complete genome (first eukaryote) – singe knockouts and double knockouts 22 Fundamental Genetics  meiosis – Hs are diploid – meiosis produces haploid gametes – mechanism for transmission of genetic material to offspring – recombination by cross-over (Holliday structure) or by independent segregation of homologous pairs 23 Fundamental Genetics (Background for Linkage Analysis)  Rule of Segregation – offspring receive ONE allele (genetic material) from the pair of alleles possessed by BOTH parents  Rule of Independent Assortment – alleles of one gene can segregate independently of alleles of other genes – (Linkage Analysis relies on the violation of Independent Assortment Rule) 24 Genetic Marker … Prelude to LA – A genetic marker allows for the observation of the genetic state at a particular genomic location (locus). A genotype is the measured state of a genetic marker.  May never be feasible to sequence cases directly.  – An “informative” marker is often “heterozygous,” or “polymorphic” and enables the observation of the inheritance of genetic material. 25 Monogenic and Polygenic Diseases – monogenic (Mendelian) -- one gene “simple” (dominant and recessive) Mendelian inheritance  direct correspondence between one gene mutation and one disorder  majority of disease genes found are monogenic  – polygenic -- (complex) multiple genes heterogeneity and epistasis  combinatorics  no longer have direct correspondence between one gene and disorder  majority of disorders are probably polygenic  – complexity of organisms and observed pathways 26 ...Mongenic and Polygenic Diseases phenocopy  reduced penetrance  – Example -- sickle cell anemia “classic” recessive disorder  defect in red blood cells (hemoglobin)  but… infant hemoglobin gene can “leak”  wide range of phenotypes  27 Examples 28 Examples 29 Example 30 BBS4 Pedigree 31 Hardy-Weinberg Equilibrium   Rule that relates allelic and genotypic frequencies in a population of diploid, sexually reproducing individuals if that population has random mating, large size, no mutation or migration, and no selection Assumptions – allelic frequencies will not change in a population from one generation to the next – genotypic frequencies are determined in a predictable way by allelic frequencies – the equilibrium is neutral -- if perturbed, it will reestablish within one generation of random mating at the new allelic frequency 32 33 H-W f(AA) = p2  f(Aa) = 2pq  f(aa) = q2   (p+q)2  (p2 + q2 + r2 + 2pq + 2pr + 2qr)= (p+q+r)2 34 Dominant and Recessive Penetrance Modeled penetrance = P(pt | gt) DD Dd dd 1 1 0 DD Dd dd 0 0 1 DD Dd dd 0.9 0.9 0.0 DD Dd dd 0 0 0.8 35 D-R Heterogeneous, DD Epistatic AA BB 1 Bb 1 bb 1   Aa 1 1 1 aa 0 0 1 AA BB 1 Bb 1 bb 0 Aa 1 1 0 aa 0 0 0 reduced penetrance 3,9,27,81,243… 3n 36 Dom-Rec Heterozygous Screen genes A, B?, b 37 Uninformative Marker 38 Informative Marker 39  Given the following observations: family structure, affection status, genotypes, and disease allele frequencies. Assuming a model for the disease, can we calculate the probability that these observations “fit” an assumed model??? 40 Linkage 41 Linkage Analysis Goal: find a marker “linked” to a disease gene.  LOD score = log of likelihood ratio  LR[θ;data] == k P[data; θ]  theta = estimate of genetic distance (recombination fraction) between marker and disease  = proportion of recombinant gametes/total gametes 42  …Linkage Analysis  Linkage analysis calculates the likelihood that the inheritance pattern of the phenotype (disease) is supported by the observed inheritance patterns (genotypes) in a pedigree. – few monogenic models, easy to test – more difficult to find models explaining inheritance in polygenic models – parameter maximization 43 Linkage Analysis Programs  FASTLINK - 2 point – O(n2), where n = number of markers  GeneHunter - multipoint, 2 point – O(n2), where n = number of people 44 Allele Sharing  tries to show that affected family members inherit the same chromosomal regions more often than expected by chance 45 Allele Sharing Example Needs at least sibs. 46 Association Studies   “Allelic association studies provide the most powerful method for locating genes of small effect contributing to complex diseases and traits.” Daniels, Am J Hum Genet 62:1189-1197, 1998. Linkage analysis – genome wide screen, 400 markers ~ 10 cM (10 MB), association needs 4000+ polymorphic markers – generally need nuclear family or larger  Association finds “linkage disequilibruim” 47 Association Studies  “Association is simply a statistical statement about the co-occurrence of alleles or phenotypes. Allele A is associated with disease D if people who have D also have A more (or maybe less) often than would be predicted from the individual frequencies of D and A in the population.” Pg. 286 Human Molecular Genetics 2, Tom Strachan 48 Examples  HLA-DR4 (antigen marker) – 36% in UK – 78% with rheumatoid arthritis  CF( RFLP markers XV2.c (X1,X2), KM19(K1,K2)) – Marker Alleles CF(case) Normal(control) – X1, K1 3 49 – X1, K2 147 19 – X2, K1 8 70 – X2, K2 8 25 – CF associated with X1, K2 in ‘89 (Strachan)49 Linkage Disequilibrium  linkage equilibrium (aka HardyWeinberg) is true if – P(gt1,gt1’;gt2,gt2’) = P(gt1,gt1’)*P(gt2,gt2’) where [P(haplotype)] case vs controls  TDT (heterozygous marker transmitted), HRR (untransmitted alleles as control)  allelic associations (outbred populations) 50 maintained at only <= 1cM  Equilibrium 51 “SNPs” Single-Nucleotide Polymorphisms  1 every 1000 bp (estimated)  2,972,052 SNPs submitted to dbSNP  – dbSNP summary link – 50% of all SNPs are in question – 10% of UTRs have SNPs 100,000 - 500,000 SNPs needed  Why don’t we do this?  – $$$ 52 Homozygosity Mapping 53 Positional Cloning 54 Disease Gene Identification SSCP -- single strand conformational polymorphism  PCR -- polymerase chain reaction  – primers amplify template sequence  direct sequencing  BBS2 (Bardet-Biedl Syndrome) 55 BBS2 genetic mapping C16 1 2 3 4 5 6 7 8 9 10 11 12 56 BBS2 genetic mapping affected unaffected C16 1 2 3 4 5 6 7 8 9 10 11 12 57 BBS4 Gene (Direct Sequencing) (Hs.26471) 58 BBS4 Deletion (by PCR) exons 3 4 59 BBS4 Mutations (direct sequencing) (R295P) 60 Summary  Disease Gene Identification – challenges – interval localization  genotyping and genetic markers, linkage analysis, allele sharing, association studies (“SNiPs”), homozygosity mapping – disease gene identification techniques  Take home – A complex disorder (with interacting genes) has yet to be characterized 61 Demo -- installing a database A database organizes data  Most common  – relational database (oracle, sybase) – perceived as a collection of tables, – where table is an unordered collection of rows – each row has a fixed number of fields, and each field can store a predefined type of data value (date, integer, string, etc.)  simplest – flat file 62 Databases NCBI  BLAST  Amazon  Yahoo  Several of our own  – genotypes – rat ESTs – eye clones from differential display – micro-array data 63 This space intentionally left blank 64

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Clustering for Accuracy, Performance, and Alternative