* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Powerpoint slides
Essential gene wikipedia , lookup
Copy-number variation wikipedia , lookup
Long non-coding RNA wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
X-inactivation wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
RNA interference wikipedia , lookup
Oncogenomics wikipedia , lookup
Deoxyribozyme wikipedia , lookup
History of RNA biology wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Nucleic acid tertiary structure wikipedia , lookup
Genomic library wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Genetic engineering wikipedia , lookup
Gene therapy wikipedia , lookup
Public health genomics wikipedia , lookup
RNA silencing wikipedia , lookup
Epitranscriptome wikipedia , lookup
Transposable element wikipedia , lookup
Point mutation wikipedia , lookup
Gene nomenclature wikipedia , lookup
Non-coding RNA wikipedia , lookup
Metagenomics wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Ridge (biology) wikipedia , lookup
Genomic imprinting wikipedia , lookup
Primary transcript wikipedia , lookup
Gene desert wikipedia , lookup
Gene expression programming wikipedia , lookup
Human genome wikipedia , lookup
Pathogenomics wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Non-coding DNA wikipedia , lookup
History of genetic engineering wikipedia , lookup
Minimal genome wikipedia , lookup
Genome (book) wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome editing wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Gene expression profiling wikipedia , lookup
Designer baby wikipedia , lookup
Microevolution wikipedia , lookup
Helitron (biology) wikipedia , lookup
Structure and function of nucleic acids. DNA structure. History: • 1868 Miescher – discovered nuclein • 1944 Avery – experimental evidence that DNA is constituent of genes. • 1953 Watson&Crick – double helical nature of DNA. • 1980 X-ray structure of more than a full turn of B-DNA. Five types of bases. Nucleotides and phosphodiester bond. Phosphodiester bond Complementarity of nucleosides – bases for double stranded helical structure. Double helical structure of DNA. A- and B-DNA – right-handed helix, Z-DNA – left-handed helix B-DNA – fully hydrated DNA in vivo, 10 base pairs per turn of helix Sugar-phosphate backbones form ridges on edges of helix. Copyright © Ramaswamy H. Sarma 1996 Hydration of B-DNA. From R. Dickerson, Structure & Expression Difference between DNA & RNA: Differences between DNA & RNA: • T is replaced by U • Extra –OH group at 2’ pentose sugar • Sugar is ribose, not deoxyribose RNA as a structural molecule, information transfer molecule, information decoding molecule rRNA mRNA tRNA Classwork I. 1. Go to http://ndbserver.rutgers.edu/. 2. Select Crystal structure of B-DNA, resolution >=2 Angstroms. 3. Select Crystal structure of single-stranded RNA with mismatch base pairing with resolution >= 2 Angstroms. RNA secondary structure prediction Assumptions used in predictions: - The most likely structure is the most stable one. - The energy associated with a given position depends only on the local sequence/structure - The structure is formed w/o knots. Minimum energy method of RNA secondary structure prediction. • Self-complementary regions can be found in a dot matrix • The energy of each structure is estimated by the nearest-neighbor rule • The most energetically favorable conformations are predicted by the method similar to dynamic programming Minimum energy method of RNA secondary structure prediction. Classwork II: Predict secondary structure for RNA “ACGUGCGU”. Stacking energies for base pairs A/U C/G G/C U/A G/U U/G A/U -0.9 -1.8 -2.3 -1.1 -1.1 -0.8 C/G -1.7 -2.9 -3.4 -2.3 -2.1 -1.4 G/C -2.1 -2.0 -2.9 -1.8 -1.9 -1.2 U/A -0.9 -1.7 -2.1 -0.9 -1.0 -0.5 G/U -0.5 -1.2 -1.4 -0.8 -0.4 -0.2 U/G -1.0 -1.9 -2.1 -1.1 -1.5 -0.4 Destabilizing energies for loops Number of bases 1 5 10 20 30 Internal - 5.3 6.6 7.0 7.4 Bulge 3.9 4.8 5.5 6.3 6.7 Hairpin - 4.4 5.3 6.1 6.5 Prediction of most probable structure. Probability of forming a base pair: P exp( G / kt) For a double-stranded structure probability = product of Boltzmann factors for each of stacking base pairs. Sequence covariation method. Some positions from different species can covary because they are involved in pairing fm(B1) - frequences in column m; fn(B2) – frequences in column n; fm,n(B1,B2) – joint frequences of two nucleotides in two columns. f m,n ( B1 , B2 ) /( f m ( B1 ) f n ( B2 )) Seq 1 Seq 2 Seq 3 Seq 4 ---G------C-----C------G-----A------C-----A------T--- Ribozymes. • RNA of self-splicing group I introns, contain 4 sequence elements and form specific secondary structures • RNA self-splicing group II introns • RNA from viral and plant satellite RNAs • Ribosomal RNAs Gene prediction. Gene – DNA sequence encoding protein, rRNA, tRNA (snRNA, snoRNA)… Gene concept is complicated: - Introns/exons - Alternative splicing - Genes-in-genes - Multisubunit proteins Gene identification • Homology-based gene prediction – Similarity Searches (e.g. BLAST, BLAT) – Genome Browsers – RNA evidence (ESTs) • Ab initio gene prediction – Prokaryotes • ORF identification – Eukaryotes • Promoter prediction • PolyA-signal prediction • Splice site, start/stop-codon predictions Prokaryotic genes – searching for ORFs. - Small genomes have high gene density Haemophilus influenza – 85% genic - No introns - Operons One transcript, many genes - Open reading frames (ORF) – contiguous set of codons, start with Met-codon, ends with stop codon. Prediction of eukaryotic genes. Ab initio gene prediction. Predictions are based on the observation that gene DNA sequence is not random: - Each species has a characteristic pattern of synonymous codon usage. - Every third base tends to be the same. - Non-coding ORFs are very short. GeneMark (HMMs), GenScan, Grail II(neural networks) and GeneParser (DP) Gene preference score – important indicator of coding region. Observation: occurrence of codon pairs in coding regions is not random. The probability of exon starting at base 1: P a1 / a Cn1 a1 – the score for an exon starting at base 1; a – the sum of all scores for base 1, base2 and base 3; n – the score for noncoding region starting at base 1; C – the ratio of coding to noncoding bases in the organism. Confirming gene location using EST libraries. • Expressed Sequence Tags (ESTs) – sequenced short segments of cDNA. They are organized in the database “UniGene”. • If region matches ESTs with high statistical significance, then it is a gene or pseudogene. Gene prediction accuracy. Factors which influence the accuracy: - genetic code of a given genome may differ from the universal code - one tissue can splice one mRNA differently from another - mRNA can be edited Gene prediction accuracy. True positives (TP) – nucleotides, which are correctly predicted to be within the gene. Actual positives (AP) – nucleotides, which are located within the actual gene. Predicted positives (PP) – nucleotides, which are predicted in the gene. Sensitivity = TP / AP Specificity = TP / PP Gene prediction accuracy. GenScan Website Common difficulties • First and last exons difficult to annotate because they contain UTRs. • Smaller genes are not statistically significant so they are thrown out. • Algorithms are trained with sequences from known genes which biases them against genes about which nothing is known. GenBank – an annotated collection of all publicly available DNA sequences. Gene prediction: classwork III. • Go to http://www.ncbi.nlm.nih.gov/mapview/ and view all hemoglobin genes of H. sapiens • Find 6 hemoglobin genes on chromosome 11, view the DNA sequence of this chromosome region • Submit this sequence to GenScan server at http://genes.mit.edu/GENSCAN.html Genome analysis. Genome – the sum of genes and intergenic sequences of haploid cell. The value of genome sequences lies in their annotation • Annotation – Characterizing genomic features using computational and experimental methods • Genes: Four levels of annotation – Gene Prediction – Where are genes? – What do they look like? – What do they encode? – What proteins/pathways involved in? Koonin & Galperin Accuracy of genome annotation. • In most genomes functional predictions has been made for majority of genes 54-79%. • The source of errors in annotation: - overprediction (those hits which are statistically significant in the database search are not checked) - multidomain protein (found the similarity to only one domain, although the annotation is extended to the whole protein). The error of the genome annotation can be as big as 25%. Sample genomes Species H.sapiens Size Genes Genes/Mb 3,200Mb 35,000 11 D.melanogaster 137Mb 13.338 97 C.elegans 85.5Mb 18,266 214 A.thaliana 115Mb 25,800 224 S.cerevisiae 15Mb 6,144 410 E.coli 4.6Mb 4,300 934 List of 68 eukaryotes, 141 bacteria, and 17 archaea at http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/links2a.html So much DNA – so “few” genes … s T Genic C Intergenic T Human Genome project. Comparative genomics - comparison of gene number, gene content and gene location in genomes.. Campbell & Heyer “Genomics” Analysis of gene order (synteny). Genes with a related function are frequently clustered on the chromosome. Ex: E.coli genes responsible for synthesis of Trp are clustered and order is conserved between different bacterial species. Operon: set of genes transcribed simultaneously with the same direction of transcription Analysis of gene order (synteny). Koonin & Galperin “Sequence, Evolution, Function” Analysis of gene order (synteny). • The order of genes is not very well conserved if %identity between prokaryotic genomes is < 50% • The gene neighborhood can be conserved so that the all neighboring genes belong to the same functional class. • Functional prediction based on gene neighboring. COGs – Clusters of Orthologous Genes. Orthologs – genes in different species that evolved from a common ancestral gene by speciation; Paralogs – paralogs are genes related by duplication within a genome.