* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lecture_4
Vectors in gene therapy wikipedia , lookup
Public health genomics wikipedia , lookup
Transposable element wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Point mutation wikipedia , lookup
Genomic imprinting wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Gene expression programming wikipedia , lookup
Ridge (biology) wikipedia , lookup
Human genome wikipedia , lookup
Genomic library wikipedia , lookup
History of genetic engineering wikipedia , lookup
Microevolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Human Genome Project wikipedia , lookup
Designer baby wikipedia , lookup
Non-coding DNA wikipedia , lookup
Genome (book) wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Pathogenomics wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Genome editing wikipedia , lookup
Protein moonlighting wikipedia , lookup
Metagenomics wikipedia , lookup
Gene expression profiling wikipedia , lookup
Helitron (biology) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genome annotation and search for homologs Genome of the week • Discuss the diversity and features of selected microbial genomes. • Link to the paper describing the genome on the MMG433 website. Bacillus subtilis • Gram-positive soil bacterium • Genetically tractable, well-studied • Developmental pathways (sporulation, genetic competence) • Industrial and agricultural importance • 4.2 Mb genome (sequence completed 1997) B. subtilis genome features • 4,106 protein coding genes • 10 rRNA operons • Nearly 50% of the genome consists of paralogous genes. – 77 ABC transporter binding proteins • 10 phage like regions - horizontal transfer. Low GC regions in the genome. • 18 sigma factors - initiate transcription. • 34 two-component regulatory systems. Sequencing of genomes • Hierarchical or contig based sequencing – Clone smaller seqments of the genome. – Labor intensive, slow – Not needed for sequencing microbial genomes • Shotgun method – Randomly clone and sequence 1.5-2 kb fragments of DNA. 5-10 fold coverage. – Computationally intensive. Sequence assembly • Focus of this week’s lab exercise • Algorithms to align and edit multiple sequences • Phrap and Consed • Sequencher (commercial) for lab. Finding functional features in a microbial genome. • Genes • • • • • rRNA operons, tRNAs - programs available Origin of replication - oriC -near dnaA gene Promoters Transcription terminators Horizontially transferred DNA – GC content Gene finding • Easy relative to eukaryotic genomes – No introns – 80-90% of DNA encodes genes. 5% in eukaryotes. • Find open reading frames (ORF scanning). – Find start codons (mostly ATG, not always) to stop codons. Smallest ORFs - usually 300 nt in length. – Additional features. Good Shine-Dalgarno sequence (ribosome binding site). AGGAGG. Not essential. – Similarity matches to genes in other genomes. – Effective way of searching for ORFs. Gene finding programs • Genefinder, Grail, Glimmer (TIGR), etc. • ORF finder from NCBI – Will use in a future lab exercise and in the final annotation project Annotating genes • How to assign preliminary functions to genes. • Automated programs. • Similarity searches – BLAST and PSI-BLAST – COGs, Pfam, CDD, other databases – Only 50-75% of genes will have a predicted function. Some have no known homologs in any other genome. • Functional characterization (individual genes) – Gene knockouts – Overexpression • In most cases computer annotation will only be able to predict function - NOT assign function. – The biological function of many genes have not been determined, even in model systems. – As genomic characterization of gene function continues - more and more computer generated annotations will be correct. • Molecular function - activity of a protein at the molecular level. – Examples would be ATPase, metal binding, converting glucose-6-phosphate to fructose-6phosphate. • Biological function - cellular role of the protein. – Examples would be translation initiation, adapting to environmental changes, glycolysis. Homologs, orthologs, and paralogs. • Homologous genes are genes that share a common evolutionary ancestor. – Orthologs are genes found in different organisms that arose from a common ancestor – Paralogs are genes found in the same organism that arose from a common ancestor. Duplication could have occurred in the species or earlier. Using BLAST to predict gene function. • BLAST predicted protein sequence against the non-redundant database. • Determine best hits • Automated annotation programs will often assign the best hit function to the gene being searched. • Must manually confirm automated annotations. Assessment of BLAST output • What is the level of identity and similarity of the best hits? – More identity - more likely the proteins may have similar functions. • Does the area of similarity occur over the entire protein? Or just part of the protein? (fig. 2.19) – Often you will find hits to only part of your protein. A GTP-binding domain for example. • Have any of the best hits been characterized experimentally? – With so many microbial genomes sequenced chances are you will have to search extensively to find a hit that has been characterized experimentally. Databases used in protein function analysis. • COGs - Cluster of orthologous groups - proteins that are best hits against each other when comparing two genomes. • Pfam - Protein families -more likely to identify conserved domains rather than full-length proteins • TIGRfam - strives to find equivalogs - “proteins that are conserved with respect to FUNCTION since their last common ancestor” Databases used in protein function analysis. • SMART - Simple Modular Architecture Research Tool. • PROSITE - Protein motifs • CDD - Conserved domain database - linked to BLAST -Pfam, SMART, COGs. • InterPro - A database that brings together many of the above databases so that you can search them all at once. Bottom line on databases • Are useful tools in assigning possible functions. • Be careful about annotations – example -proteins in the same COG can be orthologs that have evolved different functions. – Many annotations are not backed up by experimental data. – Some databases are automated - have not been checked for accuracy. Examples YqeH and DnaA Protein function • Molecular function – YqeH - GTPase – DnaA - ATPase, DNA binding • Biological function – YqeH - Unknown – DnaA -DNA replication initiation