* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Bioinformatics
Protein–protein interaction wikipedia , lookup
Expression vector wikipedia , lookup
Zinc finger nuclease wikipedia , lookup
Molecular ecology wikipedia , lookup
Genetic engineering wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Transposable element wikipedia , lookup
DNA supercoil wikipedia , lookup
Gene nomenclature wikipedia , lookup
Real-time polymerase chain reaction wikipedia , lookup
Molecular cloning wikipedia , lookup
Biosynthesis wikipedia , lookup
Gene regulatory network wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Genomic library wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Protein structure prediction wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Gene expression wikipedia , lookup
Homology modeling wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Community fingerprinting wikipedia , lookup
Non-coding DNA wikipedia , lookup
Point mutation wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Bioinformatics Overview School of B&I TCD May 2010 Who, me? • • • • • • • Andrew Lloyd [email protected] 087-225-9850, 053-9255717, 01-896-2450 Director INCBI 1993-2000 Population genetics, evolution Whole genome analysis Immunology, chickens, FIRM Definition/scope • Storage, retrieval and analysis of biological (sequence) information. • Insert better definition here • Case can be made for microarray analysis • NOT – ecoinformatics (ecology) – Image analysis – Bar-coding hospital sheets Philosophy “Nothing worth learning can be taught” Oscar Wilde Getting bioinformation • Type it in: A,T,C,C,G,T,C,A (1991) • Access databases – – – – – Literature (Pubmed) Medical (OMIM) DNA sequence (EMBL/GenBank) Protein sequence (UniProt, SwissProt, PIR) 3-D structure (PDB) Annotation • In any DB, half is data and half context. – – – – Gene ontology (language) Parsing sequence (ORF, RBS, Intron, -helix) Recognising similar sequences (evolution!) Complementary info : DB cross-referencing • (DNA -> Protein -> 3D structure -> motifs) Secondary databases • • • • • • • • Protein motifs, domains, families RNA structures (16S ribosomal RNA…) Taxonomy/classification Metabolic pathways (KEGG) Enzymes (Brenda, TCD, Ireland) SNPs: mutations and variants Disease DBs (OMIM) Immuno, epitope DBs Complete genomes • Ensembl (complex, basically vertebrate) – Uniform look-and-feel; cross-refs • UCSC GoldenPath browser • Plants • Bacterial genomes – Including mitochondrial, chloroplast – Eubacteria vs Archaea vs Eukaryotes Annotated/known genes • What does my gene do? • Blast (fasta) against the DB • SRS/Entrez to access databases – Neighboring (similar things in same DB) • DB cross-references – full picture of attributes – What biochemical pathway? OMIM Maps & Genomes FullText Journals GenBank/EMBL DNA Sequence PubMed UniProt Protein sequence Prosite Pfam Taxonomy The territory PSSM PDB 3-D struct Databases • BIG • EMBL/GenBank 200Gbp, 100m entries, 2500 complete genomes, 200K species • Encycl. Britannica 180m letters. 40m words • EMBL 1km of Britannica Volumes • Doubling every 14-18 mo • Human genome is X bp? Intrinsic vs Context Internal • DNA, protein sequence – DNA: Purine/Pyrimidine – AAs: small, hydrophobic, aromatic, polar – Variants: SNPs, Indels, Alt Splicing • 2ndry structure – DNA: stem/loops – Protein: helix, sheet, turn, loop Intrinsic vs Context External, context for your molecule • In other species (homologs, phylog trees) • In which cell • In which cellular location (GO) • Molecular complex (dimers) • Which pathway (KEGG) • Where in genome (neighbors, synteny) New Unknown Gene • • • • • • • • Blast homology searching Genomic location/neighboring genes Where is it expressed? How regulated (control sequences) Intron/exon structure Domain structure Restriction sites etc. Primer design DNA/gene structure • Four bases A T C G U – 2 pyrimidine, 2 purine – LOTS of them: how many? • • • • Open reading frame 5’ signals, 3’ signals Introns/exons Neighbours (operons) Two sequences • Alignment – Local – Global • Dotplot • Threading One seq vs many • • • • • • Homology search vs database Special case of 2-seq alignment Blast vs fasta Limit by species/taxon Substitution matrices Low complexity masking Multiple sequence alignment • MSA • Progressive alignment • ClustalW or (better) T-Coffee Phylogenetic trees • Computationally intensive • Distance matrix methods – Neighbor-joining (NJ) – UPGMA • Minimum evolution • Maximum parsimony • Maximum likelihood – Bayesian methods Genefinding • Special case of DNA analysis • How to annotate a genome • Bacterial – Find open reading frames (ORFs) – With start/stop codons – With promoter, RBS, CAAT, TATA • Eukaryotic – As above PLUS – Introns/exons – Alternative splicing Typical mammalian gene structure Start (ATG) Control Region miRNAs? Introns Stop DNA gt.. 5’ Exon 2 Exon 1 Introns “spliced out” and discarded Exon 3 …ag 3’ Exon 4 RNA RNA Stop: TAG, TGA, TAA ATGCCCAGGAGATTTGGA . . . PROTEIN MetProArgArgPheGly . . . Protein substructure • DNA makes protein and protein (enzymes) make everything else. • 20 Amino acids • Amino acid properties • Motifs • Domains • Biological units Amino acid properties again … and again and again Protein 3-D structure • Relationship between sequence & structure • Secondary structure – Alpha helix – Beta sheet – Coil – Turn • Threading sequence to homologous structure Gene Expression • • • • EST SAGE MicroArray Clustering of same expressed genes Genomics • Complete DNA seq for a species • Gene order • Gene clusters/operons – Missing operons • Gene duplication • Whole genome duplication (WGD) SNPs • Key issue in genetics is that two organisms are both the same and different: – Humans vs chimps vs mouse – Parent vs offspring vs co-national vs human • Single nucleotide polymorphisms • Variation between individuals • Pharmacogenetics – Personal tailored medicine Summary/take home • Course designed to give you access to databases, software tools • …and ways of thinking about data