* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Bioinformatics - University of Hawaii
Zinc finger nuclease wikipedia , lookup
Gene desert wikipedia , lookup
Gene therapy wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Non-coding DNA wikipedia , lookup
Molecular ecology wikipedia , lookup
Proteolysis wikipedia , lookup
Genetic engineering wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Magnesium transporter wikipedia , lookup
Western blot wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Gene nomenclature wikipedia , lookup
Gene regulatory network wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Real-time polymerase chain reaction wikipedia , lookup
Gene expression wikipedia , lookup
Expression vector wikipedia , lookup
Homology modeling wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Point mutation wikipedia , lookup
Genomic library wikipedia , lookup
Bioinformatics ABE 2007 Kent Koster Group 3 Why bioinformatics? “Other techniques raise more questions than they answer. Bioinformatics is what answers the questions those techniques generate.” Outline Bioinformatics Defined Evolution of Bioinformatics Bioinformatics History Common Uses of Bioinformatics Procedures and Tools of Bioinformatics Our Procedure Our Results Resources Bioinformatics Defined Bioinformatics is broad term covering the use of computer algorithms to analyze biological data. Differs from “computational biology” in that while computational biology is the use of computer technology to solve a single, hypothesis-based question, bioinformatics is the omnibus use of computerized statistical analysis to make statistical or comparative inferences. i.e. converting “data” to “information.” The nebulous genesis of bioinformatics 1977 – Φ-X174 Phage Genome sequenced 1990 – Paper published in the Journal of Molecular Biology describes sequence alignment search algorithm 1990s – Software used to find fragment overlap for the Human Genome Project 1992 – NCBI takes over GenBank DNA sequence database in response to the growing number of gene patents The nebulous genesis of bioinformatics 1994 – “Entrez” Global Query Cross-Database Search System allows users to search GenBank database 1995 – Dr. Owen White writes software to help find gene elements (promoters, start and stop codons, etc.) in the sequenced Haemophilus influenzae genome 1996 – NCBI-BLAST created to provide powerful heuristic searches against the GenBank database Genomics to Proteomics through Bioinformatics Because proteins are ultimately the tool of all* gene expression, proteomics is, in effect, the “product” science made possible by bioinformatics A proteome is the collection of all proteins expressed in a cell at a given time Every organism has 1 genome, but many proteomes In addition to “high throughput” protein analysis, proteomics is researched through cDNA analysis (RTPCR) Proteomics represents a methodical addition of “large scale biology” to traditional molecular biology, made possible by bioinformatics Common Uses of Bioinformatics Homology and Comparative Modeling Protein or gene homology is shared nucleotide or amino acid sequences or domains shared between different proteins regardless of whether from same or different organism Gene or Protein Identification Searching databases for nucleotide or amino acid sequences that match sequences in unknown samples So, how do ya do it? DNA Sequencing Sequence Formats Sequence Homology Software Tools Aligning Tools Annotated Information Protein Folding DNA Sequencing Sanger Method New nucleotide chains of DNA being replicated by DNA Polymerase are stopped when di-deoxy nucleotides (added in the reaction mixture in ~1/100 ratio) are incorperated into the chain DNA Sequencing Fluorescent dyes are bound to the ddNTPs, allowing the molecule to detected when it is excited by a laser Terminated DNA chains are run on a gel, and fragments are resolved by size By combining the fluorescence readings from each size nucleotide chain, the DNA sequence is computed Example Sequence Chromatograph Sequence Analysis First Things First – Sequence File Formats: Most common for nucleotides: FASTA / Multi-FASTA “>” followed by any unicode text, entire line read as sequence title Carriage return followed by continuous 5’- 3’ nucleotide sequence or protein sequence using 1-letter codes Example: >E. coli Globin-coupled chemotaxis sensory transducer (TM domain) ATGGACCTGATCACAAATGCGATTTAGAGACCTGATCACAAATG CGATGACCTGATCACAAATGCGATGACCTGATCACAAATGCGAT GTAAACCTGATCACAAATGCGATGACCTGATCACAAATGCGATC TAAACCTGATCACAAATGCGATGACCTGATCACAAATGCGATTAA Sequence Homology Software NCBI-BLAST Run by the National Center for Biotechnology Information BLAST uses a heuristic algorithm based on the Smith-Waterman algorithm Algorithm searches database for a small string within the query (default 11 for nucleotide searches), then when it detects a match, searches for shared nucleotides at each end of the seed to extend the match Gaps are taken into account, then the matches are presented in order of statistical significance http://www.ncbi.nlm.nih.gov/BLAST/ Different Types of BLAST Nucleotide-nucleotide BLAST (BLASTN): Protein-protein BLAST (BLASTP): Basic nucleutide sequence searches The BLAST that you used for your sequences Similar technology used to search amino acid sequences Position-Specific Iterative BLAST (PSI-BLAST): A more advance protein BLAST useful for analyzing relationships between divergently evolved proteins. Different Types of BLAST BLASTX and BLASTN variants: Use six-frame translation for proteins and nucleotides, respectively, in the search MegaBLAST: Used for BLASTing several sequences at once to cut down on processing load and server reporting-time Interpreting BLAST Results Max/Total Score Calculated from the number of matches and gaps. Higher relative to your query length is better E Value: E=Kmn(e-λS) Translation: E Value gives you the number of entries required in the database for a match to happen by random chance. e.g. E=e-6 means that one match would be expected for every 1,000,000 entries in the database Smaller E Values are better Values larger than E=e-5 too likely to be due to chance Interpreting BLAST Results Query Coverage The percent of the query sequence matched by the database entry Max Ident The percent identity, i.e. the percent that the genes match up within the limits of the full match (e.g. deletions or additions reduce this value) Sequence Aligning Software Clustal (free) ClustalX – Software ClustalW – Web DNAStar ($$$) Functionality is similar, but difference is in interface, tools, and speed of algorithms http://www.ebi.ac.uk/clustalw/ SMART Simple – Modular – Architecture – Research – Tool Run by EMBL (European Molecular Biology Laboratory) While BLAST compares nucleotide sequences and then informs you of any domains that may have been annotated to them, SMART compares by domains PFAM Protein domain database Manually curated, trading volume for quality Uses “hidden Markov models” for domain pattern recognition Run by Sanger Institute in the UK Heuristic server-load analysis predicts when key protein analysis report is due and crashes server http://www.sanger.ac.uk/Software/Pfam/ Interpro Database of protein domains and functional sites Best source of annotation Other tools sometimes draw annotation from Interpro Run by the European Bioinformatics Institute http://www.ebi.ac.uk/interpro/ Protein Folding Lowest energy state folding Ab initio: tremendously resource heavy, can only be done for tiny proteins Distributed computing is used for mid-sized proteins Folding@Home Human Proteome Folding Project Rosetta@Home Predictor@Home Protein Folding Software-assisted manual folding Use knowledge of biochemistry to fold protein into predicted structure, then software to find lowest energy state Commercial Programs: Protein Shop Profold Manual Motif Verification Ramachandran Plot – ratio of Ψ to Φ angles on N and C terminals of subunit Our Procedure Colonies were selected from nutrient plates Each group selected two colonies to sequence Colonies which survived ampicillin treatment were possibly transformed by the vector, which contained an ampicillin resistance gene Presence of PDI insert was expected to disrupt ccdB (lethal protein) and LacZα gene expression in vector plasmid LacZα expression resulted in some blue colonies, as the colonies were able to cleave X-Gal substrate into blue product Initial Questions Guiding Colony Selection How did some blue colonies survive? Did all blue colonies come from the PCR product? Did the white colonies contain the PDI inserts? Were some colonies able to survive without the ampicillin resistance plasmid? What was the actual sequence of the commercial positive control insert? Some samples were transformed with inserts collected from PCR instead of gel electrophoresis. Could have non-PDI sequences have ligated to the vector and been inserted into bacteria? Procedure Samples were prepared with T3 and T7 (forward and backward) primers in solution for sequencing Samples were sent to UH Manoa lab for sequencing Chromatogram results were viewed with Finch TV to determine quality Procedure Sequences were trimmed at 5’ and 3’ ends, then restriction enzyme sites on the vector were attempted to be located with Finch TV Procedure Sequences were exported in FASTA format Procedure was repeated for the other strands Pair-wise alignment was performed for both strands of each sample with EBI’s tools Consensus sequence from pair-wise alignment was searched for in BLAST Gene information was located from BLAST annotation and TAIR website Results General Remarks Because colonies were selected prior to the identity of the positive control insert being questioned, no control colonies were sequenced All sequenced white colonies definitively had PDI gene insert, save for one interesting exception Some blue colonies showed multiple nucleotide chromatogram readings, suggesting either sample contamination or separately transformed E. coli growing as one colony Group 3 Results Sequenced 1 blue and 1 white colony from same plate Colonies were transformed with PCR product, not gel-recovered DNA White colonies had PDI insert Blue colonies had 154Bp partial insert, disrupting ccdB gene, but remaining inframe and allowing for a partially function LacZ alpha gene to be expressed Group 3 White Colony T7 strand definitively showed the presence of a PDI insert Group 3 White Colony T3 and T7 strand consensus sequence also showed PDI gene presense Group 3 Blue Colony Blue colony T3 showed multiple signals Group 3 Blue Colony However, T7 strand was salvageable A 154 nucleotide sequence was found between the restriction sites Group 1 Results White Colony from PCR product showed PDI gene in both T3 and T7 strands White colony from gel purification: T7 strand sequenced as multiple signals T3 strand sequenced excellently Group 1 Gel White Colony T3 sequence showed only nucleotides 1540-2320 of the vector Group 2 Results White Colony from gel purification White colonies sequenced with PDI gene Blue w/ White Ring Colony from PCR Both T3 and T7 strand sequencing showed consistent multiple signals Group 4 Results 1 white colony from PCR and 1 white colony from gel purification were sequenced Both showed PDI gene Final Remarks All white colonies had the PDI gene, except one with a modified vector All blue colonies were transformed with the direct PCR product (not gel purified) Group 3 showed that a small (154Bp) insert that stays in-frame with the LacZ gene can knock-out the ccdB, while still allowing the expression of an at least partially functioning LacZ gene Some blue colonies with white rings could be 2 separate lines living together Bacteria transformed with ampicillin resistance gene could deplete area of ampicillin, allowing bacteria without the gene to crowd the white bacteria out of the area of depleted ampicillin How could bacteria without the insert survive both ccdB expression and ampicillin selection in broth? ccdB gene could be lost due to mutation Bactaria could have cut plasmid, deleting the ccdB, but retaining LacZ possibly and ampicillin resistance genes No group sequenced the positive control insert – sequence still a mystery! Resources http://www.bioinformatics.org http://http://syntheticbiology.org/Tools.html NCBI BLAST: http://www.ncbi.nlm.nih.gov/BLAST/ SMART: http://smart.embl-heidelberg.de/ PFAM: http://www.sanger.ac.uk/Software/Pfam/ Interpro: http://www.ebi.ac.uk/interpro/ Canadian Bioinformatics Helpdesk Newsletter (Ramachandran Plot): http://gchelpdesk.ualberta.ca/news/22sep05/cbhd_news_22sep05.p hp Finch TV: http://www.geospiza.com/finchtv/ EBI Pair-wise alignment: http://www.ebi.ac.uk/emboss/align/index.html TAIR: http://www.arabidopsis.org