* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Bioinformatics - University of Hawaii
Zinc finger nuclease wikipedia , lookup
Gene desert wikipedia , lookup
Gene therapy wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Non-coding DNA wikipedia , lookup
Molecular ecology wikipedia , lookup
Proteolysis wikipedia , lookup
Genetic engineering wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Magnesium transporter wikipedia , lookup
Western blot wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Gene nomenclature wikipedia , lookup
Gene regulatory network wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Real-time polymerase chain reaction wikipedia , lookup
Gene expression wikipedia , lookup
Expression vector wikipedia , lookup
Homology modeling wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Point mutation wikipedia , lookup
Genomic library wikipedia , lookup
Bioinformatics ABE 2007 Kent Koster Group 3 Why bioinformatics?  “Other techniques raise more questions than they answer. Bioinformatics is what answers the questions those techniques generate.” Outline Bioinformatics Defined  Evolution of Bioinformatics  Bioinformatics History  Common Uses of Bioinformatics  Procedures and Tools of Bioinformatics  Our Procedure  Our Results  Resources  Bioinformatics Defined    Bioinformatics is broad term covering the use of computer algorithms to analyze biological data. Differs from “computational biology” in that while computational biology is the use of computer technology to solve a single, hypothesis-based question, bioinformatics is the omnibus use of computerized statistical analysis to make statistical or comparative inferences. i.e. converting “data” to “information.” The nebulous genesis of bioinformatics     1977 – Φ-X174 Phage Genome sequenced 1990 – Paper published in the Journal of Molecular Biology describes sequence alignment search algorithm 1990s – Software used to find fragment overlap for the Human Genome Project 1992 – NCBI takes over GenBank DNA sequence database in response to the growing number of gene patents The nebulous genesis of bioinformatics    1994 – “Entrez” Global Query Cross-Database Search System allows users to search GenBank database 1995 – Dr. Owen White writes software to help find gene elements (promoters, start and stop codons, etc.) in the sequenced Haemophilus influenzae genome 1996 – NCBI-BLAST created to provide powerful heuristic searches against the GenBank database Genomics to Proteomics through Bioinformatics      Because proteins are ultimately the tool of all* gene expression, proteomics is, in effect, the “product” science made possible by bioinformatics A proteome is the collection of all proteins expressed in a cell at a given time Every organism has 1 genome, but many proteomes In addition to “high throughput” protein analysis, proteomics is researched through cDNA analysis (RTPCR) Proteomics represents a methodical addition of “large scale biology” to traditional molecular biology, made possible by bioinformatics Common Uses of Bioinformatics  Homology and Comparative Modeling   Protein or gene homology is shared nucleotide or amino acid sequences or domains shared between different proteins regardless of whether from same or different organism Gene or Protein Identification  Searching databases for nucleotide or amino acid sequences that match sequences in unknown samples So, how do ya do it? DNA Sequencing  Sequence Formats  Sequence Homology Software Tools  Aligning Tools  Annotated Information  Protein Folding  DNA Sequencing  Sanger Method  New nucleotide chains of DNA being replicated by DNA Polymerase are stopped when di-deoxy nucleotides (added in the reaction mixture in ~1/100 ratio) are incorperated into the chain DNA Sequencing Fluorescent dyes are bound to the ddNTPs, allowing the molecule to detected when it is excited by a laser  Terminated DNA chains are run on a gel, and fragments are resolved by size  By combining the fluorescence readings from each size nucleotide chain, the DNA sequence is computed  Example Sequence Chromatograph Sequence Analysis       First Things First – Sequence File Formats: Most common for nucleotides: FASTA / Multi-FASTA “>” followed by any unicode text, entire line read as sequence title Carriage return followed by continuous 5’- 3’ nucleotide sequence or protein sequence using 1-letter codes Example: >E. coli Globin-coupled chemotaxis sensory transducer (TM domain) ATGGACCTGATCACAAATGCGATTTAGAGACCTGATCACAAATG CGATGACCTGATCACAAATGCGATGACCTGATCACAAATGCGAT GTAAACCTGATCACAAATGCGATGACCTGATCACAAATGCGATC TAAACCTGATCACAAATGCGATGACCTGATCACAAATGCGATTAA Sequence Homology Software  NCBI-BLAST      Run by the National Center for Biotechnology Information BLAST uses a heuristic algorithm based on the Smith-Waterman algorithm Algorithm searches database for a small string within the query (default 11 for nucleotide searches), then when it detects a match, searches for shared nucleotides at each end of the seed to extend the match Gaps are taken into account, then the matches are presented in order of statistical significance http://www.ncbi.nlm.nih.gov/BLAST/ Different Types of BLAST  Nucleotide-nucleotide BLAST (BLASTN):    Protein-protein BLAST (BLASTP):   Basic nucleutide sequence searches The BLAST that you used for your sequences Similar technology used to search amino acid sequences Position-Specific Iterative BLAST (PSI-BLAST):  A more advance protein BLAST useful for analyzing relationships between divergently evolved proteins. Different Types of BLAST  BLASTX and BLASTN variants:   Use six-frame translation for proteins and nucleotides, respectively, in the search MegaBLAST:  Used for BLASTing several sequences at once to cut down on processing load and server reporting-time Interpreting BLAST Results  Max/Total Score   Calculated from the number of matches and gaps. Higher relative to your query length is better E Value: E=Kmn(e-λS)    Translation: E Value gives you the number of entries required in the database for a match to happen by random chance. e.g. E=e-6 means that one match would be expected for every 1,000,000 entries in the database Smaller E Values are better Values larger than E=e-5 too likely to be due to chance Interpreting BLAST Results  Query Coverage   The percent of the query sequence matched by the database entry Max Ident  The percent identity, i.e. the percent that the genes match up within the limits of the full match (e.g. deletions or additions reduce this value) Sequence Aligning Software  Clustal (free) ClustalX – Software  ClustalW – Web  DNAStar ($$$)  Functionality is similar, but difference is in interface, tools, and speed of algorithms  http://www.ebi.ac.uk/clustalw/  SMART Simple – Modular – Architecture – Research – Tool  Run by EMBL (European Molecular Biology Laboratory)  While BLAST compares nucleotide sequences and then informs you of any domains that may have been annotated to them, SMART compares by domains  PFAM       Protein domain database Manually curated, trading volume for quality Uses “hidden Markov models” for domain pattern recognition Run by Sanger Institute in the UK Heuristic server-load analysis predicts when key protein analysis report is due and crashes server http://www.sanger.ac.uk/Software/Pfam/ Interpro Database of protein domains and functional sites  Best source of annotation  Other tools sometimes draw annotation from Interpro  Run by the European Bioinformatics Institute  http://www.ebi.ac.uk/interpro/  Protein Folding  Lowest energy state folding Ab initio: tremendously resource heavy, can only be done for tiny proteins  Distributed computing is used for mid-sized proteins   Folding@Home  Human Proteome Folding Project  Rosetta@Home  Predictor@Home Protein Folding  Software-assisted manual folding   Use knowledge of biochemistry to fold protein into predicted structure, then software to find lowest energy state Commercial Programs: Protein Shop  Profold  Manual Motif Verification  Ramachandran Plot – ratio of Ψ to Φ angles on N and C terminals of subunit Our Procedure  Colonies were selected from nutrient plates     Each group selected two colonies to sequence Colonies which survived ampicillin treatment were possibly transformed by the vector, which contained an ampicillin resistance gene Presence of PDI insert was expected to disrupt ccdB (lethal protein) and LacZα gene expression in vector plasmid LacZα expression resulted in some blue colonies, as the colonies were able to cleave X-Gal substrate into blue product Initial Questions Guiding Colony Selection       How did some blue colonies survive? Did all blue colonies come from the PCR product? Did the white colonies contain the PDI inserts? Were some colonies able to survive without the ampicillin resistance plasmid? What was the actual sequence of the commercial positive control insert? Some samples were transformed with inserts collected from PCR instead of gel electrophoresis. Could have non-PDI sequences have ligated to the vector and been inserted into bacteria? Procedure Samples were prepared with T3 and T7 (forward and backward) primers in solution for sequencing  Samples were sent to UH Manoa lab for sequencing  Chromatogram results were viewed with Finch TV to determine quality  Procedure  Sequences were trimmed at 5’ and 3’ ends, then restriction enzyme sites on the vector were attempted to be located with Finch TV Procedure      Sequences were exported in FASTA format Procedure was repeated for the other strands Pair-wise alignment was performed for both strands of each sample with EBI’s tools Consensus sequence from pair-wise alignment was searched for in BLAST Gene information was located from BLAST annotation and TAIR website Results  General Remarks    Because colonies were selected prior to the identity of the positive control insert being questioned, no control colonies were sequenced All sequenced white colonies definitively had PDI gene insert, save for one interesting exception Some blue colonies showed multiple nucleotide chromatogram readings, suggesting either sample contamination or separately transformed E. coli growing as one colony Group 3 Results Sequenced 1 blue and 1 white colony from same plate  Colonies were transformed with PCR product, not gel-recovered DNA  White colonies had PDI insert  Blue colonies had 154Bp partial insert, disrupting ccdB gene, but remaining inframe and allowing for a partially function LacZ alpha gene to be expressed  Group 3 White Colony  T7 strand definitively showed the presence of a PDI insert Group 3 White Colony  T3 and T7 strand consensus sequence also showed PDI gene presense Group 3 Blue Colony  Blue colony T3 showed multiple signals Group 3 Blue Colony However, T7 strand was salvageable  A 154 nucleotide sequence was found between the restriction sites  Group 1 Results White Colony from PCR product showed PDI gene in both T3 and T7 strands  White colony from gel purification:  T7 strand sequenced as multiple signals  T3 strand sequenced excellently  Group 1 Gel White Colony  T3 sequence showed only nucleotides 1540-2320 of the vector Group 2 Results  White Colony from gel purification   White colonies sequenced with PDI gene Blue w/ White Ring Colony from PCR  Both T3 and T7 strand sequencing showed consistent multiple signals Group 4 Results 1 white colony from PCR and 1 white colony from gel purification were sequenced  Both showed PDI gene  Final Remarks     All white colonies had the PDI gene, except one with a modified vector All blue colonies were transformed with the direct PCR product (not gel purified) Group 3 showed that a small (154Bp) insert that stays in-frame with the LacZ gene can knock-out the ccdB, while still allowing the expression of an at least partially functioning LacZ gene Some blue colonies with white rings could be 2 separate lines living together   Bacteria transformed with ampicillin resistance gene could deplete area of ampicillin, allowing bacteria without the gene to crowd the white bacteria out of the area of depleted ampicillin How could bacteria without the insert survive both ccdB expression and ampicillin selection in broth?    ccdB gene could be lost due to mutation Bactaria could have cut plasmid, deleting the ccdB, but retaining LacZ possibly and ampicillin resistance genes No group sequenced the positive control insert – sequence still a mystery! Resources           http://www.bioinformatics.org http://http://syntheticbiology.org/Tools.html NCBI BLAST: http://www.ncbi.nlm.nih.gov/BLAST/ SMART: http://smart.embl-heidelberg.de/ PFAM: http://www.sanger.ac.uk/Software/Pfam/ Interpro: http://www.ebi.ac.uk/interpro/ Canadian Bioinformatics Helpdesk Newsletter (Ramachandran Plot): http://gchelpdesk.ualberta.ca/news/22sep05/cbhd_news_22sep05.p hp Finch TV: http://www.geospiza.com/finchtv/ EBI Pair-wise alignment: http://www.ebi.ac.uk/emboss/align/index.html TAIR: http://www.arabidopsis.org
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            