* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download What is Bioinformatics?
Transcriptional regulation wikipedia , lookup
List of types of proteins wikipedia , lookup
Western blot wikipedia , lookup
Community fingerprinting wikipedia , lookup
Magnesium transporter wikipedia , lookup
Protein adsorption wikipedia , lookup
Non-coding DNA wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Protein moonlighting wikipedia , lookup
Gene regulatory network wikipedia , lookup
Gene expression wikipedia , lookup
Protein structure prediction wikipedia , lookup
Point mutation wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Gene expression profiling wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Homology modeling wikipedia , lookup
Genome evolution wikipedia , lookup
What is Bioinformatics? • Bioinformatics: collection and storage of biological information • Computational biology: development of algorithms and statistical models to analyze biological data Jobs for bioinformaticians Databases make biological data available to scientists • As biology has increasingly turned into a data-rich science, the need for storing and communicating large datasets has grown tremendously. – Nucleotide, protein sequences – Protein structure – Expression data – Gene/protein networks Nucleotide Databases • EMBL www.ebi.ac.uk/embl/ – The EMBL (European Molecular Biology Laboratory) nucleotide sequence database is maintained by the European Bioinformatics Institute (EBI) in Hinxton, Cambridge, UK. Nucleotide Databases cont. • GenBank: maintained by the National Center for Biotechnology Information (NCBI); contains Entrez for accession to nucleotides, proteins, annotations, etc. www.ncbi.nlm.nih.gov/Genbank/ • UniGene: a non-redundant set of geneoriented clusters www.ncbi.nlm.nih.gov/UniGene/ Protein Databases • SWISS-PROT: SWISS-PROT is a protein sequence database to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. www.expasy.ch/sprot/ Protein Databases • PIR http://pir.georgetown.edu/ -The Protein Information Resource (PIR) is a division of the National Biomedical Research Foundation (NBRF) in the US. It is involved in a collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japanese International Protein Sequence Database (JIPID). Release 67.00 (31 Dec 2000) contains 198,801 entries. Sequence Motif Databases • Pfam www.sanger.ac.uk/Software/Pfam/ – Pfam is a database of protein families defined as domains (contiguous segments of entire protein sequences). For each domain, it contains a multiple alignment of a set of defining sequences (the seeds) and the other sequences in SWISS-PROT that can be matched to that alignment. 3D-Structure Databases • PDB www.rcsb.org/pdb/ -The PDB is the main primary database for 3D structures of biological macromolecules determined by X-ray crystallography and NMR. Structural biologists usually deposit their structures in the PDB on publication, and some scientific journals require this before accepting a paper. It also accepts the experimental data used to determine the structures. How to get sequences? • Entrez Database provides nucleotide and protein sequences in different formats. • One of the formats is FASTA FASTA FORMAT • Each sequence begins with a description line ‘>’ A protein in FASTA format >HBA_ALLMI VLSMEDKSNVKAIWGKASGHLEEYGAEALEMF CAYPQTKIYFPHFDMSHNSAQIRAHGKKVFSA LHEAVNHIDDLPGALCRLSELHAHSLRVDPVNF KFLAHCVLVVFAIHHPSALSPEIHASLDKFLCAV SAVLTSKYR • The first line is the description line, starts with a character '>' shows that the description line of a sequence follows the string following the '>' and ending at the first space (' ') is the sequence id (HBA_ALLMI). A DNA sequence in Fasta >X sequence ATGAATAGCACAGAGAGACCAAGAGAG AGAGAGAGACCCAGATATATCAGATAGA GA Why align sequences? • Find evolutionary relationship between species and/or genes. • Identify novel genes and define similar genes in other species. • Study genomes and how they change. Sequence Alignment • Homology means that two (or more) sequences have a common ancestor. • An example to sequence alignment Sequence 1 Sequence 2 CLUSTALW: A software for aligning sequences http://www.ebi.ac.uk/clustalw/ Genome Databases • www.ensembl.org Genome Databases: Gene Prediction • Define the location of genes (coding sequences, regulatory regions) • Gene prediction using software based on rules and patterns. Find Open Reading Frames (ORFs), with additional criteria for good start sequence for a gene. • Gene identification through alignment with known proteins and EST sequences (Expressed Sequence Tags; mRNA sequences). • Gene prediction through similarity with proteins or ESTs in other organisms. • Gene prediction through comparison with other genomes; conserved regions are probably coding or regulatory regions. Genome Databases: Annotation • Annotation of the genes: Compare with genes/proteins of known function in other organisms. • Functional classification. Broad groups of functional characterization, such as 'ribosomal proteins', 'nucleotide metabolism', 'signal transduction'. Genome Databases: Evolution • Evolutionary history • Genome duplications • Gene loss Transcription Databases • Microarrays can analyze 1000s of transcripts simultaneously. – Allow analysis of genes that are high or low in expression between normal and disease, for example. • Microarray Databases contain expression data (large amounts). – Stanford Microarray Database: Signaling & Metabolic Pathways • Analyze how genes/proteins interact and learn about function of genes – KEGG: Kyoto Encyclopedia of Genes and Genomes – http://www.genome.ad.jp/kegg/