Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Epigenetics of neurodegenerative diseases wikipedia , lookup
Human genome wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
Protein moonlighting wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Metagenomics wikipedia , lookup
Point mutation wikipedia , lookup
Sequence alignment wikipedia , lookup
Please use linux today if possible! Introduction to Molecular Biology Databases Alinda Nagy & Hedi Hegyi, PhD @ Institute of Enzymology, Budapest The BioSapiens Permanent School of Bioinformatics Budapest, Sept 4-8, 2006 Databases What is a database? • A database is a structured collection of information. (An organized array of information.) • A database consists of basic objects called records or entries. • Each record consists of fields, which hold defined data that is related to that record. • For example, a protein database would typically have proteins as records and protein properties as fields (i.e. name, length, sequence, taxonomical origin, etc.) Noam Kaplan What is a database? • A database is searchable (index) -> table of contents • A database is updated periodically (release) -> new edition • A database is cross-referenced (hyperlinks) > links with other db Why Databases? • The purpose of databases is not merely to collect and organize data, but mainly to allow advanced data retrieval. • A query is a method to retrieve information from the database. • The organization of each record into predetermined fields allows us to use queries on fields. • Example: Find all human proteins that are enzymes and have a length of 1000-1200 aa. Noam Kaplan Databases on the Internet • Biological databases often have a web interface, which allows the user to send queries to the database. • Some databases can be accessed by different web servers, each offering a different interface. User request query web page result Web server Database server Noam Kaplan Databases on the Internet Information system Query system Storage System Data Francis Ouellette Databases on the Internet Information system Query system Storage System Data - GenBank flat file PDB file Interaction Record Title of a book Book Francis Ouellette Databases on the Internet Information system Query system Storage System Data - Boxes - Oracle - MySQL - PC binary files - Unix text files - Bookshelves Francis Ouellette Databases on the Internet Information system Query system - A List you look at A catalogue indexed files SQL grep Storage System Data Francis Ouellette Databases on the Internet Information system - The UBC library Query system - Google Storage System Data - Entrez (NCBI) - SRS (Sequence Retrieval System) Francis Ouellette Database download • Nearly all biological databases are available for download as simple text files. • A local version of the database removes limitations on how you process the data. • Processing data in files requires some minimal computer-programming skills. – PERL is an easy programming language that can be used for extraction and analysis of data from files. Noam Kaplan Tour of the major molecular biology databases • There is a tremendous amount of information about biomolecules in publicly available databases. • Today, we will just look at some of the main databases and what kind of information they contain. • Exercises will give you a little practice at browsing databases. List of molecular biology databases List of molecular biology databases • Nucleic Acids Research publishes an annual database issue. The 2006 update of the online Molecular Biology Database Collection includes 858 databases • http://www3.oup.co.uk/nar/database/c/ Large Growth in the Number of Biological Databases NAR Database Issue 1000 900 Number of databases 800 700 600 500 400 300 200 100 0 1996 1997 1998 1999 2000 2001 Year 2002 2003 2004 2005 2006 Molecular biology data types Organisms Mouse chromosome X Lei Liu from the Mouse Genome Informatics project http://www.informatics.jax.org/ Genome maps Molecular biology data types Organisms Genome maps DNA sequences RNA sequences ...AATGGTACCGATGACCTGGAGCTTGGTTCGA... Lei Liu Molecular biology data types Organisms Genome maps DNA sequences RNA sequences Protein sequences ...TRLRPLLALLALWPPPPARAFVNQHLCGSHLVEA... Lei Liu Molecular biology data types Organisms Genome maps DNA sequences RNA sequences RNA structures Protein sequences Protein structures PDB entry 1CIS P.Osmark, P.Sorensen, F.M.Poulsen Lei Liu Molecular biology data types Organisms Genome maps DNA motifs RNA expression DNA sequences RNA sequences RNA structures Protein sequences Protein structures Protein motifs Lei Liu Types of molecular biology databases 14 main NAR categories: Nucleotide Sequence RNA sequence Protein sequence Structure Genomics (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Proteomics Resources Other Organelle Plant Immunological Resources are Becoming More Diverse NAR – Database Categories 2004 2006 Database Types Database Types Immunological Gene Expression Other Disease Nucleotide Sequence Plant Organelle RNA Sequence Other Nucleotide Sequence RNA sequence Proteomics Resources Protein Sequence Genome (human) Microarray Data and other Gene Expression Protein sequence Human Genes and Diseases Pathways Structure Structure Genome (nonhuman) Human and other Vertebrate Genomes Metabolic and Signaling Pathways Genomics (nonvertebrate) NAR 2006 – A Closer Look Database Types • Genome scale databases have proliferated Immunological Plant Organelle Other Proteomics Resources Microarray Data and other Gene Expression Nucleotide Sequence RNA sequence Protein sequence Human Genes and Diseases Human and other Vertebrate Genomes Metabolic and Signaling Pathways • Traditional sequence databases are now a small part Structure Genomics (nonvertebrate) • Databases around new specific data types are emerging • Pathway and disease orientated databases are emerging Database searches Using a database • How to get information out of a database: – Summaries: how many entries, average or extreme values – Browsing: no targeted information to retrieve – Search: looking for particular information • Searching a database: – Must have a key that identifies the element(s) of the database that are of interest. • Name of gene • Sequence of gene • Other information Larry Hunter Searching sequence databases • Start from sequence, find information about it • Many kinds of input sequences – Could be amino acid or nucleotide sequence – Genomic or mRNA/cDNA or protein sequence – Complete or fragmentary sequences • Exact matches are rare (even uninteresting in many cases), so often goal is to retrieve a set of similar sequences. – Both small (mutations) and large (required for function) differences within “similar” can be interesting. Larry Hunter What might we want to know about a sequence? • Is this sequence similar to any known genes? How close is the best match? Significance? • What do we know about that gene? – Genomic (chromosomal location, allelic information, regulatory regions, etc.) – Structural (known structure? structural domains? etc.) – Functional (molecular, cellular & disease) • Evolutionary information: – Is this gene found in other organisms? – What is its taxonomic tree? Larry Hunter What can be discovered about a gene by a database search? • A little or a lot, depending on the gene – Evolutionary information: homologous genes, taxonomic distributions, allele frequencies, synteny, etc. – Genomic information: chromosomal location, introns, UTRs, regulatory regions, shared domains, etc. – Structural information: associated protein structures, fold types, structural domains – Expression information: expression specific to particular tissues, developmental stages, phenotypes, diseases, etc. – Functional information: enzymatic/molecular function, pathway/cellular role, localization, role in diseases Larry Hunter NCBI and Entrez NCBI and Entrez • One of the most useful and comprehensive sources of databases is the NCBI (National Center for Biotechnology Information), part of the NIH (National Institute of Health). • NCBI provides interesting summaries, browsers for genome data, and search tools • Entrez is their database search interface http://www.ncbi.nlm.nih.gov/Entrez • Can search on gene names, sequences, chromosomal location, diseases, keywords, ... Larry Hunter BLAST: Searching with a sequence • Goals is to find other sequences that are more similar to the query than would be expected by chance (and therefore are homologous). • Can start with nucleotide or amino acid sequence, and search for either (or both) • Many options – E.g. ignore low information (repetitive) sequence, set significance critical value – Defaults are not always appropriate: READ THE NCBI EDUCATION PAGES! Larry Hunter • Major choices: – – – – – Larry Hunter Translation Database Filters Restrictions Matrix Larry Hunter Larry Hunter Close hit: Rat ADH alpha Larry Hunter Distant hit: Human sorbitol dehydrogenase Larry Hunter Parameters (at bottom!) Larry Hunter Click on: Larry Hunter Larry Hunter BLAST searches online • http://www.ncbi.nlm.nih.gov/BLAST/ • Sequences: >ENSP00000002501 pep:known chr:NCBI36:16:88598804:88613382 MEPPEGAGTGEIVKEAEVPQAALGVPAQGTGDNGHTPVEEEVGGIPVPAPGLLQVTERRQ PLSSVSSLEVHFDLLDLTELTDMSDQELAEVFADSDDENLNTESPAGLHPLPRAGYLRSP SWTRTRAEQSHEKQPLGDPERQATVLDTFLTVERPQED >ENSP00000314902 chr:18 gene:ENSG00000176890 tr:ENST00000323250 MPVAGSELPRRPLPPAAQERDAEPRPPHGELQYLGQIQHILRCGVRKDDRTGTGTLSVFG MQARYSLRDYSGQGVDQLQRVIDTIKTNPDDRRIIMCAWNPRDLPLMALPPCHALCQFYV VNSELSCQLYQRSGDMGLGVPFNIASYALLTYMIAHITGLKPGDFIHTLGDAHIYLNHIE PLKIQLQREPRPFPKLRILRKVEKIDDFKAEDFQIEGYNPHPTIKMEMAV BLAST output for ENSP00000002501 BLAST output for ENSP00000002501 BLAST output for ENSP00000314902 BLAST output for ENSP00000314902 Take home messages • There are a lot of molecular biology databases, containing a lot of valuable information • Not even the best databases have everything (or the best of everything) • These databases are moderately well crosslinked, and there are “linker” databases • Sequence is a good identifier, maybe even better than gene name! Larry Hunter Protein sequence databases • General sequence databases (e.g. UniProt) • Protein properties (e.g. PFD – Protein Folding Database) • Protein localization and targeting (e.g. NPD - Nuclear Protein Database) • Protein sequence motifs and active sites (e.g. BLOCKS, InterPro, PROSITE, PRINTS) • Protein domain databases; protein classification (e.g. InterPro, ProDom, SMART, Pfam) • Databases of individual protein families (e.g. Histone Database) http://www3.oup.co.uk/nar/database/cat/1 UniProt ( The Universal Protein Resource) http://www.uniprot.org/ ftp://ftp.uniprot.org/pub/databases/ Wu CH et al. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 2006 Jan 1;34(Database issue): D187-91. Margaret Dayhoff • The first protein database was created by Margaret Dayhoff, calledThe Atlas of Protein Sequences. • It was a book. The Atlas of Protein Sequences • Dayhoff had the idea that a compilation of all protein sequences in the literature into one resource would be a useful research tool. • She and her co-workers collected all known sequences and published them together. • Then, when a new sequence was obtained, there was a single resource available for determining its relationship to other known sequences. What is UniProt What is UniProt • The world's most comprehensive catalog of information on proteins. • Central repository of protein sequence and function. • Created by joining the information contained in SwissProt, TrEMBL, and PIR. • Collaboration between EBI (European Bioinformatics Institiute), SIB (Swiss Institute of Bioinformatics) and PIR (DDBJ to join). • Funded mainly by NIH. • Three database components: •UniProt Knowledgebase (UniProtKB) •UniProt Reference Clusters (UniRef) •UniProt Archive (UniParc) What is UniProt 1. UniProt Knowledgebase (UniProtKB): central access point for extensive curated protein information, including function, classification, and cross-reference comprising the manually annotated UniProtKB/Swiss-Prot section and the automatically annotated UniProtKB/TrEMBL section 2. UniProt Reference Clusters (UniRef): combines closely related sequences into a single record to speed searches speed similarity searches via sequence space compression by merging sequences that are 100% (UniRef100), 90% (UniRef90) or 50% (UniRef50) identical 3. UniProt Archive (UniParc): comprehensive repository, reflecting the history of all protein sequences stores all publicly available protein sequences, containing the history of sequence data with links to the source databases What is UniProt The UniProt databases collect both protein sequences obtained through experimental determination and protein sequences derived from the translation of nucleotide sequences (which were predicted or determined to codify for a protein). Amino acid sequence determined through experimental analysis GeneBank EMBL DDBJ Nucleotide sequence databases Protein sequence databases PIR SWISSPROT TrEMBL Validated Enriched of specific information UniProt Goals • High level of annotation • Minimal redundancy • High level of integration with other databases • Complete and up-to-date Annotation concepts UniParc: No annotation UniProtKB: Annotated UniRef: No annotation, just description line of UniProtKB or UniParc master entry in the cluster for use in FASTA files Minimal redundancy UniParc: All sequences that are 100% identical over their entire length are merged into a single entry, regardless of species. UniParc represents each protein sequence once and only once, assigning it a unique identifier. UniParc cross-references the accession numbers of the source databases. UniProtKB: Aims to describe in a single record all protein products derived from a certain gene (or genes if the translation from different genes in a genome leads to indistinguishable proteins) from a certain species. UniRef: Merges sequences automatically. Integration with other databases UniParc: Linked back to source records UniProtKB: Linked to >60 other databases UniRef: UniRef clusters link back to UniProtKB and UniParc records in the cluster Complete and up-to-date UniParc: All publically available protein sequences, updated every 2 weeks (05/06, Rel 8.0: 7.116.519 entries) UniProtKB: All suitable stable protein sequences, updated every 2 weeks (05/06, Rel 8.0: 3.170.612 entries) UniRef: All protein sequences in the UniProtKB and in UniParc useful for sequence similarity searches, updated every 2 weeks (05/06, Rel 8.0: 3.511.676 UniRef100, 2.254.474 UniRef90, 1.148.123 UniRef50 entries) An example An example An example An example An example Exercise 1 – Text search 1. Go to EXPASY. Click "UniProt Knowledgebase (Swiss-Prot and TrEMBL)” and then search for human cochlin. Notice that there is a wealth of information about this protein. Furthermore, there are many links to sequence analysis tools (some of which you will learn later) and some other nice features. Note that this is merely a graphical display of the original UniProtKB/SwissProt database entry (which is in text). 2. Try to answer all of the questions below. 1. Which year was the NMR structure of the LCCL domain determined? 2. Where is the protein expressed? 3. Which diseases are associated with the protein? Exercise 2 – BLAST search •1. Go to EXPASY. Click "UniProt Knowledgebase (Swiss-Prot and TrEMBL)” and then „BLAST”. •2. Copy the following human amino acid sequence. MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDVNLTHIESRPSRLKKDEYEFFTHLDK RSLPALTNIIKILRHDIGATVHELSRDKKKDTVPWFPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRH GQPIPRVEYMEEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGFRLRPVAGLLSSRDF LGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFAQFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGD SIKAYGAGLLSSFGELQYCLSEKPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQRIEVL DNTQQLKILADSINSEIGILCSALQKIK •3. Paste the sequence into the query sequence window and adjust the options as necessary. You won't need to specify advanced options, but you should choose a program and database. For simplicity, use e.g. the UniProtKB database. •4. Run the search and identify the protein. Use the link provided to see the UniProtKB/SWISS-PROT report. Exercise 2 – BLAST search •5. Now, try to answer all of the questions below. 1. What is the SWISS-PROT primary accession number? 2. What is the common name of the protein? 3. What is the gene called? 4. Which year was the crystal structure of the catalytic domain determined? Name the first author. 5. Does the enzyme require a co-factor to function? If so, what? 6. Name the most common disease that arises as a result of deficiency of this enzyme. 7. How many amino acid residues are there in the protein? 8. What is the molecular weight of the protein? Patterns and Profiles, Protein Motifs and Domains • • • • • • • • • InterPro - an integrated database of protein families, domains, motifs and functional sites. Blocks - multiply aligned ungapped segments for the most highly conserved regions of proteins. Motif - a server that scans databases to find motifs or patterns and that can generate sequence profiles. Pfam - multiple sequence alignments and HMMs of protein domains and families. PRINTS - database of groups of conserved motifs, or protein fingerprints. ProDom - protein domain families automatically generated from SWISS-PROT and TrEMBL. PROSITE - database of protein families and domains defined by functional sites, patterns and profiles. SMART - Simple Modular Architecture Research Tool for the identification of domains. COGS database - clusters of sequences determined by comparing sequences from whole genomes. InterPro (Integrated resource of Protein Families, Domains and Sites) • http://www.ebi.ac.uk/interpro/ • ftp://ftp.ebi.ac.uk/pub/databases/interpro • Mulder NJ et al. (2005) InterPro, progress and status in 2005. Nucleic Acids Res. 33 (Database Issue): D201-5. What is InterPro • Secondary protein databases on functional sites and domains are vital resources for identifying distant relationships in novel sequences, and hence for predicting protein function and structure. • Unfortunately, these signature databases do not share the same formats and nomenclature, and each database has its own strengths and weaknesses. • Thus, for best results, search strategies should ideally combine all of them. What is InterPro – InterPro is a collaborative project aimed at providing an integrated layer on top of the most commonly used signature databases by creating a unique, nonredundant characterization of a given protein family, domain or functional site. – Integrates PROSITE, PRINTS, Pfam, ProDom, SMART, TIGRFAMs, PIR superfamily, SUPERFAMILY, Gene3D and PANTHER databases and the addition of others is scheduled. – Has cross-references to the BLOCKS database as well as many specialized protein family and protein structure databases. InterPro • The latest release of InterPro (12.1) contains 12,953 entries, with 78% coverage of all proteins in UniProtKB. • Each entry has annotation provided in the name, GO mapping and abstract fields, and all matches against the Swiss-Prot and TrEMBL components of UniProt are precomputed and available for viewing in different formats. • Protein 3D structural information is integrated from MSD, CATH and SCOP, and this data is available in the match views to provide an at a glance comparison of sequence and structural domains. InterPro Dataflow scheme InterProScan result PROSITE http://www.expasy.org/prosite/ Database of protein families and domains PROSITE • consists of a large collection of biologically meaningful signatures that are described as patterns or profiles that help to reliably identify to which known protein family (if any) a new sequence belongs • the latest version (release 19.11) contains 1329 patterns and 552 profile entries • each signature is linked to a documentation providing information on the protein family or domain detected by the signature: origin of its name, taxonomic occurrence, domain architecture, function, 3D structure, main characteristics of the sequence, domain size and some references PRINTS http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/ PRINTS • The PRINT database is a compendium of protein fingerprints. • A fingerprint is a group of conserved sequence motifs that together provide diagnostic signatures for protein families. • Fingerprints are diagnostically more powerful than single motifs by making use of the biological context inherent in a multiple-motif method. • The fingerprinting method is a reliable technique for detecting members of large, highly divergent protein super-families. PFAM http://www.sanger.ac.uk/Software/Pfam/ PFAM • Database of multiple sequence alignments and HMMs of protein domains and families. • Profile hidden Markov models are statistical models of the primary structure consensus of a sequence family. • The construction and use of Pfam is tightly tied to the HMMER software package. PFAM • Composed of two sets of families: – Pfam-A: • curated part containing over 8296 protein families – Pfam-B: • automatically generated supplement containing a large number of small families taken from the PRODOM database that do not overlap with Pfam-A (lower quality) PFAM Each family has the following data: • A seed alignment which is a hand edited multiple alignment representing the family. • Hidden Markov Models (HMM) derived from the seed alignment which can be used to find new members of the domain and also take a set of sequences to realign them to the model. One HMM is in ls mode (global) the other is an fs mode (local) model. • A full alignment which is an automatic alignment of all the examples of the domain using the two HMMs to find and then align the sequences • Annotation which contains a brief description of the domain, links to other databases and some Pfam specific data. To record how the family was constructed. A PFAM entry A PFAM entry, cont’d PFAM searches PFAM results PRODOM http://www.toulouse.inra.fr/prodom.html PRODOM • Database of protein domain families automatically generated from SWISSPROT and TrEMBL databases by sequence comparison. • Useful for analysing the domain arrangements of complex protein families and the homology relationships in modular proteins. • Contains (release 2003.1) 144,444 domain families containing two or more individual domains. SMART http://smart.embl-heidelberg.de/ Simple Modular Architecture Research Tool SMART • Allows the identification and annotation of protein domains and the analysis of domain architectures. • The current release has more than 600 domain families represented among nuclear, signalling and extracellular proteins. • Extensive annotation for each domain family is available, providing information on function, subcellular localization, phyletic distribution and tertiary structure, links to OMIM in cases where a human disease is associated with one or more mutations in a particular domain. Exercise 3 – Domain search •1. Go to the PROSITE site. •2. Under "Tools for PROSITE" choose ScanProsite. •3. Paste the sequence below into the box and tick the Option "Exclude patterns with a high probability of occurrence" (to find very common patterns will not tell you much about your protein). MWAPRCRRFWSRWEQVAALLLLLLLLGVPPRSLALPPIRYSHAGICPNDMNPNLWVDAQSTCRRECETDQECETYEKCCPNVCGTKSCVAARYMDVKGKKGPVGMPKE ATCDHFMCLQQGSECDIWDGQPVCKCKDRCEKEPSFTCASDGLTYYNRCYMDAEACSKGITLAVVTCRYHFTWPNTSPPPPETTMHPTTASPETPELDMAAPALLNNPV HQSVTMGETVSFCDVVGRPRPEITWEKQLEDRENVVMRPNHVRGNVVVTNIAQLVIYNAQLQDAGIYTCTARNVAGVLRADFPLSVVRGHQAAATSESSPNGTAFPAAEL KPPDSEDCGEEQTRWHFDAQANNCLTFTFGHCHRNLNHFETYEACMLACMSGPLAACSLPALQGPCKAYAPRWAYNSQTGQCQSFVYGGCEGNGNNFESREACEESP FPRGNQRCRACKPRQKLVTSFCRSDFVILGRVSELTEEPDSGRALVTVDEVLKDEKMGLKFLGQEPLEVTLLHVDWACPCPNVTVSEMPLIIMGEVDGGMAMLRPDSFVG ASSARRVRKLREVMHKKTCDVLKEFLGLH •4. Start the scan. Which are the motifs that are found? Exercise 4 – Domain search •1. Go to the Pfam site. •2. Click „Search by protein name or sequence„. •3. Paste the sequence below into the box and choose „Both Global and Fragment Pfam search”. •MWAPRCRRFWSRWEQVAALLLLLLLLGVPPRSLALPPIRYSHAGICPNDMNPNLWVDAQSTCRRECETDQECETYEKCCPNVCGTKSCVAARYMDVKGKKGPVGMPKE ATCDHFMCLQQGSECDIWDGQPVCKCKDRCEKEPSFTCASDGLTYYNRCYMDAEACSKGITLAVVTCRYHFTWPNTSPPPPETTMHPTTASPETPELDMAAPALLNNPV HQSVTMGETVSFCDVVGRPRPEITWEKQLEDRENVVMRPNHVRGNVVVTNIAQLVIYNAQLQDAGIYTCTARNVAGVLRADFPLSVVRGHQAAATSESSPNGTAFPAAEL KPPDSEDCGEEQTRWHFDAQANNCLTFTFGHCHRNLNHFETYEACMLACMSGPLAACSLPALQGPCKAYAPRWAYNSQTGQCQSFVYGGCEGNGNNFESREACEESP FPRGNQRCRACKPRQKLVTSFCRSDFVILGRVSELTEEPDSGRALVTVDEVLKDEKMGLKFLGQEPLEVTLLHVDWACPCPNVTVSEMPLIIMGEVDGGMAMLRPDSFVG ASSARRVRKLREVMHKKTCDVLKEFLGLH 4. Search Pfam. 1. Which domains are found? 2, What may be the function of this protein? Exercise 5: Blast searches on your computer 1. download blast-2.2.14-ia32-linux.tar.gz file from ftp://ftp.ncbi.nih.gov/blast/executables/LATEST 2. Make a subdirectory in your home directory: mkdir ~/blast 3. Move the blast file there: mv blast-2.2.14-ia32-linux.tar.gz ~/blast/ 4. Go to the blast directory: cd ~/blast/ 4. unzip the file: gunzip blast-2.2.14-ia32-linux.tar.gz 5. unpack it: tar –xvf blast-2.2.14-ia32-linux.tar Exercise 5: Blast searches, cont’d 6. Get the first 100 human proteins in Swissprot: - go to http://www.expasy.org/srs5/ - click on Start - unmark TREMBL, to search only in Swissprot -press Continue Exercise 5: Blast searches, cont’d Select in the first Info line “Organism” and type in “human” Press “Do Query”, this will retrieve all human proteins in Swissprot in batches of 100 Exercise 5: Blast searches, cont’d Press save Exercise 5: Blast searches, cont’d 1. Change view to FastaSeqs 3. Press SAVE 2. Change Sequence Format to fasta Exercise 5: Blast searches, cont’d 6. Save file e.g. as 100seq.fa 7. Format your database of 100 sequences to make it searchable by blast: ~/blast/blast-2.2.14/bin/formatdb –i 100seq.fa 8. Now you have a searchable database, you can search with an input sequence of your choice. E.g. make a file from the first sequence in 100seq.fa, grab the first sequence with the mouse and type cat > seq1.fa and paste it into the file, then press <Ctrl-d> 9. Now you have an input sequence and a database, type: ~/blast/blast-2.2.14/bin/blastall –p blastp –i seq1.fa –d 100seq.fa –o seq1-vs-100seq.blastp 10. After it finished running (it will be ready immediately) you will get your output in seq1-vs-100seq.blastp file. If you invoke the blastall program without the “switches” it will list all the options you can use.