Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
United Kingdom National DNA Database wikipedia , lookup
Epitranscriptome wikipedia , lookup
Non-coding RNA wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Metagenomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Helitron (biology) wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Lab7 QRNA, HMMER, PFAM Sean Eddy’s Lab • http://selab.janelia.org/software.html HMMER Introduction • HMMER2 is an implementation in UNIX (Linux, MacOS) platform of profile hidden Markov model, whose source code, executables, and user guide can be downloaded from http://hmmer.janelia.org/ • The experiment of HMMER is to look for known domains in a query sequence by searching a single sequence again a library of HMMs. • One such library is PFAM, and you can also create your own library using HMMER HMMER executables 1. hmmalign ‐ Align sequences to an existing model. 2. hmmbuild ‐ Build a model from a multiple sequence alignment. 3. hmmcalibrate ‐ Takes an HMM and empirically determines parameters that are used to make searches more sensitive, by calculating more accurate expectation value scores (E‐values). 4. hmmconvert ‐ Convert a model file into different formats, including a compact HMMER 2 binary format, and “best effort” emulation of GCG profiles. 5. hmmemit ‐ Emit sequences probabilistically from a profile HMM. 6. hmmfetch ‐ Get a single model from an HMM database. 7. hmmindex ‐ Index an HMM database. 8. hmmpfam ‐ Search an HMM database for matches to a query sequence. 9. hmmsearch ‐ Search a sequence database for matches to an HMM. Installation • Simple installation – – – • Download the current version of HMMER “hmmer‐2.3.2.bin.intel‐linux.tar.gz” from http://hmmer.janelia.org/#download; Unpack the software by typing “tar –xvf hmmer‐2.3.2.bin.intel‐linux.tar.gz” in the command line. You will see a new directory “hmmer‐2.3.2.bin.intel‐linux”. Enter the directory of hmmer‐2.3.2.bin.intel‐linux. You will see NINE executables ready in the subdirectory “/binaries”, and also nine files in the subdirectory “/tutorial”; Installation from source code – – – – – – – – – Download the current HMMER source code version “hmmer‐2.3.2.tar.gz” from http://hmmer.janelia.org/#download; Create a new directory in your Linux account and upload or move the software package to the directory; Unpack the software by typing “tar –xvf hmmer‐2.3.2.tar.gz” in the command line; Type “cd hmmer‐2.3.2” to enter the software directory; Type “./configure” to configure for your system and build the programs; Type “make” to generate the executables; Type “make check” to run the automated test suite; (This is optional but recommended, and all these tests should pass); Please note that by default programs are in “/usr/local/bin/” and man pages are in “/usr/local/man/man1”; Type “make install” to install all executables; Hmmbuild : build a profile HMM from an aignment • hmmbuild [options] hmmfile alignfile • • hmmbuild test.hmm test.aln hmmbuild -h • hmmbuild reads a multiple sequence alignment file alignfile , builds a new profile HMM, and saves the HMM in hmmfile. • alignfile may be in ClustalW, GCG MSF, or SELEX alignment format. • By default, the model is configured to find one or more non-overlapping alignments to the complete model. • To configure the model for a single global alignment, use the -g option; • To configure the model for multiple local alignments, use the -f option; • To configure the model for a single local alignment (standard Smith/Waterman), use the -s option. Hmmcalibrate: calibrate HMM search statistics • hmmcalibrate [options] hmmfile • • hmmcalibrate test.hmm Hmmcalibrate -h • hmmcalibrate reads an HMM file from hmmfile, scores a large number of synthesized random sequences with it, fits an extreme value distribution (EVD) to the histogram of those scores, and re-saves hmmfile now including the EVD parameters. • This step is optional, but it will increase the sensitivity of your database search • hmmcalibrate may take several minutes (or longer) to run. While it is running, a temporary file called hmmfile.xxx is generated in your working directory. • If you abort hmmcalibrate prematurely (ctrl-C, for instance), your original hmmfile will be untouched, and you should delete the hmmfile.xxx temporary file. Hmmsearch: - search a sequence database with a profile HMM • hmmsearch [options] hmmfile seqfile • • hmmsearch test.hmm query.faa > query.faa.domain hmmsearch -h • hmmsearch reads an HMM from hmmfile and searches seqfile for significantly similar sequence matches. • hmmsearch may take minutes or even hours to run, depending on the size of the sequence database. It is a good idea to redirect the output to a file. • The output consists of four sections: • • • • • a ranked list of the best scoring sequences, a ranked list of the best scoring domains, alignments for all the best scoring domains, and a histogram of the scores. A sequence score may be higher than a domain score for the same sequence if there is more than one domain in the sequence; the sequence score takes into account all the domains. All sequences scoring above the -E and -T cutoffs are shown in the first list, then every domain found in this list is shown in the second list of domain hits. If desired, E-value and bit score thresholds may also be applied to the domain list using the -domE and -domT options. PFAM Pfam 23.0 (July 2008, 10340 families) • The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). • Proteins are generally composed of one or more functional regions, commonly termed domains. Different combinations of domains give rise to the diverse range of proteins found in nature. • The identification of domains that occur within proteins can therefore provide insights into their function. • There are two components to Pfam: Pfam-A and Pfam-B. – – Pfam-A entries are high quality, manually curated families. Although these Pfam-A entries cover a large proportion of the sequences in the underlying sequence database, in order to give a more comprehensive coverage of known proteins we also generate a supplement using the ADDA database. These automatically generated entries are called Pfam-B. – • Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found. Pfam also generates higher-level groupings of related families, known as clans. A clan is a collection of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM. (see PfamC) Sequence analysis with HMM • ftp://ftp.sanger.ac.uk/pub/databases/Pfam/releases/Pfam23.0/ to download files “Pfam_fs.gz” and “Pfam_ls.gz” – Pfam_ls - All global (ls mode) Pfam-A HMMs in an HMM library searchable with the hmmpfam program. – Pfam_fs - All local (fs mode) Pfam-A HMMs in an HMM library searchable with the hmmpfam program. • Data location – /home/kwchoi/public_html/I529-09-lab/Lab7/Data/PFAM_data/ – Copy to your working directory or make symbolic link • To search for domains in “test.faa” in the global sequence database, type – hmmpfam Pfam_fs test.faa > test.faa.pfam” • The results is logged into an output file “test.faa.pfam”; QRNA QRNA • QRNA is a prototype structural noncoding RNA genefinder tools for detecting novel structural RNA genes. • It uses three probabilistic "pair-grammars": • a pair stochastic context free grammar modeling alignments constrained by structural RNA evolution, • a pair hidden Markov model modeling alignments constrained by coding sequence evolution, and • a pair hidden Markov model modeling a null hypothesis of positionindependent evolution. • Given an input pairwise sequence alignment (e.g. from a BLASTN comparison of two related genomes), it classify the alignment into the coding, RNA, or null class according to the posterior probability of each class. Local Installation • The latest version of QRNA (qrna-2.0.3c.tar.gz ) can be download from ftp://selab.janelia.org/pub/software/qrna/ • Configure QRNA and install • • • • • • • • tar -xvf qrna-2.0.3c.tar cd qrna-2.0.3c cd squid make cd ../squid02 make cd ../src make • QRNA is installed in • /home/kwchoi/Installed/qrna-2.0.3c/ • Data location: – /home/kwchoi/public_html/I529-09lab/Lab7/Data/QRNA_data/ • blastn2qrnadepth.pl – /home/kwchoi/Installed/qrna2.0.3c/src/scripts/blastn2qrnadepth.pl -g human HG_13_RNAs_gene.fa.MGSCv3.fragchrom.blast • HG_13_RNAs_gene.fa.MGSCv3.fragchrom.blast.E0.01.D1.q • HG_13_RNAs_gene.fa.MGSCv3.fragchrom.blast.E0.01.D1.q.gff • HG_13_RNAs_gene.fa.MGSCv3.fragchrom.blast.E0.01.D1.q.rep • Simple test – Set the running enveriment and run a simple example: • export QRNADB=/home/kwchoi/Installed/qrna-2.0.3c/lib • /home/kwchoi/Installed/qrna-2.0.3c/src/eqrna -a 5s_rRNA.q > 5s_rRNA.q.eqrna QRNA demo • Option -C shuffles the columns of the pairwise alignment while maintaining the gap and conserved structure of the original alignment. Compare the two results of using -C and without using -C: – • Example using the scanning version with a window. Consider file “Scerevisiae orf v other yeasts.q” which contains an alignment of a S. cerevisiae ORF The alignment has 514 nucleotides, and we would like to score it with eqrna using a window of 150 nucleotides, and moving the window 50 nucleotides each time: – • /home/kwchoi/Installed/qrna-2.0.3c/src/eqrna -w 150 -x 50 Scerevisiae_orf_v_other_yeasts.q > Scerevisiae_orf_v_other_yeasts.q.w150.x50.eqrna example start with a blastn output: – • /home/kwchoi/Installed/qrna-2.0.3c/src/eqrna -a -C 5s_rRNA.q > 5s_rRNA.q.con_shuffle.eqrna /home/kwchoi/Installed/qrna-2.0.3c/src/eqrn -w 50 -x 50 HG_13_RNAs_gene.fa.MGSCv3.fragchrom.blast.E0.01.D1.q HG_13_RNAs_gene.fa.MGSCv3.fragchrom.blast.E0.01.D1.q.W150.X50.eqrna The results files are shown as following: – – – 5s_rRNA.q.eqrna 5s_rRNA.q.con_shuffle.eqrna Scerevisiae_orf_v_other_yeasts.q.w150.x50.eqrna