Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
August 4, 2014 Sea fan transcriptome work: Blasting the data Before this step, Steven did the de novo assembly in CLC (program as opposed to Trinity, which is free) - Look up - http://en.wikipedia.org/wiki/De_novo_transcriptome_assembly Get on to ipython notebook - Open Terminal type ipython notebook and press enter Blasting the transcriptome - The overall goal is to make the annotations clear in a good figure (later we’ll think about that) - mRNA isolated and 6 libraries from 3 control and 3 infected individuals - a bunch of fastq files combine to make contigs o CLC, Trinity (freeware), etc. – usually takes time o Today we’ll start with the unknown contigs and try to annotate them o Using Terminal (Mac) to blast 30K sequences as we can’t do it indirectly – you can do almost anything through this - Commands, and more posted online*; and you can find more by just googling them o Write “!” before all commands o Fgrep – counting the number of times that you see a given thing; “!fgrep –c “>” filename.fa” = need c to Count and in quotations is what you’re counting (the greater-than sign) o !wc = word count o !head = gives header of file o !awk ‘{print $3, $1}’ filename |sort –g => tells you the length of all the sequences in columns 3 and 1 awk can do a lot of other things, too o !perl on a fasta file allows you to bin it by sizes (the lengths of the contigs in histogram) - Public files can be made available via nbviewer – i.e. via GitHub - Create a blast database called “Db” – so we can blast against this database within the computer to get at those 30K sequences (otherwise we’d have to split it) o Create by doing “mkdir” – make directory o Doesn’t allow you to do it in the blast Applications folder, so do it on the desktop – you have to do the command “cd /Users/fhlguest/Desktop” exactly like that o Then we moved it to maxene (server) under Allison folder and did the cd [drag file] to set directory again o Maxene – diseases; biol533 password - dhcp157:~ fhlguest$ cd .. = go up a directory - Launch iPython notebook from your own directory on maxene - - - o Then every notebook in ipython will be saved there o Make sure you’re in the Allison folder in maxene o Type “ipython notebook” and press enter to LAUNCH In ipython o Click “new notebook” o Use iPython to run scripts, commands but you can also type things annotations to have a cell be markdown or headings – Steven use headings to keep things separate Type text; choose heading or markdown; press “Shift+Enter” What do we want to blast against? o E.g. Swiss-Prot Unipro – go to unipro.org in browser and find it o Use as a first pass; it’s a protein database and we want to use that for annotations since there’s more conservation at the protein level Can download directory OR Make a directory called blast (mkdir blast) and use the weblink to “fasta” file (orange button) for SwissProt to get address Load into ipython: “! curl –O [website]” Then it loads IF you make a mistake – go to “kernel” on toolbar in ipython and press “interrupt” to stop it; or go back to terminal and type ctrl+c to kill the server Or you can download the fasta.gz file and put it directly in to the “db” folder under Maxene o Make directories in Allison (using iPython) – mkdir Output; mkdir Query Go back into blast in Applications o “cd /Applications/blast” and go into bin folder “cd bin” o We should be able to run blast from there, but in some computers in this lab you have to use the direct path: “!/Applications/blast/bin/blastx –h” We set a variable to make this easier next time (a shortcut) bd='/Applications/blast/bin/' o Now that works, we want to make a blast database out of the SwissProt database that we downloaded Unzip fasta file of SwissProt in maxene !{bd}makeblastdb –h = gives you all the required and optional command elements -help will give you more information Required – the input file, that it’s a protein sequence, and the output directory Troubleshooting – Put a backslash after each line* - see code example Make sure to have spaces before each backslash Not space after the “!” And if it still doesn’t work, check for extra space bars in the wrong place* Success => creates 3 files in the correct output o Now do the blastx Required Query = the file you’re going to blast, in fasta format Db = the database we just created with “makeblastdb”; referenced by it’s output name uniprot_sprot Output = goes to the blast folder in the Allison folder Optional but important Output format – 6 means tab-delimited txt file E-value at 1E-20 Max_target_seqs = 1, the best match; good since we have 30,000 Num_threads = how many threads processing it (2 of either 2 or 4 CPUs depending on the computer) – processing power/ speed o If running it alone you could try 8 CPUs o Run the file and wait: (days to run) Makes file in the specified directory; can open with text wrangler Indicates it’s RUNNING with an asterisk next to the input line - - Used Steven’s files for Blast results instead of waiting 3 days o Cleaned up output file in some steps o We went through those steps – “!sed …” command – and new output in Lauren’s directory = seastar_clc_uniprot_sprot_1_new.tab o If you don’t know what a command is, put it in to exlpainshell.com ! Working with the cleaned, annotated file o What might we want to know? We can group them by where they came from o But more importantly, group by what the genes DO – that’s information that’s coming in from the Swiss Prot database o JOIN (Like the offset command in Excel – join the files where two things match, usually an ID) Google SQLShare go to “TrySQLShare” sign in with gmail or UW id Find the file and separate by “pipe” (the vertical straight line) Take Lauren’s file with pipes, output Use the SwissProt ID as the common marker to JOIN the two files (e.g. sp|P25001|) – but you need to make the ID itself it’s OWN column without the extra text E.g. “sp|P25001|COX1_PISOC…” needs to be “sp|P25001|” nnotation Annotation This is my notebook. In [1]: pwd Out[1]: u'/Volumes/Students/diseases/Allison' In [2]: mkdir Output In [3]: mkdir Query In [4]: mkdir blast In [5]: pwd Out[5]: u'/Volumes/Students/diseases/Allison' In [6]: cd blast /Volumes/Students/diseases/Allison/blast In [7]: cd db [Errno 2] No such file or directory: 'db' /Volumes/Students/diseases/Allison/blast In [8]: cd .. /Volumes/Students/diseases/Allison In [9]: cd db /Volumes/Students/diseases/Allison/db In [10]: !curl -O ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgeba se/complete/uniprot_sprot.fasta.gz % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 78.2M 100 78.2M 0 0 284k 0 0:04:41 0:04:41 --:--:-- 275k 100 78.2M 100 78.2M 0 0 283k 0 0:04:42 0:04:42 --:--:-- 283k Now we've downloaded the SwissProt into db Now we're going to blast In [11]: cd bin [Errno 2] No such file or directory: 'bin' /Volumes/Students/diseases/Allison/db In [12]: cd /Users/fhlguest In [13]: cd /Volumes/Students/diseases/Allison /Volumes/Students/diseases/Allison In [14]: cd db /Volumes/Students/diseases/Allison/db In [15]: cd bin [Errno 2] No such file or directory: 'bin' /Volumes/Students/diseases/Allison/db In [16]: cd /Applications/blast /Applications/blast In [17]: cd bin /Applications/blast/bin In [18]: ls blast_formatter* blastx* makembindex* tblastn* blastdb_aliastool* convert2blastmask* makeprofiledb* tblastx* blastdbcheck* deltablast* psiblast* update_blastdb.pl* blastdbcmd* dustmasker* rpsblast* windowmasker* blastn* legacy_blast.pl* rpstblastn* blastp* makeblastdb* segmasker* In [19]: cd /Applications/blast /Applications/blast In [20]: ls ChangeLog ncbi_package_info In [21]: README doc/ LICENSE bin/ cd bin /Applications/blast/bin In [22]: !blastx h /bin/sh: blastx: command not found In [23]: !blastx -h /bin/sh: blastx: command not found In [24]: pwd Out[24]: u'/Applications/blast/bin' In [25]: !/Applications/blast/bin/blastx -h USAGE blastx [-h] [-help] [-import_search_strategy filename] [export_search_strategy filename] [-db database_name] [-dbsize num_letters] [-gilist filename] [-seqidlist filename] [negative_gilist filename] [-entrez_query entrez_query] [db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm] [-subject subject_input_file] [-subject_loc range] [-query input_file] [-out output_file] [-evalue evalue] [-word_size int_value] [gapopen open_penalty] [-gapextend extend_penalty] [-xdrop_ungap float_value] [-xdrop_gap float_value] [-xdrop_gap_final float_value] [-searchsp int_value] [-max_hsps int_value] [sum_statistics] [-max_intron_length length] [-seg SEG_options] [soft_masking soft_masking] [-matrix matrix_name] [-threshold float_value] [-culling_limit int_value] [-best_hit_overhang float_value] [-best_hit_score_edge float_value] [-window_size int_value] [-ungapped] [-lcase_masking] [-query_loc range] [-strand strand] [-parse_deflines] [-query_gencode int_value] [-outfmt format] [-show_gis] [-num_descriptions int_value] [-num_alignments int_value] [-html] [-max_target_seqs num_sequences] [-num_threads int_value] [-remote] [-comp_based_stats compo] [-use_sw_tback] [version] DESCRIPTION Translated Query-Protein Subject BLAST 2.2.29+ Use '-help' to print detailed descriptions of command line arguments This was a work-around to use the direct path because it wouldn't let us do it the first way Now we want to make a blast database out of the SwissProt database that we downloaded In [26]: bd='/Applications/blast/bin/' In [27]: bd Out[27]: '/Applications/blast/bin/' bd is a shortcut (or a "variable") so we can just type bd to get that In [28]: !{bd}makeblastdb -h USAGE makeblastdb [-h] [-help] [-in input_file] [-input_type type] -dbtype molecule_type [-title database_title] [-parse_seqids] [hash_index] [-mask_data mask_data_files] [-mask_id mask_algo_ids] [-mask_desc mask_algo_descriptions] [-gi_mask] [-gi_mask_name gi_based_mask_names] [-out database_name] [-max_file_sz number_of_bytes] [-taxid TaxID] [-taxid_map TaxIDMapFile] [-logfile File_Name] [-version] DESCRIPTION Application to create BLAST databases, version 2.2.29+ Use '-help' to print detailed descriptions of command line arguments In [29]: !{bd}makeblastdb \ -in /Volumes/Students/diseases/Allison/uniprot_sprot.fasta \ -dbtype prot \ -out /Volumes/Students/diseases/Allison/Output/uniprot_sprot Building a new DB, current time: 08/04/2014 09:51:28 New DB name: /Volumes/Students/diseases/Allison/Output/uniprot_sprot New DB title: /Volumes/Students/diseases/Allison/uniprot_sprot.fasta Sequence type: Protein Keep Linkouts: T Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 546000 sequences in 176.893 seconds. In [32]: !{bd}blastx -help USAGE blastx [-h] [-help] [-import_search_strategy filename] [export_search_strategy filename] [-db database_name] [-dbsize num_letters] [-gilist filename] [-seqidlist filename] [negative_gilist filename] [-entrez_query entrez_query] [db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm] [-subject subject_input_file] [-subject_loc range] [-query input_file] [-out output_file] [-evalue evalue] [-word_size int_value] [gapopen open_penalty] [-gapextend extend_penalty] [-xdrop_ungap float_value] [-xdrop_gap float_value] [-xdrop_gap_final float_value] [-searchsp int_value] [-max_hsps int_value] [sum_statistics] [-max_intron_length length] [-seg SEG_options] [soft_masking soft_masking] [-matrix matrix_name] [-threshold float_value] [-culling_limit int_value] [-best_hit_overhang float_value] [-best_hit_score_edge float_value] [-window_size int_value] [-ungapped] [-lcase_masking] [-query_loc range] [-strand strand] [-parse_deflines] [-query_gencode int_value] [-outfmt format] [-show_gis] [-num_descriptions int_value] [-num_alignments int_value] [-html] [-max_target_seqs num_sequences] [-num_threads int_value] [-remote] [-comp_based_stats compo] [-use_sw_tback] [version] DESCRIPTION Translated Query-Protein Subject BLAST 2.2.29+ OPTIONAL ARGUMENTS -h Print USAGE and DESCRIPTION; ignore all other parameters -help Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters -version Print version number; ignore other arguments *** Input query options -query <File_In> Input file name Default = `-' -query_loc <String> Location on the query sequence in 1-based offsets (Format: start-stop) -strand <String, `both', `minus', `plus'> Query strand(s) to search against database/subject Default = `both' -query_gencode <Integer, values between: 1-6, 9-16, 21-25> Genetic code to use to translate query (see user manual for details) Default = `1' *** General search options -db <String> BLAST database name * Incompatible with: subject, subject_loc -out <File_Out> Output file name Default = `-' -evalue <Real> Expectation value (E) threshold for saving hits Default = `10' -word_size <Integer, >=2> Word size for wordfinder algorithm -gapopen <Integer> Cost to open a gap -gapextend <Integer> Cost to extend a gap -max_intron_length <Integer> Length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments (a negative value disables linking) Default = `0' -matrix <String> Scoring matrix name (normally BLOSUM62) -threshold <Real, >=0> Minimum word score such that the word is added to the BLAST lookup table comp_based_stats <String> Use composition-based statistics: D or d: default (equivalent to 2 ) 0 or F or f: No compositionbased statistics 1: Composition-based statistics as in NAR 29:2994-3005, 2001 2 or T or t : Composition-based score adjustment as in Bioinformatics 21:902-911, 2005, conditioned on sequence properties 3: Composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally Default = `2' *** BLAST-2-Sequences options -subject <File_In> Subject sequence(s) to search * Incompatible with: db, gilist, seqidlist, negative_gilist, db_soft_mask, db_hard_mask -subject_loc <String> Location on the subject sequence in 1-based offsets (Format: startstop) * Incompatible with: db, gilist, seqidlist, negative_gilist, db_soft_mask, db_hard_mask, remote *** Formatting options -outfmt <String> alignment view options: 0 = pairwise, 1 = queryanchored showing identities, 2 = query-anchored no identities, 3 = flat query-anchored, show identities, 4 = flat query-anchored, no identities, 5 = XML Blast output, 6 = tabular, 7 = tabular with comment lines, 8 = Text ASN.1, 9 = Binary ASN.1, 10 = Comma-separated values, 11 = BLAST archive format (ASN.1) Options 6, 7, and 10 can be additionally configured to produce a custom format specified by space delimited format specifiers. The supported format specifiers are: qseqid means Query Seq-id Subject Seq-id sallseqid means All subject Seq-id(s), separated by a ';' sgi means Subject GI sallgi means All subject GIs sacc means Subject accession saccver means Subject accession.version sallacc means All subject accessions slen means Subject sequence length qstart means Start of alignment in query qend means End of alignment in query sstart means Start of alignment in subject send means End of alignment in subject qseq means Aligned part of query sequence sseq means Aligned part of subject sequence evalue means Expect value bitscore means Bit score score means Raw score length means Alignment length pident means Percentage of identical matches nident means Number of identical matches mismatch means Number of mismatches positive means Number of positive-scoring matches frames separated by a '/' qframe means Query frame sframe means Subject frame btop means Blast traceback operations (BTOP) staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order) sscinames means unique Subject Scientific Name(s), separated by a ';' scomnames means unique Subject Common Name(s), separated by a ';' sblastnames means unique Subject Blast Name(s), separated by a ';' (in alphabetical order) sskingdoms means unique Subject Super Kingdom(s), separated by a ';' Query Coverage Per Subject qcovhsp means Query Coverage Per HSP When not provided, the default value is: 'qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore', which is equivalent to the keyword 'std' Default = `0' -show_gis Show NCBI GIs in deflines? -num_descriptions <Integer, >=0> Number of database sequences to show one-line descriptions for Not applicable for outfmt > 4 Default = `500' * Incompatible with: max_target_seqs -num_alignments <Integer, >=0> Number of database sequences to show alignments for Default = `250' * Incompatible with: max_target_seqs -html Produce HTML output? *** Query filtering options -seg <String> Filter query sequence with SEG (Format: 'yes', 'window locut hicut', or 'no' to disable) Default = `12 2.2 2.5' -soft_masking <Boolean> Apply filtering locations as soft masks Default = `false' -lcase_masking Use lower case filtering in query and subject sequence(s)? *** Restrict search or results -gilist <String> Restrict search of database to list of GI's * Incompatible with: negative_gilist, seqidlist, remote, subject, subject_loc -seqidlist <String> Restrict search of database to list of SeqId's * Incompatible with: gilist, negative_gilist, remote, subject, subject_loc -negative_gilist <String> Restrict search of database to everything except the listed GIs * Incompatible with: gilist, seqidlist, remote, subject, subject_loc -entrez_query <String> Restrict search with the given Entrez query * Requires: remote -db_soft_mask <String> Filtering algorithm ID to apply to the BLAST database as soft masking qgi gapopen (in alphab * Incompatible with: db_hard_mask, subject, subject_loc -db_hard_mask <String> Filtering algorithm ID to apply to the BLAST database as hard masking * Incompatible with: db_soft_mask, subject, subject_loc -culling_limit <Integer, >=0> If the query range of a hit is enveloped by that of at least this many higher-scoring hits, delete the hit * Incompatible with: best_hit_overhang, best_hit_score_edge -best_hit_overhang <Real, (>=0 and =<0.5)> Best Hit algorithm overhang value (recommended value: 0.1) * Incompatible with: culling_limit -best_hit_score_edge <Real, (>=0 and =<0.5)> Best Hit algorithm score edge value (recommended value: 0.1) * Incompatible with: culling_limit -max_target_seqs <Integer, >=1> Maximum number of aligned sequences to keep Not applicable for outfmt <= 4 Default = `500' * Incompatible with: num_descriptions, num_alignments *** Statistical options -dbsize <Int8> Effective length of the database -searchsp <Int8, >=0> Effective length of the search space -max_hsps <Integer, >=0> Set maximum number of HSPs per subject sequence to save (0 means no limit) Default = `0' -sum_statistics Use sum statistics *** Search strategy options -import_search_strategy <File_In> Search strategy to use * Incompatible with: export_search_strategy export_search_strategy <File_Out> File name to record the search strategy used * Incompatible with: import_search_strategy *** Extension options -xdrop_ungap <Real> X-dropoff value (in bits) for ungapped extensions -xdrop_gap <Real> X-dropoff value (in bits) for preliminary gapped extensions -xdrop_gap_final <Real> X-dropoff value (in bits) for final gapped alignment -window_size <Integer, >=0> Multiple hits window size, use 0 to specify 1-hit algorithm -ungapped Perform ungapped alignment only? *** Miscellaneous options parse_deflines Should the query and subject defline(s) be parsed? num_threads <Integer, >=1> Number of threads (CPUs) to use in the BLAST search Default = `1' * Incompatible with: remote -remote Execute search remotely? * Incompatible with: gilist, seqidlist, negative_gilist, subject_loc, num_threads -use_sw_tback Compute locally optimal Smith-Waterman alignments? In [ ]: !{bd}blastx \ -query /Volumes/Students/diseases/steven/Phel_transcriptome_clc.fa \ -db /Volumes/Students/diseases/Allison/Output/uniprot_sprot \ -outfmt 6 \ -out /Volumes/Students/diseases/Allison/blast/Phel_trans_blastx_uniprot.tab \ -evalue 1E-20 \ -max_target_seqs 1 \ -num_threads 2 In [ ]: !head Phel_transcriptome_clc.fa word count - number of lines that have run in blast In [ ]: !wc -l /Volumes/Students/diseases/Allison/blast/Phel_trans_blastx_uniprot.tab In [ ]: Common codes and cleaning up the Blasted file Here we're using the blasted transcriptome and checking it with different codes Wc is word count, cd is change directory In [1]: !wc -l /Volumes/Students/diseases/Allison/blast/Phel_trans_blastx_uniprot.tab 31 /Volumes/Students/diseases/Allison/blast/Phel_trans_blastx_uniprot.tab In [2]: cd /Volumes/Students/diseases/steven /Volumes/Students/diseases/steven fgrep is a way to grab information; this command searches the file for the > sign In [11]: !fgrep -c ">" /Volumes/Students/diseases/steven/seastar_clc_uniprot_sprot_1.tab 0 In [13]: !wc -l /Volumes/Students/diseases/steven/seastar_clc_uniprot_sprot_1.tab 8913 /Volumes/Students/diseases/steven/seastar_clc_uniprot_sprot_1.tab In [14]: !perl -h Usage: perl [switches] [--] [programfile] [arguments] -0[octal] specify record separator (\0, if no argument) -a autosplit mode with -n or -p (splits $_ into @F) -C[number/list] enables the listed Unicode features -c check syntax only (runs BEGIN and CHECK blocks) -d[:debugger] run program under debugger -D[number/list] set debugging flags (argument is a bit mask or alphabets) -e program one line of program (several -e's allowed, omit programfile) -E program like -e, but enables all optional features -f don't do $sitelib/sitecustomize.pl at startup -F/pattern/ split() pattern for -a switch (//'s are optional) -i[extension] edit <> files in place (makes backup if extension supplied) -Idirectory specify @INC/#include directory (several -I's allowed) -l[octal] enable line ending processing, specifies line terminator -[mM][]module execute "use/no module..." before executing program -n assume "while (<>) { ... }" loop around program -p assume loop like -n but print line also, like sed -s enable rudimentary parsing for switches after programfile -S look for programfile using PATH environment variable -t enable tainting warnings -T enable tainting checks u dump core after parsing program -U allow unsafe operations -v print version, subversion (includes VERY IMPORTANT perl info) -V[:variable] print configuration summary (or a single Config.pm variable) -w enable many useful warnings (RECOMMENDED) -W enable all warnings -x[directory] strip off text before #!perl line and perhaps cd to directory -X disable all warnings Head shows the first 10 lines of the file - useful to check what it looks like. This needs to be cleaned up! In [1]: !head /Volumes/Students/diseases/steven/seastar_clc_uniprot_sprot_1.tab 3291_5903_10007_H94MGADXX_V_CF71_ATCACG_R1_(paired)_trimmed_(paired)_co ntig_4 sp|P25001|COX1_PISOC 88.39 517 48 1 7061 5547 1 517 0.0 749 3291_5903_10007_H94MGADXX_V_CF71_ATCACG_R1_(paired)_trimmed_(paired)_co ntig_7 sp|Q33818|CYB_ASTPE 79.94 329 66 0 993 7 214 3291_5903_10007_H94MGADXX_V_CF71_ATCACG_R1_(paired)_trimmed_(paired)_co ntig_9 sp|Q0MVN8|QOR_PIG 45.61 239 129 1 796 80 22 99.8 3291_5903_10007_H94MGADXX_V_CF71_ATCACG_R1_(paired)_trimmed_(paired)_co ntig_18 sp|P96202|PPSC_MYCTU 30.81 714 438 15 5407 3386 1414 2111 6e-76 286 3291_5903_10007_H94MGADXX_V_CF71_ATCACG_R1_(paired)_trimmed_(paired)_co ntig_20 sp|P46058|EDSP_CYNPY 31.03 348 218 8 1731 703 149 450 3291_5903_10007_H94MGADXX_V_CF71_ATCACG_R1_(paired)_trimmed_(paired)_co ntig_24 sp|P63245|GBLP_RAT 80.77 312 60 0 1032 97 339 51 379 90 327 4 334 1 312 7061 4862 1177 5547 4407 638 Lauren cleaned it up using the sed command - I still have to figure out how that works In [2]: !head /Volumes/Students/diseases/lauren/seastar_clc_uniprot_sprot_1_new.tab Phe1_clc_contig_4 sp|P25001|COX1_PISOC 88.39 517 48 1 Phe1_clc_contig_8 sp|P68037|UB2L3_MOUSE 76.97 152 35 0 Phe1_clc_contig_17 sp|Q6DGL8|RT15_DANRE 35.00 180 107 3 Phe1_clc_contig_20 Phe1_clc_contig_24 In [8]: sp|P46058|EDSP_CYNPY sp|P63245|GBLP_RAT 31.03 80.77 348 312 218 60 8 0 !fgrep -c "RAT" /Volumes/Students/diseases/lauren/seastar_clc_uniprot_sprot_1_new.tab 573 That groups all the sequences that contain the word RAT - and you can group them all this way Use SQLShare to Join a bunch of different big tables - cloud instead of someone's computer Joining files will allow us to link the SwissProt IDs to the transcriptome file to assign gene functions This file is cleaned up to have the Swiss Prot ID (e.g. sp|P25001|) separate from the other information so we can match it to the ID in the database and join the files to attach Swiss Prot gene function information to our blasted transcriptome sequences In [ ]: !tr '|' "\t" </Volumes/Students/diseases/lauren/seastar_clc_uniprot_sprot_1_new.tab> /Volumes/Students/diseases/Allison/seastar_clc_uniprot_sprot_3.tab We'll pick up next time with joining the files using SQL Share In [ ]: 1731 1032 703 97