Download Volumes/Students/blast

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Relational model wikipedia , lookup

Database model wikipedia , lookup

Clusterpoint wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Transcript
August 4, 2014
Sea fan transcriptome work: Blasting the data
Before this step, Steven did the de novo assembly in CLC (program as opposed to
Trinity, which is free)
- Look up - http://en.wikipedia.org/wiki/De_novo_transcriptome_assembly
Get on to ipython notebook
- Open Terminal  type ipython notebook and press enter
Blasting the transcriptome
- The overall goal is to make the annotations clear in a good figure (later we’ll
think about that)
- mRNA isolated and 6 libraries from 3 control and 3 infected individuals
- a bunch of fastq files  combine to make contigs
o CLC, Trinity (freeware), etc. – usually takes time
o Today we’ll start with the unknown contigs and try to annotate them
o Using Terminal (Mac) to blast 30K sequences as we can’t do it
indirectly – you can do almost anything through this
- Commands, and more posted online*; and you can find more by just googling
them
o Write “!” before all commands
o Fgrep – counting the number of times that you see a given thing;
“!fgrep –c “>” filename.fa” = need c to Count and in quotations is what
you’re counting (the greater-than sign)
o !wc = word count
o !head = gives header of file
o !awk ‘{print $3, $1}’ filename |sort –g => tells you the length of all the
sequences in columns 3 and 1
 awk can do a lot of other things, too
o !perl on a fasta file allows you to bin it by sizes (the lengths of the
contigs in histogram)
- Public files can be made available via nbviewer – i.e. via GitHub
- Create a blast database called “Db” – so we can blast against this database
within the computer to get at those 30K sequences (otherwise we’d have to
split it)
o Create by doing “mkdir” – make directory
o Doesn’t allow you to do it in the blast Applications folder, so do it on
the desktop – you have to do the command “cd
/Users/fhlguest/Desktop” exactly like that
o Then we moved it to maxene (server) under Allison folder and did the
cd [drag file] to set directory again
o Maxene – diseases; biol533 password
- dhcp157:~ fhlguest$ cd .. = go up a directory
- Launch iPython notebook from your own directory on maxene
-
-
-
o Then every notebook in ipython will be saved there
o Make sure you’re in the Allison folder in maxene
o Type “ipython notebook” and press enter to LAUNCH
In ipython
o Click “new notebook”
o Use iPython to run scripts, commands but you can also type things
annotations to have a cell be markdown or headings – Steven use
headings to keep things separate
 Type text; choose heading or markdown; press “Shift+Enter”
What do we want to blast against?
o E.g. Swiss-Prot Unipro – go to unipro.org in browser and find it
o Use as a first pass; it’s a protein database and we want to use that for
annotations since there’s more conservation at the protein level
 Can download directory OR  Make a directory called blast (mkdir blast) and use the weblink
to “fasta” file (orange button) for SwissProt to get address
 Load into ipython: “! curl –O [website]”
 Then it loads
 IF you make a mistake – go to “kernel” on toolbar in
ipython and press “interrupt” to stop it; or go back to
terminal and type ctrl+c to kill the server
 Or you can download the fasta.gz file and put it directly
in to the “db” folder under Maxene
o Make directories in Allison (using iPython) – mkdir Output; mkdir
Query
Go back into blast in Applications
o “cd /Applications/blast” and go into bin folder “cd bin”
o We should be able to run blast from there, but in some computers in
this lab you have to use the direct path:
“!/Applications/blast/bin/blastx –h”
 We set a variable to make this easier next time (a shortcut) bd='/Applications/blast/bin/'
o Now that works, we want to make a blast database out of the
SwissProt database that we downloaded
 Unzip fasta file of SwissProt in maxene
 !{bd}makeblastdb –h = gives you all the required and optional
command elements
 -help will give you more information
 Required – the input file, that it’s a protein sequence, and the
output directory
 Troubleshooting –
 Put a backslash after each line* - see code example
 Make sure to have spaces before each backslash
 Not space after the “!”

And if it still doesn’t work, check for extra space bars in
the wrong place*
 Success => creates 3 files in the correct output
o Now do the blastx
 Required
 Query = the file you’re going to blast, in fasta format
 Db = the database we just created with “makeblastdb”;
referenced by it’s output name uniprot_sprot
 Output = goes to the blast folder in the Allison folder
 Optional but important
 Output format – 6 means tab-delimited txt file
 E-value at 1E-20
 Max_target_seqs = 1, the best match; good since we
have 30,000
 Num_threads = how many threads processing it (2 of
either 2 or 4 CPUs depending on the computer) –
processing power/ speed
o If running it alone you could try 8 CPUs
o Run the file and wait: (days to run)
 Makes file in the specified directory; can open with text
wrangler
 Indicates it’s RUNNING with an asterisk next to the input line
-
-
Used Steven’s files for Blast results instead of waiting 3 days
o Cleaned up output file in some steps
o We went through those steps – “!sed …” command – and new output
in Lauren’s directory = seastar_clc_uniprot_sprot_1_new.tab
o If you don’t know what a command is, put it in to exlpainshell.com !
Working with the cleaned, annotated file
o What might we want to know? We can group them by where they
came from
o But more importantly, group by what the genes DO – that’s
information that’s coming in from the Swiss Prot database
o JOIN
 (Like the offset command in Excel – join the files where two
things match, usually an ID)
 Google SQLShare  go to “TrySQLShare”  sign in with gmail
or UW id
 Find the file and separate by “pipe” (the vertical straight
line)
 Take Lauren’s file with pipes, output
 Use the SwissProt ID as the common marker to JOIN the two
files (e.g. sp|P25001|) – but you need to make the ID itself it’s
OWN column without the extra text

E.g. “sp|P25001|COX1_PISOC…” needs to be
“sp|P25001|”
nnotation
Annotation
This is my notebook.
In [1]:
pwd
Out[1]:
u'/Volumes/Students/diseases/Allison'
In [2]:
mkdir Output
In [3]:
mkdir Query
In [4]:
mkdir blast
In [5]:
pwd
Out[5]:
u'/Volumes/Students/diseases/Allison'
In [6]:
cd blast
/Volumes/Students/diseases/Allison/blast
In [7]:
cd db
[Errno 2] No such file or directory: 'db'
/Volumes/Students/diseases/Allison/blast
In [8]:
cd ..
/Volumes/Students/diseases/Allison
In [9]:
cd db
/Volumes/Students/diseases/Allison/db
In [10]:
!curl -O
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgeba
se/complete/uniprot_sprot.fasta.gz
% Total
% Received % Xferd Average Speed
Time
Time
Time
Current
Dload Upload
Total
Spent
Left Speed 100 78.2M 100 78.2M
0
0
284k
0 0:04:41
0:04:41 --:--:-- 275k 100 78.2M 100 78.2M
0
0
283k
0
0:04:42 0:04:42 --:--:-- 283k
Now we've downloaded the SwissProt into db
Now we're going to blast
In [11]:
cd bin
[Errno 2] No such file or directory: 'bin'
/Volumes/Students/diseases/Allison/db
In [12]:
cd
/Users/fhlguest
In [13]:
cd /Volumes/Students/diseases/Allison
/Volumes/Students/diseases/Allison
In [14]:
cd db
/Volumes/Students/diseases/Allison/db
In [15]:
cd bin
[Errno 2] No such file or directory: 'bin'
/Volumes/Students/diseases/Allison/db
In [16]:
cd /Applications/blast
/Applications/blast
In [17]:
cd bin
/Applications/blast/bin
In [18]:
ls
blast_formatter*
blastx*
makembindex*
tblastn*
blastdb_aliastool* convert2blastmask* makeprofiledb*
tblastx*
blastdbcheck*
deltablast*
psiblast*
update_blastdb.pl* blastdbcmd*
dustmasker*
rpsblast*
windowmasker* blastn*
legacy_blast.pl*
rpstblastn* blastp*
makeblastdb*
segmasker*
In [19]:
cd /Applications/blast
/Applications/blast
In [20]:
ls
ChangeLog
ncbi_package_info
In [21]:
README
doc/ LICENSE
bin/
cd bin
/Applications/blast/bin
In [22]:
!blastx h
/bin/sh: blastx: command not found
In [23]:
!blastx -h
/bin/sh: blastx: command not found
In [24]:
pwd
Out[24]:
u'/Applications/blast/bin'
In [25]:
!/Applications/blast/bin/blastx -h
USAGE
blastx [-h] [-help] [-import_search_strategy filename]
[export_search_strategy filename] [-db database_name]
[-dbsize
num_letters] [-gilist filename] [-seqidlist filename]
[negative_gilist filename] [-entrez_query entrez_query]
[db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]
[-subject subject_input_file] [-subject_loc range] [-query input_file]
[-out output_file] [-evalue evalue] [-word_size int_value]
[gapopen open_penalty] [-gapextend extend_penalty]
[-xdrop_ungap
float_value] [-xdrop_gap float_value]
[-xdrop_gap_final
float_value] [-searchsp int_value] [-max_hsps int_value]
[sum_statistics] [-max_intron_length length] [-seg SEG_options]
[soft_masking soft_masking] [-matrix matrix_name]
[-threshold
float_value] [-culling_limit int_value]
[-best_hit_overhang
float_value] [-best_hit_score_edge float_value]
[-window_size
int_value] [-ungapped] [-lcase_masking] [-query_loc range]
[-strand
strand] [-parse_deflines] [-query_gencode int_value]
[-outfmt
format] [-show_gis] [-num_descriptions int_value]
[-num_alignments
int_value] [-html] [-max_target_seqs num_sequences]
[-num_threads
int_value] [-remote] [-comp_based_stats compo]
[-use_sw_tback] [version] DESCRIPTION
Translated Query-Protein Subject BLAST 2.2.29+
Use '-help' to print detailed descriptions of command line arguments
This was a work-around to use the direct path because it wouldn't let us do it the first way
Now we want to make a blast database
out of the SwissProt database that we
downloaded
In [26]:
bd='/Applications/blast/bin/'
In [27]:
bd
Out[27]:
'/Applications/blast/bin/'
bd is a shortcut (or a "variable") so we can just type bd to get that
In [28]:
!{bd}makeblastdb -h
USAGE
makeblastdb [-h] [-help] [-in input_file] [-input_type type]
-dbtype molecule_type [-title database_title] [-parse_seqids]
[hash_index] [-mask_data mask_data_files] [-mask_id mask_algo_ids]
[-mask_desc mask_algo_descriptions] [-gi_mask]
[-gi_mask_name
gi_based_mask_names] [-out database_name]
[-max_file_sz
number_of_bytes] [-taxid TaxID] [-taxid_map TaxIDMapFile]
[-logfile
File_Name] [-version] DESCRIPTION
Application to create BLAST
databases, version 2.2.29+ Use '-help' to print detailed descriptions
of command line arguments
In [29]:
!{bd}makeblastdb \
-in /Volumes/Students/diseases/Allison/uniprot_sprot.fasta \
-dbtype prot \
-out /Volumes/Students/diseases/Allison/Output/uniprot_sprot
Building a new DB, current time: 08/04/2014 09:51:28 New DB name:
/Volumes/Students/diseases/Allison/Output/uniprot_sprot New DB title:
/Volumes/Students/diseases/Allison/uniprot_sprot.fasta Sequence type:
Protein Keep Linkouts: T Keep MBits: T Maximum file size: 1000000000B
Adding sequences from FASTA; added 546000 sequences in 176.893 seconds.
In [32]:
!{bd}blastx -help
USAGE
blastx [-h] [-help] [-import_search_strategy filename]
[export_search_strategy filename] [-db database_name]
[-dbsize
num_letters] [-gilist filename] [-seqidlist filename]
[negative_gilist filename] [-entrez_query entrez_query]
[db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]
[-subject subject_input_file] [-subject_loc range] [-query input_file]
[-out output_file] [-evalue evalue] [-word_size int_value]
[gapopen open_penalty] [-gapextend extend_penalty]
[-xdrop_ungap
float_value] [-xdrop_gap float_value]
[-xdrop_gap_final
float_value] [-searchsp int_value] [-max_hsps int_value]
[sum_statistics] [-max_intron_length length] [-seg SEG_options]
[soft_masking soft_masking] [-matrix matrix_name]
[-threshold
float_value] [-culling_limit int_value]
[-best_hit_overhang
float_value] [-best_hit_score_edge float_value]
[-window_size
int_value] [-ungapped] [-lcase_masking] [-query_loc range]
[-strand
strand] [-parse_deflines] [-query_gencode int_value]
[-outfmt
format] [-show_gis] [-num_descriptions int_value]
[-num_alignments
int_value] [-html] [-max_target_seqs num_sequences]
[-num_threads
int_value] [-remote] [-comp_based_stats compo]
[-use_sw_tback] [version] DESCRIPTION
Translated Query-Protein Subject BLAST 2.2.29+
OPTIONAL ARGUMENTS -h
Print USAGE and DESCRIPTION; ignore all
other parameters -help
Print USAGE, DESCRIPTION and ARGUMENTS;
ignore all other parameters -version
Print version number; ignore
other arguments
*** Input query options -query <File_In>
Input
file name
Default = `-' -query_loc <String>
Location on the
query sequence in 1-based offsets (Format: start-stop) -strand
<String, `both', `minus', `plus'>
Query strand(s) to search against
database/subject
Default = `both' -query_gencode <Integer, values
between: 1-6, 9-16, 21-25>
Genetic code to use to translate query
(see user manual for details)
Default = `1'
*** General search
options -db <String>
BLAST database name
* Incompatible with:
subject, subject_loc -out <File_Out>
Output file name
Default =
`-' -evalue <Real>
Expectation value (E) threshold for saving hits
Default = `10' -word_size <Integer, >=2>
Word size for wordfinder
algorithm -gapopen <Integer>
Cost to open a gap -gapextend
<Integer>
Cost to extend a gap -max_intron_length <Integer>
Length of the largest intron allowed in a translated nucleotide
sequence
when linking multiple distinct alignments (a negative value
disables
linking)
Default = `0' -matrix <String>
Scoring
matrix name (normally BLOSUM62) -threshold <Real, >=0>
Minimum word
score such that the word is added to the BLAST lookup table comp_based_stats <String>
Use composition-based statistics:
D
or d: default (equivalent to 2 )
0 or F or f: No compositionbased statistics
1: Composition-based statistics as in NAR
29:2994-3005, 2001
2 or T or t : Composition-based score
adjustment as in Bioinformatics
21:902-911,
2005, conditioned
on sequence properties
3: Composition-based score adjustment as
in Bioinformatics 21:902-911,
2005, unconditionally
Default =
`2'
*** BLAST-2-Sequences options -subject <File_In>
Subject
sequence(s) to search
* Incompatible with: db, gilist, seqidlist,
negative_gilist,
db_soft_mask, db_hard_mask -subject_loc <String>
Location on the subject sequence in 1-based offsets (Format: startstop)
* Incompatible with: db, gilist, seqidlist, negative_gilist,
db_soft_mask, db_hard_mask, remote
*** Formatting options -outfmt
<String>
alignment view options:
0 = pairwise,
1 = queryanchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored,
no identities,
5 = XML Blast output,
6 = tabular,
7 =
tabular with comment lines,
8 = Text ASN.1,
9 = Binary ASN.1,
10 = Comma-separated values,
11 = BLAST archive format (ASN.1)
Options 6, 7, and 10 can be additionally configured to produce
a
custom format specified by space delimited format specifiers.
The
supported format specifiers are:
qseqid means Query Seq-id
Subject Seq-id
sallseqid means All subject Seq-id(s), separated
by a ';'
sgi means Subject GI
sallgi means
All subject GIs
sacc means Subject accession
saccver
means Subject accession.version
sallacc means All subject
accessions
slen means Subject sequence length
qstart
means Start of alignment in query
qend means End of alignment
in query
sstart means Start of alignment in subject
send means End of alignment in subject
qseq means Aligned
part of query sequence
sseq means Aligned part of subject
sequence
evalue means Expect value
bitscore means
Bit score
score means Raw score
length means Alignment
length
pident means Percentage of identical matches
nident means Number of identical matches
mismatch means Number of
mismatches
positive means Number of positive-scoring matches
frames separated by a '/'
qframe means Query frame
sframe means Subject frame
btop means Blast traceback
operations (BTOP)
staxids means unique Subject Taxonomy ID(s),
separated by a ';'
(in numerical order)
sscinames means unique Subject Scientific Name(s), separated by a ';'
scomnames means unique Subject Common Name(s), separated by a
';'
sblastnames means unique Subject Blast Name(s), separated by a
';'
(in alphabetical order)
sskingdoms means
unique Subject Super Kingdom(s), separated by a ';'
Query Coverage Per Subject
qcovhsp means Query Coverage Per HSP
When not provided, the default value is:
'qseqid sseqid pident
length mismatch gapopen qstart qend sstart send
evalue bitscore',
which is equivalent to the keyword 'std'
Default = `0' -show_gis
Show NCBI GIs in deflines? -num_descriptions <Integer, >=0>
Number
of database sequences to show one-line descriptions for
Not
applicable for outfmt > 4
Default = `500'
* Incompatible with:
max_target_seqs -num_alignments <Integer, >=0>
Number of database
sequences to show alignments for
Default = `250'
* Incompatible
with: max_target_seqs -html
Produce HTML output?
*** Query
filtering options -seg <String>
Filter query sequence with SEG
(Format: 'yes', 'window locut hicut', or
'no' to disable)
Default
= `12 2.2 2.5' -soft_masking <Boolean>
Apply filtering locations as
soft masks
Default = `false' -lcase_masking
Use lower case
filtering in query and subject sequence(s)?
*** Restrict search or
results -gilist <String>
Restrict search of database to list of
GI's
* Incompatible with: negative_gilist, seqidlist, remote,
subject,
subject_loc -seqidlist <String>
Restrict search of
database to list of SeqId's
* Incompatible with: gilist,
negative_gilist, remote, subject,
subject_loc -negative_gilist
<String>
Restrict search of database to everything except the listed
GIs
* Incompatible with: gilist, seqidlist, remote, subject,
subject_loc -entrez_query <String>
Restrict search with the given
Entrez query
* Requires: remote -db_soft_mask <String>
Filtering algorithm ID to apply to the BLAST database as soft masking
qgi
gapopen
(in alphab
* Incompatible with: db_hard_mask, subject, subject_loc -db_hard_mask
<String>
Filtering algorithm ID to apply to the BLAST database as
hard masking
* Incompatible with: db_soft_mask, subject,
subject_loc -culling_limit <Integer, >=0>
If the query range of a
hit is enveloped by that of at least this many
higher-scoring hits,
delete the hit
* Incompatible with: best_hit_overhang,
best_hit_score_edge -best_hit_overhang <Real, (>=0 and =<0.5)>
Best
Hit algorithm overhang value (recommended value: 0.1)
*
Incompatible with: culling_limit -best_hit_score_edge <Real, (>=0 and
=<0.5)>
Best Hit algorithm score edge value (recommended value: 0.1)
* Incompatible with: culling_limit -max_target_seqs <Integer, >=1>
Maximum number of aligned sequences to keep
Not applicable for
outfmt <= 4
Default = `500'
* Incompatible with:
num_descriptions, num_alignments
*** Statistical options -dbsize
<Int8>
Effective length of the database
-searchsp <Int8, >=0>
Effective length of the search space -max_hsps <Integer, >=0>
Set
maximum number of HSPs per subject sequence to save (0 means no limit)
Default = `0' -sum_statistics
Use sum statistics
*** Search
strategy options -import_search_strategy <File_In>
Search strategy
to use
* Incompatible with: export_search_strategy export_search_strategy <File_Out>
File name to record the search
strategy used
* Incompatible with: import_search_strategy
***
Extension options -xdrop_ungap <Real>
X-dropoff value (in bits) for
ungapped extensions -xdrop_gap <Real>
X-dropoff value (in bits) for
preliminary gapped extensions -xdrop_gap_final <Real>
X-dropoff
value (in bits) for final gapped alignment -window_size <Integer, >=0>
Multiple hits window size, use 0 to specify 1-hit algorithm -ungapped
Perform ungapped alignment only?
*** Miscellaneous options parse_deflines
Should the query and subject defline(s) be parsed? num_threads <Integer, >=1>
Number of threads (CPUs) to use in the
BLAST search
Default = `1'
* Incompatible with: remote -remote
Execute search remotely?
* Incompatible with: gilist, seqidlist,
negative_gilist, subject_loc,
num_threads -use_sw_tback
Compute
locally optimal Smith-Waterman alignments?
In [ ]:
!{bd}blastx \
-query /Volumes/Students/diseases/steven/Phel_transcriptome_clc.fa \
-db /Volumes/Students/diseases/Allison/Output/uniprot_sprot \
-outfmt 6 \
-out
/Volumes/Students/diseases/Allison/blast/Phel_trans_blastx_uniprot.tab
\
-evalue 1E-20 \
-max_target_seqs 1 \
-num_threads 2
In [ ]:
!head Phel_transcriptome_clc.fa
word count - number of lines that have run in blast
In [ ]:
!wc -l
/Volumes/Students/diseases/Allison/blast/Phel_trans_blastx_uniprot.tab
In [ ]:
Common codes and cleaning up the
Blasted file
Here we're using the blasted transcriptome and checking it with different codes
Wc is word count, cd is change directory
In [1]:
!wc -l
/Volumes/Students/diseases/Allison/blast/Phel_trans_blastx_uniprot.tab
31
/Volumes/Students/diseases/Allison/blast/Phel_trans_blastx_uniprot.tab
In [2]:
cd /Volumes/Students/diseases/steven
/Volumes/Students/diseases/steven
fgrep is a way to grab information; this command searches the file for the > sign
In [11]:
!fgrep -c ">"
/Volumes/Students/diseases/steven/seastar_clc_uniprot_sprot_1.tab
0
In [13]:
!wc -l
/Volumes/Students/diseases/steven/seastar_clc_uniprot_sprot_1.tab
8913
/Volumes/Students/diseases/steven/seastar_clc_uniprot_sprot_1.tab
In [14]:
!perl -h
Usage: perl [switches] [--] [programfile] [arguments]
-0[octal]
specify record separator (\0, if no argument)
-a
autosplit mode with -n or -p (splits $_ into @F)
-C[number/list]
enables the listed Unicode features
-c
check syntax
only (runs BEGIN and CHECK blocks)
-d[:debugger]
run program
under debugger
-D[number/list]
set debugging flags (argument is a
bit mask or alphabets)
-e program
one line of program (several
-e's allowed, omit programfile)
-E program
like -e, but
enables all optional features
-f
don't do
$sitelib/sitecustomize.pl at startup
-F/pattern/
split()
pattern for -a switch (//'s are optional)
-i[extension]
edit <>
files in place (makes backup if extension supplied)
-Idirectory
specify @INC/#include directory (several -I's allowed)
-l[octal]
enable line ending processing, specifies line terminator
-[mM][]module
execute "use/no module..." before executing program
-n
assume "while (<>) { ... }" loop around program
-p
assume loop like -n but print line also, like sed
-s
enable rudimentary parsing for switches after programfile
-S
look for programfile using PATH environment variable
-t
enable tainting warnings
-T
enable tainting checks
u
dump core after parsing program
-U
allow unsafe operations
-v
print version, subversion
(includes VERY IMPORTANT perl info)
-V[:variable]
print
configuration summary (or a single Config.pm variable)
-w
enable many useful warnings (RECOMMENDED)
-W
enable
all warnings
-x[directory]
strip off text before #!perl line and
perhaps cd to directory
-X
disable all warnings
Head shows the first 10 lines of the file - useful to check what it looks like. This needs to be cleaned up!
In [1]:
!head /Volumes/Students/diseases/steven/seastar_clc_uniprot_sprot_1.tab
3291_5903_10007_H94MGADXX_V_CF71_ATCACG_R1_(paired)_trimmed_(paired)_co
ntig_4 sp|P25001|COX1_PISOC
88.39
517
48
1
7061
5547
1
517
0.0
749
3291_5903_10007_H94MGADXX_V_CF71_ATCACG_R1_(paired)_trimmed_(paired)_co
ntig_7 sp|Q33818|CYB_ASTPE
79.94
329
66
0
993
7
214
3291_5903_10007_H94MGADXX_V_CF71_ATCACG_R1_(paired)_trimmed_(paired)_co
ntig_9 sp|Q0MVN8|QOR_PIG
45.61
239
129
1
796
80
22
99.8
3291_5903_10007_H94MGADXX_V_CF71_ATCACG_R1_(paired)_trimmed_(paired)_co
ntig_18 sp|P96202|PPSC_MYCTU
30.81
714
438
15
5407
3386
1414
2111
6e-76
286
3291_5903_10007_H94MGADXX_V_CF71_ATCACG_R1_(paired)_trimmed_(paired)_co
ntig_20 sp|P46058|EDSP_CYNPY
31.03
348
218
8
1731
703
149
450
3291_5903_10007_H94MGADXX_V_CF71_ATCACG_R1_(paired)_trimmed_(paired)_co
ntig_24 sp|P63245|GBLP_RAT
80.77
312
60
0
1032
97
339
51
379
90
327
4
334
1
312
7061
4862
1177
5547
4407
638
Lauren cleaned it up using the sed command - I still have to figure out how that works
In [2]:
!head
/Volumes/Students/diseases/lauren/seastar_clc_uniprot_sprot_1_new.tab
Phe1_clc_contig_4
sp|P25001|COX1_PISOC
88.39
517
48
1
Phe1_clc_contig_8
sp|P68037|UB2L3_MOUSE 76.97
152
35
0
Phe1_clc_contig_17
sp|Q6DGL8|RT15_DANRE
35.00
180
107
3
Phe1_clc_contig_20
Phe1_clc_contig_24
In [8]:
sp|P46058|EDSP_CYNPY
sp|P63245|GBLP_RAT
31.03
80.77
348
312
218
60
8
0
!fgrep -c "RAT"
/Volumes/Students/diseases/lauren/seastar_clc_uniprot_sprot_1_new.tab
573
That groups all the sequences that contain the word RAT - and you can group them all this way
Use SQLShare to Join a bunch of different big tables - cloud instead of someone's computer
Joining files will allow us to link the SwissProt IDs to the transcriptome file to assign gene functions
This file is cleaned up to have the Swiss Prot ID (e.g. sp|P25001|) separate from the other information so
we can match it to the ID in the database and join the files to attach Swiss Prot gene function information to
our blasted transcriptome sequences
In [ ]:
!tr '|' "\t"
</Volumes/Students/diseases/lauren/seastar_clc_uniprot_sprot_1_new.tab>
/Volumes/Students/diseases/Allison/seastar_clc_uniprot_sprot_3.tab
We'll pick up next time with joining the files using SQL Share
In [ ]:
1731
1032
703
97