Download Homework 1 and 2 review

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
GBIO009-1 - Bioinformatics
Homework 1 and 2 review session
Presented by
Kirill Bessonov
November 2012
____________________________________________________________________________________________________________________
Kirill Bessonov
1
GBIO009-1 - Bioinformatics
HW1: classical Q & A (GenomeGraphs) (1)
• First two questions were on Bioconductor libraries. There are BioC 608 packages
• To get citations on particular library use
citation("library_name")
•
You were asked to get genomic data on specific gene
library(GenomeGraphs)
#download the whole database of Ensemble IDs
ensembl_Human_Genes =
useMart("ensembl",dataset="hsapiens_gene_ensembl");
#get info on gene form the database on the Ensemble ID
gene <- makeGene(id = "ENSG00000115145", type="ensembl_gene_id",
biomart = ensembl_Human_Genes )
#get info on transcript
transcript <- makeTranscript(id = "ENSG00000115145",
type="ensembl_gene_id", biomart= ensembl_Human_Genes)
gdPlot ( list("gene"=gene, "transcripts"=transcript))
#retrieve info from the database displaying first 25 entries
getBM(c("ensembl_gene_id", "hgnc_symbol", "description"),
filter=c("with_exon_transcript", "with_protein_id",
"with_transcript_variation"),values=list(TRUE, TRUE, TRUE),
ensembl_Human_Genes )[1:25,]
____________________________________________________________________________________________________________________
Kirill Bessonov
2
GBIO009-1 - Bioinformatics
HW1: classical Q & A (GenomeGraphs) (2)
• What is the gene name (i.e. hgnc_symbol) and function
represented by the Ensembl ID - ENSG00000115145?
geneInfo=getBM(c("ensembl_gene_id", "hgnc_symbol", "description"), filter=c("with_exon_transcript",
"with_protein_id", "with_transcript_variation"),values=list(TRUE, TRUE, TRUE), ensembl_Human_Genes )
> geneInfo[geneInfo$ensembl_gene_id == "ENSG00000115145", ]
ensembl_gene_id hgnc_symbol
description
4829 ENSG00000115145
STAM2 signal transducing adaptor molecule (SH3 domain and ITAM motif) 2
• How many exons does the ensemble id ENSG00000115145
has? 51 exons
attr(gene, "ens")
ensembl_gene_id ensembl_transcript_id ensembl_exon_id exon_chrom_start exon_chrom_end rank strand
biotype
1 ENSG00000115145
ENST00000263904 ENSE00001351655
153032117
153032506
1
-1 protein_coding
2 ENSG00000115145
ENST00000263904 ENSE00002888710
153006659
153006743
2
-1 protein_coding
……
48 ENSG00000115145
ENST00000494589 ENSE00002785037
153004538
153004636
3
-1 protein_coding
49 ENSG00000115145
ENST00000494589 ENSE00002808134
153003676
153003822
4
-1 protein_coding
50 ENSG00000115145
ENST00000494589 ENSE00002929781
153001402
153001471
5
-1 protein_coding
51 ENSG00000115145
ENST00000494589 ENSE00001828491
153000503
153000527
6
-1 protein_coding
____________________________________________________________________________________________________________________
Kirill Bessonov
3
GBIO009-1 - Bioinformatics
HW1: classical Q & A (GenomeGraphs) (3)
• Execute the following command. How many
chromosomes do you see?
25 chromosomes. 22 autosomal pairs, 1 sex pair and one
mitochondrial chromosome
• Why the number of chromosomes in this Ensembl
dataset is greater than 23 chromosome pairs? What does
“MT”, “X” and “Y” refer to?
Because of the MT chromosome, since X and Y can be
grouped to a single pair
> getBM("chromosome_name","","", ensembl_Human_Genes)[c(1:22,433:435),1]
[1] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18"
"19" "2" "20" "21" "22" "3" "4" "5" "6" "7" "8"
"9" "MT" "X" "Y"
____________________________________________________________________________________________________________________
Kirill Bessonov
4
GBIO009-1 - Bioinformatics
HW2:
Pairwise alignments
(classical Q&A)
____________________________________________________________________________________________________________________
Kirill Bessonov
5
GBIO009-1 - Bioinformatics
HW2: Pairwise alignments (classical Q&A) Q1
•
•
Please align globally using Needleman–Wunsch algorithm the following DNA
sequences. Use
The following scoring rules: a) gap -5; b) match between two bases +5; c) mismatch
between two bases +3;
____________________________________________________________________________________________________________________
Kirill Bessonov
6
GBIO009-1 - Bioinformatics
HW2: Pairwise alignments (classical Q&A) Q3
•
Do local protein alignment using BLOSUM 62 matrix on the HEAGAWGHEE and PAWHAE
sequence. The scoring rules are a) gap -8; matches and mismatches are given in BLOSUM 62
matrix.
____________________________________________________________________________________________________________________
Kirill Bessonov
7
GBIO009-1 - Bioinformatics
HW2: Pairwise alignments (classical Q&A) Q5
Produce a dot plot of Human and
Mouse p53 proteins from previous
question and paste the plot below.
Complete the lines of R code to get
the dot plot.
Are both proteins similar?
Yes, very similar since we see clear
diagonal corresponding to >90% of
sequences length
Where is/are the region(s) of
greatest variation occur?
Between 50-100
____________________________________________________________________________________________________________________
Kirill Bessonov
8
GBIO009-1 - Bioinformatics
HW2: Pairwise alignments (classical Q&A) Q7
• What global alignment score do you get for the two p53
proteins, when you use the BLOSUM62 alignment matrix, a
gap opening penalty of -10 and a gap extension penalty of 0.5? Answer: score of 1556
query("p53_HUMAN", "AC=P04637");
p53_HUMAN_seq = getSequence(p53_HUMAN);
query("p53_MOUSE", "AC=P02340");
p53_MOUSE_seq = getSequence(p53_MOUSE);
globalAlign <- pairwiseAlignment(p53_HUMAN_seq, p53_MOUSE_seq, substitutionMatrix =
"BLOSUM62", gapOpening = -10, gapExtension = -0.5)
• Errors: the R-code was not stated and the ID of
proteins were not given such as Uniprot ID P04637
____________________________________________________________________________________________________________________
Kirill Bessonov
9
GBIO009-1 - Bioinformatics
HW2: Computer Style
Implementation of NW algorithm in R
____________________________________________________________________________________________________________________
Kirill Bessonov
10
GBIO009-1 - Bioinformatics
HW2: Computer style (NW algorithm) [1]
• Given the pseudo-code implement NW
algorithm in R
– Algorithm has two parts
• Calculation of the alignment F-matrix
• Finding the optimal path(s) through the matrix
for to length(A)
d = gap penalty score
F(i,0) ← d*i
for j=0 to length(B)
i and j = positions in A
F(0,j) ← d*j
for i=1 to length(A){
for j=1 to length(B)
{
Match ← F(i-1,j-1) + S(Ai, Bj)
Delete ← F(i-1, j) + d
Insert ← F(i, j-1) + d
F(i,j) ← max(Match, Insert,
Delete)
} }
& B sequences
____________________________________________________________________________________________________________________
Kirill Bessonov
11
GBIO009-1 - Bioinformatics
HW2: Computer style (NW algorithm) [2]
Fmatrix = function(A,B){
fmatrix = matrix(0, nrow = (nchar(A)+1) , ncol = nchar(B)+1)
d = -8 #this is gap penalty
for(i in 0 : nchar(A)){
fmatrix[i+1,1] = d * i #populates initial row with gap penalty
}
for(j in 0 : nchar(B)){
fmatrix[1,j+1] = d * i
}
for(i in 1 : nchar(A)){
for(j in 1 : nchar(B)) {
score = rules(A,B) #get me sccore for the pair of aa or nt
match = fmatrix[i,j] + score
delete = fmatrix[i,j+1] + d
insert = fmatrix[i+1,j] + d
fmatrix[i+1,j+1] = max(match,delete,insert)
}
}
colnames(fmatrix) = strsplit( paste(" " , B, sep=""), "")[[1]];
rownames(fmatrix) = strsplit( paste(" " , A, sep=""), "")[[1]];
return(fmatrix)
}
____________________________________________________________________________________________________________________
Kirill Bessonov
12
GBIO009-1 - Bioinformatics
HW2: Computer style (NW algorithm) [3]
rules = function(A,B){
s.matrix <- matrix(rep(0,16), nrow = 4, ncol=4, byrow=TRUE,
dimnames = list(c("A","C","G","T"),c("A","C","T","G")))
s.matrix["A",] = c(2,-1,-1,-1)
s.matrix["C",] = c(-1,2,-1,-1)
s.matrix["T",] = c(-1,-1,2,-1)
s.matrix["G",] = c(-1,-1,-1,2)
}
> s.matrix
A C T G
A 2 -1 -1 -1
C -1 2 -1 -1
G -1 -1 2 -1
T -1 -1 -1 2
____________________________________________________________________________________________________________________
Kirill Bessonov
13
GBIO009-1 - Bioinformatics
HW2: Computer style (NW algorithm) [4]
•
Check the F-matrix
fmatrix=Fmatrix("ATCG", "TG")
T
G
-32 -32 -32
A -8 -16 -24
T -16 -6 -14
C -24 -14 -4
G -32 -22 -12
•
Start finding the optimal path(s) through the matrix
AlignmentA = ""
AlignmentB = ""
i = nchar(A) + 1
j = nchar(B) + 1
while(i > 1 && j > 1){
CurrentScore = fmatrix[i,j]
#get score at current position of F-matrix
ScoreDiag = fmatrix[i - 1, j - 1]
ScoreUp = fmatrix[i, j - 1]
what is around that F-matrix cell?
ScoreLeft = fmatrix[i - 1, j]
____________________________________________________________________________________________________________________
Kirill Bessonov
14
GBIO009-1 - Bioinformatics
HW1: Computer style (NW algorithm) [5]
•
Selecting the bottom right cell and starting to trace-back the path of optimal alignment
AlignmentA = ""
AlignmentB = ""
Which cell of the F-matrix I am now?
while(i > 1 && j > 1){
CurrentScore = fmatrix[i,j]
ScoreDiag = fmatrix[i - 1, j - 1]
ScoreUp = fmatrix[i, j - 1]
ScoreLeft = fmatrix[i - 1, j]
On diagonal path: previous + next cell
#considering the score came from diagonal
if (CurrentScore == ScoreDiag + s.matrix[substr(A,i,i), substr(B,j,j)) ){
AlignmentA = paste(substr(A,i-1,i-1),AlignmentA, sep = "")
AlignmentB = paste(substr(B,j-1,j-1),AlignmentB, sep = "")
i = i - 1
j = j - 1
}
____________________________________________________________________________________________________________________
Kirill Bessonov
15
GBIO009-1 - Bioinformatics
HW2: Computer style (NW algorithm) [6]
#considering if the score comes from left (introducing a gap)
else if(CurrentScore == ScoreLeft + d){
AlignmentA = paste(substr(A,i-1,i-1),AlignmentA, sep = "")
AlignmentB = paste( "-", AlignmentB, sep = "")
i = i - 1
}
#considering if the score comes from upper cell (introducing a gap)
else if(CurrentScore == ScoreUp + d)
{
AlignmentA = paste( "-", AlignmentA, sep = "")
AlignmentB = paste(substr(B,j-1,j-1), AlignmentB, sep = "")
j = j – 1
}
print(AlignmentA)
print(AlignmentB)
finalScore = cat("Final score :",fmatrix[(nchar(A)+1),(nchar(B)+1)])
____________________________________________________________________________________________________________________
Kirill Bessonov
16
GBIO009-1 - Bioinformatics
HW2: Computer style (NW algorithm) [7]
• The scoring matrices could have been
accessed though character indices not
requiring conversion and making code faster
• How one would output more than one BEST
possible alignments?
• Please use more comments in your R-code
• Would be nice to see trace-backs visually
• Also the scoring rules were not stated clearly
____________________________________________________________________________________________________________________
Kirill Bessonov
17
Related documents