Download Question 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Designer baby wikipedia , lookup

Microevolution wikipedia , lookup

Gene wikipedia , lookup

Genomics wikipedia , lookup

Metagenomics wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene nomenclature wikipedia , lookup

RNA-Seq wikipedia , lookup

Point mutation wikipedia , lookup

Gene expression profiling wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

NEDD9 wikipedia , lookup

Protein moonlighting wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
HPC and Bioinformatics COT 6930
Homework 2 (9 pts)
Due March 18
Part 1: Gene expression data analysis (4 pts)
The purpose of this assignment is for you to understand basic gene expression data
analysis techniques. We will use WEKA data mining to perform two types of gene
expression data analysis
1. Molecular classification of leukemia cancer. We will build a classifier to identify
whether a diseased tissue sample belongs to acute lymphoblastic leukemia (ALL)
or acute myeloid leukemia (AML), by using its gene expression data. This article
provides more information on molecular classification of leukemia cancer.
2. Selecting a small number of important genes to enhance the accuracy for
leukemia classification.
What you should do:
First, download and install WEKA from http://www.cs.waikato.ac.nz/ml/weka/. Read the
"WEKA Explorer User Guide" at
http://internap.dl.sourceforge.net/sourceforge/weka/ExplorerGuide.pdf.
This ppt file provides detailed step-by-step guidance on how to use WEKA explorer.
Next, download the leukemia gene expression data for here. The data is in ARFF file
format (which is the required format for WEKA). Memory issue: Because the size of the
gene expression data is relatively large, you may need to change WEKA’s heap size by
following the instructions below:




Find weak’s home directory, it’s normally under D:\program files\weka-3-4 (or C:
drive, depending on where you install weak).
Open Runweka.ini
Add “maxheap=256m” (without double quote) right above the
“mainclass=weka.gui.GUIChooser” (if the memory of your computer is more
than 256m, you may increase the heap size according, say maxheap=512m.
Save and close Runweka.ini file, and restart weka
Then, use the WEKA Explorer to classify the data and compare the performance of
different classifiers and feature selection algorithms. You should choose J4.8
classification algorithm. Compare the classifier's classification accuracy obtained with
10-fold cross-validation in the following scenarios:


Apply the classifiers directly without any feature (gene) selection.
Use feature selection algorithms to select the top 5 ranked genes, and then apply
the classifiers on the filtered data. To do this properly, you should use the
ReliefFAttributeEval in WEKA Explorer (use default parameter setting). You can
find the details of the ReliefF algorithm from this article.
What should be turned in?

Please follow the format of this article and turn in a written report with the
following content.
o The 10-fold cross-validation results of apply J4.8 to classify the
Leukemia.arff dataset without any feature selection (1 pt)
o Draw a figure with x-axis showing the number of selected genes (1, 3, 5,
10, 20, 30) and the y-axis showing the accuracy of the J4.8 built from the
selected features (1 pt)
o Provide a simple explanation on why selecting a small number of genes
can help build a more accurate classification model (2 pt).
Part 2: Bioinformatics resources and search engines (5 pts)
Question 1 (4.0 pts)
Please
read
this
article
and
go
through
Blast
Guide
at
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/guide.html. Given following DNA sequence and
NCBI BLAST tools http://www.ncbi.nlm.nih.gov/blast/Blast.cgi
Please answer following questions:
1.
Which version of the NCBI Blast tool should we use to find similar sequences? (0.2 pt)
2.
Now use the sequence as an input and search against the database (by using the tool you selected),
please list the alignments, % identity, and E-values of the highest 3 matches (0.5 pt)
3.
Click the first matching sequence and describe what is the name of this protein? (0.2 pt) and what
is the name of the gene encode this protein (0.2 pt), what is the amino acid sequence of this protein
(0.2 pt), briefly state the function of this protein (0.2 pt).
4.
Assume that you are interested in finding more information about the first matching protein
retrieved from step 3, and you decide to find the second structure of the amino acid sequence of
the protein first. Use PSIPRED protein structure prediction tool to find the secondary structure
information of the protein and attach the results, please report your retrieval results (0.5 pt)
5.
Now you are interested in finding proteins structurally similar to the first matching protein in the
steps 3, and you decide to use PSIPRED (Fold Recognition tool GenTHREADER). Please list the
PDB_ID (Protein Data Bank ID) of the retrieved proteins (ignoring the last two characters) (0.5 pt)
6.
Assume that you decide to use the tblastn (Search translated nucleotide database using a protein
query) as your search tool to find proteins similar to the first matching proteins retrieved from step
3. You use the amino acid sequence of the first matching protein (from step 3) as input and search
from the database. Select the first matching result and list proteins IDs cross-referenced to the
Protein Data Bank (0.5 pt), what are the SCOP classes of first three proteins (0.5 pt). Please list
the common proteins retrieved from step 5 and step 6 (0.5 pt)
acttgtcatg
ggttttcccc
agagccaccg
gggagcgtgc
gggtcactgc
aaacattttc
cccaagcaat
acccaggtcc
cagcagctcc
tcccttccca
ggacagccaa
tggccaagac
tccgcgccat
gcgactgtcc
tcccatgtgc
tccagggagc
tttccacgac
catggaggag
agacctatgg
ggatgatttg
agatgaagct
tacaccggcg
gaaaacctac
gtctgtgact
ctgccctgtg
ggccatctac
agctttgtgc
tcaagactgg
aggtagctgc
ggtgacacgc
ccgcagtcag
aaactacttc
atgctgtccc
cccagaatgc
gcccctgcac
cagggcagct
tgcacgtact
cagctgtggg
aagcagtcac
caggagcctc
cgctaaaagt
tgggctccgg
ttccctggat
atcctagcgt
ctgaaaacaa
cggacgatat
cagaggctgc
cagccccctc
acggtttccg
cccctgccct
ttgattccac
agcacatgac
gcaggggttg
tttgagcttc
ggacactttg
tggcagccag
cgagccccct
cgttctgtcc
tgaacaatgg
tccccgcgtg
ctggcccctg
tctgggcttc
caacaagatg
acccccgccc
ggaggttgtg
atgggattgg
tcaaaagtct
cgttcgggct
actgccttcc
ctgagtcagg
cccttgccgt
ttcactgaag
gcccctgcac
tcatcttctg
ttgcattctg
ttttgccaac
ggcacccgcg
aggcgctgcc
Question 2 (1 pt)
Assuming that you were asked to determine, from the sequences of pancreatic
ribonuclease from hose (Equus caballus), minke whale (Balaenoptera acutorostrata), and
red kangaroo (Macropus rufus), which two of these species are most closely related. The
sequences information is given as follow, and you decide to use ClustalW (
http://www.ebi.ac.uk/clustalw/) multiple sequence alignment tool to find the answer. Please
summarize the alignment results (0.5 pt) and conclude which two species are most
closely related (0.5 pt)
>RNP_HORSE
kespamkfer qhmdsgstss snptycnqmm krrnmtqgwc kpvntfvhep ladvqaiclq
knitckngqs ncyqssssmh itdcrltsgs kypncayqts qkerhiivac egnpyvpvhf
dasvevst
>RNP_BALAC
respamkfqr qhmdsgnspg nnpnycnqmm mrrkmtqgrc kpvntfvhes ledvkavcsq
knvlckngrt ncyesnstmh itdcrqtgss kypncaykts qkekhiivac egnpyvpvhf
dnsv
>RNP_MACRU
etpaekfqrq hmdtehstas ssnycnlmmk ardmtsgrck plntfihepk svvdavchqe
nvtckngrtn cyksnsrlsi tncrqtgask ypncqyetsn lnkqiivace gqyvpvhfda
yv