* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Question 1
Survey
Document related concepts
Designer baby wikipedia , lookup
Microevolution wikipedia , lookup
Metagenomics wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Gene expression programming wikipedia , lookup
Gene nomenclature wikipedia , lookup
Point mutation wikipedia , lookup
Gene expression profiling wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Transcript
HPC and Bioinformatics COT 6930 Homework 2 (9 pts) Due March 18 Part 1: Gene expression data analysis (4 pts) The purpose of this assignment is for you to understand basic gene expression data analysis techniques. We will use WEKA data mining to perform two types of gene expression data analysis 1. Molecular classification of leukemia cancer. We will build a classifier to identify whether a diseased tissue sample belongs to acute lymphoblastic leukemia (ALL) or acute myeloid leukemia (AML), by using its gene expression data. This article provides more information on molecular classification of leukemia cancer. 2. Selecting a small number of important genes to enhance the accuracy for leukemia classification. What you should do: First, download and install WEKA from http://www.cs.waikato.ac.nz/ml/weka/. Read the "WEKA Explorer User Guide" at http://internap.dl.sourceforge.net/sourceforge/weka/ExplorerGuide.pdf. This ppt file provides detailed step-by-step guidance on how to use WEKA explorer. Next, download the leukemia gene expression data for here. The data is in ARFF file format (which is the required format for WEKA). Memory issue: Because the size of the gene expression data is relatively large, you may need to change WEKA’s heap size by following the instructions below: Find weak’s home directory, it’s normally under D:\program files\weka-3-4 (or C: drive, depending on where you install weak). Open Runweka.ini Add “maxheap=256m” (without double quote) right above the “mainclass=weka.gui.GUIChooser” (if the memory of your computer is more than 256m, you may increase the heap size according, say maxheap=512m. Save and close Runweka.ini file, and restart weka Then, use the WEKA Explorer to classify the data and compare the performance of different classifiers and feature selection algorithms. You should choose J4.8 classification algorithm. Compare the classifier's classification accuracy obtained with 10-fold cross-validation in the following scenarios: Apply the classifiers directly without any feature (gene) selection. Use feature selection algorithms to select the top 5 ranked genes, and then apply the classifiers on the filtered data. To do this properly, you should use the ReliefFAttributeEval in WEKA Explorer (use default parameter setting). You can find the details of the ReliefF algorithm from this article. What should be turned in? Please follow the format of this article and turn in a written report with the following content. o The 10-fold cross-validation results of apply J4.8 to classify the Leukemia.arff dataset without any feature selection (1 pt) o Draw a figure with x-axis showing the number of selected genes (1, 3, 5, 10, 20, 30) and the y-axis showing the accuracy of the J4.8 built from the selected features (1 pt) o Provide a simple explanation on why selecting a small number of genes can help build a more accurate classification model (2 pt). Part 2: Bioinformatics resources and search engines (5 pts) Question 1 (4.0 pts) Please read this article and go through Blast Guide at http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/guide.html. Given following DNA sequence and NCBI BLAST tools http://www.ncbi.nlm.nih.gov/blast/Blast.cgi Please answer following questions: 1. Which version of the NCBI Blast tool should we use to find similar sequences? (0.2 pt) 2. Now use the sequence as an input and search against the database (by using the tool you selected), please list the alignments, % identity, and E-values of the highest 3 matches (0.5 pt) 3. Click the first matching sequence and describe what is the name of this protein? (0.2 pt) and what is the name of the gene encode this protein (0.2 pt), what is the amino acid sequence of this protein (0.2 pt), briefly state the function of this protein (0.2 pt). 4. Assume that you are interested in finding more information about the first matching protein retrieved from step 3, and you decide to find the second structure of the amino acid sequence of the protein first. Use PSIPRED protein structure prediction tool to find the secondary structure information of the protein and attach the results, please report your retrieval results (0.5 pt) 5. Now you are interested in finding proteins structurally similar to the first matching protein in the steps 3, and you decide to use PSIPRED (Fold Recognition tool GenTHREADER). Please list the PDB_ID (Protein Data Bank ID) of the retrieved proteins (ignoring the last two characters) (0.5 pt) 6. Assume that you decide to use the tblastn (Search translated nucleotide database using a protein query) as your search tool to find proteins similar to the first matching proteins retrieved from step 3. You use the amino acid sequence of the first matching protein (from step 3) as input and search from the database. Select the first matching result and list proteins IDs cross-referenced to the Protein Data Bank (0.5 pt), what are the SCOP classes of first three proteins (0.5 pt). Please list the common proteins retrieved from step 5 and step 6 (0.5 pt) acttgtcatg ggttttcccc agagccaccg gggagcgtgc gggtcactgc aaacattttc cccaagcaat acccaggtcc cagcagctcc tcccttccca ggacagccaa tggccaagac tccgcgccat gcgactgtcc tcccatgtgc tccagggagc tttccacgac catggaggag agacctatgg ggatgatttg agatgaagct tacaccggcg gaaaacctac gtctgtgact ctgccctgtg ggccatctac agctttgtgc tcaagactgg aggtagctgc ggtgacacgc ccgcagtcag aaactacttc atgctgtccc cccagaatgc gcccctgcac cagggcagct tgcacgtact cagctgtggg aagcagtcac caggagcctc cgctaaaagt tgggctccgg ttccctggat atcctagcgt ctgaaaacaa cggacgatat cagaggctgc cagccccctc acggtttccg cccctgccct ttgattccac agcacatgac gcaggggttg tttgagcttc ggacactttg tggcagccag cgagccccct cgttctgtcc tgaacaatgg tccccgcgtg ctggcccctg tctgggcttc caacaagatg acccccgccc ggaggttgtg atgggattgg tcaaaagtct cgttcgggct actgccttcc ctgagtcagg cccttgccgt ttcactgaag gcccctgcac tcatcttctg ttgcattctg ttttgccaac ggcacccgcg aggcgctgcc Question 2 (1 pt) Assuming that you were asked to determine, from the sequences of pancreatic ribonuclease from hose (Equus caballus), minke whale (Balaenoptera acutorostrata), and red kangaroo (Macropus rufus), which two of these species are most closely related. The sequences information is given as follow, and you decide to use ClustalW ( http://www.ebi.ac.uk/clustalw/) multiple sequence alignment tool to find the answer. Please summarize the alignment results (0.5 pt) and conclude which two species are most closely related (0.5 pt) >RNP_HORSE kespamkfer qhmdsgstss snptycnqmm krrnmtqgwc kpvntfvhep ladvqaiclq knitckngqs ncyqssssmh itdcrltsgs kypncayqts qkerhiivac egnpyvpvhf dasvevst >RNP_BALAC respamkfqr qhmdsgnspg nnpnycnqmm mrrkmtqgrc kpvntfvhes ledvkavcsq knvlckngrt ncyesnstmh itdcrqtgss kypncaykts qkekhiivac egnpyvpvhf dnsv >RNP_MACRU etpaekfqrq hmdtehstas ssnycnlmmk ardmtsgrck plntfihepk svvdavchqe nvtckngrtn cyksnsrlsi tncrqtgask ypncqyetsn lnkqiivace gqyvpvhfda yv