* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download projects - Computer Science and Engineering
Survey
Document related concepts
Protein moonlighting wikipedia , lookup
Community fingerprinting wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
List of types of proteins wikipedia , lookup
Peptide synthesis wikipedia , lookup
Mass spectrometry wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Bottromycin wikipedia , lookup
Point mutation wikipedia , lookup
Deoxyribozyme wikipedia , lookup
SNP genotyping wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Self-assembling peptide wikipedia , lookup
Molecular evolution wikipedia , lookup
Transcript
CSE280a: Projects Vineet Bafna CSE280 Vineet Bafna Project Logisitics • • • Research project (70%) Work individually, or in groups of 2 Two presentations: – Introductory presentation: Feb 1st week (20 minutes) (20% grade) • • • • • – One on one meeting with instructor (end February) (10% grade) • – Discuss preliminary results Final presentation (last 2-3 classes): (30% grade) • • CSE280 Describe the goals of the project Describe your (computational) formulation Summarize/critique reading assignment Present an algorithm Constructive criticism of other projects Submit a final report Final presentation Vineet Bafna Project 1: disease gene mapping • • Recall, Linkage Disequilibrium In the absence of recombination, – – • Correlation between columns The joint probability Pr[A=a,B=b] is different from P(a)P(b) With extensive recombination – CSE280 Pr(a,b)=P(a)P(b) Vineet Bafna Measures of LD • • Consider two bi-allelic sites with alleles marked with 0 and 1 Define – – • • P00 = Pr[Allele 0 in locus 1, and 0 in locus 2] P0* = Pr[Allele 0 in locus 1] Linkage equilibrium if P00 = P0* P*0 D = abs(P00 - P0* P*0) = abs(P01 - P0* P*1) = … CSE280 Vineet Bafna LD can be used to map disease genes LD 0 1 1 0 0 1 D N N D D N • • LD decays with distance from the disease allele. By plotting LD, one can short list the region containing the disease gene. CSE280 Vineet Bafna Multiple loci LD 0 0 1 0 0 1 D N N D D N • 0 1 1 0 0 0 In complex diseases, multiple loci interact to confer disease susceptibility CSE280 Vineet Bafna Testing for multiple loci • Assume SNP matrix with n individuals, m loci. Testing for all sets of 5 SNPs implies a huge number of computations? m n 5 • Can you come out with computational strategies that can speed it up? CSE280 Vineet Bafna Speeding up multiple locus computations • • • • A filtering strategy? Input: a SNP matrix with one or more pairs that interactively associate Output: a set of SNP pairs that includes the interacting pair(s). Method should be fast, and should NOT consider all pairs. CSE280 Vineet Bafna Speeding up the computations k=2 • • • 0 0 1 0 0 1 1 1 0 0 1 1 1 0 1 0 1 1 Correlated SNPs should also have low hamming distance. Random SNPs should have high hamming distance. Strategy: select k individuals at random. – Hash each individual restricted to k individuals – Correlated SNPs should fall in the same bin with high probability CSE280 Vineet Bafna Project 2: mtDNA phylogeny • • In the absence of recombination, the history of mitochondrial DNA can be expressed by a tree. The goal of this project is to build a robust phylogeny using a heuristic modification of the perfect phylogeny. CSE280 Vineet Bafna The Genographic project https://www3.nationalgeographic.com/genographic/atlas.html • The genographic project aims to trace geographic origins of the human race using mitochondrial DNA. CSE280 Vineet Bafna Without recurrent mutations • Unique tree can explain the evolutionary history A B C D E 1 1 0 1 0 1 2 1 0 1 0 0 3 0 1 0 1 0 4 0 0 1 0 0 r 1 E 3 2 B 5 4 D C CSE280 Vineet Bafna A 5 0 0 0 1 0 With recurrent mutations • • • • Adding another individual F destroys perfect phylogeny Why? It is not so easy to place F Can you suggest a strategy? r A B C D E F 1 E 1 1 0 1 0 1 0 3 0 1 0 1 0 0 4 0 0 1 0 0 0 3 2 B 2 4 2 1 0 1 0 0 1 5 1 D C CSE280 Vineet Bafna A F 5 0 0 0 1 0 0 Tests of Selection • In class, we have discussed alleles that can be selectively neutral, or under active selection – • • Active selection may be positive or negative How do we identify regions under positive, or negative selection? Balancing selection: sometimes it is helpful for a population to CSE280 Vineet Bafna Adaptive Selection • • Selection leads to loss of heterozygosity (will be explained in detail in the next lecture). Can you come up with a test for selection? CSE280 Vineet Bafna Balancing selection • • • Sometimes both alleles are useful in a population, and it helps to have both around A simple example is when diversity is important (the two variants help maintain diversity) Bipolar disorder genes could be under balancing selection – – • • High creativity which might confer some selective/reproductive advantage. Depression offers a disadvantage If so, the tests for this disorder might be tricky. How can we identify regions under balancing selection? CSE280 Vineet Bafna Testing for Balancing Selection • • • Adaptive selection leads to loss of heterozygosity (will be explained in detail in the next lecture). Balancing selection leads to two dominant haplotypes Can you come up with a test for balancing selection? CSE280 Vineet Bafna Project: Primer design for cancer genomics CSE280 Vineet Bafna The Science behind Gleevec Fusions – observed in leukemia, lymphoma, and sarcomas • “Philadelphia Translocation” – CSE280 Drugs target this fusion protein Vineet Bafna Fluoroscent in situ hybridization • Cancer genomes show extensive structural variation CSE280 Vineet Bafna Assaying for tumor variants • • • Most tumors start off with a single cell, which then proliferate. Drugs like Gleevec are used well after cancer has taken hold. Can we detect the cancer early by detecting the genomic abnormality? – • If a very few cells in the person are cancerous, can we still detect it? Can we track a patient through his treatment? CSE280 Vineet Bafna Cancer genomics deletion • • In cancers, large genetic changes can occur, including deletions, inversions, and rearrangements of genomes In the early stages, only a few cells will show this CSE280 Vineet Bafna Polymerase Chain Reaction • • PCR is a technique for amplifying and detecting a specific portion of the genome Amplification takes place if the primers are ‘appropriate’ distance apart (<2kb) CSE280 Vineet Bafna Assaying for Rare Variants • PCR can be used to assay for a given genomic abnormality, even in a heterogenous population of tumor and normal cells PCR Extract Genomic DNA Tumor cell CSE280 Distance too large for amplification Vineet Bafna Detection Variant Variants • • What if the variant is the minority in the cell population? What if deletion boundaries are uncertain? Patient A Deletion Patient B Deletion Patient C Deletion CSE280 Vineet Bafna Observed variation in deletion size Sizes of homozygous deletions in cell lines from different human cancers. (scale is in megabases). CSE280 Vineet Bafna Primer Approximation Multiplex PCR (PAMP)* • Multiple primers are optimally spaced, flanking a breakpoint of interest – – • Upstream of breakpoint, forward primers Downstream of breakpoint, reverse primers The primers are run in a multiplex PCR reaction – Any pair can form a viable product Patient B Patient C Deletion CSE280 Deletion Vineet Bafna Experimental Design (500Kb region) • 10 sets of 25 primers: upstream and downstream – – 250 upstream 250 downstream • Primer-pairs closest to breakpoint amplified • Assay by oligo array Goal: Computational selection of an ‘optimal’ primer set CSE280 Vineet Bafna Goal • • • Input, a collection of primers Identify a subset of primers that do not cross-hybridize, are unique, yet cover the region completely Use combinatorial optimization, simulated annealing, integer linear programming….. CSE280 Vineet Bafna Spectral Networks Algorithms for De Novo Interpretation of Tandem Mass Spectra Nuno Bandeira, Ph.D. Department of Computer Science and Engineering, University of California, San Diego ProtIG seminar series September 21, 2007 CSE280 Vineet Bafna Proteins and their modifications Proteins are fundamental players in the regulation of biological processes. encodes for DNA Proteins regulate Knowing proteins involves knowing many things. This dissertation focuses on: - Identification - Sequencing - Post-translational modifications ( ) CSE280 Vineet Bafna Protein sequences and modifications From a computational perspective, a protein can be represented as a string over a weighted alphabet: Protein sequence: …AFSRLEMILGF… Amino acid Mass A 71 F 147 (obtained via enzymatic digestion) S 87 Modifications change amino acid masses: R 156 L 113 E 129 M 131 I 113 G 57 AFSRL SRLEMILGF EMILG SRLEMILGF Subsequences are called peptides Mass(M)=131 Mass(SRLEMILGF)=1047 SRLEM ILGF Mass( )=16 Mass(M )=147 CSE280 Mass(SRLEM ILGF)=1063 Vineet Bafna Nobel prize in chemistry, 2002 CSE280 Vineet Bafna What is mass spectrometry? Amino acid Mass A 71 F 147 S 87 … CSE280 http://nobelprize.org/chemistry/laureates/2002/chemadv02.pdf Vineet Bafna Tandem Mass Spectrometry (MS/MS) Protein Sequence: …THISISAVERYLARGESAMPLEPRTEINSEQENCE… Modified peptide LARG*E Prefix masses: b L R A L A L A L R A G Prefix masses : b L R A E G E L A R G E L A R G E L G* R A E R G* E R G* E G*E LA E L Intensity LA GE LAR RGE LARG ARGE E L Intensity Suffix masses : y Mass (m/z) CSE280 MS/MS spectrum E G* y: Suffix masses PM Modification: any event that changes the mass at a specific site. RG*E LARG* ARG*E Peptide LARGE Mass (m/z) Vineet Bafna Mass shifts Example of a real MS/MS spectrum b10 Symmetric CSE280 y12 Vineet Bafna Tandem Mass Spectrometry (MS/MS) Peptides Large set of MS/MS spectra Tandem Mass Spectrometry Enzymatic digestion Proteins … … Database search De novo sequencing Database of known peptides u MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE, ALKIIMNVRT, ALKIIMNVRT,SEQUENCE, SEQUENCE, HEWAILF, GHNLWAMNAC, GVFGSVLRA, EKLNKAATYIN.. s e s f e q q n c e q Peptide SEQUENCE CSE280 Vineet Bafna u e u n e e n c c e es Mixture spectra Peptide A: NLAFFQLR Peptide B: ALDDILNLK Sometimes, the instrument generates a single spectrum from two or more peptides: ? CSE280 Vineet Bafna Mixture spectrum How to identify mixture spectra? CSE280 Vineet Bafna Proposed approach • • When identifying a mixture spectrum of peptides A,B, assume you have non-mixture spectra for the same peptides. Compare the non-mixture spectra of known peptides to putative mixture spectra to determine peptide identifications CSE280 Vineet Bafna Project description • Implement an algorithm to identify mixture spectra from pairs of peptides by combining previously identified spectra from isolated peptides. • Test the above implementation by simulating mixture spectra using an existing database of spectra from isolated peptides. • Propose a scoring procedure to separate correct from false identifications. Nuno Bandeira [email protected] CSE280 Vineet Bafna