* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Phylogenetic Tree Construction using Pathway Analysis
Survey
Document related concepts
Metagenomics wikipedia , lookup
Maximum parsimony (phylogenetics) wikipedia , lookup
Minimal genome wikipedia , lookup
Designer baby wikipedia , lookup
Ridge (biology) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Gene expression profiling wikipedia , lookup
Microevolution wikipedia , lookup
Quantitative comparative linguistics wikipedia , lookup
Gene expression programming wikipedia , lookup
History of genetic engineering wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Wnt signaling pathway wikipedia , lookup
Transcript
Phylogenetic Tree Construction using Pathway Analysis Bioengineering 190C Project By: Harry Choi Nick Lin Gabe Kwong Li Yan Christina Yau Background Traditional Approach Comparison of single orthologs between organisms Distance matrix generation from similarity scores Hierarchical Clustering Tree Construction Disadvantage Sensitive to choice of gene for comparison Possible inconsistency of trees generated Our Approach N annotated organisms to be clustered Reference organism is chosen Pathway in the reference organism is chosen Pool of orthologs in the N organisms is generated by BLAST Analysis of pool of ortholog generated vector representing each organisms Distance calculation from vectors Hierarchical Clustering Tree Construction Rationale for Approach Pathway takes into account multiple genes Individual differences between genes not directly taken into account All genes considered are related to each other in cellular function Better conservation of actual function than sequence identities Better consistency in trees generated Program Design Modular design divided into following portions: BLAST Analysis of BLAST results Hierarchical Clustering Tree Construction Design allows for reuse of components in different applications with minor changes Design allows individual subrountines to be used recursively to generate desired results with minimal changes Step 1: BLAST Conserved proteins from Wnt pathway will be used as example Wnt pathway Wnt proteins form a family of highly conserved secreted signaling molecules that regulate cell-to-cell interactions during embryogenesis. Wnt genes and Wnt signaling are also implicated in cancer. Wnt pathway is found in many organisms such as: Drosophila, Caenorhabditis elegans, Xenopus, Chiecken, Mouse, Zebrafish, and Human. Wnt pathway (cont) Choose 6 most conserved proteins from this pathway as seed proteins: Wnt Frizzled Dsh Apc Axin Tcf (Roel Nusse, 2002) 5 organisms Drosophila: Mouse: C. elegans: Zebrafish: Xenopus: 54455 sequences 77143 sequences 62256 sequences 3069 sequences 5174 sequences Strategy Seed protein (pr1) from Organism 1 (O1) blast against 4 other organisms: O1 O2 O3 O4 O5 pr1 pr2 pr3 pr4 pr5 pr6 pr7 Pr8 pr9 Secondary seed proteins (pr 2, …, 9) blast against respective 4 other organism: O1 pr1 pr10 pr11 O2 pr2 O3 O4 O5 O1 O2 pr4 pr6 pr12 pr8 pr13 pr14 pr15 pr16 pr17 pr1 pr24 pr3 pr25 pr26 O3 pr4 O4 O5 pr7 pr14 pr15 pr9 pr16 pr17 pr23 O1 O2 O3 O4 O5 O1 O2 O3 O4 O5 pr11 pr18 pr19 pr3 pr5 pr6 pr20 pr7 pr13 pr21 pr22 pr9 pr23 : : : : : : pr5 : : : : : : Example input seed sequence for BLAST >gi|85190|pir||A29650 wingless (wg) protein precursor - fruit fly (Drosophila melanogaster) MDISYIFVICLMALCSGGSSLSQVEGKQKSGRGRGSMWWGIAKVGEPNNITPIMYMDPAIHS TLRRKQRRLVRDNPGVLGALVKGANLAISECQHQFRNRRWNCSTRNFSRGKNLFGKIVDRGC RETSFIYAITSAAVTHSIARACSEGTIESCTCDYSHQSRSPQANHQAGSVAGVRDWEWGGCS DNIGFGFKFSREFVDTGERGRNLREKMNLHNNEAGRAHVQAEMRQECKCHGMSGSCTVKTC WMRLANFRVIGDNLKARFDGATRVQVTNSLRATNALAPVSPNAAGSNSVGSNGLIIPQSGLV YGEEEERMLNDHMPDILLENSHPISKIHHPNMPSPNSLPQAGQRGGRNGRRQGRKHNRYHF QLNPHNPEHKPPGSKDLVYLEPSPSFCEKNLRQGILGTHGRQCNETSLGVDGCGLMCCGRGY RRDEVVVVERCACTFHWCCEVKCKLCRTKKVIYTCL (fasta format: start with “>”) Example output file from BLAST: 15 secondary seed proteins wg_85190 wg_85190 wg_85190 wg_85190 wg_85190 wg_85190 wg_85190 wg_85190 wg_85190 wg_85190 wg_85190 wg_85190 wg_85190 wg_85190 wg_85190 wg_celegans_7508752 wg_celegans_3880389 wg_celegans_17539494 wg_zebrafish_103816 wg_zebrafish_833600 wg_zebrafish_18859559 wg_zebrafish_139740 wg_xenopus_65236 wg_xenopus_69039 wg_xenopus_139748 wg_mouse_293671 wg_mouse_387388 wg_mouse_69037 wg_mouse_13529431 wg_mouse_139744 1.70e-41 1.70e-41 1.70e-41 1.20e-80 1.20e-80 1.20e-80 1.20e-80 1.40e-76 1.40e-76 1.40e-76 2.50e-78 2.50e-78 2.50e-78 2.50e-78 2.50e-78 Example output file from BLAST (cont) wg_celegans_7508752 wg_celegans_7508752 wg_celegans_7508752 wg_celegans_7508752 wg_celegans_7508752 wg_celegans_7508752 wg_celegans_7508752 wg_celegans_7508752 wg_celegans_7508752 wg_celegans_7508752 wg_celegans_7508752 wg_celegans_7508752 wg_celegans_7508752 wg_celegans_7508752 wg_celegans_7508752_drosophila_6537292 wg_celegans_7508752_drosophila_12018324 wg_celegans_7508752_xenopus_422628 wg_celegans_7508752_xenopus_313268 wg_celegans_7508752_xenopus_465484 wg_celegans_7508752_mouse_202406 wg_celegans_7508752_mouse_227507 wg_celegans_7508752_mouse_111253 wg_celegans_7508752_mouse_14789729 wg_celegans_7508752_mouse_6678599 wg_celegans_7508752_mouse_14424475 wg_celegans_7508752_zebrafish_1256778 wg_celegans_7508752_zebrafish_18859567 wg_celegans_7508752_zebrafish_2501662 1.30e-90 1.30e-90 1.10e-96 1.10e-96 1.10e-96 2.40e-96 2.40e-96 2.40e-96 2.40e-96 2.40e-96 2.40e-96 2.30e-94 2.30e-94 2.30e-94 Step 2: Analysis of BLAST Results ie. Metric Determination Metric Determination Common Algorithms used to calculate a distance metric from similarity scores include (1-%Identity) and S = e(-d/m2) (Shepard 1987). A different algorithm is used for this project. Rules Metric must Satisfy The distance between a gene and itself must be zero Dii = 0. Communitive property: Dij = Dji. Triangular inequality: Dij + Dik Djk. Dij j Djk k i Dik Our Algorithm Determine unique gene pool from all the organisms that meet the threshold for a particular gene in pathway. Celegans_17531491 Wg-Drosophila g1 g2 g3 Is g1 Unique? Yes g1 Is g2 unique? No g2 Gene pool g4 g3 g2 g4 Gene Vectors Drosophila Mouse Zebrafish... Genepool of entire Wnt pathway g1 g2 g3 . . . gn 1 0 0 . . . 1 0 1 1 . . . 0 0 0 0 . . . 1 No Homolog of gn found in Mouse. Homologous gn found in Zebrafish Euclidean Distance Vectors are in N dimensional space Determine Euclidean Distance by taking the root of the differences squared. Dij = = (Di1-Dj1)2 + …+ (Din-Djn)2 (1-0)2 + (1-1)2 + (0-1)2 + … Distance Matrix O1 O2 O3 . . . . . . . . . On O1 O2 O3 . . . On 0 D21 0 D31 D32 0 Since Euclidean distances commute Matrix is Triangular. . . . Dn1 0 Step 3: Hierarchical Clustering Hierarchical Clustering There are two types of clustering: Successive Fusions (Agglomerative Clustering) Separation (Divisive Clustering) Hierarchical Clustering In this project, agglomerative clustering algorithm has been employed Idea: The most similar objects are first grouped. These are then merged according to their similarities, until all are fused into one single cluster Hierarchical Clustering Input of the clustering program: Any N x N triangular matrix containing the pairwise distances between the organisms D = {djk} Hierarchical Clustering Feed the NXN matrix as the input and the clustering method will output a (N-1)X(N-1) matrix In this case, it will be a 4X4 matrix: Hierarchical Clustering New Distances are determined between the new group and each of the remaining organisms Hierarchical Clustering d d Continue with the clustering until all the organisms fused into one cluster = min {d (35)2, d 12} = min{7, 9} = 7 (135)4 = min {d (35)4, d 14} = min{8,6} = 6 (135)2 D(135)(24) = min {d (135)(2) , d(135)(4) } = min {7, 6} = 6 Hierarchical Clustering The outputs of each run: The names of the organisms that are grouped together The distance between the two organisms After N-1 number of iterations, the outputs are saved to a file and they will be used to draw the phylogenetics tree. Step 4: Phylogenetic Tree Construction Tree Construction Sample input Flat file of clusters and distances e.g. sample1.txt A B 4.5 B C 5.2 E D 5.8 C E 12.4 Or A A E A e.g. sample2.txt B 4.5 BC 5.2 D 5.8 BC E D 12.4 Tree Construction Sample Input (continued) Requirements for input file: Each line must represent one cluster First entries are leaves in the cluster Last entry is the distance No more than two new leaves can be added in a cluster Each entry must be delimited by a tab Flexibility File can have all leaves in the cluster or a new leaf and any leaf from previous clusters Subroutine can be reuse to generate tree from any file by modifiying one line of code Tree Construction Method Subclass intree Read in file line by line Add new leaves to Vector leaves Array elts tracks the number of leaves added to the vector in each cluster Array d tracks the distances between elements in the cluster Subclass treed Convert distances to pixels Draw tree and leaves in Jframe Draw scale of distance in Jframe Tree Construction Sample Output A B C D E 1 2 3 4 5 6 7 8