Download D - mbg

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene desert wikipedia , lookup

Gene wikipedia , lookup

Promoter (genetics) wikipedia , lookup

RNA-Seq wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Gene regulatory network wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Molecular ecology wikipedia , lookup

Community fingerprinting wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
COURSE OF BIOINFORMATICS A.A. 2013-­‐2014 PHYLOGENY -­‐ PHYLOGENETICS ANTONELLA LISA IGM-­‐CNR PAVIA PHYLOGENY IS A DISCIPLINE THAT TRIES TO RECONSTRUCT THE HISTORY OF LIFE GROUPING ORGANISMS ACCORDING TO THEIR LEVEL OF SIMILARITY. MORE SIMILAR TWO SPECIES ARE, CLOSER THEY WILL BE TO THEIR COMMON ANCESTOR. CHARACTERS STATE SEEDS 0(NO)/1(YES) SOFT INSIDE 0(NO)/1(YES) SMALL SEEDS 0(NO)/1(YES) LARGE CENTRAL STONE 0(NO)/1(YES) THICK 0(NO)/1(YES) SKIN SEGMENTED 0(NO)/1(YES) ww.amnh.org/ology/features/treeoflife/pages/howtoreadclado.php ww.amnh.org/ology/features/treeoflife/pages/howtoreadclado.php EVOLUTIONARY TREE Time NODES = SPECIATION EVENTS IN EVOLUTION BRANCHES = PHYLOGENETIC RELATIONSHIPS OTU= OPERATIONAL TAXONOMIC UNIT MANY PHYLOGENIES ALSO INCLUDE AN OUTGROUP, SO THAT ALL THE MEMBERS OF THE GROUP OF INTEREST ARE MORE CLOSELY RELATED TO EACH OTHER THAN THEY ARE TO THE OUTGROUP, SO, THE OUTGROUP STEMS FROM THE BASE OF THE TREE. EVOLUTIONARY TREES DEPICT CLADES. A CLADE IS A GROUP OF ORGANISMS THAT INCLUDES AN ANCESTOR AND ALL DESCENDANTS OF THAT ANCESTOR. YOU CAN THINK OF A CLADE AS A BRANCH ON THE TREE OF LIFE. h8p://medsocnet.ncsa.illinois.edu/MSSW/moodle/AuthTut/vpage_beta.php?Cd=218&&pid=1055 THE BRANCHING PATTERN OF A TREE IS CALLED TOPOLOGY UNROOTED TREE A N U N R O O T E D T R E E O N L Y SPECIFIES THE RELATIONSHIPS AMONG THE OTUS BUT DOES NOT DEFINE THE EVOLUTIONARY 5
1
8
PATH. UNROOTED TREES DO NOT MAKE ANY ASSUMPTIONS OR REQUIRE KNOWLEDGE ABOUT COMMON ANCESTORS. 7
2
6
3
4
ROOTED TREE ROOT
8
7
4
2
3
1
T I M E 6
5
IN A ROOTED TREE EXISTS A NODE, CALLED ROOT, FROM WHICH A UNIQUE PATH LEADS TO ANY OTHER NODE. THE DIRECTION OF EACH PATH CORRESPONDS TO EVOLUTIONARY TIME, AND THE ROOT IS THE COMMON ANCESTOR OF ALL THE OTUS UNDER STUDY. A N Y U N R O O T E D T R E E B E C A M E S R O O T E D INTRODUCING AN APPROPRIATE OUTGROUP PHYLOGENETICS PHYLOGENETICS IS THE COMPARISON OF EQUIVALENT GENES TO: •  RECONSTRUCTING A GENEALOGIC TREE OF THESE SPECIES. •  DETERMINE THE CLOSEST RELATIVES OF THE ORGANISM YOU ARE INTERESTED IN •  DISCOVER THE FUNCTION OF A GENE •  RETRACING THE ORIGIN OF A GENE RELATIONSHIP BETWEEN AN ANCESTRAL SEQUENCE AND ITS DESCENDANTS (EVOLUTION OF FAMILY OF SEQUENCES) EVOLUTIONARY TREE OF NS5B SEQUENCES OF HCV GENOTYPES (Tamura et al. 2007) HOW MANY TREES!!!!! NR =
NU =
(2m " 3)!
2 m"2 (m " 2)!
(2m " 5)!
2 m"3 (m " 3)!
m = NO. OF OTU # OTU 3 4 5 6 7 8 9 10 POSSIBLE NUMBER OF TREES ROOTED 3 15 105 945 10395 135135 2027025 34459425 UNROOTED 1 3 15 105 945 10395 135135 2027025 DIFFERENT KIND OF SEQUENCES = DIFFERENT EVOLUTIONARY INFORMATION ORTHOLOGOUS AND PARALOGOUS GENES ANCESTRAL GENE DUPLICATION GENE A SPECIES 1 GENE A1 GENE B ORTHOLOGOUS PARALOGOUS GENE B1 SPECIES 2 GENE A2 PARALOGOUS ORTHOLOGOUS GENE B2 HOMOLOGOUS GENES ARE GENES WITH COMMON ANCESTOR HOMOLOGOUS GENES = GENE TREE ORTHOLOGOUS GENES = SPECIES TREE PARALOGOUS GENES = GENE FAMILY TREE IT IS DIFFICULT TO ESTABLISH IF TWO GENES ARE HORTOLOGOUS. HIGH DIVERGENCE EVOLUTION DUPLICATION EVENTS GENE CONVERSION IF IT IS SO DIFFICULT, WHY DO WE USE MOLECULAR PHYLOGENY? •  THERE IS MORE INFORMATION AT MOLECULAR THAN AT MORPHOLOGICAL LEVEL •  ANY DIFFERENCE INSIDE A MACROMOLECULE CAN BE CONSIDERED INDEPENDENT •  MACROMOLECULES ARE LESS INFLUENCED BY ENVIRONMENT •  CHARACTERS CLASSIFICATION IS SIMPLER •  SOME MOLECULES EVOLVE AT A CONSTANT RATE (TIME OF DIVERGENCE) TO PERFORM PHYLOGENETIC ANALYSIS: START FROM PROTEIN SEQUENCES THAT ARE BETTER PERFORMING THE MULTIPLE ALIGNMENT. TRANSLATE BACK THE AMINO ACID SEQUENCES INTO DNA MAINTAINING THE PREVIOUS ALIGNMENT. SUBMIT THE ALIGNED SEQUENCED TO FURTHER ANALYSIS METHODS OF TREE RECONSTRUCTION 1.  DISTANCE MATRIX METHODS 2.  MAXIMUM PARSIMONY METHODS 3.  MAXIMUM LIKELIHOOD METHODS 4.  METHODS OF INVARIANTS DISTANCE MATRIX METHODS IN DISTANCE MATRIX METHODS EVOLUTIONARY DISTANCES (USUALLY NUMBER OF NUCLEOTIDES OR AMINO ACID SUBSTITUTIONS BETWEEN SEQUENCES) ARE COMPUTED FOR ALL PAIRS OF TAXA, AND A PHYLOGENETIC TREE IS CONSTRUCTED BY USING AN ALGORITHM BASED ON SOME FUNCTIONAL RELATIONSHIPS AMONG THE DISTANCE VALUES. UPGMA UNWEIGHTED PAIR-­‐GROUP METHOD WITH ARITHMETIC MEAN U P G M A I S T H E S I M P L E S T M E T H O D F O R T R E E RECONSTRUCTION. IT CAN BE USED TO CONSTRUCT PHYLOGENETIC TREES IF THE RATES OF EVOLUTION ARE APPROXIMATELY CONSTANT AMONG THE DIFFERENT LINEAGES SO THAT AN APPROXIMATELY LINEAR RELATION EXISTS BETWEEN EVOLUTIONARY DISTANCE AND DIVERGENCE TIME. UPGMA METHOD EMPLOYS A SEQUENTIAL CLUSTERING ALGORITHM. OTU A B C B dAB C dAC dBC D dAD dBD dDC OTU A B B 0.23 C 0.87 0.59 D 0.73 1.12 C 1.17 OTU A B B 0.23 C 0.87 0.59 D 0.73 1.12 C 1.17 OTU (AB) C C d(AB)C D d(AB)D dDC
OTU (AB) C C 0.730 D 0.925 1.17 1.  SMALLEST DISTANCE BETWEEN A AND B 2.  A AND B SINGLE OUT A B 3.  RECALCULATE DISTANCE MATRIX d AB 0.23
d AB = branching point =
=
= 0.115
2
2
d AC + dBC 0.87 + 0.59
d(AB)C =
=
= 0.73
2
2
d AD + dBD 0.73+1.12
d(AB)D =
=
= 0.925
2
2
OTU (AB) C d(AB )C 0.73
branching point =
=
= 0.0575
2
2
(dAD + dBD + dCD )] 0.73 + 1.12 + 1.17
[
d(ABC )D =
=
= 1.007
3
3
C 0.730 D 0.925 1.17 A B (ABC) !
D 1.007 C A B C D !
d(ABCD ) 1.007
root of the tree = d(ABCD ) =
=
= 0.503
2
2
NEIGHBOR-­‐JOINING METHOD (N-­‐J) IN AN UNROOTED BIFURCATING TREE, TWO OTUS ARE SAID TO BE NEIGHBORS IF THEY ARE CONNECTED THROUGH A SINGLE INTERNAL NODE. A D C B D E A B C N-­‐J METHOD FINDS NEIGHBORS SEQUENTIALLY SO THAT THE TOTAL LENGTH OF THE TREE IS MINIMIZED ANY PAIR OF OTUS CAN BE SEPARATED AND THERE ARE N(N-­‐1)/2 WAYS OF CHOOSING THEM. AMONG THESE POSSIBLE PAIRS OF OUTS THE ONE THAT GIVE THE SMALLEST SUM OF BRANCH LENGTH IS CHOSEN. THIS PAIR OF OTUS IS REGARDED AS SINGLE OTU AND THE ARITHMETIC MEAN DISTANCES BETWEEN OTUS ARE COMPUTED TO FORM A NEW DISTANCE MATRIX. THE NEXT PAIR OF OTUS THAT GIVES THE SMALLEST SUM OF BRANCH LENGTHS IS THEN CHOSEN. THIS PROCEDURE IS CONTINUED UNTIL ALL N-­‐3 INTERIOR BRANCHES ARE FOUND. RELIABILITY TESTS NOT ONLY TREES CAN BE ESTIMATED, BUT THEIR RELIABILITY OR ROBUSTNESS (I.E., ACCURACY) CAN BE EVALUATED AS WELL. RELIABILITY REFERS TO THE PROBABILITY THAT MEMBERS OF A CLADE WILL BE PART OF THE TRUE TREE. BOOTSTRAPPING IS THE MOST COMMON RELIABILITY TEST. BOOTSTRAP POSITIONS OF THE ALIGNED SEQUENCES ARE RANDOMLY SAMPLED FROM THE MULTIPLE SEQUENCE ALIGNMENT WITH REPLACEMENTS. IT IS POSSIBLE THAT SOME POSITIONS WILL BE REPEATED IN THE SUBSAMPLE, WHILE SOME POSITIONS WILL BE LEFT OUT. THE SAMPLED POSITIONS ARE ASSEMBLED INTO NEW DATA SETS, THE SO-­‐
CALLED BOOTSTRAPPED SAMPLES. A NEW TREE IS RECONSTRUCTED FOR ANY NEW DATA SET. COUNTING HOW MANY TIMES EACH GROUPING FROM THE ORIGINAL TREE OCCURS IN THE SAMPLE TREES. •  THE CLOSER THE SCORE IS TO 100, THE MORE SIGNIFICANT THE GROUPING. •  BOOTSTRAPPING CAN BE USED WITH DISTANCE, PARSIMONY AND LIKELIHOOD METHODS. •  BELOW, THE BOOTSTRAP SCORES FOR PARTICULAR INTERNAL BRANCHES ARE SHOWN. WHICH METHOD SHOULD BE USED? ALL THAT YOU CAN! • EACH METHOD HAS ITS OWN STRENGTHS • USE MULTIPLE METHODS FOR CROSS-­‐VALIDATION • IN SOME CASES, NONE OF THE THREE GIVES THE CORRECT PHYLOGENY! DETERMINE THE RELATEDNESS OF EIGHT VERTEBRATE SPECIES BASED ON DIFFERENCES IN THE BETA CHAIN OF THEIR HEMOGLOBIN >Homo VHLTPEKSAVTAGNVDEVGEGLVTQFESDLSTPDAVMGPK >Rhesus VHLTPEKNAVTTGNVDEVGEGLLTQFESDLSSPDAVMGPK >Mouse VHLTDAKAAVSCGNSDEVGEGLVTQYDSDLSSASAIMGAK >Rat VHLTDAKAAVNGGNPDDVGEGLVTQYDSDLSSASAIMGPK >Duck VHWTAEKQLITGGNVADCAEALITQFASNLSSPTAILGPM >Goose VHWTAEKQLITGGNVADCAEALITQFSSNLSSPTAILGPM >Crocodile ASFDPHKQLIGDHDVAHCGESMIKRYENDISNAQAIMHEK >Alligator ASFDAHRKFIVDADVAQCADSMIKRYEHKMCNAHDILHSK Kosinski, R.J. 2006. An introduccon to phylogenecc analysis. Pages 57-­‐106, in Tested Studies for Laboratory Teaching, Volume 27 (M.A. O'Donnell, Editor). SIMILARITY MATRIX (Idencty matrix) DISTANCE MATRIX (=1-­‐ similarity matrix) ACCORDING TO THE DATA, WHICH TWO SPECIES ARE MOST CLOSELY RELATED? WHY DO WE THINK THESE TWO LINEAGES DIVERGED THE SHORTEST TIME AGO? WHICH SPECIES IS MOST CLOSELY RELATED TO HUMANS? WHICH IS MOST DISTANTLY RELATED TO HUMANS? ***** DIFFERENCES BETWEEN LINEAGES START TO ACCUMULATE ONCE INTERBREEDING STOPS ***** Tcoffee: hlp://tcoffee.crg.cat/apps/tcoffee/index.html CLUSTAL W (1.83) multiple sequence alignment!
!
Homo
VHLTPEKSAVTAGNVDEVGEGLVTQFESDLSTPDAVMGPK 40!
Rhesus
VHLTPEKNAVTTGNVDEVGEGLLTQFESDLSSPDAVMGPK 40!
Mouse
VHLTDAKAAVSCGNSDEVGEGLVTQYDSDLSSASAIMGAK 40!
Rat
VHLTDAKAAVNGGNPDDVGEGLVTQYDSDLSSASAIMGPK 40!
Duck
VHWTAEKQLITGGNVADCAEALITQFASNLSSPTAILGPM 40!
Goose
VHWTAEKQLITGGNVADCAEALITQFSSNLSSPTAILGPM 40!
Crocodile
ASFDPHKQLIGDHDVAHCGESMIKRYENDISNAQAIMHEK 40!
Alligator
ASFDAHRKFIVDADVAQCADSMIKRYEHKMCNAHDILHSK 40!
.
: :
: . .:.::.:: .:... :: !
CLUSTALW2 PHYLOGENY CLADOGRAM PHYLOGRAM HTTP://WWW.PHYLOGENY.FR/ SHEEP MOOSE GIRAFFE CHEVROTAIN BELUGA SPERM_WHALE RORQUAL PIG PECCARY DROMEDARY TAPIR HORSE HYENA COYOTE HIPPO O12957 O02672 O02683 O02690 O02681 O02687 O02673 O02688 O12959 O02677 O02689 O02682 O02676 O02680 O12954 GAMMA FIBRINOGEN ACCESSION CODES O12957 O02672 O02683 O02690 O02681 O02687 O02673 O02688 O12959 O02677 O02689 O02682 O02676 O02680 O12954 3-­‐HYDROXY-­‐3-­‐METHYLGLUTARYL COENZYME A SYNTETHASE XENOPUS1 MOUSE1 BOS1 HOMO1
GALLUS1 ZEBRAFISH1 HOMO2 RATTUS2 MOUSE2 GALLUS2 DROSOPHILA ARABISOPSIS YEAST AAI59180 AAH29693.1 NP_001193507 NP_002121.4 NP_990742.1 XP_005155547.1 CAG33131.1 NP_775117.2 NP_032282.2 XP_422225.3 NP_524711 AEE83053.1 P54874.1 A DUPLICATION EVENT THE FORMATION OF CARBON–CARBON BONDS VIA AN ACYL-­‐ENZYME INTERMEDIATE PLAYS A CENTRAL ROLE IN FATTY ACID, POLYKETIDE, AND ISOPRENOID BIOSYNTHESIS. UNIQUELY AMONG CONDENSING ENZYMES, 3-­‐HYDROXY-­‐3-­‐METHYLGLUTARYL (HMG)–COA SYNTHASE (HMGS) CATALYZES THE FORMATION OF A CARBON–CARBON BOND BY ACTIVATING THE METHYL GROUP OF AN ACETYLATED CYSTEINE. THIS REACTION IS ESSENTIAL IN GRAM-­‐POSITIVE BACTERIA, AND REPRESENTS THE FIRST COMMITTED STEP IN HUMAN CHOLESTEROL BIOSYNTHESIS. HUMAN, MOUSE, RAT, AND CHICKEN HAVE TWO COPIES OF THE HMGCS ENZYME: ONE THAT ACTS IN THE CYTOSOL (HMGCS1) AND ONE THAT ACTS IN THE MITOCHONDRIA (HMGCS2) [31]. BY CONTRAST, FISHES, FROGS, AND SHARKS HAVE A SINGLE COPY OF THE ENZYME. PHYLODENDRON PHYLOGENETIC TREE PRINTER hlp://iubio.bio.indiana.edu/treeapp/treeprint-­‐form.html TREX-­‐ONLINE hlp://www.trex.uqam.ca hlp://www.phylogeny.fr/ hlp://www.ebi.ac.uk/Tools/phylogeny/clustalw2_phylogeny/ MAXIMUM PARSIMONY METHODS THE PRINCIPLE OF MAXIMUM PARSIMONY SEARCHES FOR A TREE THAT REQUIRES THE SMALLEST NUMBER OF EVOLUTIONARY CHANGES TO EXPLAIN THE DIFFERENCES OBSERVED AMONG THE OTUS UNDER STUDY. CHARACTER STATES (E.G. THE NUCLEOTIDE OR AMINO ACID AT A SITE) ARE USED AND THE SHORTEST PATHWAY LEADING TO THESE CHARACTER STATES IS CHOSEN AS THE BEST TREE. INFORMATIVE SITES A NUCLEOTIDE SITE IS PHYLOGENETICALLY INFORMATIVE IF IT FAVORS SOME TREES AND NOT OTHERS. MAXIMUM LIKELIHOOD METHODS IN MAXIMUM LIKELIHOOD METHODS ONE SEARCHES FOR THE MAXIMUM LIKELIHOOD (ML) VALUE FOR THE CHARACTER STATE CONFIGURATIONS * AMONG THE SEQUENCES UNDER STUDY FOR EACH POSSIBLE TREE AND CHOOSES THE ONE WITH THE LARGEST ML VALUE AS THE PREFERRED TREE. *A NUCLEOTIDE CONFIGURATION AT A SITE MEANS THE PATTERN OF NUCLEOTIDE DIFFERENCES AT THAT SITE AMONG THE SEQUENCES INVOLVED METHODS OF INVARIANT THE METHODS OF INVARIANT STUDY SOME PARTICULAR FUNCTIONS OF THE CHARACTER STATE THAT HAVE THE EXPECTED VALUE ZERO UNDER CERTAIN TREES BUT HAVE NON ZERO EXPECTATION UNDER OTHER TREES. NEIGHBOR-­‐JOINING USES ONLY PAIRWISE DISTANCES MINIMIZES DISTANCE BETWEEN NEAREST NEIGHBORS VERY FAST EASILY TRAPPED IN LOCAL OPTIMA GOOD FOR GENERATING TENTATIVE TREE, OR CHOOSING AMONG MULTIPLE TREES MAXIMUM PARSIMONY USES ONLY SHARED DERIVED CHARACTERS MINIMIZES TOTAL DISTANCE SLOW ASSUMPTIONS FAIL WHEN EVOLUTION IS RAPID BEST OPTION WHEN TRACTABLE (<30 TAXA, HOMOPLASY RARE)