Download workshop2

CANDID: A candidate gene identification tool Part 2 Janna Hutz [email protected] March 26, 2007 Review • Literature – Well-characterized genes • Protein domains – All genes • Cross-species conservation – All genes Today’s agenda • Expression levels • Linkage data • Association data • CANDID performance measures Candidate lists vs. single candidates • Candidate lists – Complex trait or disease – Disease with known heterogeneity • Single candidates – Mendelian trait – New disease – Disease with clear, well-defined pathology Candidate lists vs. single candidates • Microarray • SNP typing • Sequencing • Immunocytochemistry • Knockout model ACT[A/G]GGA Example 4 • Goiter - thyroid gland problem • Iodine deficiency • Genetic causes Example 4 • Iodine is not supplied • Iodine is present, but is not added to the molecule • Which gene is mutated? Expression data • We know what tissue our gene is expressed in (thryoid). • How can we use this knowledge to help identify the candidate? • Wouldn’t it be nice if we had an expression database? Expression databases • Our ideal expression database would have: – Expression data for the same genes across many different tissues – As many tissues as possible – As many genes as possible – Good documentation • Gene Atlas Gene Atlas • Genomics Institute of the Novartis Research Foundation • 79 human tissues (160 samples) • 2 arrays – Affymetrix HG-U133A – GNF1H (custom) • 17,809 genes Measure of gene expression • Our thyroid gene: – Gene that is brightest on the thyroid array? – Gene that is brightest on the thyroid array, compared to all the other arrays. heart brain thyroid lung Measures of gene expression • Run CANDID, specifying that we’re interested in the thyroid. http://dsgweb.wustl.edu/llfs/secure_html/hutz/index.html User name: workshop Password: perl031907 • (We’ll need a tissue code for that.) Example 4 - Results • Our favorite genes: • TP53 - rank is… – 16314th • KRAS - rank is… – 5229th • What genes are ranked most highly? Example 4 - Results • 192 genes with expression score of 1 • The TOP gene is actually responsible for the phenotype described earlier – Its expression score = 1 Prior evidence • I’m not interested in examining all of the genes in the genome - just some of them. • Linkage and association Linkage • CANDID can: – Weight regions with higher LOD scores – Limit analysis to certain regions – How does it do this? Linkage scoring 1732 gene’s LOD score maximum genome-wide LOD score Linkage files • How does CANDID get this linkage information? • CANDID takes two kinds of files – Unformatted output from GENEHUNTER and MERLIN – Custom linkage files Custom linkage files • Simple format • Line 1 of the file must contain the word “custom” somewhere • Subsequent lines: Chromosome (tab) cM (tab) LOD score • But how do I get cM positions? Mapmaker • Inputs file as: Chromosome (tab) basepair (tab) LOD score • Outputs new file in the format: Chromosome (tab) cM (tab) LOD score • Will be available on the CANDID website soon Example 5 pancreatic cancer • Deletion on chromosome 13 between 23.65 cM and 25.08 cM. Creating a custom linkage file • Example: custom 13 23.64 13 23.65 13 25.08 13 25.08 0 3 3 0 23.65 25.08 Running CANDID 1. Try running CANDID using only the linkage criterion. 2. Now, run CANDID with the linkage criterion and literature criterion (your choice of keywords) • • Linkage weight = 1000 Literature weight = 1 Results • From OMIM: “Individuals with mutations in the BRCA2 gene, which predisposes to breast and ovarian carcinoma, have an increased risk of pancreatic cancer; germline mutations in BRCA2 are the most common inherited alteration identified in familial pancreatic cancer.” But linkage is so last season… Association • Increasing numbers of association studies • Increasing numbers of SNPs in each study • Can CANDID use this information, too? Association • Database – dbSNP - 11.8 million human SNPs – Includes HapMap SNPs – Most comprehensive – Each snp has a number prefixed with “rs” Association • How does CANDID accept association data? • Custom file format - each line is: rs# (tab) p-value Association scoring • For each gene, take the best p-value for that gene’s SNPs • Subtract that p-value from 1 • Unless you test SNPs in every gene, this can be kind of unfair… Association scoring • Tested 10 genes • Gene 9 has a best p-value of 0.8 (bad) • Gene X was not tested • Should Gene 9 get a higher overall score than Gene X? p-value threshold • User defines a p-value threshold • Let’s say it’s 0.1. • Any SNPs with p-values above 0.1 are not considered. • Now Gene 9 and Gene X have the same score (0). Example 6 • Age-related Eye Disease Study • Macular degeneration Example 6 • Make custom association file rs3753396 rs543879 rs7724788 0.0444 0.0494 0.75 • Run CANDID with this association file Results rs3753396 rs543879 rs7724788 0.0444 } CFH 0.0494 0.75 } SLC25A46 So just how well does this work anyway? Preliminary evidence • Online Mendelian Inheritance in Man • 154 diseases linked to chromosome 1 • Literature, domains - chose keywords • Conservation • Expression - chose tissue codes Ideal weights • Tested all combinations of weights in those 4 categories – Possible weights: (0, 0.1, … , 0.9, 1) • Which weight combination was the best, across all 154 diseases? Top 10 weight combinations 1. Literature = 1, everything else = 0 2. Literature = 0.9, everything else = 0 3. Literature = 0.8, everything else = 0 4. Literature = 0.7, everything else = 0 5. … 10. Literature = 0.1, everything else = 0 11. Literature = 1, domains = 0.1 More specifics • Literature only: average ranking = 425 – 425/38697 = 98.9th percentile – 44/154 genes ranked #1 for at least one set of weights • Chromosome 1: average ranking = 22 – 22/2280 = 99th percentile – 84/154 genes ranked #1 for at least one set of weights Analysis of results • They make a lot of sense. • Genes in OMIM are, by definition, wellcharacterized. • Many diseases are rare, with particular names or keywords that would only appear in papers about the disease genes. Next steps • Separate OMIM analysis into simple and complex traits – Get new ideal weights • See how well these ideal weights do in ranking candidates from chromosome 2. Next steps • CANDID’s databases were last compiled in November 2006. • Find publications that have come out since then. • How well does CANDID do in ranking those genes? Next steps • Many new whole-genome studies and microarray studies implicate lists of candidates. • If CANDID analyzes those phenotypes, how significant is the overlap of CANDID’s top genes and those papers’ top genes? Next steps • Any other suggestions? • Any interesting data you have? Any questions? Acknowledgments • Mike Province • Howard McLeod • Aldi Kraja • Ingrid Borecki • Qunyuan Zhang • Ryan Christensen • John Martin

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download workshop2