* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Patents 101 - The Zhao Bioinformatics Laboratory
Pathogenomics wikipedia , lookup
Genetically modified crops wikipedia , lookup
Ridge (biology) wikipedia , lookup
Synthetic biology wikipedia , lookup
Genetic engineering wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Minimal genome wikipedia , lookup
Genomic imprinting wikipedia , lookup
Gene therapy wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Genome evolution wikipedia , lookup
Gene nomenclature wikipedia , lookup
The Selfish Gene wikipedia , lookup
Gene desert wikipedia , lookup
Genome (book) wikipedia , lookup
History of genetic engineering wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Helitron (biology) wikipedia , lookup
Gene expression programming wikipedia , lookup
Gene expression profiling wikipedia , lookup
Molecular Inversion Probe wikipedia , lookup
Microevolution wikipedia , lookup
Designer baby wikipedia , lookup
Post-process of IMGAG M.t. 2.0 Release Affymetrix Medicago Probe set – IMGAG 2.0 / MTGI 8.0 Mapping Zhao Bioinformatics Lab Plant Biology Division IMGAG M.t. 2.0 Data downloaded from ftp://ftpmips.gsf.de/plants/medicago/MT_2_0/MT2.0_medicago_chrX_20080303_NoOverlap.xml.tar.gz ● Summary - 38,844 TU and 38,844 models. One to one - 38,759 gene name, so 82 model is redundant in gene name. - Of the 38,844 models, 85’s CDS region is not compatible with FASTA file - 4644 models with 5’-UTR + CDs; - 5846 models with CDS+3’-UTR - 11656 models with 5’-UTR + CDS + 3’-UTR. - 16698 models CDS only Plant Biology Division Evidence Code ● F (5036 genes) full coverage/FL-cDNA: The complete gene model from translation ● ● ● start to translation stop is covered by expressed Medicago sequence, e.g. FL-cDNA or EST alignments across the full length of the coding sequence. E (14737 genes) expressed/EST matches: Expression of the gene is supported by Medicago EST sequence that matches the gene call (partially). H (14209 genes) homology/heterologous: the gene call is supported by similarity to Medicago or other ESTs, protein, FL-cDNA, genomic or other sequences with partial or full-length alignments. I (1375 genes) intrinsic/ab initio/inferred/hypothetical: the gene call is based only on intrinsic prediction tools such as FGENESH, Genscan or Eugene, and no significant alignments to other sequences are available. The length of the prediction is greater than 300 bp or there is a significant domain match in Interpro. ● L (3830 genes) 'low quality' gene calls: gene calls not in F, E, nor H, with no significant Interpro domain match and a length less than 300 bp. i.e., unsupported intrinsic predictions of short length and thus statistically containing many false predictions. Total genes: 38334 NON-OVERLAPPED genes Plant Biology Division Affymetrix Medicago Probe set – IMGAG gene Mapping Two approaches ● A. Blast-based approach (1) HSP length / Affymetrix probeset target length >= threshold1 (2) Matching identity length / Max_HSP length >= threshold2 ● B. Affy probe-set level matching (1) IMGAG gene sequences were matched to corresponding Affymetrix probe sets using a position-weighted scoring index in which mismatches near the middle of a probe were most heavily penalized as follows: (1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,2,2,2,2,2,1,1,1,1,1). (2) A perfect match for a probe set yields a score of 45. Matches were declared when at least 8 of 11 probe sets had scores of 43 or higher. Plant Biology Division Statistics on Probe sets Type Percent in the Mtr. set Notes Unique probe sets: e.g. 44182 Mtr.10097.1.S1_at 86.80 unique to one gene alternative (_a_), e.g.: Mtr.10267.1.S1_a_at 116 2.28 alternative probe sets to one gene shared (_s_), e.g. Mtr.10146.1.S1_s_at 4793 9.42 common to multiple genes others (_x_), e.g.: Mtr.10093.1.S1_x_at 1809 3.55 other probe sets with complicated mapping Total 50900 100 Plant Biology Division Num of probe sets Statistics on Approach A – scenario #1: less stringent ● Affy Probeset Target Blast against IMGAG cDNA Threshold 1=0.7; Threshold 2=0.7 Num of cDNA Matching probe-set Percent Num of probe_sets Matching cDNA Percent 13717 0 35.31 25190 0 49.49 10054 1 25.88 15223 1 29.91 15073 >=2 38.80 10487 >=2 20.60 38844 total 100 50900 total 100 Plant Biology Division Statistics on Approach A – scenario #2: Perfect matches ● Affy Probeset Target Blast against IMGAG cDNA Threshold 1=1.0; Threshold 2=1.0 Num of cDNA Matching Percent probe-set Num of probe_sets Matching cDNA Percent 28169 0 72.52 39593 0 77.79 8864 1 22.82 10344 1 20.32 1811 >=2 4.62 963 >=2 1.89 38844 total 100 50900 total 100 Plant Biology Division Statistics of Original probe_set EST mapping Num of EST Matching probeset Percent 6315 0 17.12 29038 1 78.74 1525 >=2 4.14 36878 total 100 Plant Biology Division Statistics of our probe_set vs. EST mapping 90 Num of Matching EST probe-set Percent 80 70 3304 0 8.96 29535 1 80.09 60 50 Origin 40 4039 >=2 10.95 Ours 30 20 36878 total 100 10 0 0 probset 1 probeset 2 probesets Overlapping mapping between our probe-set vs. EST mapping and the Affy original probe-se vs. EST mapping. 37872 ∩ 32108=32106. Plant Biology Division Our method covered 32106/32108=99.9993% of the Affy original mapping. Statistics on Approach B ● IMGAG cDNA versus Probe_set Num of cDNA Matching probe_set Percent 19961 0 51.39 12909 1 33.23 5974 (3134 uni) >=2 15.38 38844 total 100 Plant Biology Division Probe sets map to IMGAG or ESTs Item Num of probe_sets 1 7494 None 14.72 2 21284 TC/EST only 41.82 3 14362 12866 TC/EST and unique IMGAGv2 25.28 1496 TC/EST and multiple IMGAGv2 2.94 + 6500 Unique 12.77 IMGAGv2 only 1260 Multiple 2.48 IMGAGv2 only ++ 4 7760 Plant Biology Division 50900 Matched To Total Percent 28.22 14.72 EST 41.82 (28.22) IMGAG 15.25 15.25 100 MTGI 8 vs.– IMGAG gene Mapping ● Mt2.0 cDNA BLASTN against MTGI8 (expectation 1e-04); ● Further applied blow filters: HSP length/Unigene length (a) Identity length/HSP length (b) ● Result: 9333 (24.0%) cDNA are mapped to 9255 (25.1%) unigene (a>0.9 b>0.9); 11517 (29.6) cDNA are mapped to 11383 (30.9%) unigene (a>0.8 b>0.8); 13284 (34.2%) cDNA are mapped to 13092 (35.5%) unigene (a>0.7 b>0.7); 9959 (25.64.0%) cDNA are mapped to 10543 (28.59%) unigene (a>0.8 b>0.95); 13063 (33.63%) cDNA are mapped to 14585 (39.55%) unigene (a>0.5 b>0.95); ● Total cDNA: 38844, Total unigene: 36878 Plant Biology Division MTGI 8 High Quality TC vs.– IMGAG gene Mapping ● I. Retrieved 9,396 High Quality TC based on IMGAG’s criteria BLAST TIGR’s High Quality TC vs. BAC: (1). >95% identity over 80% of the TC length = 64.3% (current 2,500 BACs) -> 73.2% projected for 2,800 BACs to be sequenced (2). >95% identity over 50% of the TC length = 68.6% (current 2,500 BACs) -> 77.0% projected for 2,800 BACs to be sequenced ● II. Our Mt2.0 cDNA BLASTN against 9396 MTGI8 High Quality TC (expectation 1e-04); Further applied blow filters: HSP length/Unigene length (a) Identity length/HSP length (b) Result: 3550 (9.14%) cDNA are mapped to 3294(35.06%) unigene (a>0.8 b>0.95); 5052 (13.0%) cDNA are mapped to 4613(49.10%) unigene (a>0.5 b>0.95); Total cDNA: 38844, Total High Quality TC: 9396 Plant Biology Division Thank You! ● Suggestions / Comments Plant Biology Division