Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IDENTIFICATION OF THE POWER-LAW COMPONENT IN HUMAN TRANSCRIPTOME Vasily V. Grinev Associate Professor Department of Genetics Faculty of Biology Belarusian State University Minsk, Republic of Belarus DIVERSITY OF SPLICE SITES IN HUMAN GENOME/TRANSCRIPTOME A graphical representation of the traditional (linear) transcriptional model (A), splice sites (B) and exon (C) splicing graphs models of human RCAN3 gene organisation DISCRETE POWER-LAW MODEL The probability mass function 𝐩 𝒙 = 𝐂𝒙−𝜶 Normalization constant 𝐂= 𝐃 = 𝒔𝒖𝒑𝒙 𝐏 𝒙 − 𝑷𝒆𝒎𝒑 𝒙 𝟏 𝛇(𝛂, 𝒙𝒎𝒊𝒏 ) Determination of the scaling parameter a value by maximum likelihood estimator for xmin 6 Hurwitz zeta function ∞ 𝛇 𝛂, 𝒙𝒎𝒊𝒏 = Estimation of the lower bound xmin by Kolmogorov-Smirnov statistic (𝐧 + 𝒙𝒎𝒊𝒏 )−𝜶 𝐧=𝟎 The cumulative distribution function −𝟏 𝒏 𝛂≅𝟏+𝐧 𝐥𝐧 𝐢=𝟏 𝒙𝒊 𝒙𝒎𝒊𝒏 − 𝟏 𝟐 Determination of the scaling parameter a value by direct numerical maximization of the likelihood function The complementary cumulative distribution function itself for x < 6 𝒏 min 𝛇 𝛂, 𝒙 𝐏 𝒙 =𝟏− 𝛇 𝛂, 𝒙𝒎𝒊𝒏 𝛇 𝛂, 𝒙 𝑷 𝒙 = 𝛇 𝛂, 𝒙𝒎𝒊𝒏 Important equations 𝓛(𝛂) = −𝐧𝐥𝐧𝛇 𝛂, 𝒙𝒎𝒊𝒏 − 𝛂 𝐥𝐧𝒙𝒊 𝐢=𝟏 Determination of parameters Clauset,A., Shalizi,C.R., Newman,M.N.J. (2009) Power-law distributions in empirical data. SIAM Rev., 51, 661-703. Newman,M.E.J. (2005) Power laws, Pareto distributions and Zipf’s law. Contemp. Phys., 46, 323-351. Goldstein,M.L., Morris,S.A., Yen,G.G. (2004) Problems with fitting to the power-law distribution. Eur. Phys. J. B, 41, 255-258. COMPETITIVE STATISTICAL MODELS 1) Power-law 𝐩 𝒙 = 𝐂𝒙−𝜶 1) Log-likelihood ratio test 2) Truncated power-law 𝐩 𝒙 = 𝐂𝒙−𝜶 𝒆−𝝀𝒙 3) Yule-Simon 𝐩 𝒙 =𝐂 Г(𝐱) Г(𝐱 + 𝛂) 4) Exponential 𝐩 𝒙 = 𝐢=𝟏 𝒑𝟏 (𝒙𝒊 ) 𝒑𝟐 (𝒙𝒊 ) Vuong,Q.H. (1989) Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica, 57, 307-333. 𝐀𝐈𝐂 = 𝟐𝐤 − 𝟐 𝒍𝒏 𝑳 5) Stretched exponential −𝛌𝒙𝜷 𝐩 𝒙 = 𝐂𝒙𝛃−𝟏 𝒆 𝐩 𝒙 =𝐂 𝒏 2) Akaike information criterion 𝐂𝒆−𝝀𝒙 , 6) Log-normal 𝑳𝟏 𝐑= = 𝑳𝟐 (𝒍𝒏 𝒙−𝛍)𝟐 − 𝟐𝝈𝟐 𝒆 𝒙 7) Poisson 𝝁𝒙 𝐩 𝒙 =𝐂 𝐱! The probability mass functions of competitive statistical models Akaike,Y. (1974) A new look at the statistical model identification. IEEE Transact. Automat. Control, 19, 716-723. 3) Bayesian information criterion 𝐁𝐈𝐂 = −𝟐 𝒍𝒏 𝑳 + 𝐤𝐥𝐧(𝐧) Schwarz,G.E. (1978) Estimating the dimension of a model. Ann. Stat., 6, 461-464. Comparison of alternative statistical models STATISTICAL ANALYSIS CONFIRMS THE PRESENCE OF POWER-LAW COMPONENT IN TRANSCRIPTOME OF KASUMI-1 CELLS USAGE OF EXONS IN ALTERNATIVE SPLICING FOLLOWS A POWER-LAW IN HUMAN TRANSCRIPTOME USAGE OF EXONS IN ALTERNATIVE SPLICING FOLLOWS A POWER-LAW IN HUMAN TRANSCRIPTOME Maximum values of splicing degrees from different models of human genes ARE THERE ANY SPECIFIC FEATURES ASSOCIATED WITH DIFFERENT CLASSES OF SPLICE SITES? Every splice site was annotated with sequence, sequence-related, functional and structural features which were extracted from four types of the genomic/RNA elements RANDOM FOREST BASED DATA MINING A small set of features allows distinguish between two classes of splice sites in Kasumi-1 cells RANDOM FOREST BASED DATA MINING Iterative removing of misclassified splice sites leads to high accuracy of classification RANDOM FOREST BASED DATA MINING About half of misclassified cases of splice sites can be explained by some different ways MANY THANKS TO THE MEMBERS OF OUR TEAM: Ilia M. Ilyushonak Dr. Petr V. Nazarov Dr. Laurent Vallar Northern Institute for Cancer Research Prof. Olaf Heidenreich THANK YOU FOR ATTENTION!