Download Protein World

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genome wikipedia , lookup

Oncogenomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Metagenomics wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Gene wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene therapy wikipedia , lookup

Gene desert wikipedia , lookup

Gene nomenclature wikipedia , lookup

Public health genomics wikipedia , lookup

Genomic imprinting wikipedia , lookup

Genome (book) wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

History of genetic engineering wikipedia , lookup

Microevolution wikipedia , lookup

Genome evolution wikipedia , lookup

Epigenetics of human development wikipedia , lookup

NEDD9 wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Designer baby wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression programming wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Transcript
Affymetrix Expression Data
Comics Group
12-05-2003 Nijmegen
Tim Hulsen
General Information (1)
• Affymetrix oligo microarrays: HG-U133 A and B
(human) and MG-U74v2 A, B and C (mouse)
• Updated every two months; releases used here:
november 2002 and january 2003
• UniGene-based
• Probes: 25mer oligos complementary to the
sequences of interest
• Probe pairs: perfect match (PM) probe and
mismatch (MM) probe, MM is different from PM
in the 13th position
General Information (2)
• Human chips: 3269 samples, 44792 fragments,
115 tissue categories (114 for nov. 2002
release), 15 SNOMED tissue categories
• Mouse chips: 859 samples, 36701 fragments, 25
tissue categories, 12 SNOMED tissue categories
• Results from all samples within a tissue category
are combined by generating electronic
Northerns
• For each tissue fragment and each tissue
category is determined:
– Median expression value
– Present call (percentage)
Median expression value
• Expression value: intensity
• All expression values that have a ‘present
call’ are used to determine the median
expression value
• Varies from 0 to ~65,000 in human and
from 0 to ~97,000 in mouse
Present Call (Percentage)
• Normalization/scaling procedures (MAS 5.0) are
used to determine an expression intensity value
with an associated confidence level to each
fragment
• When confidence level p for expression is
smaller than 0.05, the expression intensity for
this specific fragment in this particular sample is
called present (P)
• Call values are used to calculate a present call
percentage (P calls / total calls)
Snomed category definitions (1)
• SNOMED: Systematised Nomenclature of
Medicine
• Combines specific categories into more global
categories, i.e. organ systems
• In human far more useful than in mouse (115>15,25->12)
• Categories like: cardiovascular system, digestive
organs, endocrine gland, female genital system,
male genital system, musculoskeletal system,
nervous system, respiratory system, etc.
Snomed category definitions (2)
• Example: cat. 7: ‘hematopoietic system’:
Human
Mouse
36. Bone marrow
10. Blood
37. Dendritic reticulum cell
11. Bone marrow
38. Lymph node
12. Mesenteric lymph node
39. Lymphocyte
13. Spleen
40. Monocyte
14. Thymus
41. Segmented neutrophil
42. Spleen
43. Thymus
44. Tonsil
45. White blood cell
Annotation provided
For each fragment, if available:
• title
• unigeneAcc
• geneSymbol
• unigeneId
• geneAlias
• interproId
• exemplarAcc
• pfamId
• omimId
• swissprotId
• snpId
• goId
• refseqId
• goFunction
• refseqprotId
• goProcess
• ncbiNuclId
• goComponent
• ncbiProtId
• comment
Goals & Problems
• Goal: use data set to see if co-expression
between orthologous/paralogous gene
pairs is higher than between ‘unrelated’
gene pairs, in human & mouse
• Problem 1: limited annotation
• Problem 2: empty expression profiles
• Problem 3: size of data set
Limited annotation (1)
For example for three of the most used protein ids:
ncbiProtId (in red), refseqProtId (in green), swissprotId (in blue)
Human:
Mouse:
Limited annotation (2)
Solutions:
• Smith-Waterman of (SWX) of all Affymetrix
sequences to the human & mouse IPI
sets, for which orthologs and paralogs
were already defined -> IPI id added to
database
• Smith-Waterman (SWN) of all Affymetrix
sequences to each other for better
orthology/paralogy prediction
Empty expression profiles
• Lots of genes have no expression at all in any
tissue category
• Useless for correlation calculation; two genes
with no expression will have a top correlation!
• For human: 4114 out of 44792 fragments
completely no expression in all tissue categories
-> 40678 left
• For mouse: 6791 out of 36701 fragments
completely no expression in all tissue categories
-> 29910 left
Size of data set
• Correlation between gene pairs is calculated:
the number of pairs is (x2-x)/2 for x genes ->
millions of data points
• Number of gene pairs is already brought down
by the ‘no expression gene removal’: in human
from 1,003,139,236 to 827,329,503, in mouse
from 673,463,350 to 447,289,095
• For some quick analyses, sets of e.g. 1000
randomly selected genes were used -> 499,500
gene pairs
Uncentered Correlation
• ‘Uncentered’: from 0 to 1
• UC(X,Y)= Σ( X / ( sqrt ( Σ( X2 / N ) ) ) * ( Y / ( sqrt ( Σ( Y2 / N ) ) ) ) / N
• Calculated correlations between gene pairs were used to
see if the co-expression for orthologous pairs and/or
paralogous pairs is higher than for ‘unrelated’ pairs
• This was measured by using the KEGG Pathway map
(release 25)
• The best, however not completely convincing, result was
found using PCP and not ME:
Correlation KEGG Pathway Check
• Data points above a correlation
threshold of 0,9 and 1,0 were left out
because of very low numbers
(irreliability)
• Only orthologous conserved gene
pairs have a higher accuracy when
increasing the correlation threshold
• May be a combination of PCP and
ME should be used
• Another measure could be used:
same GO category instead of KEGG,
GO is already annotated by Affymetrix
• Lots of genes have only an
expression value in one tissue; this
correlation method is not really
suitable -> mutual information analysis
Mutual Information
• For each tissue category: 0 or 1 (ME/PCP value
below/above a specified threshold)
• x0 = % of 0s, x1 = % of 1s
• x00 = % of 0-0 pairs, x01 = % of 0-1 pairs, x10 = %
of 1-0 pairs, x11 = % of 1-1 pairs
• Entropy per gene: -(x0*ln(x0)+x1*ln(x1))
• Entropy per gene pair: (x00*ln(x00)+x01*ln(x01)+x10*ln(x10)+x11*ln(x11))
• MI = Entropy(1) + Entropy(2) – Entropy(1,2)
• 0<=MI<=0,693147
MI GO Category Check
• Mutual information check using GO Biological Process 3rd level of specification
• Horizontal axis shows log(MI)
• Different lines: different thresholds for defining as a ‘0’ or a ‘1’
• Accuracy indeed seems to be higher for pairs with much mutual information, but
there is also a peak at -9<=log(MI)<-8
• Orthologous/paralogous pairs not checked yet
Future plans
• Complete mutual information analysis,
using both KEGG Pathway and GO
databases; look at orthologous and
paralogous gene pairs too
• Check alternative splicing
• Speed – license ends at the end of June