Download Natural rules for Arabidopsis thaliana

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Pathogenomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genome evolution wikipedia , lookup

Point mutation wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

History of RNA biology wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Human genome wikipedia , lookup

Genomics wikipedia , lookup

Genome editing wikipedia , lookup

Sequence alignment wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Metagenomics wikipedia , lookup

Microsatellite wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Epitranscriptome wikipedia , lookup

Helitron (biology) wikipedia , lookup

Primary transcript wikipedia , lookup

Alternative splicing wikipedia , lookup

Transcript
Cent. Eur. J. Biol. • 7(4) • 2012 • 620-625
DOI: 10.2478/s11535-012-0060-1
Central European Journal of Biology
Natural rules for Arabidopsis thaliana
pre-mRNA splicing site selection
Research Article
Ning Wu1,*, Kanyand Matand1,2,#, Huijuan Wu3, Baoming Li3, Kayla Love4,
Brittany Stoutermire3, Yanfeng Wu2
1
Center for Biotechnology Research and Education,
Langston University, Langston, Oklahoma 73050, USA
2
Department of Biology, School of Arts and Science,
Langston University, Langston, Oklahoma 73050, USA
Department of Biology, Beijing Center for Physical
and Chemical Analysis, 100089 Beijing, China
3
Department of Chemistry, School of Arts and Science,
Langston University, Langston, Oklahoma 73050, USA
4
Received 13 February 2012; Accepted 08 May 2012
Abstract: The accurate prediction of plant pre-mRNA splicing sites has been studied extensively. The rules for plant pre-mRNA splicing
still remain unknown. This study, based on confirmed sequence data, systematically analyzed all expressed genes on
Arabidopsis thaliana chromosome IV to quantitatively explore the natural splicing rules. The results indicated that defining
Arabidopsis thaliana pre-mRNA splicing sites required a combination of multiple factors including (1) relative conserved
consensus sequence at splicing site; (2) individual nucleotide distribution pattern in 50 nucleotides up- and down-stream
regions of splicing site; (3) quantitative analysis of individual nucleotide distribution by using the formulations concluded
from this study. The combination of all these factors together can bring the accuracy of Arabidopsis thaliana splicing
site recognition over 99%. The results provide additional information to the future of plant pre-mRNA splicing research.
Keywords: Arabidopsis thaliana • Pre-mRNA splicing
© Versita Sp. z o.o.
1. Introduction
The accurate removal of introns from precursor
messenger RNA (pre-mRNA) is an essential procedure
for gene expression. Several key factors involved in
intron removal have been suggested by previous
studies. These include: the relative consensus
sequence at the splicing sites, the individual nucleotide
distribution pattern in flank regions of splicing sites, and
a conserved branch point sequence [1,2]. According to
past studies, some conserved short sequences within
introns have been identified to function in intron splicing
across all species, which include dinucleotide guanine
(G) and uridine (U) at 5’ splicing site and adenine (A)
and guanine at 3’ splicing site, and a short conserved
“branch point” sequence located within 50 nucleotides
upstream of the 3’ splicing site [3]. However, particular
620
structural and sequence features distinguish plant
introns from the other species. Although plant introns
share a high level sequence similarity with other
species, there is a lack of a conserved branch point
sequence when comparing vertebrate and yeast introns
[4,5]. Former studies demonstrated that alternation of
the sequences around splicing sites might lead to the
change of intron/exon recognition mechanisms [6] and
lead to different transcripts and functions [7], which
indicated that splicing of a particular intron depends
on a fine balance of multiple splicing signals of
varying strengths in the sequence context of an intron/
exon organization [5,8]. Based on current available
information, several plant splicing site prediction
methods have been developed [9-11]. However,
because of the lack of an in vitro system capable of
efficiently splicing plant introns experimentally and
E-mail: *[email protected], #[email protected]
Unauthenticated
Download Date | 6/16/17 6:55 PM
N. Wu et al.
the complicacy of plant species, there are many
uncertainties of plant pre-mRNA splicing mechanisms
[12]. The signals that specifically define the borders of
splicing sites in plant are still not fully understood due
to lack of a complete and accurate description of the
gene structure on the basis of sequence [13]. Recently,
upon the genome research proceeding into the postgenome era, the comprehensive data collection of
genomics and functional genome over many major
species provides scientists the opportunity to
review and analyze those confirmed sequence
data by using different bioinformatical means and
allows them to illustrate the potential mechanism of
pre-mRNA splicing. Some researchers have applied
large confirmed human gene samples in their studies
to improve the results’ accuracy and sensitivity [14].
However, to date, there have been no reports of such a
study in plant genomics research. As a representative
of plant, Arabidopsis thaliana has been well studied
for years. Both the complete genome and the most
expressed gene sequences are available for analysis.
This study employed current available genomic data
to explore all confirmed, natural expressed genes on
Arabidopsis thaliana chromosome IV to illustrate the
potential intrinsic splicing site strength and the natural
rules for splicing site selection.
2. Experimental Procedures
2.1 Data source
Through the “National Center for Biotechnology
Information” (NCBI) Plant Genomes Central, Arabidopsis
chromosome IV genome sequence has been targeted for
this study (http://www.ncbi.nlm.nih.gov/mapview/maps.
cgi?taxid=3702&chr=4). All 18.6 million bp nucleotide
sequences with 5,122 genes in the region were focused on.
The complete sense sequence structure of each individual
gene was retrieved from Arabidopsis thaliana database
(http://mips.gsf.de/proj/thal/db/index.html)
located
at
the “Munich Information Center for Protein Sequences”
Figure 1.
(MIPS, Germany) through the NCBI on site linkage. All the
sequences were downloaded to the local computers.
2.2 Sequence data manipulation
For each downloaded gene sequence, all exon and
intron nucleotide sequences were marked separately
according to the current available gene structure
information. Additionally, the sequences of 50
nucleotides upstream and downstream of each splicing
site were pulled out for the analysis (Figure 1). For the
sequences that were shorter than 50 bp on one side
of splicing site, manual sequence trimming on the other
side was performed to bring equal length of sequences
in both exon and intron flank regions. The data was
divided into four groups including (1) 3’ end of exon
sequence, (2) 5’ end of intron sequence, (3) 3’ end of
intron sequence, and (4) 5’ end exon sequence. All the
data was integrated into Microsoft Excel data sheet.
2.3 Sequence data analysis
Microsoft Excel program was applied for this study.
The present frequency of each individual nucleotide
on each position through the range of 100 nucleotides
centered by the splicing site was calculated. The relative
conservative nucleotide sequences across 5’ and 3’
splicing sites were analyzed (Table 1). The splicing
sites that showed complete 50 bp up- and down-stream
of splicing site were defined as “full sequence” sites
and were used as the experimental data for statistical
analysis. The occurrence of each nucleotide in 50 bp
regions up- and down-stream of each individual splicing
site was calculated. A series paired group t-tests were
performed to inspect distribution difference of each type
of nucleotide between up- and down-stream regions of
5’ and 3’ splicing sites. The exact number differences of
A, G, C (cytosine), U nucleotides across each splicing
site was also calculated and analyzed. The splicing
sites that showed shorter sequences (less than 50 bp)
in either exon or intron flank region were defined as
“shorter sequence” sites and were used as the test
group for results examination.
Individual gene exon/intron sequence data mining.
621
Unauthenticated
Download Date | 6/16/17 6:55 PM
Natural rules for Arabidopsis thaliana
pre-mRNA splicing site selection
A: 5’ splicing site
Position
E-2
E-1
I-1
I-2
I-3
A
8986
1512
17
26
9352
G
1248
10744
14281
24
1843
C
1529
547
8
99
701
U
2557
1517
14
14171
2424
Consensus sequence
A
G
G
U
A
B: 3’ splicing sites
Position
I-5
I-4
I-3
I-2
I-1
E-1
E-2
A
2388
3801
1070
14397
20
3679
3513
G
1558
5321
138
17
14408
7465
2567
C
1609
1255
9078
10
7
1604
2097
U
8892
4070
4161
23
12
1699
6270
Consensus sequence
U
N
C
A
G
G
U
Table 1.
(A) Nucleotide distribution around 14,320 of 5’ splicing sites. (B) Nucleotide distribution around 14,447 of 3’ splicing sites. E: exon region;
I: intron region. N: any nucleotide.
3. Results and Discussion
Total 18.6 million nucleotides with 5,122 genes on
Arabidopsis thaliana chromosome IV were downloaded
from the Munich Information Center for Protein
Sequences (MIPS, Germany) (http://mips.gsf.de/
proj/thal/db/index.html). The flank sequences with 50
nucleotides up- and down-stream of each splicing site
were defined as “full sequence” and grouped by their
locations for analysis. For sequences shorter than 50
nucleotides - “shorter sequence” on one side of splicing
site, manual sequence trimming on the other side was
performed to bring equal length of sequences in both
exon and intron flank regions. Totally 30,870 splicing
sites (15,410 of 5’ sites and 15,460 of 3’ sites) were
included in this study with 28,767 “full sequence” sites
and 2,103 “shorter sequence” sites, which covered
3,038,104 nucleotides.
The consensus sequences on both 5’ and 3’ splicing
sites were determined by calculating the distribution
probability of each nucleotide (Table 1). The relatively
consensus sequences on 5’ and 3’ splicing sites were
observed as AG|GUA and UNCAG|GU (N could be any
nucleotide) respectively, which is consistent with previous
studies [3,12,15]. The results indicated that consensus
sequences at splicing sites are important for site
recognition but are not enough to define the sites alone.
The distributions of A, G, C, U among all 28,767
“full sequence” were calculated respectively. Paired
group sample t-tests had been performed to examine
the appearance difference of each type of nucleotide
between up- and down-stream flank regions. The
analysis showed the same results in both 5’ and 3’
splicing sites with the number of U in intron flank
region larger than in exon flank region. This result was
reversed for A, G, and C (Figure 2). Statistical analysis
showed that there were significant differences in all
four types of nucleotides with all the P values equal
to or approximating zero. The results provided strong
evidence to support the theory that the distribution
patterns of nucleotides in flank regions of splicing sites
are the splicing signals which define the 5’ and 3’ splicing
sites [10,16]. In addition to previous reported U, G, and
C, this study also proved that A also played an important
role in splicing site determination.
Based on statistical results, the differences of each
nucleotide between up- and down-stream regions were
calculated. Among 14,320 of 5’ splicing sites, 13,061
sites showed the numbers of U in exon flank regions
(U) less or equal to it in intron flank regions (u) as U-u≤0
(formulation-1). The remaining 1,259 sites (U-u>0), about
91.58%, showed the numbers of A in exon flank region
(A) less or equal to it in intron flank regions (a), which
was defined as “if U-u>0, then A-a≤0” (formulation-2).
Combining the results of formulations 1 and 2, it covered
14,214 out of 14,320 of 5’ splicing sites (Table 2). On 3’
splicing site, 13,496 out of 14,447 splicing sites showed
the numbers of U in intron flank regions (u) larger than
it in exon flank regions (U) as u-U≥0 (formulation-3).
The left over 951 of u-U<0 sites showed the numbers
622
Unauthenticated
Download Date | 6/16/17 6:55 PM
N. Wu et al.
Figure 2.
Nucleotide distribution in both 5’ and 3’ splicing site flank regions. I>E: the number of nucleotide in intron larger than it in exon. E≥I: the
number of nucleotide in exon larger or equal to it in intron.
of A in intron flank region (a) larger or equal to it in exon
flank regions (A), which was expressed as “If u-U<0,
then a-A≥0” (formulation-4). Combining the results of
formulations 3 and 4, it covered 14,346 out of 14,447
3’ splicing sites (Table 3). The results indicated that, in
addition to the consensus sequence, if applying intronic
U-rich sequences only (formulation-1 or 3), there were
91.21% of 5’ splicing sites and 93.42% of 3’ splicing
sites fulfilled the criteria, respectively. The left over
splicing sites had to be defined by additional features.
A previous study has shown that the relative contrast in
U and G+C content between introns and their flanking
exons correlates with splicing efficiency [17]. However,
our results show that, after applying formulation-1 or
3, neither G nor C could provide sufficient evidence
to define the left over splicing sites when compared
to A (Table 2 and 3). By combining the applications
of formulations (1 and 2, 3 and 4), the accuracies for
defining 5’ and 3’ splicing site reached 99.26% and
99.30% respectively. It was interesting that, if the
general rule of intronic U-rich sequence was broken, A
became the next appropriate nucleotide for splicing site
defining with contrast distribution pattern to its general
format.
To test the efficiency and accuracy of formulation
1 to 4, we applied them to examine 2,103 of splicing
sites with “shorter sequence”. In total 1,090 of 5’ splicing
sites and 1,013 of 3’ splicing sites were tested. The
shortest exon sequences were the starting codon (AUG)
only and always showed the same sequence length in
623
Unauthenticated
Download Date | 6/16/17 6:55 PM
Natural rules for Arabidopsis thaliana
pre-mRNA splicing site selection
Formulation-1: U-u≤0
≤0
A-a
G-g
C-c
U-u
6686
3420
4722
13061
>0
7634
10900
9598
1259
Subtotal
14320
14320
14320
14320
G-g
C-c
U-u
Formulation-2: If U-u>0, then A-a≤0
A-a
≤0
1153
486
559
0
>0
106
773
700
1259
Subtotal
1259
1259
1259
1259
Table 2.
The number differences of each nucleotide around the 5’ splicing site flank regions. (Uppercase: exon region; Lowercase: intron region).
Formulation-3: u-U≥0
a-A
g-G
c-C
u-U
<0
8761
11553
8734
951
≥0
5686
2894
5713
13496
Subtotal
14447
14447
14447
14447
a-A
g-G
c-C
u-U
101
571
427
951
Formulation-4: If u-U<0, then a-A≥0
<0
≥0
850
380
524
0
Subtotal
951
951
951
951
Table 3.
The number differences of each nucleotide around the 3’ splicing site flank regions. (Uppercase: exon region; Lowercase: intron region).
the beginning of intron sequences as GUA. The results
of other shorter sequence sites demonstrated the
accuracies of 98.44% on 5’ splicing sites (formulation-1
and 2) and 98.42% on 3’ splicing sites (formulation-3
and 4) respectively. In some rare cases that both U and
A distributions were unable to define splicing site, the
distribution of C in exons (less than what it would be
in introns) would be the next factor to determine the
splicing sites, which brought the total accuracies of
“shorter sequence” site to 99.63% on 5’ splicing site and
99.90% on 3’ splicing site (data not shown).
The accurate prediction of plant mRNA splicing
sites has been extensively studied. Several different
computer-aided splicing site selection models that
were believed to mimic the in vivo splicing process
mathematically have been developed based on several
key factors. These factors include: the consensus
splicing site sequence, intronic U-rich sequence,
and the branch point confirmed through the different
species (vertebrate, yeast) [9,10,15,18]. However, the
rules and mechanisms for plant pre-mRNA splicing
remain unknown. Our study, based on the confirmed
sequence data, systemically analyzed all expressed
gene structures on Arabidopsis thaliana chromosome
IV to quantitatively explore the natural splicing rules.
In conclusion, defining Arabidopsis thaliana premRNA splicing sites requires the combination of
multiple factors, which includes (1) relative conserved
consensus sequence at splicing site; (2) individual
nucleotide distribution pattern in 50 nucleotides upand down-stream regions of splicing site with intronic
U-rich sequence and exonic G, C, A-rich sequences;
(3) quantitative analysis of individual nucleotide
distribution pattern. The calculation preference is as
follows (upper case represents exon sequence, lower
case represents intron sequence): On 5’ splicing site:
(i) U-u≤0, (ii) if U-u>0, then A-a≤0, (iii) if U-u>0 and
A-a>0, then C-c≤0. On 3’ splicing site: (i) u-U≥0, (ii) if
624
Unauthenticated
Download Date | 6/16/17 6:55 PM
N. Wu et al.
u-U<0, then a-A≥0, (iii) if u-U<0 and a-A<0, then c-C≥0.
According to statistical analysis results, the combination of
all these factors together can bring the accuracy of splicing
site recognition over 99%. The results of this study provide
new data and extend previous research findings which
will aid in the future of plant pre-mRNA splicing research.
Acknowledgements
Dr. Ning Wu and Dr. Kanyand Matand have the
equal contribution to this study and both are listed as
corresponding authors. All listed authors have agreed
this submission without conflict of interest.
References
[1]
Green M.R., Biochemical mechanisms of
constitutive and regulated pre-mRNA splicing,
Annu Rev Cell Biol.,1991, 7, 559-599
[2] Wang Z., Burge C., Splicing regulation: from a
parts list of regulatory elements to an integrated
splicing code, RNA, 2008, 14, 802-813
[3] Lal S., Choi J.H., Shaw J.R., Hannah L.C., A splice
site mutant of maize activates cryptic splice sites,
elicits intron inclusion and exon exclusion, and
permits branch point elucidation, Plant Physiol.,
1999, 121, 411-418
[4] Goodall G.J., Filipowicz W., Different effects of
intron nucleotide composition and secondary
structure on pre-mRNA splicing in monocot and
dicot plants, EMBO J., 1991, 10, 2635-2644
[5] Brown J.W., Simpson C.G., Splice site selection in
plant pre-mRNA splicing, Annu Rev Plant Physiol
Plant Mol Biol., 1998, 49, 77-95
[6] Simpson C.G., McQuade C., Lyon J., Brown J.W.,
Characterization of exon skipping mutants of the
COP1 gene from Arabidopsis, Plant J., 1998, 15,
125-131
[7] Lazar G., Goodman H.M., The Arabidopsis splicing
factor SR1 is regulated by alternative splicing,
Plant Mol Biol., 2000, 42, 571-581
[8] Sarmah B., Chakraborty N., Chakraborty S.,
Datta A., Plant pre-mRNA splicing in fission yeast,
Schizosaccharomyces pombe, Biochem Biophys
Res Commun., 2002, 293, 1209-1216
[9] Lou H., McCullough A.J., Schuler M.A., 3’ splice
site selection in dicot plant nuclei is position
dependent, Mol Cell Biol, 1993, 13, 4485-4493
[10] Brendel V., Kleffe J., Prediction of locally optimal
splice sites in plant pre-mRNA with applications
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
to gene identification in Arabidopsis thaliana
genomic DNA, Nucleic Acids Res, 1998, 26,
4748-4757
Luehrsen K.R., Taha S., Walbot V., Nuclear premRNA processing in higher plants, Prog Nucleic
Acid Res Mol Biol, 1994, 47, 149-193
Hebsgaard S.M., Korning P.G., Tolstrup N.,
Engelbrecht J., Rouze P., Brunak S., Splice site
prediction in Arabidopsis thaliana pre-mRNA by
combining local and global sequence information,
Nucl. Ac. Res., 1996, 24, 3439-3452
Guigó R., Flicek P., Abril J.F., Reymond A., Lagarde
J., Denoeud F., et al., EGASP: the human ENCODE
Genome Annotation Assessment Project, Genome
Biol, 2006, 7, 1-31
Dogan R.I., Getoor L., Wilbur J., Mount S.,
Features generated for computational splice-site
prediction correspond to functional elements, BMC
Bioinformatics, 2007, 8, 410-424
White O., Soderlund C., Shanmugan P., Fields C.,
Information contents and dinucleotide compositions
of plant intron sequences vary with evolutionary
origin, Plant Mol Biol, 1992, 19, 1057-1064
Kleffe J., Hermann K., Vahrson W., Wittig B.,
Brendel V., Logitlinear models for the prediction of
splice sites in plant pre-mRNA sequences, Nucl.
Ac. Res., 1996, 24, 4709-4718
Carle-Urioste J.C., Brendel V., Walbot V., A
combinatorial role for exon, intron and splice site
sequences in splicing in maize, Plant J, 1997, 11,
1253-1263
Brendel V., Kleffe J., Carle-Urioste J.C., Walbot V.,
Prediction of splice sites in plant pre-mRNA from
sequence properties, J Mol Biol, 1998, 276, 85-104
625
Unauthenticated
Download Date | 6/16/17 6:55 PM