Download lncRNA in

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genomic library wikipedia , lookup

Metalloprotein wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Genetic code wikipedia , lookup

Gene wikipedia , lookup

Protein wikipedia , lookup

Exome sequencing wikipedia , lookup

Magnesium transporter wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Community fingerprinting wikipedia , lookup

Expression vector wikipedia , lookup

Point mutation wikipedia , lookup

Western blot wikipedia , lookup

Interactome wikipedia , lookup

Structural alignment wikipedia , lookup

Proteolysis wikipedia , lookup

Protein purification wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Gene expression wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
1/19/2016
What are lncRNA’s?
• Arbitrarily defined as >200bp RNA’s that do not code for protein
• Excludes Pseudogenes and Nonsense Mediated Decay products
Using Long Read Transcriptome
Sequencing for LncRNA Prediction in
Non-model Organisms
• Usually categorized by genomic relationship to protein coding genes
• Really unknown territory (classes within this category may be more
different from each other than with protein coding genes)
Protein Coding
Richard Kuo
lncRNA in silico discovery
• Motif Method
– Approaches and Issues
• Look for structural evidence
– Structural possibility does not mean non-coding
• Look for sequence binding motif
– Could still be protein coding
• Look for sequence similarity to known lncRNA
– Not enough known about lncRNA and low evolutionary conservation
• Subtractive Method
Sense Overlapping
Intronic
Intergenic
Anti-sense
lncRNA
Short Read VS Long Read Data
• Short Read Data
– Usually not stranded
– Issues with constructing correctly spliced model
– Issues with transcription start and end
• Long Read Data
– Full length transcript sequence with accurate splice junctions
– Approaches and issues
• Find protein coding evidence, if none then label lncRNA
– Requires accurate full length transcript sequence
– Assumes we know all proteins
– In phylogenomic approaches need enough interspecies multiple sequence
alignments
Long Read Technologies
• Nanopore Sequencing
–
–
–
–
Maximum read length of about 100kb
Poorly characterized error rates
More expensive per base than Pacbio
Still not widely available and issues with quality control
• Pacific Bioscience SMRT Sequencing (Isoseq)
–
–
–
–
10 kb average with 30 kb possible
15% error rate mostly comprised of insertion/deletion
Circular sequencing
Size selection is advised
Using Pacbio Isoseq
• Library preparation
– Poly-A tail selected
– Optional 5’ cap selection
– Size selection
• Analysis
– Create Read of Insert (ROI) from
circular sequences
– Remove non-full length and
chimeric sequences
– Iterative Clustering for Error
correction (ICE)
– Map sequences to genome using
GMAP
– Resolve redundancies
1
1/19/2016
Our lncRNA pipeline
Protein Sequence Similarity
• Make list of ORF’s
• Three methods for finding protein coding evidence
– Protein sequence similarity
– Coding Potential Calculator (CPC)
– Coding-Potential Assessment Tool (CPAT)
• Rank list by length
• Convert ORF’s to amino acid sequence
• If transcript does not have evidence from any of those methods, label as
lncRNA
• Blastp against Uniref 90
– For non-model organisms some species specific proteins may not be represented but
evidence may come from CPC or CPAT
• Can continue to add different tools to the criteria
• Any Blastp hits counted as protein coding evidence
• Conservative pipeline (low sensitivity but high specificity)
Coding Potential Calculator
Coding-Potential Assessment Tool
• Developed in 2007 by Lei Kong
• Developed in 2013 by Liguo Wang
• Uses 6 metrics for prediction
• Only uses sequence analysis
– 3 ORF based
• 4 metrics
• Log-odds score
• Coverage of predicted ORF
• Integrity of predicted ORF
–
–
–
–
– 3 protein similarity based
• Number of protein hits
• Hit score
• Frame score
ORF (Open Reading Frame) size
ORF coverage
Ficket TESTCODE statistic
Hexamer usage bias
• Requires training data set with annotated protein coding and
non-coding RNA
• Requires selection of arbitrary threshold
Image from: http://s3-production.bobvila.com/blogs/wp-content/uploads/2013/05/Wrench.jpg
Image from: https://upload.wikimedia.org/wikipedia/commons/3/3d/Casio-fx115ES-5564.jpg
Method Comparison
lncRNA Summary
• From 2 samples types
(Brain and Embryo)
Uniprot
• 20,539 lncRNA predicted in Chickens
697
• 1,822 Anti-sense to Ensembl gene
lncRNA Lengths
lncRNA Number of Exons
8000
4 5 6+
3% 1% 2%
7000
3
6%
6000
# of Transcripts
1953
427
20539
CPAT
CPC
3610
17069
78
Ensembl 75
17954Chicken Transcripts
0Chicken lncRNA
215170Human Transcripts
12101Human lncRNA
94929Mouse Transcripts
2538Mouse lncRNA
5000
1
2
16%
4000
2
3000
3
2000
4
1
72%
1000
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
6.5
Transcript Length (kb)
2
1/19/2016
LncRNA per Chromosome
lncRNA Functional SNP
LncRNA per Chromosome
3500
3000
# lncRNA
2500
2000
1500
1000
500
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 W Z
Chromosome
LncRNA per Chromosome Normalized
# lncRNA/Chromosome Size
0.00006
0.00005
0.00004
0.00003
0.00002
0.00001
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 W Z
Chromosome
Antisense lncRNA
Alternative Splice Variant lncRNA
Immune lncRNA
Discussion
• Long read transcriptome sequencing is a powerful tool for
predicting lncRNA using ORF methods
• Focus on high confidence lncRNA predictions first
• Compare lncRNA annotations with RNAseq sets to predict
function
• More known lncRNA’s will make it easier to find more (better
training sets)
-
Male Neg vs Male Inf
Neg has 0 expression, Inf has significant expression
Ensembl gene: tumor necrosis factor receptor superfamily
member 21 precursor
3
1/19/2016
Acknowledgement
Thank you!
Professor Dave Burt
Professor Alan Archibald
Bob Paton
Dr. Jacqueline Smith
Dr. Lel Eory
Choon-Kiat Khoo
Pip Beard - TC Chair
Tom Freeman – Expert
Advisor
4