Download File

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Playing with real data
Reference-guided assembly
(SNP calling)
Evolutionary & Ecological Genomics
●
●
With dozens (or perhaps hundreds) of
genotypes from a range of habitats, we should
be able to associate ecology with genotype
Build phylogenies / haplotype networks to
understand relationships
●
Infer evolutionary history
●
Study the role of selection
The Short Read Archive (SRA)
●
http://www.ncbi.nlm.nih.gov/sra
–
Genomic data on many tens of thousands of
samples
–
Search-able
–
Download-able
–
Not assembled
Genome resources at NCBI
●
●
http://www.ncbi.nlm.nih.gov/genome/
Many hundreds of assembled genomes and
other resources
●
Search-able
●
Download-able
●
A bit buggy / hard to find what you want
sometimes
Plastid genomes on NCBI
●
●
http://www.ncbi.nlm.nih.gov/genomes/GenomesGroup.cgi
?taxid=2759&opt=plastid
599 plastid genomes
●
427 'Viridiplantae' chloroplasts / plastids
●
Viridiplantae
●
–
Green algae
–
Nonvasular plants
–
Ferns, gymnosperms
–
Flowering plants
279 eudicot plastid genomes
Expand on this dataset of plastid
genomes
●
2 phases
●
Variation within a species (Arabidopsis thaliana)
–
●
SNP calling
Adding new species
–
De novo assembly
Expand on this dataset of plastid
genomes
●
2 phases
●
Variation within a species (Arabidopsis thaliana)
–
●
SNP calling
Adding new species
–
De novo assembly
Chloroplast
●
http://en.wikipedia.org/wiki/Chloroplast
●
Why the chloroplast / plastid?
–
Small, simple, well-studied genome
–
Important (photosynthesis)
–
High copy number
–
Little repetitive 'junk'
Connecting climate to genotypes
1001 Arabidopsis genomes
●
Actually 1049 in progress, but only a portion
completed
●
Can order seeds from these inbred lines
●
WGS data
●
Excellent reference genome
●
Not well genotyped at 'repetitive' sequences,
including the plastid
Chosen subset
●
McKay et al. 2003
–
●
●
Photosynthetic efficiency, Lat, Long, Elevation
Great genetic and phenotypic data, we just
have to put them together to see if anything
interesting is going on
Part of a larger dataset (automated SNP calls)
with interesting patterns
Ethics of SRA use
●
●
●
Anyone can download the data
Courteous to allow the authors to write a paper
first
Toronto statement
http://signal.salk.edu/atg1001/Data_
Sharing_Policy.pdf
Homework
●
By Thursday
–
Read up on chloroplasts – wikipedia page
http://en.wikipedia.org/wiki/Chloroplast
–
●
By next Monday
–
●
D2L something interesting about chloroplasts
D2L a bash script for SNP genotyping your cp
By next Tuesday
–
D2L a 'draft' SNP table for your ecotype
Bash scripting
●
which bash
Bash scripting
●
which bash
/bin/bash
●
http://linuxconfig.org/bash-scripting-tutorial
Bash scripting
#!/bin/bash
Bash scripting


#!/bin/bash
bwa index mt.fa
Bash scripting



#!/bin/bash
bwa index mt.fa
bwa mem mt.fa sra_data.fastq > ler.sam
Bash scripting








#!/bin/bash
bwa index mt.fa
bwa mem mt.fa sra_data.fastq > ler.sam
samtools view -b -o ler.bam -S ler.sam
samtools sort ler.bam ler.sorted
samtools index ler.sorted.bam
samtools faidx mt.fa
samtools mpileup -gf mt.fa ler.sorted.bam > mt.bcf
bcftools view -vc mt.bcf > snps_indels.vcf
Your chloroplast genomes
●
Arabidopsis reference plastid sequence:
http://www.ncbi.nlm.nih.gov/nuccore/7525012?report=fasta
●
Your fastq sequences – in the genomics2014
home folder, I will email you a link to yours