Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Playing with real data Reference-guided assembly (SNP calling) Evolutionary & Ecological Genomics ● ● With dozens (or perhaps hundreds) of genotypes from a range of habitats, we should be able to associate ecology with genotype Build phylogenies / haplotype networks to understand relationships ● Infer evolutionary history ● Study the role of selection The Short Read Archive (SRA) ● http://www.ncbi.nlm.nih.gov/sra – Genomic data on many tens of thousands of samples – Search-able – Download-able – Not assembled Genome resources at NCBI ● ● http://www.ncbi.nlm.nih.gov/genome/ Many hundreds of assembled genomes and other resources ● Search-able ● Download-able ● A bit buggy / hard to find what you want sometimes Plastid genomes on NCBI ● ● http://www.ncbi.nlm.nih.gov/genomes/GenomesGroup.cgi ?taxid=2759&opt=plastid 599 plastid genomes ● 427 'Viridiplantae' chloroplasts / plastids ● Viridiplantae ● – Green algae – Nonvasular plants – Ferns, gymnosperms – Flowering plants 279 eudicot plastid genomes Expand on this dataset of plastid genomes ● 2 phases ● Variation within a species (Arabidopsis thaliana) – ● SNP calling Adding new species – De novo assembly Expand on this dataset of plastid genomes ● 2 phases ● Variation within a species (Arabidopsis thaliana) – ● SNP calling Adding new species – De novo assembly Chloroplast ● http://en.wikipedia.org/wiki/Chloroplast ● Why the chloroplast / plastid? – Small, simple, well-studied genome – Important (photosynthesis) – High copy number – Little repetitive 'junk' Connecting climate to genotypes 1001 Arabidopsis genomes ● Actually 1049 in progress, but only a portion completed ● Can order seeds from these inbred lines ● WGS data ● Excellent reference genome ● Not well genotyped at 'repetitive' sequences, including the plastid Chosen subset ● McKay et al. 2003 – ● ● Photosynthetic efficiency, Lat, Long, Elevation Great genetic and phenotypic data, we just have to put them together to see if anything interesting is going on Part of a larger dataset (automated SNP calls) with interesting patterns Ethics of SRA use ● ● ● Anyone can download the data Courteous to allow the authors to write a paper first Toronto statement http://signal.salk.edu/atg1001/Data_ Sharing_Policy.pdf Homework ● By Thursday – Read up on chloroplasts – wikipedia page http://en.wikipedia.org/wiki/Chloroplast – ● By next Monday – ● D2L something interesting about chloroplasts D2L a bash script for SNP genotyping your cp By next Tuesday – D2L a 'draft' SNP table for your ecotype Bash scripting ● which bash Bash scripting ● which bash /bin/bash ● http://linuxconfig.org/bash-scripting-tutorial Bash scripting #!/bin/bash Bash scripting #!/bin/bash bwa index mt.fa Bash scripting #!/bin/bash bwa index mt.fa bwa mem mt.fa sra_data.fastq > ler.sam Bash scripting #!/bin/bash bwa index mt.fa bwa mem mt.fa sra_data.fastq > ler.sam samtools view -b -o ler.bam -S ler.sam samtools sort ler.bam ler.sorted samtools index ler.sorted.bam samtools faidx mt.fa samtools mpileup -gf mt.fa ler.sorted.bam > mt.bcf bcftools view -vc mt.bcf > snps_indels.vcf Your chloroplast genomes ● Arabidopsis reference plastid sequence: http://www.ncbi.nlm.nih.gov/nuccore/7525012?report=fasta ● Your fastq sequences – in the genomics2014 home folder, I will email you a link to yours