Download Detection of genome duplication and alternative

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Detection of genome duplication and alternative replication
using high-throughput DNA sequencing
New high-throughput sequencing (HTS) technologies such as 454 or SOLiD
generate copious amounts of data. With one SOLiD run, it is possible to obtain up to
30 Gb, or, in other words, to resequence 100 bacterial strains with an average
coverage (number of times each base is read) up to 100X.
An interesting and yet unexploited by-product of HTS is that the quantity of reads
(raw product of sequencing) is very tightly correlated with the quantity of DNA in the
sample. By analyzing the coverage, it is easy to detect genome duplications,
deletions or alternate replication mechanisms (see figure).
A database including the raw output of HTS runs is available on the web. In a
preliminary analysis, we downloaded all runs including bacteria. We plotted the
coverage of the sequenced strain against a reference strain. Some peaks appear
and suggest either duplications or alternate replication mechanisms, but they need to
be better characterized. The goals of this project are:
•
•
•
•
Filter the interesting plots (removing short sequences, plasmids, etc.)
Find alternate ways of smoothing plots (optional)
Detect significant variation in the coverage along the genome (peak detection)
and report interesting plots
Try to estimate the likelihood of a duplication or alternate replication (slope
analysis, listing the gene content of the region, presence of phage genes,
repeats, etc.)
Supervisor: Lionel Guy ([email protected])
Wolbachia i genomet från
ananassae; var, när och hur?
bananflugan
Drosophila
Wolbachia är en bakterie som lever inuti många olika insekter och påverkar deras
reproduktion. För några år sedan hittade man sekvenser från denna bakterie när
man undersökte sekvenser från genomprojekten från flera Drosophila arter. Från
början antog man att Wolbachia-sekvenserna fanns där p g a att dessa Drosophila
arter var infekterade med Wolbachia. Senare insåg man att åtminstone i en av dessa
Drosophila arter, D. ananassae, hade Wolbachia-DNA inegrerats i en av flugans
kromosomer. Man sökte igenom alla sekvenser från Drosophila ananassae genomet
med det redan publicerade genomet från wMel, en Wolbachia stam som lever i
Drosophila melanogaster och försökte sätta ihop dem för att hitta vart i D. ananassae
genomet som Wolbachia-DNA hade integrerats, men utan att lyckas.
Vi har nyligen sekvenserat ett genom från en annan Wolbachia stam, wRi, som lever
i D. simulans och som är väldigt lik de sekvenser som återfanns i D. ananassae.
Projektet går ut på att göra om den analys som tidigare gjordes med wMel, med wRi
istället. Syftet är att hitta brytpunkter mellan Wolbachia och Drosophila sekvenser, för
att ta reda på vart integreringen skett, om den har skett vid ett enstaka tillfälle och om
man kan se spår av förändringar i det integrerade DNAt.
Supervisor: Lisa Klasson ([email protected])
faVIZ – a gene family visualization program
Genes can be grouped into gene families based on their evolutionary history. This
grouping is often central in evolutionary biology and is used as the basis for a large
number of other analyses. faMCL is a system recently developed at Molevol for
defining such gene families among relatively closely related bacterial species.
The aim of this project is to construct a small program to visualize, filter and calculate
statistics on the faMCL gene families. Your faVIZ program should help the user to
filter the clusters based on questions like
Which gene families contain
...exactly one gene from each species?
...maximum one gene from each species?
...maximum one gene from each species, except from species X?
…the maximum number of genes?
…no genes from species X?
For each set of (filtered) clusters, the program should also be able to visualize the
results as
...histograms (eg. number of genes per cluster, or number of clusters per species)
...Venn diagrams (for any 2 or 3 species)
…where the genes are located on the genome
The faMCL and faVIZ programs are supposed to be used in several genome projects
at molevol. Therefore, all filtered cluster sets should be possible to save in raw text
format, and all visualizations should be possible to save as high-resolution (or vector
graphics) figure files ready for publication.
Supervisor: Björn Nystedt ([email protected])
Creating genomic databases of interest for local blasts
To obtain information about unknown sequences one has to compare it with a
database containing genes from known organisms. The most common way to to that
is by using BLAST and a database of choice. The goal of the project is to prepare a
tool for creating a database of interest for local BLAST from available microbial
genomes. The database, which will be simply a fasta file, should be easy for updating (downloading the newest release of microbial genomic information) either by
creating it again or by adding new genomes. The most important feature of the
database should be easy parsing of blast results, therefore each entry should store
the required information in the header.
Program should give the following choices:
1. taxons used:
all available bacterial and archaeal genomes; only bacteria; only selected
phyla, species, genomes;
2. type of sequence information:
full genomes, only genes or only proteins;
3. type of information about the sequence kept in the header of each entry:
species, phylum, gene ID, protein function, ...
Practically the program will have to the following: download sequenced genomes,
check if there is a new release (new genomes) and up-date the database, using the
information provided for each genome (for example the genbank file) prepare the
necessary database. To complete the project you will have to figure out what kind of
files are available from the NCBI ftp site (or another if preferred) and what is their
structure. And then use BioPerl (or Python, or write by yourself if preferred) to extract
the information from the files and create the desired headers for the fasta file (the
database).
To test your database you will have to format it (using formatdb) so that it will be
compatible with the BLAST program and use it with a set of environmental
sequences.
Supervisor: Katarzyna Zaremba (Kasia) ([email protected])
Comparative genomics of diplomonads: a pilot study of
Trepomonas sequences
We are currently performing a large comparative genomics project on members of
diplomonads, a group of microbial eukaryotes. The most studied member is Giardia
lamblia, a frequent cause of diarrhea in humans, but there are also free-living and
commensal members of the group. We have just a sub-project on Trepomonas, a
free-living diplomonad found in oxygen-poor environments. Your task is to perform
pilot bioinformatic analyses on the sequence data we have obtained to find out where
our project is heading.
Available sequence datasets from various diplomonads:
•
•
•
•
•
~150 EST sequences from Trepomonas sp. (not analyzed)
>200.000 genomic shotgun sequences and >20.000 expressed sequenced
tags (EST) from Spironucleus vortens (unassembled)
4-5x coverage of 454 sequences from Spironucleus salmonicida (draft
assembly, not annotated)
590 clusters of EST sequences from Spironucleus barkhanus (assembled and
annotated)
2 completely sequenced and annotated genomes from Giardia lamblia
Suggestions for questions to address:
•
•
•
•
•
How are these organisms related? Do all genes tell the same story? If not,
why?
How many Trepomonas genes are shared with other diplomonads?
How many have been acquired from other sources? From which lineages?
What is the polyadenylation signal in Trepomonas?
Is there a characteristic codon usage for Trepomonas?
Supervisor: Jan Andersson ([email protected])