Download Detection of genome duplication and alternative

Detection of genome duplication and alternative replication using high-throughput DNA sequencing New high-throughput sequencing (HTS) technologies such as 454 or SOLiD generate copious amounts of data. With one SOLiD run, it is possible to obtain up to 30 Gb, or, in other words, to resequence 100 bacterial strains with an average coverage (number of times each base is read) up to 100X. An interesting and yet unexploited by-product of HTS is that the quantity of reads (raw product of sequencing) is very tightly correlated with the quantity of DNA in the sample. By analyzing the coverage, it is easy to detect genome duplications, deletions or alternate replication mechanisms (see figure). A database including the raw output of HTS runs is available on the web. In a preliminary analysis, we downloaded all runs including bacteria. We plotted the coverage of the sequenced strain against a reference strain. Some peaks appear and suggest either duplications or alternate replication mechanisms, but they need to be better characterized. The goals of this project are: • • • • Filter the interesting plots (removing short sequences, plasmids, etc.) Find alternate ways of smoothing plots (optional) Detect significant variation in the coverage along the genome (peak detection) and report interesting plots Try to estimate the likelihood of a duplication or alternate replication (slope analysis, listing the gene content of the region, presence of phage genes, repeats, etc.) Supervisor: Lionel Guy ([email protected]) Wolbachia i genomet från ananassae; var, när och hur? bananflugan Drosophila Wolbachia är en bakterie som lever inuti många olika insekter och påverkar deras reproduktion. För några år sedan hittade man sekvenser från denna bakterie när man undersökte sekvenser från genomprojekten från flera Drosophila arter. Från början antog man att Wolbachia-sekvenserna fanns där p g a att dessa Drosophila arter var infekterade med Wolbachia. Senare insåg man att åtminstone i en av dessa Drosophila arter, D. ananassae, hade Wolbachia-DNA inegrerats i en av flugans kromosomer. Man sökte igenom alla sekvenser från Drosophila ananassae genomet med det redan publicerade genomet från wMel, en Wolbachia stam som lever i Drosophila melanogaster och försökte sätta ihop dem för att hitta vart i D. ananassae genomet som Wolbachia-DNA hade integrerats, men utan att lyckas. Vi har nyligen sekvenserat ett genom från en annan Wolbachia stam, wRi, som lever i D. simulans och som är väldigt lik de sekvenser som återfanns i D. ananassae. Projektet går ut på att göra om den analys som tidigare gjordes med wMel, med wRi istället. Syftet är att hitta brytpunkter mellan Wolbachia och Drosophila sekvenser, för att ta reda på vart integreringen skett, om den har skett vid ett enstaka tillfälle och om man kan se spår av förändringar i det integrerade DNAt. Supervisor: Lisa Klasson ([email protected]) faVIZ – a gene family visualization program Genes can be grouped into gene families based on their evolutionary history. This grouping is often central in evolutionary biology and is used as the basis for a large number of other analyses. faMCL is a system recently developed at Molevol for defining such gene families among relatively closely related bacterial species. The aim of this project is to construct a small program to visualize, filter and calculate statistics on the faMCL gene families. Your faVIZ program should help the user to filter the clusters based on questions like Which gene families contain ...exactly one gene from each species? ...maximum one gene from each species? ...maximum one gene from each species, except from species X? …the maximum number of genes? …no genes from species X? For each set of (filtered) clusters, the program should also be able to visualize the results as ...histograms (eg. number of genes per cluster, or number of clusters per species) ...Venn diagrams (for any 2 or 3 species) …where the genes are located on the genome The faMCL and faVIZ programs are supposed to be used in several genome projects at molevol. Therefore, all filtered cluster sets should be possible to save in raw text format, and all visualizations should be possible to save as high-resolution (or vector graphics) figure files ready for publication. Supervisor: Björn Nystedt ([email protected]) Creating genomic databases of interest for local blasts To obtain information about unknown sequences one has to compare it with a database containing genes from known organisms. The most common way to to that is by using BLAST and a database of choice. The goal of the project is to prepare a tool for creating a database of interest for local BLAST from available microbial genomes. The database, which will be simply a fasta file, should be easy for updating (downloading the newest release of microbial genomic information) either by creating it again or by adding new genomes. The most important feature of the database should be easy parsing of blast results, therefore each entry should store the required information in the header. Program should give the following choices: 1. taxons used: all available bacterial and archaeal genomes; only bacteria; only selected phyla, species, genomes; 2. type of sequence information: full genomes, only genes or only proteins; 3. type of information about the sequence kept in the header of each entry: species, phylum, gene ID, protein function, ... Practically the program will have to the following: download sequenced genomes, check if there is a new release (new genomes) and up-date the database, using the information provided for each genome (for example the genbank file) prepare the necessary database. To complete the project you will have to figure out what kind of files are available from the NCBI ftp site (or another if preferred) and what is their structure. And then use BioPerl (or Python, or write by yourself if preferred) to extract the information from the files and create the desired headers for the fasta file (the database). To test your database you will have to format it (using formatdb) so that it will be compatible with the BLAST program and use it with a set of environmental sequences. Supervisor: Katarzyna Zaremba (Kasia) ([email protected]) Comparative genomics of diplomonads: a pilot study of Trepomonas sequences We are currently performing a large comparative genomics project on members of diplomonads, a group of microbial eukaryotes. The most studied member is Giardia lamblia, a frequent cause of diarrhea in humans, but there are also free-living and commensal members of the group. We have just a sub-project on Trepomonas, a free-living diplomonad found in oxygen-poor environments. Your task is to perform pilot bioinformatic analyses on the sequence data we have obtained to find out where our project is heading. Available sequence datasets from various diplomonads: • • • • • ~150 EST sequences from Trepomonas sp. (not analyzed) >200.000 genomic shotgun sequences and >20.000 expressed sequenced tags (EST) from Spironucleus vortens (unassembled) 4-5x coverage of 454 sequences from Spironucleus salmonicida (draft assembly, not annotated) 590 clusters of EST sequences from Spironucleus barkhanus (assembled and annotated) 2 completely sequenced and annotated genomes from Giardia lamblia Suggestions for questions to address: • • • • • How are these organisms related? Do all genes tell the same story? If not, why? How many Trepomonas genes are shared with other diplomonads? How many have been acquired from other sources? From which lineages? What is the polyadenylation signal in Trepomonas? Is there a characteristic codon usage for Trepomonas? Supervisor: Jan Andersson ([email protected])

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Detection of genome duplication and alternative