* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download manual of aliquotG
Neocentromere wikipedia , lookup
Oncogenomics wikipedia , lookup
Genetic engineering wikipedia , lookup
X-inactivation wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Segmental Duplication on the Human Y Chromosome wikipedia , lookup
Gene desert wikipedia , lookup
History of genetic engineering wikipedia , lookup
Microevolution wikipedia , lookup
Non-coding DNA wikipedia , lookup
Gene expression programming wikipedia , lookup
Transposable element wikipedia , lookup
Public health genomics wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Copy-number variation wikipedia , lookup
Designer baby wikipedia , lookup
Median graph wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Pathogenomics wikipedia , lookup
Genome (book) wikipedia , lookup
Helitron (biology) wikipedia , lookup
Minimal genome wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Human genome wikipedia , lookup
Genomic library wikipedia , lookup
Human Genome Project wikipedia , lookup
manual of aliquotG December 5, 2011 1 Installation and Usage The program is designed to solve the Genome Aliquoting Problem (see our article in Citation). That’s to reconstruct the genome (Gdup ) just after WGD (whole genome duplication) from an extant rearranged duplicated genome. It is designed on a Linux platform (Ubuntu 10.04 by us). To install it, open a terminal, extract all files into a folder, change directory to the folder by typing: cd the folder then type the following command to install it: make Now you will find the executable file aliquotG in ”the folder /bin/” and you can run it in the directory. Usage: aliquotG -i [infile] -o [outfile] <option> Option: –nd N set the duplicate size as N –d Depth set the search depth, large value will increase the run time (recommend value 1—5) Infile Format: file include fasta like sequence. Sequence name begins with a ’>’ and contain only one line. The name is separated into two part by ’|’, first is the species name, the second is chromosome’s name (or scaffold’s name). Lines following each ’>’ is the sequence of the corresponding chromosomes, each is represented by a sequence of signed nature integers. Examples is showed in the program files. 2 Method Summary We implement the program using a heuristic algorithm. The process consists of three steps: (1) infer strong adjacencies of the labeled perfectly duplicated genome Gdup ; (2) infer weak adjacency; (3) remove circular chromosome and calculated the DCJ distance. We denote the partial graph of a genome G as PG(G). In step 1, we calculate the weight for each edge as the multiplicity of each edge in the partial graph, and ignore all edge whose weight is only 1. Then we use the maximum weighted matching to infer a maximum match, and for each pair of matched vertices, add an edge connecting them into a graph PG(H) (i.e. the partial graph of genome H. H is empty initially, and is the result Gdup at last), and assign a weight r to the edge (where r is the duplicated size, or number of genes of each gene family). And we label and contract all matched pair (see article in citation). In step 2, we assign a new weight pair (Np , Lp ) to each pairs of unmatched vertices (or use ’–d’ option to constrain that the shortest path between the two vertices is 6Depth). Then use the same labeling and contracting process as in step 1. In step 3, we transform all circular chromosome in H to linear ones. And calculate the DCJ distance. An example is show as follows: 1 Figure 1. An Example of the algorithm. Black edge: edge in Gobs or G. Gray dashed edge: edge in Gdup or H. Top. Inferring strong adjacencies: each normal nature integers(gene family ID) represents a gene family, while the subscript(copy ID) represents different gene in the same gene family. Gray shadow ellipses with same gray level indicate the adjacencies and corresponding edges in the partial graph (top right). Cyan shadow highlight a strong adjacency (or edge). Red vertices are matched. Thick black is the contracted edge. Black number on each edge indicate the corresponding copy ID in genome Gobs . Middle. Inferring weak: Blue or light blue numbers are the weights Np, Lp for the pair of 2 vertices linked by gray dot edge. Here Depth is set to 1. Other symbols are the same as Top. Bottom. Reconstructing Genome Gdup : Result genome is on the right. 3 Citation This paper describe the program: Zelin Chen, Shengfeng Huang, Yuxin Li and Anlong Xu. 2011. An improved heuristic algorithm for genome aliquoting. 3