Download Protein Family Analysis: Protein Family Sorter

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

G protein–coupled receptor wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene expression wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Non-coding DNA wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Protein wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Magnesium transporter wikipedia , lookup

QPNC-PAGE wikipedia , lookup

Protein moonlighting wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Protein (nutrient) wikipedia , lookup

Interactome wikipedia , lookup

Western blot wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Protein adsorption wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Proteolysis wikipedia , lookup

Molecular evolution wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
 Protein Family Analysis: Protein Family Sorter The Protein Family Sorter tool at PATRIC allows users to select a set of genomes of interest (maximum up to 500 genomes) and examine distribution of protein families across the genomes, commonly referred to as the ‘pan genome’, which in this case refers to the superset of proteins found in all selected genomes. This tool provides various filtering options to quickly locate protein families that are conserved across all the genomes (‘core genome’), conserved only in a subset of the selected genomes (‘accessory genome’) or that match a specified function. A tabular view shows protein families matching filtering criteria and an interactive heatmap viewer provides a bird’s-­‐eye (‘pan genome’) view of the distribution of the protein families across multiple genomes, with clustering and anchoring functions to show relative conservation of synteny and identify lateral transfers. I. Finding the tool. Go to the Tools tab across the top of any PATRIC page and click on it. This will open up a list of tools (A). Click on the Protein Family Sorter (Red arrow in A). This opens up a landing page for the tool, and users who are logged in will see the groups they have created in the box under “1. Select organism(s)” in Panel B. A"
B"
2. Select one or more groups for comparison by first clicking on the check box in front of the group name (Blue Arrow 1) followed by clicking on the Search button (Blue arrow 2). 2"
1"
II. The Pan Genome. This opens a page that contains a filter for genomes on the left, and a table of protein families on the right. This page loads with the pan genome for the selection, showing all the protein families across all the genomes. 1 III. The Core Genome. To find the core genome (the protein families that all genomes have at least one member in), click on the circle under the column called “Present in all families” in the row called “Genome Name” (Blue arrow 1 below). This will auto select “Present in all families” for all the genomes in your selection. It will also resort the table to show the protein families that meet this condition (highlighted by the red box). *"
1"
IV. Individual Accessory Genome. Users can find the accessory genome for individual strains by first checking the circle under the column called “Absent in all families” in the row called “Genome Name” (Blue arrow 1 below). This auto selects “Absent in all families” for all the genomes for the selection. It will also resort the table to show the protein families that meet this condition (highlighted by the red box below, where the number of families found is 0). Next, select to the circle under the column called “Present in all families” for a single genome (Blue arrow 2 below). This will show the protein families that are unique to just that genome. It will also resort the table to show the protein families that meet this condition (highlighted by the red box in the lowest panel). 2 1"
2"
V. Searching for specific protein family names. The text box near the bottom of the filter can be used to find specific protein families. This will filter on the name of the family, not necessarily on the product description of the individual genes, so the results from the Feature Table and the Protein Family table will not necessarily match. To find specific protein families, enter a name (NOT a locus tag) in the filter box on the left (Blue arrow 1) and then click the filter button (Blue arrow 2). This will filter the results to show the protein families that match the search. 1"
2"
VI. Visualizing protein families. 1. A visual representation of the presence or absence of protein families across particular genomes is available in a Heatmap view. Click on the Heatmap tab (Blue Arrow 1) that opens the heatmap view page. Protein families are listed on the x-­‐axis across the top of this view, and genomes along the y-­‐axis. 3 1"
2. To see the names of the individual protein families and/or genomes move the sliders to expand the view as shown below (Red arrows indicate sliding direction). This will open the heatmap so that the names of the genomes and the protein families can be read. 3. Clicking on the legend (Red asterisk) shows what the different colors of the cells mean. Yellow cells in the heatmap indicate a single protein that belongs to that protein family annotated in that genome. Mustard colored cells indicate that there are two proteins are belong to that family in that genome. Orange cells indicate 3 or more proteins, and when there is a black cell it means that the genome does not have a protein annotated in that family. *"
4 VII. Clustering protein families. Click on the word cluster (Blue arrow 1 below) to resort to heatmap to show similar patterns in the presence or absence of protein families across all the genomes in the selection. The default settings for the clustering button are based on the Pearson Correlation Algorithm and a Pairwise Average-­‐linkage Clustering Type. The heatmap below shows similar patterns in two closely related members of the Brucella ceti clade among the selected genomes. 1"
ce%"M13"
ce%"M644"
VIII. Anchoring protein families. 1. PATRIC also provides a function where the presence and absence of protein families across all the genomes can be compared in the order that the genes appear in a single genome. “Anchoring” is a good way to look for genomic islands. Click on the down arrow that is part of the text box following “Advanced Clustering” (Blue arrow 1) and this opens a dropdown box that shows all the genomes in the group. Select one genome from that box by clicking on its name (Blue arrow 2). 1"
2"
2. This resorts the protein families to show the order that they occur in the genes occur on the reference genome. Using the scroll bar at the bottom of the heatmap (Red arrow in Panel A) allows users follow along the genome (Panel B). 5 A"
B"
3. Mouse over individual cells (Red arrow 1 below) shows the names of both the genome and the protein family in the blue header box above the heatmap (Red arrow 2). 2"
1"
IX. Downloading data from Heatmap. 1. To get the data and names on protein families, use the mouse to draw a box around the area of interest in the heatmap (Red arrow 1 in Panel A). A pop-­‐up window will appear that allows users to download the heatmap data in the selection, download the proteins from your selection, show the proteins from the selection, add the proteins to a group, or to cancel (Panel B). To see the proteins, click on “Show Proteins” (Red Arrow 2 below). A"
B"
1"
2"
6 2. A new window will load that shows the data behind the selection, including the name of the genome, accession number, PATRIC and RefSeq locus tags, the size of the protein, and the product description. Once in this view, users have a number of options that they can perform with these genes. These include saving them to a group (A), get the nucleotide or amino acid sequences (B), download the information (C), see if they were important in any metabolic pathways (D), generate a multiple sequence alignment for them (E), or find other identifiers that map to them from other resources (F). D"
A"
B"
C"
E"
F"
X. Pathway Summary of genes in the selection. 1. To find functionality that genes from the selection might share, click on the box in front of “Genome Name” (Arrow 1 below). This will select all the genes in the table. Then click on the Pathway Summary button at the top of the table (Arrow 2). 2"
1"
17. This will open a table that shows all the pathways that the genes selected are present in (Panel A). To find the pathway that has the most of the selected genes present in it, double-­‐
click on the “# of Genes Selected” column header (Panel B). This will reorder the pathways to show the one that has the most of the selected genes in it at the top (Panel C). Click on that pathway (Blue arrow 1). 7 A"
B"
C"
1"
18. This will open up a page that contains a summary of the Enzyme Commission (EC) numbers present in this genome on the left, and the KEGG map for that pathway on the right. 19. Clicking on the legend opens it. Green boxes indicate that this genome has at least one gene annotated with that EC number. Blue boxes indicate those genes that were part of your selection from the heat map, and white boxes indicate that these genes are not annotated in this genome. 8