* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 2 -1 -2 -1 1 2 K
Survey
Document related concepts
Epitranscriptome wikipedia , lookup
Genetic code wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Expression vector wikipedia , lookup
Biochemistry wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Gene expression wikipedia , lookup
Point mutation wikipedia , lookup
Interactome wikipedia , lookup
Magnesium transporter wikipedia , lookup
Metalloprotein wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Western blot wikipedia , lookup
Protein purification wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Transcript
Chap. 11 Protein Structures Amino Acid General structure of amino acids an amino group a carboxyl group α-carbon bonded to a hydrogen and a side-chain group, R Side chain R determines the identity of particular amino acid • • • • • R: large white and gray C: black Nitrogen: blue Oxygen: red Hydrogen: white Protein Protein: polymer consisting of AA’s linked by peptide bonds AA in a polymer is called a residue Folded into 3D structures Structure of protein determines its function Primary structure: linear arrangement of AA’s AA sequence (primary structure) determines 3D structure of a protein, which in turn determines its properties N- and C-terminal Secondary structure: short stretches of AAs Tertiary structure: overall 3D structure Protein Structures Secondary structure Secondary structures have repetitive interactions resulting from hydrogen bonding between N-H and carboxyl groups of peptide backbone Conformations of side chains of AA are not part of the secondary structure α-helix Secondary structure β-pleated sheet Parallel/antiparallel 3D form of antiparallel Secondary structure: domain Part of chain folds independently of foldings of other parts • Such independent folded portion of protein is called domain (super-secondary structure) (a) (b) (c) (d) α unit α α unit (helix-turn-helix) meander Greek key Domain Larger proteins are modular Their structural units, domains or folds, can be covalently linked to generate multi-domain proteins Domains are not only structurally, but also functionally, discrete units – domain family members are structurally and functionally conserved and recombined in complex ways during evolution Domains can be seen as the units of evolution Novelty in protein function often arises as a result of gain or loss of domains, or by re-shuffling existing domains along sequence Pairs of protein domains with the same 3D fold, precise function is conserved to ~40% sequence identity (broad functional class is conserved ~20%) DNA binding domains http://en.wikipedia.org/wiki/DNA-binding_domain Motif A short, conserved regions (frequently the most conserved regions of a domain) Critical for the domain to function Domain vs. Motif Motif are structural characteristics Domains are functional regions, usually consisting of a few motifs Motif Representation Motif In multiple alignments of distinctly related sequences, highly conserved regions are called motifs, features, signatures or blocks Tends to correspond to core structural and functional elements of the proteins Motif Greek key motif is often found in –barrel tertiary structure (a) (b) (c) (d) (e) complement control protein module Immunoglobulin module Fibronectin type I module Growth factor module Kringle module (a) (b) (c) (d) Linked series of -meanders Greek key pattern Alternative α untis Top and side views (α-helical section is outside) Secondary structure: conformation Two types of Protein Conformations Fibrous Globular –folds back onto itself to create a spherical shape (a) (b) Schematic diagrams of fibrous and globular proteins Computer-generated model of globular protein Secondary Structure Prediction Ab initio prediction (from AA sequence) Still an open problem 1974 Peter Chou and Gerald Fasman Use known structures to determine which AA contributes to each secondary structure Propensity values : likelihood that an AA appears in a particular structure P(a), P(b) and P(turn) >1 indicates a greater than average chance (log-odd ratios) Frequency values: frequency of an AA being found in a hairpin Four positions in a hairpin beta-turn Accuracy is around 50-60%, but popular due to its foundation for later prediction programs AA P(a) P(b) P(turn) f(i) f(i+1) f(i+2) f(i+3) Alanine Arginine Asparagine Aspartic acid Cysteine Glutamic acid Glutamine Glycine Histidine Isoleucine Leucine Lysine Methionine Pheylalanine Proline Serine Threonine Tryptophan Tyrosine Valine 142 98 67 101 70 151 111 57 100 108 121 114 145 113 57 77 83 108 69 104 83 93 89 54 119 37 110 75 87 160 130 74 105 138 55 75 119 137 147 170 66 95 95 146 119 74 98 156 95 47 59 101 60 60 152 143 96 96 114 50 0.076 0.106 0.083 0.110 0.050 0.060 0.098 0.085 0.047 0.034 0.025 0.115 0.082 0.041 0.301 0.139 0.108 0.013 0.065 0.048 0.035 0.099 0.191 0.179 0.117 0.077 0.037 0.190 0.093 0.013 0.036 0.072 0.014 0.065 0.034 0.125 0.065 0.064 0.114 0.028 0.058 0.085 0.091 0.081 0.128 0.064 0.098 0.152 0.054 0.054 0.070 0.095 0.055 0.065 0.068 0.106 0.079 0.167 0.125 0.053 0.060 0.070 0.161 0.147 0.149 0.056 0.074 0.102 0.140 0.043 0.061 0.055 0.068 0.059 0.102 0.120 0.086 0.077 0.082 0.062 Chou-Fasman Algorithm Step 1: identify alpha-helices Find a region of six contiguous residues where at least four have P(a)>103 Extend the region until a set of four contiguous residues with P(a)<100 is found If region’s average P(a)>103, length is >5, and ∑P(a)> ∑P(b), alpha Step 2: beta strands Find a region of five contiguous residues with at least three with P(b)>105 Extend the region until a set of four contiguous residues with P(b)<100 is found If region’s average P(b)>105, and ∑P(b)> ∑P(a), beta Chou-Fasman Algorithm Step 3: beta turns For each residue f, determine the turn propensity (P(t)) for j, as P(t) j = f(i) j *f(i+1) j+1 *f(i+2) j+2 *f(i+3) j+3 A turn at postion if P(t) >0.000075, average P(turn) from j to j+3 > 100, and ∑P(a)< ∑P(turn) > ∑P(b) Step 4: overlaps If alpha region overlaps with beta, the region’s ∑P(a) and ∑P(b) determine the most likely structure in the overlapped region If ∑P(a) > ∑P(b) for the overlapping region, alpha If ∑P(a) < ∑P(b) for the overlapping region, beta If ∑P(a) = ∑P(b), no valid call Secondary structure prediction Chou and Fasman (1974) based on the frequencies of amino acids found in a helices, b-sheets, and turns. Proline: occurs at turns, but not in a helices. GOR (Garnier, Osguthorpe, Robson): related algorithm Modern algorithms: use multiple sequence alignments and achieve higher success rate (about 70-75%) Page 427 Secondary structure prediction Web servers: GOR4 Jpred NNPREDICT PHD Predator PredictProtein PSIPRED SAM-T99sec Table 11-3 Page 429 Secondary Structure Prediction by PSIRED Prediction of regions of the protein that form alpha-helix, beta-sheet, or random coil http://bioinf.cs.ucl.ac.uk/psipred/ Based on neural networks Uses Chou-Fasman-like algorithm but first does PSI-BLAST search to get a collection of sequences related to the input (searching for orthologous sequences) Univ. College London, 1999 PSI-BLAST is performed in five steps 1. Select a query and search it against a protein database 2. PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialized position-specific scoring matrix (PSSM) Page 146 Inspect the blastp output to identify empirical “rules” regarding amino acids tolerated at each position R,I,K C D,E,T K,R,T N,L,Y,G 1 M 2 K 3 W 4 V 5 W 6 A 7 L 8 L 9 L 10 L 11 A 12 A 13 W 14 A 15 A 16 A ... 37 S 38 G 39 T 40 W 41 Y 42 A A -1 -1 -3 0 -3 5 -2 -1 -1 -2 5 5 -2 3 2 4 2 0 0 -3 -2 4 R N D C Q E G H I L K M F P -2 -2 -3 -2 -1 -2 -3 -2 1 2 -2 6 0 -3 1 0 1 -4 2 4 -2 0 -3 -3 3 -2 -4 -1 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 20 amino acids -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -3 -4 -1 -3 -3 -4 -3 2 2 -3 1 3 -3 -3 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 all the amino acids -3 -4 -4 -2 -2 -3 -4 -3 1 4 -3 2 1 -3 from position -2 -1 -2 -1 -1 -21 to 4 -2 -2 -2 -1 -2 -3 -1 -1 0 -1 -2 2 0 2 -1 -3 -3 0 -2 -3 -1 the end of your PSI-2 -1 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 -3 -1 BLAST query -1 0 -1 -1 0 0 protein -3 -1 -2 -3 -2 -2 -1 0 -1 -1 -3 -4 -5 -3 -2 -2 -3 -3 -2 -2 -2 -1 -1 -2 -2 -1 -1 -3 -2 -1 0 6 -2 -3 -3 0 -1 -2 -2 -3 2 -2 -2 -4 -1 -3 -2 -2 -3 -4 -1 -2 -1 -2 0 -2 -1 -3 -2 -1 -2 -3 -1 -2 -1 -1 -3 -4 -2 1 3 -3 S -2 0 -3 -2 -3 1 -3 -2 -3 -3 1 1 -3 1 3 1 T -1 -1 -3 0 -3 0 -1 -1 -1 -1 0 0 -2 -1 0 0 W -2 -3 12 -3 12 -3 -2 -2 -2 -2 -3 -3 7 -3 -3 -3 Y -1 -2 2 -1 2 -2 -1 0 -1 -1 -2 -2 0 -3 -2 -2 V 1 -3 -3 4 -3 0 1 3 2 1 0 0 0 -1 -2 -1 -1 4 1 -3 -2 0 -2 -3 -1 1 5 -3 -4 -3 -3 12 -3 -2 -2 2 -1 1 0 -3 -2 -3 -2 2 7 -2 -2 -4 0 -3 -1 0 1 M 2 K 3 W 4 V 5 W 6 A 7 L 8 L 9 L 10 L 11 A 12 A 13 W 14 A 15 A 16 A ... 37 S 38 G 39 T 40 W 41 Y 42 A A -1 -1 -3 0 -3 5 -2 -1 -1 -2 5 5 -2 3 2 4 R -2 1 -3 -3 -3 -2 -2 -3 -3 -2 -2 -2 -3 -2 -1 -2 N -2 0 -4 -3 -4 -2 -4 -3 -4 -4 -2 -2 -4 -1 0 -1 D -3 1 -5 -4 -5 -2 -4 -4 -4 -4 -2 -2 -4 -2 -1 -2 C Q E G H I L K M -2 -1 -2 -3 -2 1 2 -2 6 -4 2 4 -2 0 -3 -3 3 -2 -3 -2 -3 -3 -3 -3 -2 -3 -2 -1 -3 -3 -4 -4 3 1 -3 1 -3 -2 -3 -3 -3 -3 -2 -3 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -1 -2 -3 -4 -3 2 4 -3 2 -1 -3 -3 -4 -3 2 2 -3 1 -1 -2 -3 -4 a -3given 2 4 -3 2 note that -1 -2 -3 -4 -3 2 4 -3 2 amino as-1 -1 -1 -1 acid 0 -2 (such -2 -2 -1 -1 -1 -1 0in -2your -2 -2 -1 -1 alanine) -2 -2 -3 -4 -3 1 4 -3 2 query -1 -1 -2 protein 4 -2 -2 can -2 -1 -2 -2 2 0 different 2 -1 -3 -3 0 -2 receive -1 -1 -1 3 -2 -2 -2 -1 -1 F 0 -4 1 -1 1 -3 0 3 0 0 -3 -3 1 -3 -3 -3 P -3 -1 -4 -3 -4 -1 -3 -3 -3 -3 -1 -1 -3 -1 -1 -1 W -2 -3 12 -3 12 -3 -2 -2 -2 -2 -3 -3 7 -3 -3 -3 Y -1 -2 2 -1 2 -2 -1 0 -1 -1 -2 -2 0 -3 -2 -2 V 1 -3 -3 4 -3 0 1 3 2 1 0 0 0 -1 -2 -1 2 0 0 -3 -2 4 -1 -3 -1 -3 -2 -2 0 -1 0 -4 -2 -2 -1 -2 -1 -5 -3 -3 -2 -2 -3 2 -2 -1 -2 -1 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -4 -2 1 3 -3 -1 4 1 -3 -2 0 -2 -3 -1 1 5 -3 -4 -3 -3 12 -3 -2 -2 2 -1 1 0 -3 -2 -3 -2 2 7 -2 -2 -4 0 -3 -1 0 scores for matching -1 0 0 0 -1 -2 -3 0 -2 alanine—depending -3 -2 -2 6 -2 -4 -4 -2 -3 on-1the in-1 -1 -1 -1 position -2 -2 -1 -1 -3 -2protein -3 -3 -3 -3 -2 -3 -2 the S -2 0 -3 -2 -3 1 -3 -2 -3 -3 1 1 -3 1 3 1 T -1 -1 -3 0 -3 0 -1 -1 -1 -1 0 0 -2 -1 0 0 1 M 2 K 3 W 4 V 5 W 6 A 7 L 8 L 9 L 10 L 11 A 12 A 13 W 14 A 15 A 16 A ... 37 S 38 G 39 T 40 W 41 Y 42 A A -1 -1 -3 0 -3 5 -2 -1 -1 -2 5 5 -2 3 2 4 R -2 1 -3 -3 -3 -2 -2 -3 -3 -2 -2 -2 -3 -2 -1 -2 N -2 0 -4 -3 -4 -2 -4 -3 -4 -4 -2 -2 -4 -1 0 -1 D -3 1 -5 -4 -5 -2 -4 -4 -4 -4 -2 -2 -4 -2 -1 -2 2 0 0 -3 -2 4 -1 -3 -1 -3 -2 -2 0 -1 0 -4 -2 -2 -1 -2 -1 -5 -3 -2 C Q E G H I L K M -2 -1 -2 -3 -2 1 2 -2 6 -4 2 4 -2 0 -3 -3 3 -2 -3 -2 -3 -3 -3 -3 -2 -3 -2 -1 -3 -3 -4 -4 3 1 -3 1 -3 -2 -3 -3 -3 -3 -2 -3 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -1 -2 -3 -4 -3 2 4 -3 2 -1 -3 -3 -4 -3 2 2 -3 1 -1 note -2 -3 that -4 -3a given 2 4 -3 2 -1 -2 -3 -4 -3 2 4 -3 2 -1 amino -1 -1 0acid -2 -2(such -2 -1 as -1 -1 tryptophan) -1 -1 0 -2 -2in -2your -1 -1 -2 -2 -3 -4 -3 1 4 -3 2 can -1 query -1 -2 protein 4 -2 -2 -2 -1 -2 -2 receive 2 0 2 different -1 -3 -3 0 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 F 0 -4 1 -1 1 -3 0 3 0 0 -3 -3 1 -3 -3 -3 scores for matching -1 tryptophan— 0 0 0 -1 -2 -3 0 -2 -3 -3 -2 -2 6 -2 -4 -4 -2 -3 -4 on-1the -1 depending -1 -1 -2 -2 -1 -1 -1 -2 -3 position -2 -3 -3 -3 -2 -3 -2 1 in-3 the -3 -2 -2 -3 2 -2 -1 -2 -1 3 -1 protein -1 -1 0 -2 -2 -2 -1 -1 -3 P -3 -1 -4 -3 -4 -1 -3 -3 -3 -3 -1 -1 -3 -1 -1 -1 S -2 0 -3 -2 -3 1 -3 -2 -3 -3 1 1 -3 1 3 1 T -1 -1 -3 0 -3 0 -1 -1 -1 -1 0 0 -2 -1 0 0 W -2 -3 12 -3 12 -3 -2 -2 -2 -2 -3 -3 7 -3 -3 -3 Y -1 -2 2 -1 2 -2 -1 0 -1 -1 -2 -2 0 -3 -2 -2 V 1 -3 -3 4 -3 0 1 3 2 1 0 0 0 -1 -2 -1 -1 4 1 -3 -2 0 -2 -3 -1 1 5 -3 -4 -3 -3 12 -3 -2 -2 2 -1 1 0 -3 -2 -3 -2 2 7 -2 -2 -4 0 -3 -1 0 PSI-BLAST is performed in five steps 1. Select a query and search it against a protein database 2. PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialized position-specific scoring matrix (PSSM) 3. The PSSM is used as a query against the database 4. PSI-BLAST estimates statistical significance (E values) 1. Repeat steps [3] and [4] iteratively, typically 5 times. At each new search, a new profile is used as the query Page 146 SRC protein Tyrosine kinase Enzyme putting a phophate group on tyrosine AA (phosphorylation) Activates an inactive protein, eventually activates celldivision proteins NP_005408 >gi|4885609|ref|NP_005408.1| proto-oncogene tyrosine-protein kinase Src [Homo sapiens] MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAEPKLFGGFNSS DTVTSPQRAGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYV APSDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKL DSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGC FGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLL DFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYT ARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPEC PESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL Examining Crystal Structure Cn3D: NCBI structure viewer and modeling tool DeppView: SWISSPROT JMOL NCBI Structure database Links to NCBI MMDB (Molecular Modeling Database) MMDB contains experimentally verified protein structures SRC – MMDB ID 56157, PDB ID 1FMK View Structure from NCBI Structure database Opens up Cn3D window Click to rotate; Ctrl_click to zoom; Shift_clcik to move Rendering and coloring menus Tertiary structure 3D arrangment of all atoms in the module Considers arrangement of helical and sheet sections, conformations of side chains, arrangement of atoms of side chains, etc. Experimentally determined by X-ray crystallography – measure diffraction patterns of atoms NMR (Nuclear Magnetic Resonance) spectroscopy – use protein samples in aqueous solution • Tertiary structure of α-lactalbumin myoglobin Protein families Groups of genes of identical or similar sequence are common Sometimes, repetition of identical sequences is correlated with the synthesis of increased quantities of a gene product e.g., a genome contains multiple copies of ribosomal RNAs Human chromosome 1 has 2000 genes for 5S rRNA (sedimentation coefficient), and chr 13, 14, 15, 21 and 22 have 280 copies of a repeat unit made up of 28S, 5.8S and 18S Amplication of rRNA genes evolved because of heavy demand for rRNA synthesis during cell division These rRNA genes are examples of protein families having identical or near identical sequences Sequence similarities indicate a common evolutionary origin α- and β-globin families have distinct sequence similarities evolved from a single ancestral globin gene Protein families and superfamilies Dayhoff classification, 1978 Protein families – at least 50 % AA sequence similar (based on physico-chemical AA features) Related proteins with less similarity (35%) belong to a superfamily, may have quite diverse functions α- and β-globins are classified as two separate families, and together with myoglobins form the globin superfamily families have distinct sequence similarities evolved from a single ancestral globin gene Protein family database Pattern or secondary database derived from sequences a pattern may be the most conserved aspects of sequence families The most conserved part may vary between species Use scoring system to account for some variability Position-specific scoring matrix (PSSM) or Profile Contrast to a pairwise alignment, having the same weight regardless of positions Protein family databases are derived by different analytical techniques But, trying to find motifs, conserved regions, considered to reflect shared structural or functional characteristics Three groups: single motifs, multiple motifs, or full domain alignments Protein family databases Pattern or secondary database derived from sequences Data source Stored info PROSITE Swiss-Prot Regular expressions (patterns) of single most conserved motif Profiles Swiss-Prot Weighted matrices (profiles) of position-sensitive weights PRINTS Swiss-Prot and TrEMBL Aligned motifs (fingerprints) Pfam Swiss-Prot and TrEMBl multiple sequence alignment of a protein domain or conserved region Blocks interPro/PRINTS Aligned motifs (blocks) eMOTIF Blocks/PRINTS Permissive regular expressions Single Motif Method Regular expression PROSITE PDB 1ivy Carboxypet_Ser_His (PS00560) [LIVF]-x2-[[LIVSTA]-x[IVPST]-[GSDNQL]-[SAGV]-[SG]-H-x-[IVAQ]P-x(3)-[PSA] [] – any of the enclosed symbols X- any residue (3) – number of repeats Fuzzy regular expression Build regular expressions with info on shared biochemical properties of AA Provide flexibility according to AA group clustering Multiple motif methods PRINTS Encode multiple motifs (called fingerprints) in ungapped, unweighted local alignments BLOCKS Derived from PROSITE and PRINTS Use the most highly conserved regions in protein families in PROSITE Use motif-finding algorithm to generate a large number of candidate blocks Initially, three conserved AA positions anywhere in the alignment are identified and used as anchors Blocks are iteratively extended and ultimately encoded as ungapped local alignments Graph theory is used to assemble a best set of blocks for a given family Use position specific scoring matrix (PSSM), similar to a profile Full domain alignment Profiles Use family-based scoring matrix via dynamic programming Has position-specific info on insertions and deletions in the sequence family Hidden Markov Model (HMM) PFAM, SMART, TIGRFAM represent full domain alignments as HMMs PFAM Represents each family as seed alignment, full alignment, and an HMM Seed contains representative members of the family Full alignment contains all members of the family as detected with HMM constructed from seen alignment Structure-based Sequence Alignment Well-known that sequence alignment is not correct by sequence similarity alone and that similar structure but no sequence similarity Sequence alignment is augmented by structural alignments COMPASS< HOMSTRAD< PALI, .. Protein Structure Comparison/Classification Protein structures Domain Polypeptide chain in a protein folds into a ‘tertiary’ structure One or more compact globular regions called domains The tertiary structure associated with a domain region is also described as a protein fold Multi-domain Proteins with polypeptide chains fold into several domains Nearly half the known globular structures are multidomain, more than half in two domains Automatic structure comparison methods are introduced in 1970s shortly after the first crystal structures are stored in PDB Structure comparison algorithms Two main components in structure comparison algorithms Scoring similarities in structural features Optimization strategy maximizing similarities measured Most are based on geometric properties from 3D coordinates Intermolecular method Superpose structures by minimizing distance between superposed position Intra Compare sets of internal distances between positions to identify an alignment maximizing the number of equivalent positions Distance is described by RMSD (Root Mean Square Deviation), squared root of the average squared distance between equivalent atoms Inter vs. Intra RMSD Distant homolog Structure is more conserved than sequences during evolution Structural similarity between distant homologs can be found Pairwise sequence similarity SSAP structural similarity score in parenthesis (0 – 100) Distant homolog Structural variations in protein families Structure comparison algorithms SSAP, 1989 Residue level, Intra, Dynamic programming DALI, 1993 Residue fragment level, intra, Monte Carlo optimization COMPARER, 1990 Multiple element level, both, Dynamic programming Structure classification hierarchy Class level -- proteins are grouped according to their structural class (composition of residues in a α -helical and β-strand conformations) Mainly- α, mainly- β, alternating α- β, α plus β (mainly- α and – β are segregated) Architecture the manner by which secondary structure elements are packed together (arrangement of sec. structures in 3D space) Fold group (topology) Orientation of sec. structures and the connectivity between them Superfamily Family Hierarchy example Protein Structure databases PDB Over 20,000 entries deduced from X-ray diffraction, NMR or modeling Massively redundant 1FMK, 1BK5, 2F9C, .. Protein Structure databases SCOP (Structural Classification of Proteins) Multi-domain protein is split into its constituent domains Known structures are classified according to evolutionary and structural relationship Domains in SCOP are grouped by species and hierarchically classified into families, superfamilies, folds and classes Family level – group together domains with celar sequence similarities Superfamily – group of domains with structural and functional evidence for their descent from a common evolutionary ancestor Gold – group of domains with the same major secondary structure with the same chain topology Domains identified manually by visually inspecting structures Proteins in the same superfamily often have the same function Protein Structure databases CATH (Class, Architecture, Topology, Homology) Homology – clustered domains with 35% sequence identity and shared common ancestry 800 fold families, 10 of which are super-folds 2009 www.cs.uml.edu/~kim/580/08_cath.pdf Structure classification Most structure classifications are established at the domain level Thought to be an important evolutionary unit and easier to determine domain boundaries from structural data than from sequence data Criteria for assessing domain regions within a structure The domain possesses a compact globular structure Residues within a domain make more internal contacts than to residues in the rest of polypeptide Secondary structure elements are usually not shared with other regions of the polypeptide There is evidence for existence of this region as an evolutionary unit CATH classifications Multi-domain structures Protein Function/Structure Prediction Protein Function Prediction In the absense of experimental data, function of a protein is usually inferred from its sequence similarity to a protein of known function The more similar the sequence, the more similar the function is likely to be Not always true Can clues to function be derived directly from 3D structure Definition of function Function can be described at many levels: biochemical, biological processes, pathways, organ level Proteins are annotated at different degrees of functional specificity: ubiquitin-like dome, signaling protein, .. GO (Gene Ontology) scheme Protein Function Prediction Sequence-based – largely unreliable Profile-based Profiles are constructed from sequences of whole protein families with families are grouped by 3D structure or function (as in Pfam) Start with sequences matched by an initial search, iteratively pull in more remote homologues More sensitivity than simple sequence comparison because profiles implicitly contain information on which residues within the family are well conserved and which sites are more variable Structure-based Fold-based Proteins sharing simlar functions often shave similar folds, resulting from descent from a common ancestral protein Sometimes, function of proteins alter during evolution with the folds unchanged Thus, fold match is not always reliable Surface clefts and binding pockets Chap. 12 RNA Structures RNA structure Stem-loop structure RNA structure A loop structure A loop between i and j when base at i pairs with base at j Base at i+1 pairs with at base j Or base at i pairs with base at j-1 Or a multiple loop RNA secondary structure Search for minimum free energy Gibbs free energy at 37 degrees (C) Free energy increments of base pairs are counted as stacks of adjacent pairs Successive CGs: -3.3 kcal/mol Unfavorable loop initiation energy to constrain bases in a loop RNA structure prediction Ad-hoc approach Simply look at a strand and find areas where base pairing can occur Possible to find many locations where folds can occur Prediction should be able to determine the most likely one What should be the criteria ? 1980, Nussinov-Jacobson Algorithm More stable one is the most likely structure Find the fold that forms the greatest number of base pairs (base-pairing lowers the overall energy of the strand, more stable) Checking for all possible folds is impossible -> dynamic programming Nussinov-Jacobson Algorithm Create an nxn matrix for a sequence with n bases Initialize the diagonal to 0 Fill the matrix with the largest number of base pairs (S) S(i+1, j-1) + w(i,j) S(i,j) = max [ S(i+1, j) ] S(i, j-1) max[S(I,k) + S(k+1,j)} w(I,j) = 1 if base I can be paired with base j