* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Sin título de diapositiva
Ridge (biology) wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Gene expression wikipedia , lookup
Gene regulatory network wikipedia , lookup
Exome sequencing wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Molecular cloning wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Point mutation wikipedia , lookup
Gene expression profiling wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Molecular ecology wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Non-coding DNA wikipedia , lookup
Genome evolution wikipedia , lookup
Community fingerprinting wikipedia , lookup
Introducción a la Bioinformática 2002 Universidad Nacional San Cristobal de Huamanga, Ayacucho Mirko Zimic Tópicos de interés en la bioinformática • • • • • • • • Análisis de secuencias Filogenia y evolución molecular Modelamiento molecular Plegamiento de Proteínas Genómica y Proteómica Genética estadística Microarreglos Programación científica Pongamos un ejemplo … Cisteíno proteasa de la fasciola hepática: En busca de un péptido inmunogénico Alineamiento: cisteíno proteasas de mamífero Vs. cisteíno proteasa de Fasciola hepatica. AA Idénticos AA divergentes VPKSVDWREKGYVTPVKNQGQCGSCWAFSATGALEGQMFRKTGR ISLSEQNLVDCSRPQGN AVPDKIDWRESGYVTEVKDQGNCGSCWAFSTTGTMEGQYM KNERTSISFSEQQLVDCSRPWGN _____ROJO_________ QGCNGGLMDNAFQYIKENGGLDSEESYPYEATDTSCNY KPEYSVANDTGFVDIPQREKA LMK NGCGGGLMENAYQYLKQF GLETESSYPYTAVGGQCRYNKQLG VAKVTGYYTV QSGSEVEL KN _VIOLETA____ _AMARILLO_______ AVATVGPISVAIDAGHSFQFYKSGIYYEPDCSSKDLDHGVLVVGYGFEG TDSNNNKYW IVKNSW LIGSEGPSAVAVDVESDFMMYRSGIYQSQTCSPLRVNHAVLAVGYGTQGGTD YW IVKNSW _____ _VERDE_____ GPEWGM-GYVKMAKDRNNH CGIATAASYPTV GLSWGERGYIRMV RNRGNMCGIASLASLPMVARFP Epítope Discontinuo, formado por porciones distantes de la secuencia. Denaturación El epítope se pierde con la denaturación. Epítope Continuo, formado por una porción de la secuencia Denaturación El epítope se conserva como tal. Modelaje tridimensional por homología. Identidad de secuencia de 56% con quimopapaína (1YAL) Análisis de Superficie: vista de átomos por radio de van der Waals AA idénticos AA divergentes Selección de secuencias (1)divergentes, (2)accesibles al solvente y (3)contínuas. TMEGQYMKNERTSISFS YYTVQSGSEVELK NLIGSE QSQTCSPLRVN RYNKQLGVAKV Evaluación de la estabilidad conformacional de los péptidos por minimización de energía. H2O TMEGQYMKNERTSISFS “backbone” YYTVQSGSEVELKNLIGSE Pongamos otro ejemplo… Sensibilidad de la aspartyl proteasa del HIV-1 a los inhibidores más frecuentes Representación en “cartoon” de la enzima proteasa de HIV-1 MONOMERO PROTEASA HIV Enzima proteasa de HIV-1 mostrando los elementos de estructura secundaria, flaps y sitio activo Enzima proteasa de HIV-1 indicando los residuos consenso de unión inhibidor-enzima INDINAVIR RITONAVIR Asociación de indinavir a la proteasa de HIV-1 Proteasa de HIV-1 mutante modelada en complejo con Ritonavir COMPARACION ENTRE UNA ENZIMA SENSIBLE Y UNA RESISTENTE A RITONAVIR Un ejemplo más… Ordenamiento filogenético y el contenido de GC en tripanosomátidos Reported %GC variation for each codon position in Trypanosomatids (Alonso et al,1992) C r ith id ia L e is h m a n ia %GC cod on p o s itio n 90 T .c ru z i T .b ru ce i 85 1st 2nd 3 rd 80 75 70 65 60 55 50 45 40 42 44 46 48 50 52 % G C to ta l D N A 54 56 58 60 Codon usage in Trypanosomatids leucine 70 60 50 40 30 20 T.brucei T.cruzi Leishmania Critidia CTG CTC CTT TTG CTA TTA CTG CTC CTT TTG CTA TTA CTG CTC CTT TTG CTA TTA CTG CTC TTG CTA TTA 0 CTT 10 Codon usage in Trypanosomatids serine 40 35 30 25 20 15 10 T.brucei T.cruzi Leishmania Critidia TCG AGC TCC TCT TCA AGT TCG AGC TCC TCT TCA AGT TCG AGC TCC TCT TCA AGT TCG AGC TCC TCT TCA 0 AGT 5 Phylogeny of Trypanosomatid lineage (Maslov & Simpson) “Hole” formation by DNA replication GC content variation in time Restriction: AA family conservation and AA conservation %GC variation in Trypanosomatid lineage (Nuclear coding DNA) 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 P1 P2 P3 P3* Crithidia Leishmania T.cruzi P T.brucei % GC GC variation in trypanosomatidae lineage Nuclear DNA I. Proyecto Genoma Humano La secuencia del genoma está casi completa! – aproximadamente 3.5 billones de pares de bases. All the Genes • Any human gene can now be found in the genome by similarity searching with over 90% certainty. • However, the sequence still has many gaps – one is unlikely to find a complete and uninterrupted genomic segment for any gene – still can’t identify pseudogenes with certainty • This will improve as more sequence data accumulates Raw Genome Data: The next step is obviously to locate all of the genes and describe their functions. This will probably take another 15-20 years! …Algunos años atrás… Celera sostenía que sólo habrían 30,000 genes – so why are there 60,000 human genes on Affymetrix GeneChips? – Why does GenBank have 49,000 gene coding sequence and UniGene have 89,000 clusters of unique ESTs? • Clearly we are in desperate need of a theoretical framework to go with all of this data Implications for Biomedicine • Physicians will use genetic information to diagnose and treat disease. – Virtually all medical conditions (other than trauma) have a genetic component. • Faster drug development research – Individualized drugs – Gene therapy • All Biologists will use gene sequence information in their daily work II. Bioinformatics Challenges The huge dataset Lots of new sequences being added - automated sequencers - Human Genome Project - EST sequencing GenBank has over 10 Billion bases and is doubling every year!! (problem of exponential growth...) How can computers keep up? New Types of Biological Data • Microarrays - gene expression • Multi-level maps: genetic, physical, sequence, annotation • Networks of Protein-protein interactions • Cross-species relationships – Homologous genes – Chromosome organization Similarity Searching the Databanks What is similar to my sequence? Searching gets harder as the databases get bigger - and quality degrades Tools: BLAST and FASTA = time saving heuristics (approximate) Statistics + informed judgement of the biologist Alignment Alignment is the basis for finding similarity Pairwise alignment = dynamic programming Multiple alignment: protein families and functional domains Multiple alignment is "impossible" for lots of sequences Another heuristic - progressive pairwise alignment Sample Multiple Alignment Structure- Function Relationships Can we predict the function of protein molecules from their sequence? sequence > structure > function Conserved functional domains = motifs Prediction of some simple 3-D structures (a-helix, b-sheet, membrane spanning, etc.) Protein domains DNA Sequencing Automated sequencers > 40 KB per day 500 bp reads must be assembled into complete genes - errors especially insertions and deletions - error rate is highest at the ends where we want to overlap the reads - vector sequences must be removed from ends Faster sequencing relies on better software overlapping deletions vs. shotgun approaches: TIGR Finding Genes in genome Sequence is Not Easy • About 2% of human DNA encodes functional genes. • Genes are interspersed among long stretches of non-coding DNA. • Repeats, pseudo-genes, and introns confound matters Pattern Finding Tools • It is possible to use DNA sequence patterns to predict genes: • • • • promoters translational start and stop codes (ORFs) intron splice sites codon bias • Can also use similarity to known genes/ESTs Phylogenetics Evolution = mutation of DNA (and protein) sequences Can we define evolutionary relationships between organisms by comparing DNA sequences - is there one molecular clock? - phenetic vs. cladisitic approaches - lots of methods and software, what is the "correct" analysis? II. El papel del Biólogo en la Era de la Información El Internet provee abundante información biologica Puede resultar abrumador… - e-mail - Web Necesidad de nuevas habilidades = localizar información necesaria de manera eficiente Computing in the lab - everyday tasks (vs. computational biology) ordering supplies reference books lab notes literature searching Training "computer" scientists Know the right tool for the job Get the job done with tools available Network connection is the lifeline of the scientist Jobs change, computers change, projects change, scientists need to be adaptable The job of the biologist is changing • As more biological information becomes available … – The biologist will spend more time using computers – The biologist will spend more time on data analysis (and less doing lab biochemistry) – Biology will become a more quantitative science (think how the periodic table and atomic theory affected chemistry) III. Molecular Biology Software Tools GCG (Wisconsin Package) The most popular and most comprehensive set of tools for the molecular biologist. - Runs on mainframe computers: (UNIX) - Web, X-Windows (SeqLab) interfaces - Inexpensive for large numbers of users - Requires local databases (on the mainframe computer) - Allows for custom databases and programming The Web Many of the best tools are free over the Web BLAST ENTREZ/PUBMED Protein motifs databases Bioinformatics “service providers” DoubleTwist™, Celera, BioNavigator™ Hodgepodge collection of other tools PCR primer design Pairwise and Multiple Alignment Personal Computer Programs Macintosh and Windows applications - Commercial: Vector NTI™, MacVector™, OMIGA™, Sequencher™ - Freeware: Phylip, Fasta, Clustal, etc. Better graphics, easier to use Can't access very large databases or perform demanding calculations Integration with web databases and computing services Putting it all together The current state of the art requires the biologist to jump around from Web to mainframe to personal computer The trend is for integration – Web + personal computer will replace text interface to mainframe ? – Will the Web become the ultimate interface for all computing ?? IV. Genómica Genomics Technologies • Automated DNA sequencing • Automated annotation of sequences • DNA microarrays – gene expression (measure RNA levels) – single nucleotide polymorphisms (SNPs) • Protein chips (SELDI, etc.) • Protein-protein interactions cDNA spotted microarrays Affymetrix Gene Chips Impact on Bioinformatics • Genomics produces high-throughput, highquality data, and bioinformatics provides the analysis and interpretation of these massive data sets. • It is impossible to separate genomics laboratory technologies from the computational tools required for data analysis. Pharmacogenomics • The use of DNA sequence information to measure and predict the reaction of individuals to drugs. • Personalized drugs • Faster clinical trials – Selected trail populations • Less drug side effects – toxicogenomics