* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download amino acids
Gene desert wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Transformation (genetics) wikipedia , lookup
Biosynthesis wikipedia , lookup
Zinc finger nuclease wikipedia , lookup
Genetic code wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Molecular ecology wikipedia , lookup
DNA supercoil wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Real-time polymerase chain reaction wikipedia , lookup
Molecular cloning wikipedia , lookup
Transposable element wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Genomic library wikipedia , lookup
Genetic engineering wikipedia , lookup
Gene expression wikipedia , lookup
Gene regulatory network wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Community fingerprinting wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Point mutation wikipedia , lookup
Non-coding DNA wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
What is Life? • life for living beings is their own existence, • life – the phenomenon relying on the existence of dynamic, self-organizing structures able to multiply and evolution, • life – a computational process whose algorithm is encoded in DNA memory and performed by the protein engine. Elementary instructions of protein subroutines During protein construction the Nature uses 20 standard amino acids: Ala, A — alanine Asn, N — asparagine Cys, C — cysteine Glu, E — glutamic acid His, H — histidine Leu, L — leucine Met, M — methionine Pro, P — proline Thr, T — threonine Tyr, Y — tyrosine Arg, R — arginine Asp, D — aspartic acid Gln, Q — glutamine Gly, G — glycine Ile, I — isoleucine Lys, K — lysine Phe, F — phenylalanine Ser, S — serine Trp, W — tryptophan Val, V — valine Program memory – DNA double helix Both strands run in opposite directions to each other and are glued by complementary bases: 4A – T 4C – G Each living cell posseses the whole genetic information about the organism. DNA – information coding Every protein has its gene – the fragment of DNA strand, where the order of symbols A, C, T, G describes the sequence of amino acids. Genetic code – one amino acid is determined by three successive symbols on DNA strand (i.e. codon). Transcription Translation Folding DNA – gene regulatory network IN activation IN: inhibition regulatory region OUT: transcription coding region nodes: genes/protein products arches: regulatory relations The module of gene expression regulatory network controlling some stage of sea urchin’s embryogenesis. Replication of DNA 4 occurs before every cell division, 4 specialized enzymes disentangle double helix and build new strands, complementary to the already existing ones, 4 strains coping is performed with error correction in 3'→5' direction, 4 two identical (almost …) double helixes are made. Bioinformatic databases Continually developed collections of information on genetic maps, DNA sequences, proteins, their interaction, spatial structures, functions in the organism, etc. Entrez: www.ncbi.nlm.nih.gov/Entrez Ensembl: www.ensembl.org Genbank – nucleotide sequence database, the current size of 1.1·1011 letters in 1.2·108 reported sequences, its size doubles every 18 months. Bioinformatic databases Data available to the public on the Internet Bioinformatic databases Data available to the public on the Internet Sequences comparison We have learned the sequence of a gene: 4AGAGTCAATCCATAG Question: what is its function? Clue: check what is already known about the counterparts (homologues) of this gene in other evolutionarily related species. How to find them? We need a program to search other known genomes for fragments that are very similar to given input (they have to be transformed by the evolution from one to the other). Sequences comparison We have two DNA substrings with a common origin. How did the evolution occur during the transformation? 4AGAGTCAATCCATAG 4CAGAGGTCCATCATG Possible histories of the process may be shown by sequence alignments: 4-AGAG-TCAATCCATAG 4CAGAGGTCCATC-ATGor the other one: 4------AGAGTCAATCCATAG 4CAGAGG----TCCATCATG-Which one is more probable? Sequences comparison We introduce costs for editing operations e.g.: 4character unchanged: 0 4replacement: 2 4indel (insertion or deletion): 3 We search for the „cheapest” alignment as the most believable. 4-AGAG-TCAATCCATAG 4CAGAGGTCCATC-ATGCost: 4×3 + 2×2 = 16 4------AGAGTCAATCCATAG 4CAGAGG----TCCATCATG-Cost: 12×3 + 4×2 = 44 Sequences comparison Problem: How to find the „cheapest" alignment of two sequences? Example. Sequences CTG and CCCG. Alignments correspond to all possible ways: Start → Stop. Start The problem is reduced to finding the cheapest connection on the map. C Solution: 4-CTG 4CCCG Cost: 5 C T G C C G Stop Sequences comparison BLAST (Basic Local Alignment Search Tool) – homologous sequence finder. Biological sequence analysis The way to understanding … Gene prediction Gene finding in DNA – a sophisticated computational problem. 4 some (quite differentiated) sequences mark the beginning and the end of the gene transcription area ... difficult to find them, 4an eukaryote gene is divided into alternate fragments: coding and noncoding (called introns) and the latter are cut out from the RNA transcript (splicing) before protein synthesis. 4 in the genome there are many pseudogenes, i.e. "broken" copies of old genes that have lost the possibility of transcription – useless relic of evolution. Gene prediction Gene finding in DNA – a sophisticated computational problem. 4 only ~1.5% of the human genome encodes proteins and ~80% is not related to genes or their regulation. Open Reading Frame (ORF) in DNA – sequence of nucleotide triplets beginning with codon Start and ending with Stop, longer than implied by the case. Potentially coding sequence. Similar issue: finding of regulatory sequences and other functional motifs. Genomic trashcan 4 Transposons – „jumping genes" encode enzymes able to cut and move them to another place of a chromosome. Often produce "large" mutations. 4 Reverse transcriptase – enzyme performing „transcription”, but from RNA into new double strand DNA (which then can be integrated with the genome). Used by retroviruses, but also some retrotransposons are capable of self-replication in cell nuclei ("integrated" with the genome parasite gene). Phylogenetic analysis What is the similarity between an archaeologist and a geneticist? Both like to dig in garbage ☺ – trash after extinct ancestors is a valuable source of information. – for example, the parasitic sequence (e.g. transposon), duplicated in genome, preserved many self-copies in the DNA of descendent species – the evidence of common origin. Phylogenetic analysis How do we know that whales and dolphins are … even-toed ungulate? ☺ Examining the existence/lack of 20 different repetitive sequences (arrows) in the DNA of a group of present-day species we see that this information can be composed into only one phylogenetic tree: Phylogenetic analysis Inactive for millions of years retroviruses, integrated now with the genomes of modern … monkeys, document human evolution: Phylogenetic analysis Phylogenetics based on homologous genes sequences comparison: among many possible hypothetical phylogenetic trees we search such a history, for which the probability of appearance of genes under comparison is the greatest. n=5 105 n=10 ~3.5·107 n=15 ~2.1·1014 n=20 ~8.2·1021 … hard computer analysis, methods of artificial intelligence… Number of phylogenetic trees for n species: 3·5 ·... ·(2n–3) Phylogenetic analysis Tree of Life Project www.tolweb.org/tree/ What is Life? Profesor Donald Knuth, author of "The Art of Computer Programming": I can't be as confident about computer science as I can about biology. Biology easily has 500 years of exciting problems to work on. … impossible without relying on computer analysis.