* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download What is a gene?
Cre-Lox recombination wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
Cancer epigenetics wikipedia , lookup
Ridge (biology) wikipedia , lookup
Non-coding RNA wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Frameshift mutation wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Genetic code wikipedia , lookup
Messenger RNA wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Gene therapy wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Transposable element wikipedia , lookup
Genomic imprinting wikipedia , lookup
Gene expression programming wikipedia , lookup
Public health genomics wikipedia , lookup
Genetic engineering wikipedia , lookup
Human genome wikipedia , lookup
Metagenomics wikipedia , lookup
Gene desert wikipedia , lookup
Epitranscriptome wikipedia , lookup
Minimal genome wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Pathogenomics wikipedia , lookup
Genomic library wikipedia , lookup
Non-coding DNA wikipedia , lookup
Gene nomenclature wikipedia , lookup
Genome (book) wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Genome evolution wikipedia , lookup
History of genetic engineering wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Primary transcript wikipedia , lookup
Gene expression profiling wikipedia , lookup
Point mutation wikipedia , lookup
Genome editing wikipedia , lookup
Microevolution wikipedia , lookup
Designer baby wikipedia , lookup
Helitron (biology) wikipedia , lookup
BIOLOGY 3020 Fall 2008 Gene Hunting (DNA database searching) DNA and p53 Transcription Factor How many transcription factors (TFs) in Corn? Lecture Outline 1. What is a gene? 2. DNA Databases today – GenBank 3. How to find a new gene in the GenBank 4. How to know that you have a full length (complete) gene 5. Storing your work 1: What is a gene? A gene is a unit of genetic information Genes are made of DNA (found in cell nucleus) One gene encodes one protein (polypeptide) (made in cell cytoplasm) A messenger RNA (mRNA) mediates the expression of a gene (via ribosome) An organism is encoded for by numerous genes (about 26,000 for humans) Central Dogma of Molecular Genetics DNA – all genes are present in every cell Only some genes are expressed in a given cell mRNA population represents those genes expressed in a given cell (tissue specific gene expression) How a gene is expressed IntronI Exon1 IntronII Exon3 Exon2 D DNA mRNA Start D Intron slicing and polyA tailing Mature mRNA AAAAAAA Stop polyA tail Open Reading Frame (ORF) Translation on ribosome Protein Where can you find a gene ? Book collections can be stored in a library Collections of genes can be made and stored in gene libraries ! There are 2 main kinds of gene libraries Genomic libraries are made from DNA and contain entire genes (exons and introns). cDNA libraries are made from mRNAs that are converted into DNA (only exons) cDNA libraries are very useful A library of genes expressed in a given tissue type is a cDNA library To study a tissue (e.g. liver or brain) then a cDNA library contains the genes used to make that tissue cDNA libraries are made from mRNA which is converted into DNA. One cDNA clone from a cDNA library contains the coding information for that gene (with introns removed) cDNA is made from mRNA Start AAAAAAA Stop TTTTTTT Add polyT primer, nucleotides, and Reverse Transcriptase AAAAAAA TTTTTTT Mature mRNA DNA/RNA RNA removed (by NaOH) and second strand synthesized TTTTTTT Complementary DNA cDNA A full length cDNA is hard to find Start Stop AAAAAAA Open Reading Frame (ORF) AAAAAAA mRNA is degraded from 5’ end AAAAAAA AAAAAAA AAAAAAA Most cDNAs are not full length (flcDNA) and the ORF is incomplete (partial) cDNA (EST) libraries have few flcDNAs Open Reading Frame (ORF) cDNA libraries are made and individual clones sequenced at random A sequenced cDNA is called an Expressed Sequence Tag (EST) Millions of ESTs from different tissues of different organisms are stored in GenBank – but only a small few are flcDNAs! -how to find the longest ones? Where ? 2. DNA Databases today – GenBank GenBank is housed at NCBI www.ncbi.nlm.nih.gov The Entrez Nucleotide database is a collection of sequences from several sources, including GenBank, RefSeq, and PDB. The number of bases in these databases continues to grow at an exponential rate. As of April 2006, there are over 130 billion bases in GenBank and RefSeq alone ! The main infomration access point is in Entrez (click on All databases) A virtual “Jungle” of information……. 3. How to find a new gene in this jungle? Class project to clone novel transcription factor (TF) genes from Corn A good starting point is the set of predicted TFs from rice (whose genome has been completed) Visit the GRASSIUS website GRASSIUS Website New NSF supported database www.grassius.org GRASSIUS Outreach section GRASSIUS Helpful Links (On Links menu) Maize MAGI Maize Assembled Genomic Island [MAGI] MaizeGDB MaizeGDB is the community database for information on Zea mays The Maize Full Length Project This project uses genomics tools to understand a fundamental biological process through identificaion of genes expressed during maize reproduction and in somatic tissues responding to abiotic perturbations such as heat, cold, salt, UV-B, drought, and lack of light. Rice TIGR Rice Genome The TIGR Rice Genome Annotation Database and Resource is a National Science Foundation project and provides sequence and annotation data for the rice genome RiceTFDB RiceTFDB (2.1) is a public database arising from efforts to identify and catalogue all Oryza sativa genes involved in transcriptional control Comparative Genomic Resources CGGC Comparative Grass Genomics Center (CGGC) AGRIS The Arabidopsis Gene Regulatory Information Server (AGRIS) is a information resource for Arabidopsis promoter sequences, transcription factors and their target genes Grass Transcription Factor Database (GRASSTFDB) GRASSTFDB provides a comprehensive collection of transcription factors from maize, sugarcane, sorghum and rice. Transcription factors, defined here specifically as proteins containing domains that suggest sequencespecific DNA-binding activities, are classified based on the presence of 50+ conserved domains. Links to resources that provide information on mutants available, map positions or putative functions for these transcription factors are provided. The genes that you clone and study will be added to this database Use Known Rice TF Gene to find related TF in maize EST database 2516 TFs in 66 families Example: the G2-like TF family These TFs are known to be important in the growth of plants and is found in several other species but not yet studied in corn Each TF gene has a unique Locus number (like a bar code) Clicking on a locus gives more information on that particular gene You want to retrieve the sequence (at bottom of the page See next slide) Domain architecture is info on the protein product These links give you the actual ORF (Coding Sequence CDS), entire gene or protein sequence The actual ORF of a gene (Coding Sequence CDS). The first start codon is always ATG and the last is one of three stop codons TAA, TAG or TGA The ORF is a multiple of 3 The ORF is translated into the protein sequence The start codon ATG always encodes the amino acid methionine M. The * indicates the stop codon (no amino acid in the protein) Copy and paste this sequence into a new protein molecule in VectorNTI In the Protein Molecules Local Database – make a new subbase for your protein files Click on “New Protein Molecule” and type in the locus name of the rice TF locus e.g. Os01g08160.1 Click on the sequence and Features menu Click on “Edit sequence” Paste in the sequence and click “OK” twice Using the Rice TF as a starting point…… ..Let the hunt for the corn TF begin! Highlight the protein sequence and click on Tools … Do a BLAST search (like a google search) to search the GenBank We will use the NCBI BLAST server There are 5 different BLAST programs to choose from BLAST stands for Basic Local Alignment Search Theorem. (Like doing a Google search) Select tblastn program and est others database and then submit. When “Finished” click in file BLAST Report has graphic and list windowpanes A B C D E In windowpane A is info on each “Hit” against the database (here there are 500 hits) The first is with a corn (Zea mays) mRNA (EST) A C D In windowpane C the arrows show how the query sequence (Q:1) lines up with the highlighted hit (H:1) (Top blue line in windowpane D) The actual alignment of the sequence Q1 with the 1st hit is shown in windowpane E D E Note that amino acid 91 of Q1 aligns with nucleotide 64 of H1(=amino acid 21) so hit1 is a partial cDNA (NOT full length) however…. Scrolling down we find that another blue line does overlap with the beginning of the query Now amino acid 22 overlaps with bp 347 of the corn EST with the GenBank accession EE188556 This one looks like it is a flcDNA Click on this and the Genbank file will open…. Now the new gene is in your sights! Genbank file EE188556 seems to be a flcDNA By highlighting the sequence and translating it in different frames, then by examining with the BLAST result it can be seen that the correct ORF is in frame 2 Extending back from the shared region about 45 amino acids we find a Met (ATG start codon) Record the GenBank number EE188556 In the comments file make sure that the clone is available from the Arizona Genomics The plate location is needed Institute to request the clone Save the Genbank file into VectorNTI and you will use this in the second part of the course Export the file as an archive and email it to [email protected] with your group number and GenBank file in the subject line e.g. Group5 EST-EE188556 In your lab report include the following in the Results Section for this lab 1: The Rice Locus number that you started with 2: The protein sequence of the Rice gene and a brief description of the TF family to which it belongs 3: The GenBank Accession number of the Maize EST that appears to be a flcDNA similar to the rice TF 4: The Arizona Genomics Institute Plate number Congratulations! Now you have hunted down a new gene and you will clone this in the 2nd and 3rd part of the course