Download Impact of bioinformatics - CS Intranet

HKU CS Bioinformatics Research Siu Ming Yiu Department of Computer Science The University of Hong Kong Other faculty members: Prof. Francis Chin Prof. TW Lam Dr HF Ting 1 Impact of bioinformatics Medical research Biological research Huge volume of data e.g. finding a cancercausing gene? Environmental study e.g. how to remove harmful bacteria e.g. can we make rice grow faster? e.g. human genome: 3G long; Medical study: 100 persons e.g. human gut contains 1000+ bacteria (data: 500G)  obesity Biofuel e.g. how bacteria digest food to produce energy? 2 The de novo assembly problem (single genome) Given an unknown genome, Genome X NO existing technology is able to read out the DNA sequence (ACCG…..) of it as the sequence is too long (e.g. human = 3 billions long; even bacteria are about 10k – several millions). What we can do? High-throughput sequencing technology (next generation sequencing (NGS)): …………………. DNA sequencing machine [Inside the machine, the genomes are randomly cut into short fragments (reads), the machine can read out the DNA sequence of the reads.] Multiple copies of Genome X GTCG ACCG CTAG CAAG CTTG AACG CTCG GTCG GTTG GGAG 3 Bad news  Multiple copies of Genome X (1) The reads are really short: 100-150 bp (c.f. genome of a bacterium – 10K to several millions). (2) They are mixing together (no idea where from the genome each read is from!!). (3) There are errors in the read. [AACCGTTC => AACGGTCC] The (de novo) assembly problem: Can we reconstruct the original genome from the reads? 4 Data volume: HUGE!! Take human genome as an example. The genome is of 3x109 (3 billion) long. Recall: multiple copies are cut (fragmented). At any position of the genome, multiple copies of reads may be obtained. The average number of copies of reads from each position of a genome is referred as ………………. the depth of the sequencing. Note that they are mixed together, no ordering information For depth = 30, # of reads: (3x109x30)/100 ≈ 109 5 Good news  There are some clues inside the reads: The reads are overlapping! Unknown genome: AACCGGTTGCACGTTCCACTTGGCC……… CCGGTTGTC TGTCACGTT ACCGGTTGT AACCGGTTG GGTTGTCAC CGGTTGTCA TTGTCACGT GTTGTCACG Ideal case: every position has at least one read, no errors in the read, then…. [But the reality…. is a lot worse] 6 Unknown genome: AACCGGTTGCACGTTCCACTTGGCC……… CCGGTTGTC AACCGCTTG GGTTGTCAC TGTCACGTT CGGTTGTCA ACCGGTTTT TTGTCACGT GTTGTCACG The reality: (a) There are errors in the reads; not easy to locate the next read! (b) At some positions, we may have no reads. 7 Publications Bioinformatics (impact factor: 5.323) BMC genomics (impact factor: 4.4) PloS One (impact factor: 3.73) BMC bioinformatics (impact factor: 3.02) Journal of Computational Biology (impact factor: 1.56) IEEE/ACM TCBB (impact factor: 1.54) …… Top conferences: RECOMB, ISMB, ECCB Nature papers with our collaborators HKU-BGI research center: BGI (Shenzhen) is the largest genomic center in the world Other international collaborators: JGI, dept. of energy, US (biofuel); Sidekid hospital, Canada (diabetes); CAS-MPG PICB, Shanghai (C4 Rice project); UC San Francisco (Optical mapping data analysis); NUS, Singapore (RNA study); …. 8 How to solve the problem? A few general approaches String graph, de Bruijn graph, … Genome Idea: we still make use of the …. A C G T G T A C C T C……. overlapping parts in reads to connect them together. We do not need reads of every Read G T G T A C C T C (k = 4) position. -------------------------Graph: Vertex: k-mer (k consecutive GTGT TGTA GTAC TACC ACCT CCTC nucleotides in a read) Edge: two k-mers appear consecutively in a read 9 Genome: A A C G A C G T G T A C C T C A G T AACGACGTG ACGACGTGT CGACGTGTA GACGTGTAC ACGTGTACC CGTGTACCT GTGTACCTC TGTACCTCA GTACCTCAG TACCTCAGT Reads (len = 9) AACG ACGA CGAC Ideal case - No errors - Reads at every position - The graph can read out one single path, that will be the genome! GTGT ACGT GACG CAGT GTAC CGTG TGTA CCTC TCAG CTCA ACCT TACC 10 Genome: A A C G A C G T G T A C C T C A G T AACGACGTG ACGACGTGT CGACGTGTA GACGTGTAC ACGTGTACC CGTGTACCT GTGTACCTC TGTACCTCA GTACCTCAG TACCTCAGT Reads (len = 9) AACG ACGA CGAC Note: even a few reads are missing, we are still ok! Can anyone see that how many reads can be missed depends on the value of k (when constructing the graph!)? Q: to allow more missing reads, larger or smaller k is better? GTGT ACGT GACG CAGT GTAC CGTG TGTA CCTC TCAG CTCA ACCT TACC 11 Genome: A A C G A C G T G T A C G CTCAGT Reads (len = 9) AACGACGTG ACGACGTGT CGACGTGTA GACGTGTAC ACGTGTACC CGTGTACCT GTGTACCTC TGTACC GT C A GTACCTCAG TACCTCAGT ACGT CGTG CGTC Contigs: Maximal path without branches/paths contig CGAC GACG ACGT CGACGT Real case is more complicated: Even no error, in a genome, some patterns may repeat! In reality, we seldom can construct the whole genome in one piece, but stop at junctions, resulting with a set of contigs 12 A part of the de Bruijn graph for Ecoli (~4M long); you can imagine how complicated for human genome (3G long) 13 Conclusions  Our team:  Core Faculty members: Prof. Francis Chin, Prof. TW Lam, me  1 Research Assistant Professor (Henry Leung)  1 Postdoc (Jianyu Shi)  about 8 PhD/master students + a team in HKU-BGI Lab  Some collaborators:  Beijing Genome Institute at Shenzhen (BGI) - HKU-BGI Laboratory  HKU medical schools; life science departments  Sickkids hospital, Canada  JGI, DoE, US  CAS-MPG PICB, Shanghai (C4 Rice project)  UC San Francisco (Pui’s group)  GIS (Genome Institute at Singapore)  Universities: NUS, CUHK, U of Liverpool etc. <Thank you> 14

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Impact of bioinformatics - CS Intranet