Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Data Mining for BioInformatics at Ewha CSE Dec. 14, 2001 Hwan-Seung Yong (Gene: ACTGAAAGGGCTCTCAAA) Dept. of Computer Science & Engineering Ewha Womans Univ. BioInformatics and Computer Science • Computer: 2진법 시스템(0/1) designed by Human • Living things: 4진법(A/G/C/T) designed by Nature • 컴퓨터 기술의 발전 – – – – 데이터 분석 + 데이타베이스 = 데이터 마이닝 (At present) 고성능 병렬 컴퓨터 기술 분산 처리 및 웹/X ML 기술 지식관리(Knowledge Management) 기술의 등장 For BioInformatics • 인간이 컴퓨터를 만든 이유 – 4진법속에 담긴 생명의 비밀을 찾아서 – 신의 영역에 도전 BioInformatics and Computer Science • BioInformatics – DNA 코드 Reader(biotechnology) 및 Alignment 기술 개발 • 유전자의 전체 시퀀스를 겨우 만든 상태 – 이것으로 부터 의미(유전자 등)를 찾는 것. – Binary Object로 부터 Source Code를 찾는 기술 • Disassembler와 Reverse Engineering 기술 전문가가 필요 – 데이타마이닝이 중요한 적용 기술임. Computer System Binary Code Assembly Code Source Code DNA Sequence 유전자 단백질 Living Things: Nature Why Ewha CSE is appropriate for BioInformatics • Recent focus of CSE’s Research Area – – – – – – As a BK Project Plan: Knowledge Engineering Framework Data Warehousing and OLAP Data Mining XML Technology Knowledge Engineering Enabling Technology Knowledge Engineering Application • Electronic Commerce • BioInformatics • 본교 관련 연구기관 – 분자생명과학대학원 (BK) – 한국과학재단 SRC(세포신호전달센터) – 정통부 컴퓨터 그래픽스/가상현실 연구센터 • 기존의 관련연구(직접) – 검찰청 유전자 검색 및 자동분석 프로그램 개발 – 국립과학수사연구소 유전자 정보 관리 시스템 개발 유전자 자동분석 프로그램 유전밴드 인식, 코드 등록 프로그램 DNA Locus Registration Interface Data Warehousing, OLAP and Data Mining • Data Warehousing and OLAP – – – – – – ETL Methodology (Extraction, Transformation and Loading) Data Warehouse Architecture OLAP Server Development Multidimensional Data Processing Metadata Handling Data Quality Control • Data Mining – – – – – – Classification and Analysis of Data Minig Technique Clustering Algorithm Association Algorithm Classification Algorithm CRM Appliation based on Web Log Mining Text Mining for XML Data XML and Supporting Technology • XML Related Area – XML Server Development • Query Processing and Storage System – XML document Mining • Knowledge Enabling Technology – – – – – – Multimedia Highspeed Network Component based Software Engineering Security Multimedia DBMS Natural Language Processing Computer Graphics and Virtual Reality Research Requirement for BioInformatics • Large Volume of Data including multimeia data • High Performace Computing System – Massively Parallel Processing Hardware and Software • XML related work is important – For exchange of bio data – Gene Annotation • Web based collaborative system – Require web based interoperable application and standard – Distributed processing technique • CORBA, SOAP, Microsoft .NET framework • Data Mining – For Gene Prediction, Functional Genomics Bio Data Mining Research • XML Standard for Bio Data • Graphical User Interface for XML Data • Data Converter to XML – Convert Existing Bio Data to XML Standard – Convert between Some XML Standard • Integration Methodology with Existing DB – SOAP(Simple Object Access Protocol) – WSDL(Web Service Description Language) XML Standard for Bio Data • Before – FASTA format, GenBank format, GFF(General Feature Format) • XML Format – AGAVE (Architecture for Genomic Annotation, Visualization and Exchange) • • • • • Developed by Double Twist, Inc. Released in June 2000 Open Source licence in August 2001. AGAVE 3.2 version with Prophecy 3.0 in Sept. 2001 Refer http://www.agavexml.org • Genome XML Viewer by Labbook – BSML XML standard for Bio Data • BioXML Standard and GAME – an open-source/free software organization dedicated to providing a set of standard xml formats for the exchange of biological data • GAME(Genomic Annotation Markup Language) – – – – Created at BDGP (Berkeley Drosophila Genome Project) Current Version 1.1 released in March 2000 http://www.bioxml.org Follow WikiWeb scheme • collaborative web site that can be edited by anyone • Community documentation system • Everyone can edit sharing web pages 컴퓨터이론 및 보안 연구실 Whole genome sequence annotation Known gene Unknown gene • Sequence similarity • Neural networks • Hidden Markov models Unknown gene prediction Microarray data analysis Phylogenetic prediction Phylogeny inference Phylogenetic analysis Comparative genomics Data mining tools Two samples comparison Phylogenetic Tree Visualization • Tree drawing algorithms • Graph drawing algorithms Clustering classification tools Multiple samples comparison New algorithm design •Simulated annealing •Other optimization techniques Open Source Project • Open BioInformatics Foundation – http://www.open-bio.org – Umbralla group for various bio*.org group • bioxml.org, bioperl.org, biopython.org, biojava.org, biocorba.org • biopathways.org • bio-ensembl.org – Annotation for human genome – The First Bioinformatics Open Source Conference (BOSC'2001) was held, August 2001 at San Diego. – Many Open System Activities Vision and Future Prediction • Ewha will – Contribute something in Bio Data Mining Area – Have Bio Informatics Institute or Research Center – Have strong bio-industry relationship • Closing Comment ATGCCGTCGGGCCCCGGGGC => Thank You를 4진법으로 표현