Download Topics in Informatics - University of Saint Joseph

Document related concepts

Microsoft Jet Database Engine wikipedia , lookup

Database wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Functional Database Model wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Topics in
Informatics
Spring 2005, SJC
About the Instructor
•
•
•
•
•
Instructor
Dr. Hong Zhou
Office
McDonough 317
Office Hours MWF 10:00 – 11:00am
Email: [email protected], Phone: 231-5826
Syllabus
• You can all me either:
– Hong
– Dr. Hong
– Dr. Zhou
What is Informatics?
• Search for ‘What is informatics’ at
http://www.google.com, we got
different definitions.
• Basically, the study and application
of the knowledge and skills of
data/information flow and
manipulation (including storage,
retrieval, analysis, and
construction/deriving, etc).
Informatics
• Data obtaining
• Data flow and control
• Data representation (records) and
storage
• Data retrieval/mining
• Data analysis
• Data derivation (generating new data
from existing data via analysis)
What Will You Learn
• Obtaining reliable data.
• Data Management (Data Storage and
Representation, Retrieval) 
database.
• Introduction to Bioinformatics.
• Introduction to Health Informatics.
Part I Obtaining Reliable
Data
• Complex and precise communication
is something distinguishing us from
non-human.
• The world development is somehow
the development of our
understanding, i.e. information of the
universe including our social
systems.
• Information and its uses are the
center of such development.
Information vs Data
• What is in your mind when we talk
about “INFORMATION”?
• Is information touchable, visible?
• To my understanding, data is the
description of information, and
information is the interpretation of
data.
• So, let’s deal with the description 
data, in this class.
What is Data?
• When we talk about data, the first
image in our mind might be numbers
such as 5, 87, 98.34, etc.
• However, are the numbers 5, 87,
98.34 meaningful/informative?
Data with Context
• Pure numbers are meaningless for
us.
• Numbers with context are
meaningful, however.
• For example, 5 pounds of sugar.
• So, in this class, we are talking about
meaningful data and ignore all
meaningless data. (Are we
meaningful persons?)
Quick Questions
Are following ‘data’ meaningful?
• 20
• 20 years
• A 20 years old girl
• A 20 years old girl named Amie
• A 20 years old girl named Amie who
is a SJC student.
Data Target
• Data is used to describe a subject.
• For example, age, height, weight,
gender, profession, are description of
a person.
• Medical record is a description of a
patient
Quick Question
What are the targets of the following two rows of data?
Ford
V6
White
Amie
Female 9/198
6
4 door Sedan
CSC
SJC
What is RELIABLE?
• When we talk about reliable data,
what does that mean?
• Let’s discuss this issue at two levels:
– Individual level
– Group/population level (statistics)
Individual Level
• Reliable data means that the data is
‘closely’ related to the individual (or
event) and ‘precisely’ describes the
individual (or event).
• A computer of 3.2 ghz CPU, 512 mb
RAM, 512 kb cache, etc.
Group/Population Level
• ‘Reliable’ is more meaningful at the
group level.
• Can a specific medical diagnose of a
patient be representative of all
patients with the same symptom?
• Probably not.
Statistical Thinking
• One powerful approach to analyze data is
statistics.
• We measure the reliability (significance) of
data in the sense of statistics.
• Statistical thinking is to use data to build
our understanding, gain insights, and
draw conclusions or make inferences.
• Not drawing conclusion from an incident.
Principles in Statistical
Thinking
• Count on data instead of an incident
• Where the data is from matters.
• Lurking variables
• Variation is everywhere
• Conclusions are not absolutely
certain
Count on large amount of data
instead of a few incidents
• Famous fortune teller
• The thumb of a monk
Where data is from
• Group data can be collected from
surveys or observations, or obtained
from experiments.
• When collecting data, where the data
come from is important. For
example, once there is a question “If
you had it to do over again, would
you have children?” 70% from the
written responses are NO. Is this
piece information reliable?
Lurking Variables
• Is music practice improving test
scores?
• What is behind?
The Importance of RANDOM
• The key factor in data collection is
the RANDOM concept, i.e. the data
has to be randomly collected with no
bias.
• Suppose that you are doing a survey
of 2004 election prediction from
10000 people in USA, how are you
going to pick the 10000 persons?
Only in schools? Only in New York?
Only women? Avoid as much as bias
as you can.
Experiments
• Some reliable data can only be produced
by experiments, especially in science.
• For example, in biology, to pin down the
function of a gene, you have to knock out
the gene or depress it and check the
phenotype changes. After that, you have
to recover the gene and verify if the
phenotype also recovers. Such
experiments are very convincing, but
expensive.
Another Experiment
• It once was believed that women who take
hormones after menopause reduce the
risk of heart attack. The belief was
resulted from the studies that simply
compared women who were taking
hormones with others who were not. Are
such study results reliable?
• Such experiments lack proper Controls,
which are the essential in all experiments.
• How are you going to design an
experiment for this study?
Reliable Data cont’d
• It is not a simple task to obtain
reliable data, it requires extensive
consideration and design.
• Some experiment results may look
convincing at some time, but may
lose their reliability over time or
when the environment changes. For
example, the third stop light of cars.
Discussion
• Is absence of evidence the evidence
of absence?
Project 1
• Write a paragraph to discuss the claim “Absence
of evidence is evidence of absence”. Please make
your own judgment as the grading is based on
your argument.
• Design a simple survey to collect opinions about
terminating death penalty. Be aware of the
importance of “RANDOM”. Write a short
paragraph to argument that the data collected by
your survey is reliable.
• Points: 100.
• Due Date: Feb 1st, 2005.
• Submit your work in the digital drop box in
Blackboard.
Part II Data Storage
• Can all information be recorded as
data? Let’s start the discussion.
– Feeling
– Knowledge
– Intelligence
Personal Ideas
• My understanding: Yes, just some of
them are too complicated or too
difficult to manifest precisely.
• And that is whey we have IQ test,
MQ (motivational quotient), EQ, etc.
Where to store
• Data is stored somewhere.
– Minds
– Books (paper documents)
– Computers
– Etc
• Let’s compare the three storage
methods, which one you think more
lasting or appropriate?
Passing Words
• In ancient time, knowledge is passed in words
generation by generation.
• Here is a story about passing by words:
– General called the captain telling “tonight at 7:00pm,
the Halley comet will pass your camp in the sky.
Organize your soldiers to watch”.
– Captain informed his lieutenant: “Tonight at 7:00pm, the
Halley comet will pass our camp in the sky and the
general is coming to watch with our soldiers.”
– The lieutenant informed the sergeant: “Tonight at
7:00pm, the general will accompany Halley comet
passing over our camp, organize the soldiers”
– The sergeant to soldiers: “Tonight at 7:00pm, general
Halley will pass over our camp in sky and we are going
to watch that”.
Data Storage
• Paper storage:
– Size and cost
– Transportation
• Computer:
– Signature  legal effect
– Hacking
– What if computers are
down?
• However, if data is not
organized, it is difficult to
make use of. So, data
storage strategy is
important.
• In this class, we talk about
data storage by using
computer technology.
Ways to store
• Data storage is a big, and probably
the largest issue related to computer
data manipulation.
• Different database structures,
different database managements,
online storage, etc.
Chapter 1. File structure
• Hierarchical
structure
• Easy to deal with the
hierarchical
relationships.
• For example, the
administration is a
hierarchical structure.
• Let me use the
DOD/NIMA VPF
structure as an
example
root
Folder 1
Folder 2
subfolders
files
files
VPF Structure
• DOD (Department of Defense) and
NIMA (National Image and Mapping
Agency) sponsored the VPF
development (Vector Product
Format)  Nickname: very poor
format
• It is used to store the earth ground
information and provide a digital
map.
VPF structure
Database
• Library
Library
Coverage
File1
File1
Coverage
File1
File1
Coverage
File1
Navigation in Hierarchical
Structure
What is the purpose
of Index?
Project 2
• Create a hierarchical file structure to
store some your works in SJC.
• This is the way I prefer: organize
your works based on the classes you
take.
• If you have other ways, that is ok as
long as they are organized well.
• Show me in class what you have
done.
• Points: 100
Chapter 2 XML
• Extensible Markup Language
• Purpose:
– Data transportation
– Data representation
– Data storage
• Why we should talk about it here?
Because the data inside a XML file is
hierarchical
What XML Promises?
• Data portability
• Programming language Java promises the
portability of programs.
• However, programs are working on data.
Before XML, data is not portable,
communication among systems, agencies
are extremely difficult.
• XML allows systems to communicate using
a standard means of data representation.
HTML?
• HTML is the portable language for
browsers.
• It is a standard.
• However, it governs how information
is displayed in a browser with
defined formats and defined tags.
The Difficulties XML faces
• XML has some defined formats
• But doesn’t have defined tags.
• User defined tags
• Unlimited types of data.
Solution (Partially)
• Make the information self-explained.
• You have to invent your own tags!
A Simple Example
<person>
<lastname>Fonship</lastname>
<firstname>Michele</firstname>
<gender>female</gender>
<education type=“elementary”>
<start-date>9/1980</start-date>
<stop-date>5/1985</stop-date>
<school>Badley school</school>
</education>
</person>
Tips about XML format
• A tag is case sensitive
• A starting tag must have a closing
tag to match
• All XML elements must be properly
nested.
• All XML documents must have a root
element.
• Attribute values must always be
quoted.
Comments in XML
• Comments in XML
• The syntax for writing comments in
XML is similar to that of HTML.
• <!-- This is a comment -->
• A sample XML file.
XML Element Naming
• Names can contain letters, numbers,
and other characters
• Names must not start with a number
or punctuation character
• Names must not start with the
letters xml (or XML or Xml ..)
• Names cannot contain spaces.
Is it valid or not?
<students>
<one student>
<first name>Rose</first name>
<last name>Washington</last name>
</one student>
</students>
Element Content
• An XML element is everything from
(including) the element's start tag to
(including) the element's end tag.
• An element can have element
content, mixed content, simple
content, or empty content. An
element can also have attributes.
Is this valid?
<food>
<vegetable></vegetable>
<fruit>apple</fruit>
</food>
Child Elements vs. Attributes
<person sex="female">
<firstname>Anna</firstname>
<lastname>Smith</lastname>
</person>
<person>
<sex>female</sex>
<firstname>Anna</firstname>
<lastname>Smith</lastname>
</person>
Disadvantages of Attributes
• attributes cannot contain multiple values (child
elements can)
• attributes are not easily expandable (for future
changes)
• attributes cannot describe structures (child
elements can)
• attributes are more difficult to manipulate by
program code
• attribute values are not easy to test against a
Document Type Definition (DTD) - which is used
to define the legal elements of an XML document
Using Child Elements?
• So, it is a good idea to use Child
Elements other than Attributes.
• Check this out. Tell which way you
prefer.
• Can this file work? What is wrong?
A case for Attribute
• What is metadata? Data about data.
For example, your SJC student ID is
a metadata about you since it does
not describe you.
<publisher id=“p1”>
<name>O’Reilly</name>
<address>somewhere</address>
</publisher>
Is this valid?
<teacher>
<course>Eng100</course>
<course id=5>Math100</course>
<office Hour>2:00-3:00pm<office Hour>
<office>McDonough Hall 211 </Office>
</teacher>
What are the errors?
More about XML
• Now we have so called “XML database” whose
basic element is XML document. It is not very
successful yet.
• Remember that XML does not really do anything
except describing data.
• We have to interpret whatever it is describing. In
the sense of computer software, the user has to
develop software to interpret.
• What are DTD and XML schema?
• What are the disadvantages of XML? Please
discuss about it.
Analyze the XML file
• Example XML file
• Let’s discuss the weakness of this
file.
• What do you suggest?
• How do you think about my solution?
In class exercise
• Given the data shown in Access
database, can we store the same
data in XML format? Please try it in
class. Thanks.
Useful Sites about XML
• http://www.w3schools.com/xml/
• http://www.xml.org
XML in Uses?
• BBC topic news are also available
online via XML. Example.
• XML at work.
• XML in commerce?
• What is GML and SGML?
Project 3
• Here are the requirements, which are
also available in Blackboard.
• Discussion: will XML really be the
standard of data transportation or
data storage?
Part 3 Database
• Instead of listing it as Chapter 3, it is
listed as Part 3, which shows that
this is a big issue.
Chapter 1 Database History
• Hierarchical database
• Network database
• Relational database
• Object-oriented database
• Object-oriented relational database
• XML database
• etc
Relational Database
• The major database in use.
• Based on the relations between data
items.
• Key element: tables.
• Available relational databases: Oracle,
DB2, Sybase, MS SQL Server, Access,
MySQL, etc.
• A site about evaluation.
• The instructor’s database work.
Records and Attributes
• A table has multiple records, each
has multiple values.
• For example.
• The attributes define the data types.
All data in that column must conform
to the given data types.
Primary Key
• The primary key of a relational table uniquely
identifies each record in the table. It can either
be a normal attribute that is guaranteed to be
unique (such as Social Security Number in a table
with no more than one record per person) or it
can be generated by the DBMS (such as a
globally unique identifier, or GUID, in Microsoft
SQL Server). Primary keys may consist of a
single attribute or multiple attributes in
combination
• For example, in the table example, the primary
key is “Student#”.
• Every table must have Primary Key defined.
Primary Key (2)
• Guess what would be the Primary
Key in the SJC database for
students?
• Will it be ok to use your name (last
name and first name) as the primary
key?
Create a table for …
•
•
•
•
•
•
•
Smith, Jack, male, 8/15/1989, 421865241, Forrest,
Shoplifting, Linda Luke, (860)321-9086, 105.
Marsa, Rose, female, 7/1/1988, 3245691877, Jones, Dog
fighting, Nancy Charles, (860)321-9088, 106.
Lese, Sam, male, 3/21/1986, 425423785, Hartford, Dwell
breaking, Linda Luke, (860)321-9086, 105.
Haly, Rachel, female, 3/25/1989, 423671841, Hartford,
misconduct, Linda Luke, (860)321-9086, 105.
Horse, James, male, 11/2/1987, 765213456, Lama,
misconduct, Nancy Charles, (860)321-9088, 106.
Lincoln, George, male, 10/5/1988, 324342342, Jones,
fighting, Linda Luke, (860)321-9086, 105.
Doom, Jade, female, 9/9/1988, 423213495, Hartford,
misconduct, Nancy Charles, (860)321-9088, 106.
TableS
• Surely we will deal with multiple
database tables concerning any
complete datasets.
• When dealing with complicate
datasets, first thing is to categorize
the data into groups with each group
represented by a table.
• The second thing is to find and build
the relationships between the tables.
Analyze the data
• How many categories we have?
• Let’s use UML to clear the data
relationship!
• UML is Unified Modeling Language
which arises in 1990’s. It derived
from the three greatest minds of
system modeling.
• It is the standard language used to
analyze system design.
Practice
• The UML diagram
• What tables you would construct for
the data in the XML file?
• Do this exercise in class.
Relationship
• Now let’s talk about the relationship
types
One-to-One
One-to-many vs many-to-one
Many-to-many
One-One
• SSN – Person
SSN
1
Person
1
One-Many
• Bank accounts   person (one
person can have multiple accounts,
but one account belongs to one
person/family).
Bank account
*
person
1
Many-Many
• Course-Student. A student may take
multiple courses, and a course may
be taken by multiple students.
Foreign Key
• A foreign key is a relationship or
link between two tables which
ensures that the data stored in a
database is consistent.
• The foreign key link is set up by
matching columns in one table (the
child) to the primary key columns in
another table (the parent).
• Referential Integrity
Foreign Key Example 1
Table Students
PK
studentID
playerID
First name
Last name
Major
SSN
Basket Ball players
First name
One-to-one
Last name
Position
number
Gender
DOB
parent
child
PK
Foreign Key Example 2
• Given a table about instructors
whose columns are ID, first name
and last name.
• Suppose the basic information of a
offered course is the instructor and
the course name.
Cont’d
*
1
Course name
ID
Instructor
First name
Last name
One-Many
Cont’d
• Look at this example.
Exercise in Depth
• UML diagram of the exercise.
• Now, how to define the tables that
can properly represent the UML
diagram?
Common Rules
• One object (entity) one table
• One attribute one column
• Additional PK – optional in some
cases.
How to define Relations
between Tables?
• First of all, we have to know that
Parents come before children. Tables
that can be built without referencing
other tables/data could be used as
parent table.
• For example, student table vs
basket ball player table.
Relations cont’d
• In case of One-One relation, the
parent table is the table that can be
built without referencing any data in
the child table. The child table must
be the table that references data in
the parent table.
Example
studentID
First name
Last name
Major
SSN
Gender
DOB
playerID
First name
Last name
Position
number
Relation cont’d
• In case of One-Many, the One must
be the parent table, and the Many
must be the child table
Example
child
parent
BookID
PublisherID
Title
Name
Author
Publish year
Publisher ID
Many-One
Address
Many-Many
• It is pretty hard to express the
Many-Many relations between two
tables.
• For example, students   Courses
relationship.
• How are we going to do it?
Solution
• Make use of another table! In this
case, we have three tables. One for
students only, one for courses only,
and one to link students with
courses.
Solution Example
studentID
Course#
lastname
Title
firstname
DOB
gender
StudentID
Course#
Location
The full table construction
• Let’s work on this data to build the
whole tables!
• Now, let’s do this project 4!
Sword, a real application
• Publicly information about Sword.
• A success story of data
representation, storage and
management in Mississippi.
• Please form 2 or 3 groups for the
coming projects since they are kind
of complicated. Inform me of the
group members in the next class.
Thanks.
Discussion of Sword in Class
• The sword data scenario
Chapter 2 Access Basics I
• Please form 2 or 3 groups to for the
coming projects since they are kind
of complicated. Inform me of the
group members in the next class.
Thanks.
• Every student is supposed to collect
at least 2 restaurant menus of the
Hartford area. Keep them for later
use.
Basics (1)
• Open and save an Access database.
• Create a table in Design View.
• To create good tables, we need to
understand our data first. Let’s have
a look at the existing data in next
slide.
Try
• Create a table to hold the information below?
• Smith, Jack, male, 8/15/1989, 421865241, Forrest,
Shoplifting, Linda Luke, (860)321-9086, 105.
• Marsa, Rose, female, 7/1/1988, 3245691877, Jones, Dog
fighting, Nancy Charles, (860)321-9088, 106.
• Lese, Sam, male, 3/21/1986, 425423785, Hartford, Dwell
breaking, Linda Luke, (860)321-9086, 105.
• Haly, Rachel, female, 3/25/1989, 423671841, Hartford,
misconduct, Linda Luke, (860)321-9086, 105.
• Horse, James, male, 11/2/1987, 765213456, Lama,
misconduct, Nancy Charles, (860)321-9088, 106.
• Lincoln, George, male, 10/5/1988, 324342342, Jones,
fighting, Linda Luke, (860)321-9086, 105.
• Doom, Jade, female, 9/9/1988, 423213495, Hartford,
misconduct, Nancy Charles, (860)321-9088, 106.
Try cont’d
• First, primary key!
• Continue the building of one table for
all the data.
• After done, save the work and give
the table a sensible name.
Create Table: wizard
• Let’s explore the table creation
function of Access: we can create
table by Wizard, i.e. with templates.
Create Multiple Tables
• Based on the UML diagram of the
data, let’s create multiple tables.
Course
Student
Name
DOB
Gender
Major
N
Course #
Location
N:M
M
N
1
Instructor
Name
Office
Gender
Normalization
• Normalization in database means to
remove the redundant data to
improve data storage efficiency, data
integrity and scalability.
• It is essential
• Good online explanation
3 Level Normalization
• The first level of normalization removes
redundant data horizontally, i.e. no
repeated columns.
• The second level of normalization removes
redundant data vertically, i.e. no repeated
data in rows.
• The third level of normalization organize
data that does not depend on the primary
key into another table.
Normalization
• Totally there are 5 levels of
normalization.
• It is absolutely necessary to apply
the 1st and 2nd levels of
normalization.
• The 3rd level is applied sometimes.
• Don’t bother with the 4th or 5th levels
of normalization
Exercise
• What is the normalization level of the
database constructed?
Course
Student
Name
DOB
Gender
Major
N
Course #
Location
N:M
M
N
1
Instructor
Name
Office
Gender
Basics (2): Simple Query
• Based on the constructed table, let’s have
some fun with Query.
• Query is a programming language called
SQL (structured query language).
• SQL is a standard interactive and
programming language for getting
information from and updating a
database.
• Click here to learn more?
Create Query
• Create Query in design view
• Create Query by using wizard
• View the result sheet.
Query Syntax
• Though we now know how to create
simple queries graphically, we still
need to understand the syntax.
• SELECT sth FROM somethere.
Select * from classes;
Select ID from classes;
Select classes.ID, lastname from classes;
Set Conditions
• SELECT something FROM somewhere
WHERE conditions-are-met
– Select * from students where
gender=0;
– Select * from students where
lastname=‘Smith’;
– Select * from students where DOB
between #1/1/1988# and #1/1/1990#
Set Conditions
• Select * from students where
lastname like ‘Smi*’;
• Select * from students where
lastname like “*smi*”;
• SELECT * FROM students WHERE
gender=0 AND lastname like
“*smi*”;
• Be aware in standard SQL, LIKE
‘%smi%’;
JOIN
• In many cases, we need to fetch data
from multiple tables. Thus, we need to
bind together the data from the tables.
The binding is based on some keys,
usually the primary key or some other
unique data items.
• Good online material (but be aware that
this is for standard SQL, not for Access!)
• FOR Microsoft inquiry, please go to:
http://msdn.microsoft.com/
Join in Access
• Select sth1, sth2 from table1 INNER JOIN
table2 ON table1.key1 = table2.key2.
• For example:
Select students.* from (students INNER
JOIN studentscourses ON students.ID =
studentscourses.studentID) where
studentscourses.courseNum=‘Comp200’
Other JOINS
• There are two different types of
JOIN:
– INNER JOIN
– OUTER JOIN
• LEFT JOIN
• RIGHT JOIN
Let’s not deal with OUTER JOIN in
this class to make it simple.
INNER JOIN
• INNER JOIN only join the records
that both tables have the
corresponding key!.
• See the MSDN explanation
Sort the Results
• You can order the results in
ascending or descending order.
1. Select * from students order by
studentID desc;
2. Select * from students order by
lastname; (if it is ascending order,
you don’t need to specify it)
Subquery
• Inside a query, we can have another
query to provide some information
for a condition, i.e. we have
subquery(s) inside a query.
Select * from students where
studentID in (select studentID from
studentscourses where
coursenumber=“Comp200”);
Functions
• Access query could use built-in
functions. For example, MAX, MIN,
COUNT, etc. Let’s experience
COUNT.
Question: how to find the number of
students who are taking courses
currently in the school?
Others
• SO far, we have been dealing with SELECT
queries. There are other types:
• CREATE – create tables
• INSERT – insert rows
• DROP – drop tables
• DELETE – delete rows
• ALTER - change the table structures
• Etc.
Sample Database
• Here is a sample database with some
queries constructed. Might be useful
as references.
• Remember that this class is not only
for database, so we cannot go very
deep into database issues. If you
have more interests in database, I
may be able to offer a class
specifically on database.
Project 5
• Project 5 requests you to construct a
database for a group of restaurant.
Please use UML diagram to analyze
the data first, then construct your
database. Also, please provide some
queries. -- Imaging that you are
provide a hotline services for
customer inquiries about food
services in Hartford area.
Part 4
Bioinformatics
• What is bioinformatics?
The study of the application of computer
and statistical techniques to the
management of biological information
The science of creating and managing
biological databases to keep track of, and
eventually simulate, the complexity of
living organisms.
There exist different definitions, though.
The Possible Role of
Bioinformatics?
• Look over the history of biology, different
approaches are used over the time.
• Initially “Guessing”  Observation 
Dissection.
• Mendal started genetic experiments.
• Biochemists used organics to clear out the
metabolic pathways.
• Molecular biology is another approach now
used to decode the life secrets.
• Is it the time for bioinformatics as another
approach?
Several Foundations of
Bioinformatics
• Lives are from the same ancestors,
‘either evolved or created’. That means that
knowledge obtained on one form of
life may be applied to other forms.
In fact, molecular biology started
from bacteria, then yeast, then
mammal.  database
• Publicly available data resources.
• Human Genome Project
Publicly Resources
• I am not sure how many biological
research laboratories we have in the
world, it must be MANY MANY.
• No other science has equal or even
close amount of research
laboratories.
• The largest amount of research funds
from government, states, private
corporations, etc.
Most famous Agencies
• NIH (National Institute of Health)
• WHO (World Health Organization)
• Others …
Huge Amount of
Information
• All the scientists in the world generated
large amount of scientific information, and
it is likely much of them is repeated.
• Communication among scientists become
extremely important.
• That is why there are so many publicly
available biological resources.
• Internet plays a critical role in the
information sharing.
Internet’s Information
1. Access to information for anyone with an
Internet browser.
2. The data stored in centralized database
us redundant by a factor of about 2.5,
which provides a quality control.
3. Information from yeast (for example)
could be helpful in finding/understanding
homologous genes/pathways in humans
(comparative genomics).
Human Genome Project
• HGP.
• Without HGP, there is no real
Bioinformatics.
• Bioinformatics boosted up after large
amount of Human Genome are
decoded  how to use these DNA
information?  Computer
technologies!
Bioinformatics and Evolution
Ancestor
Child A
Child Aa
Child B
Child Ab
ChildAa-1
Mutations
Child Ba
Child Bb
ChildBa-1
Child C
Mutations
• Mutations that occur in germ cells
will be passed on to the next
generation, like any other DNA
sequences.
• So, as time and generations go by, a
DNA sequence will acquire more and
more mutations and resemble less
and less the original DNA sequence.
Need to know where from
• From an evolutionary perspective, we
cannot know where we are going unless
we know where we have been. Before,
the study of human evolution was largely
the province of paleoanthropologists who
studied the fossil record.
• However, gene comparisons now become
the major and more accurate techniques
 using computer
technologies/bioinformatics
Do you know …
• We all started from Africa?
• Using the Mitochondrial DNA analysis
among women from different
nations, it is found that African
people have larger variations in DNA
sequence  oldest group has the
greatest genetic diversity  African
is the oldest population  the
ancestor.
Bioinformatics with AIDS
• Analysis of the human genome
guides AIDS research. Some
persons long-infected with HIV have
not shown any symptoms of the
disease. Studies found that these
people possess a variant of a
receptor CCR5  Rarely in Asian and
African  guess it may come to
European in 14th century.
Tools of Bioinformatics
• Gene Predication Software
• Sequence Alignment Software
• Molecular Phylogenetics
• Molecular Modeling and 3-D
Visualization.
NCBI
• National Center for Biotechnology
Information.
– PubMed (Medline)
– Entrez
– BLAST
– OMIM
– Books
– TaxBrowser
– Structure
PubMed
• Access to the Medline database  largest
biomedical literature source.
• Medline database contains citations and
abstracts from more than 4600 biomedical
journals published in USA and other
countries.
• Searches are commonly conducted using a
keyword(s), author names, publication
date, and/or journal titles.
Entrez
• A search and retrieval system that
integrates all of the databases available at
NCBI. These databases include nucleotide
sequences, protein sequences, genomes,
molecular structure and PubMed.
• GenBank, DNA DataBank of Japan,
European Molecular Biology Laboratory
make up the International Nucleotide
Sequence Database Collaboration. These
organizations exchange data every day.
• Search for Bcl2 as an example.
BLAST
• Basic Local Alignment Search Tool.
Sequence 1
…AGTTCGATAGCTAAGGTCGG…
Sequence 2
…AGTTCGATAGCTATGGTCGG…
BLAST
Sequence 3
…AGTTCGATAGCTAAGGTCGG…
Sequence 4
…AGTTCGATAGCTAGGTCGGG…
BLAST – Another Look
Sequence 3
…AGTTCGATAGCTAAGGTCGG…
Sequence 4
…AGTTCGATAGCTA–GGTCGG…
Use BLAST
• Click here.
• Let’s choose blastn.
• Now, let’s practice its uses.
OMIM
• Online Mendelian Inheritance in Man
• It is a database containing
information about human genes and
genetic disease. This resources is
often used by physicians and
researchers interested in genetic
diseases.
Books
• NCBI collaborates with authors and
publishers to create a virtual
bookshelf.
TaxBrowser
• The taxonomy site contains a
classification of all the organisms
that are represented by sequences in
the public databases, including
model organisms commonly used in
molecular biology.
Structure
• The structure site features the
Molecular Modeling Database
(MMDB), which contains
macromolecular 3-D structures as
well as tools to analyze them.
Included in the MMDB are
experimentally determined
structures obtained from the protein
data bank.
Cn3D4.1
• You can download it.
• It reads MMDB instead of PDB file.
This is because MMDB will ensures
the correctness of the read PDB file.
• The Link
Applications of
Bioinformatics
• Forensic Science
• Agriculture
• Medicine
• Pharma/Biotechnology
• Environmental Science
• Ethical Legal, and Social ISsues
Forensic Science
• Minisatellites consists of short DNA
sequences that repeat in tandem. The
number of repeats & the sequence within
each repeat can exhibit wide variation in a
population. Techniques based on this
were developed to identify individuals.
E.g. FBI established Combined DNA Index
System (CODIS) that contains profiles of
convicted offenders.
Forensic Science
• DNA testing is now the standard
technique for confirm paternity.
• Is also a technique to identify
criminals and victims.
• Computer technology is essential to
search through the database for the
identification.
Agriculture
• Genome projects for major crop
plants are well underway:
Pest control
Seed quality
Plant micronutrients (golden rice)
Etc.
Medicine
• The ability to correlate genetic data
with medical records promises to
improve our understanding of
disease and improve treatments.
• Microarray  cancer classification
• Associating SNPs with disease helps
scientists to identify genes that play
roles in disease progression.
Pharma/Biotechnology
• Bioinformatics is providing a complete list
of candidate genes for drug discovery.
The tools of functional genomics are being
used to establish the metabolic roles
played by the candidate gene products.
• Pharmaceutical companies are using
bioinformatics to search for new
antibiotics.
Cont’d
• Advances in genomics are expanding
the range of drug targets and are
shifting the discovery effort from
direct screening programs to rational
target-based drug designs.
Environmental Sciences
• Global biodiversity.
• Global Biodiversity Information
Facility (GBIF)
• How to analyze these diversity and
make use of them.
• Computer software to monitor
environmental changes, via birds and
other animals’ behaviors.
Ethical, Legal and Social
Issues
• Anonymous databases  include
nonidentifiable genetic data.
• Non-anonymous databases  its
data could be linked to individuals.
• An ethical concern most relevant to
non-anonymous databases is
Informed Consent.
Informed Consent
• Informed consent is the ethical
practice of respecting individual
autonomy and protecting an
individual from harm. It refers to a
process whereby an individual freely
and knowingly weighs the risks and
benefits of donating a tissue or DNA
sample for research purposes.
Privacy & Confidentiality
• Personal privacy is an important
aspect of informed consent. Privacy
is the right to control access to
information about oneself.
• Confidentiality is the obligation for
those who obtain information about
individuals to protect the privacy of
that information.
More
• If society is to gain the most from
genomic biology, then the public
must be able to rationally consider
scientific issues. They should not
place a blind trust in scientists, nor
should they dismiss new technologies
out of hand.
In-Class Exercise
• Human Genome is sequenced via the
“shortgun” approach in which human
chromosomes were randomly cut
into pieces.
• Each DNA pieces are sequenced
separately.
• Computer technology is then used to
find the overlap and construct the
contiguous sequence.
In-Class Exerices
• Group 1
• Group 2
• Group 3
• Each group will constitute two
fragments and all groups work
together for the final sequences.
• For simplicity, we are dealing with
only one strand for simplicity.