Download BINF 3360, Introduction to Computational Biology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
1/17/2017
BINF 3360, Introduction to Computational Biology
Lecture 2, Introduction to Python
Young-Rae Cho
Associate Professor
Department of Computer Science
Baylor University
Python Programming Language
 Script Language

General-purpose script language

Broad applications
(web, bioinformatics, network programming, graphics, software engineering)
 Features

Object-oriented

Extension with modules

Database integration

Embeddable

Web frameworks / Web modules
1
1/17/2017
Getting Started
 Download & Installation

http://www.python.org/download/ (the most recent version: Python 3.3)
 Edit & Run

Create a file named test.py

Edit the code
# This is a test.
dna = ‘ATCGATGA’
print dna, ‘\n’

Run the code
> python test.py
Primitives
 Primitive Data Types

Numbers or Strings
num = 1234
st = ‘1234’
num_1 = num + int(st)
st_1 = str(num) + st
 Substring
dna1 = ‘ACGTGAACT’
dna2 = dna1[0:4]
length = len(dna2)
 Reversing
dna1 = ‘ACGTGAACT’
dna2 = dna1[::-1]
2
1/17/2017
Lists
 List Variables

A list of comma-separated values
lst1 = [‘A’, ‘C’, ‘G’]
lst2 = [‘T’]
lst1 = lst1 + lst2
Variable-length list
 Insert, Delete, Append, Reverse, and Sort
lst = [‘A’, ‘T’, ‘G’]
lst = [‘A’, ‘T’, ‘G’]
lst.insert(1, ‘C’)
lst [1:2] = ‘C’
del lst[2]
lst [1:1] = ‘T’
lst.append(‘T’)
lst [2:3] = ‘’
lst.extend([‘A’, ‘C’])
lst [len(lst) : len(lst)] = ‘T’
lst.reverse()
lst [len(lst) : len(lst)] = [‘A’, ‘C’]
lst.sort()
lst [::-1]
Sets
 Set Variables
DNAbases = {‘A’, ‘C’, ‘G’, ‘T’}
RNAbases = {‘A’, ‘C’, ‘G’, ‘U’}
DNAbases | RNAbases
DNAbases & RNAbases
DNAbases - RNAbases
 Add and Remove
bases = {‘A’, ‘D’, ‘G’}
bases.add(‘T’)
bases.remove(‘D’)
3
1/17/2017
Dictionaries
 Initialization
d= {
d = dict()
‘key1’: ‘value1’ ,
d[‘key1’] = ‘value1’
‘key2’: ‘value2’ ,
k2, v2 = ‘key2’, ‘value2’
‘key3’: ‘value3’
d[k2] = v2
}
 Mapping
d[‘key1’]
d.get(‘key1’)
d.keys()
d.values()
 Delete
del d[‘key1’]
Input / Output
 Standard Input
import sys
data = sys.stdin.readline().replace(‘\n’, ‘ ’)
 Reading Files
name = ‘myfilename.txt’
name = sys.stdin.readline()
name = sys.argv[1]
with open(name) as file:
with open(name) as file:
with open(name) as file:
data = file.read()
 Writing Files
data = file.read()
data = file.read()
name = ‘output.txt’
with open(name, ‘w’) as file:
file.write(‘ATCGATG’)
4
1/17/2017
Functions
 Types
 Built-in system functions
 User-defined functions
 Defining Function
def function_name (parameter_list):
statement
statement
return value
 Function Call
Iteration
 Iterative Process
def find_max(lst):
max_so_far = lst[0]
for item in lst[1:]:
if item > max_so_far:
max_so_far = item
return max_so_far
lst1 = [3,5,10,4,6]
maximum = find_max(lst1)
5
1/17/2017
Recursion
 Recursive Call
def print_tree(tree, level):
print ‘ ’ * 4 * level, tree[0]
for subtree in tree[1:]:
print_tree(subtree, level+1)
t1 = [‘A’, [‘T’, [‘A’], [‘T’]], [‘G’, [‘G’], [‘C’]]]
print_tree(t1, 0)
Modules
 Module

A collection of functions

Module python (.py) files in a library directory
 Module Call
import random
seq = 'ATCGATAGCTA'
random_base = seq[random.randint(0,len(seq)-1)]
from random import *
seq = 'ATCGATAGCTA'
random_base = seq[randint(0,len(seq)-1)]
6
1/17/2017
Regular Expressions (1)
 Special Languages
 Metacharacters (characters having special meanings):
. (any character),
\n,
\t,
\s (whitespace),
\w (any alphabetic or numeric character),
\W,
\d (decimal digit),
\D
 Quantifiers
e.g., ‘ct .*g’,
‘ct .+g’,
 Grouping and back-reference
 Alternatives
 Character set
‘ct .?g’,
‘ct{2}g’, ‘ct{2,5}g’
e.g., ‘(.)(.)aa\1\2’
e.g., ‘(ct|ca)’
e.g., ‘[acgt]’,
‘[a-zA-Z]’
 Anchors: ^ (the start of the string), $ (the end of the string)
e.g., ‘^tata’ , ‘aa$’
Regular Expressions (2)
 Usage
 search: searches the first match
of the pattern in a string, and
returns the position as a
import re
pos = re.search(‘TATA .* AA’, seq)
print pos.start()
MatchObject instance
 findall: searches all matches of
the pattern in a string, and
returns a list of the matches
import re
matches = re.findall(‘TATA .* AA’, seq)
print matches
 finditer: searches all matches of
the patterns in a string, and
returns an Iterator object as a MatchObject instance
7
1/17/2017
Biological Applications
 Parsing Sequences
 Sequence Validation
 Motif Search
 Sequence Transformation

DNA Replication

Transcription from DNA to RNA

Translating RNA into Protein

DNA Sequence Mutation
Parsing Sequences (1)
 Single Sequence in FASTA Format
>gi|5524211|gb|AAD44166.1| cytochrome b
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIP
YIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDK
IPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRS
VPNKLGGVLALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYP
YTIIGQMASILYFSIILAFLPIAGXIENY
 Parsing

Make a function to return the sequence from the FASTA format
def read_FASTA_seq(filename):
with open(filename) as f:
return f.read().partition(‘\n’)[2].replace(‘\n’, ‘’)
8
1/17/2017
Parsing Sequences (2)
 Multiple Sequences in FASTA Format
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIP
QFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFY
VMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQIATIGE
NLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
 Parsing ?
Sequence Validation (1)
 DNA Sequence Validation

Make a function to check the sequence consists of ‘A’, ‘T’, ‘C’, and ‘G’ only
def validate_dna (base_sequence):
seq = base_sequence.upper()
for base in seq:
if base not in ‘ACGT’:
return False
return True
def validate_dna (base_sequence):
seq = base_sequence.upper()
return len(seq) == (seq.count(‘T’) + seq.count(‘C’) +
seq.count(‘A’) + seq.count(‘G’) )
9
1/17/2017
Sequence Validation (2)
 Counting Base Frequency

Make a function to calculate the percent of ‘C’ and ‘G’ in a DNA sequence
def percent_of_GC (base_sequence):
seq = base_sequence.upper()
count = 0
for base in seq:
if base in ‘CG’:
count += 1
return float(count) / len(seq)
def percent_of_GC (base_sequence):
seq = base_sequence.upper()
return float(seq.count(‘G’) + seq.count(‘C’)) / len(seq)
Motif Search
 Searching Substring

Make a function to take a sequence and a motif and return the position(s)
of matching in the sequence
def motif_search (seq, motif):
return seq.find(motif)
def all_motif_search (seq, motif):
pos = []
idx = seq.find(motif)
pos.append(idx)
seq = seq.partition(motif)[2]
while seq.find(motif) > 0:
idx += seq.find(motif) + len(motif)
pos.append(idx)
seq = seq.partition(motif)[2]
return pos
10
1/17/2017
Transcription
 Simulating Transcription

Make a function to transcribe a DNA into an RNA
def transcription (dna):
return dna.replace(‘T’, ‘U’)
Translation (1)
 Making Genetic Code

Make a function to translate a codon to an amino acid
def codon2aa(codon):
genetic_code = { ‘UUU’: ‘F’, ‘UUC’: ‘F’,
‘UUA’: ‘L’, …… }
if codon in genetic_code.keys():
return genetic_code[codon]
else:
return ‘Error’
11
1/17/2017
Translation (2)
 Simulating Translation

Make a function to translate an RNA into a protein sequence
def translation(rna):
protein = ‘’
for n in range(0, len(rna), 3):
protein += codon2aa(rna[n:n+3])
return protein
Translation (3)
 Simulating Translation – cont’

Make a generator function which returns values from a series it computes
def aa_generator(rna):
return (codon2aa(rna[n:n+3]) for n in range(0, len(rna), 3) )
def translation(rna):
gen = aa_generator(rna)
protein = ‘’
aa = next(gen)
while aa:
protein += aa
aa = next(gen)
return protein
12
1/17/2017
Mutation
 Simulating Mutation

Make a function to simulate single point mutations in a DNA sequence
import random
def mutation(dna):
position = random.randint(0,len(dna)-1)
bases = ‘ACGT’
new_base = bases[random.randint(0,3)]
dna[position:position+1] = new_base
return dna
bases.replace(dna[position], ‘’)
new_base = bases[random.randint(0,2)]
Questions?
 Lecture Slides are found on the Course Website,
web.ecs.baylor.edu/faculty/cho/3360
13
Related documents