Download BINF 3360, Introduction to Computational Biology

1/17/2017 BINF 3360, Introduction to Computational Biology Lecture 2, Introduction to Python Young-Rae Cho Associate Professor Department of Computer Science Baylor University Python Programming Language  Script Language  General-purpose script language  Broad applications (web, bioinformatics, network programming, graphics, software engineering)  Features  Object-oriented  Extension with modules  Database integration  Embeddable  Web frameworks / Web modules 1 1/17/2017 Getting Started  Download & Installation  http://www.python.org/download/ (the most recent version: Python 3.3)  Edit & Run  Create a file named test.py  Edit the code # This is a test. dna = ‘ATCGATGA’ print dna, ‘\n’  Run the code > python test.py Primitives  Primitive Data Types  Numbers or Strings num = 1234 st = ‘1234’ num_1 = num + int(st) st_1 = str(num) + st  Substring dna1 = ‘ACGTGAACT’ dna2 = dna1[0:4] length = len(dna2)  Reversing dna1 = ‘ACGTGAACT’ dna2 = dna1[::-1] 2 1/17/2017 Lists  List Variables  A list of comma-separated values lst1 = [‘A’, ‘C’, ‘G’] lst2 = [‘T’] lst1 = lst1 + lst2 Variable-length list  Insert, Delete, Append, Reverse, and Sort lst = [‘A’, ‘T’, ‘G’] lst = [‘A’, ‘T’, ‘G’] lst.insert(1, ‘C’) lst [1:2] = ‘C’ del lst[2] lst [1:1] = ‘T’ lst.append(‘T’) lst [2:3] = ‘’ lst.extend([‘A’, ‘C’]) lst [len(lst) : len(lst)] = ‘T’ lst.reverse() lst [len(lst) : len(lst)] = [‘A’, ‘C’] lst.sort() lst [::-1] Sets  Set Variables DNAbases = {‘A’, ‘C’, ‘G’, ‘T’} RNAbases = {‘A’, ‘C’, ‘G’, ‘U’} DNAbases | RNAbases DNAbases & RNAbases DNAbases - RNAbases  Add and Remove bases = {‘A’, ‘D’, ‘G’} bases.add(‘T’) bases.remove(‘D’) 3 1/17/2017 Dictionaries  Initialization d= { d = dict() ‘key1’: ‘value1’ , d[‘key1’] = ‘value1’ ‘key2’: ‘value2’ , k2, v2 = ‘key2’, ‘value2’ ‘key3’: ‘value3’ d[k2] = v2 }  Mapping d[‘key1’] d.get(‘key1’) d.keys() d.values()  Delete del d[‘key1’] Input / Output  Standard Input import sys data = sys.stdin.readline().replace(‘\n’, ‘ ’)  Reading Files name = ‘myfilename.txt’ name = sys.stdin.readline() name = sys.argv[1] with open(name) as file: with open(name) as file: with open(name) as file: data = file.read()  Writing Files data = file.read() data = file.read() name = ‘output.txt’ with open(name, ‘w’) as file: file.write(‘ATCGATG’) 4 1/17/2017 Functions  Types  Built-in system functions  User-defined functions  Defining Function def function_name (parameter_list): statement statement return value  Function Call Iteration  Iterative Process def find_max(lst): max_so_far = lst[0] for item in lst[1:]: if item > max_so_far: max_so_far = item return max_so_far lst1 = [3,5,10,4,6] maximum = find_max(lst1) 5 1/17/2017 Recursion  Recursive Call def print_tree(tree, level): print ‘ ’ * 4 * level, tree[0] for subtree in tree[1:]: print_tree(subtree, level+1) t1 = [‘A’, [‘T’, [‘A’], [‘T’]], [‘G’, [‘G’], [‘C’]]] print_tree(t1, 0) Modules  Module  A collection of functions  Module python (.py) files in a library directory  Module Call import random seq = 'ATCGATAGCTA' random_base = seq[random.randint(0,len(seq)-1)] from random import * seq = 'ATCGATAGCTA' random_base = seq[randint(0,len(seq)-1)] 6 1/17/2017 Regular Expressions (1)  Special Languages  Metacharacters (characters having special meanings): . (any character), \n, \t, \s (whitespace), \w (any alphabetic or numeric character), \W, \d (decimal digit), \D  Quantifiers e.g., ‘ct .*g’, ‘ct .+g’,  Grouping and back-reference  Alternatives  Character set ‘ct .?g’, ‘ct{2}g’, ‘ct{2,5}g’ e.g., ‘(.)(.)aa\1\2’ e.g., ‘(ct|ca)’ e.g., ‘[acgt]’, ‘[a-zA-Z]’  Anchors: ^ (the start of the string), $ (the end of the string) e.g., ‘^tata’ , ‘aa$’ Regular Expressions (2)  Usage  search: searches the first match of the pattern in a string, and returns the position as a import re pos = re.search(‘TATA .* AA’, seq) print pos.start() MatchObject instance  findall: searches all matches of the pattern in a string, and returns a list of the matches import re matches = re.findall(‘TATA .* AA’, seq) print matches  finditer: searches all matches of the patterns in a string, and returns an Iterator object as a MatchObject instance 7 1/17/2017 Biological Applications  Parsing Sequences  Sequence Validation  Motif Search  Sequence Transformation  DNA Replication  Transcription from DNA to RNA  Translating RNA into Protein  DNA Sequence Mutation Parsing Sequences (1)  Single Sequence in FASTA Format >gi|5524211|gb|AAD44166.1| cytochrome b LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIP YIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDK IPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRS VPNKLGGVLALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYP YTIIGQMASILYFSIILAFLPIAGXIENY  Parsing  Make a function to return the sequence from the FASTA format def read_FASTA_seq(filename): with open(filename) as f: return f.read().partition(‘\n’)[2].replace(‘\n’, ‘’) 8 1/17/2017 Parsing Sequences (2)  Multiple Sequences in FASTA Format >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIP QFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFY VMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQIATIGE NLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH  Parsing ? Sequence Validation (1)  DNA Sequence Validation  Make a function to check the sequence consists of ‘A’, ‘T’, ‘C’, and ‘G’ only def validate_dna (base_sequence): seq = base_sequence.upper() for base in seq: if base not in ‘ACGT’: return False return True def validate_dna (base_sequence): seq = base_sequence.upper() return len(seq) == (seq.count(‘T’) + seq.count(‘C’) + seq.count(‘A’) + seq.count(‘G’) ) 9 1/17/2017 Sequence Validation (2)  Counting Base Frequency  Make a function to calculate the percent of ‘C’ and ‘G’ in a DNA sequence def percent_of_GC (base_sequence): seq = base_sequence.upper() count = 0 for base in seq: if base in ‘CG’: count += 1 return float(count) / len(seq) def percent_of_GC (base_sequence): seq = base_sequence.upper() return float(seq.count(‘G’) + seq.count(‘C’)) / len(seq) Motif Search  Searching Substring  Make a function to take a sequence and a motif and return the position(s) of matching in the sequence def motif_search (seq, motif): return seq.find(motif) def all_motif_search (seq, motif): pos = [] idx = seq.find(motif) pos.append(idx) seq = seq.partition(motif)[2] while seq.find(motif) > 0: idx += seq.find(motif) + len(motif) pos.append(idx) seq = seq.partition(motif)[2] return pos 10 1/17/2017 Transcription  Simulating Transcription  Make a function to transcribe a DNA into an RNA def transcription (dna): return dna.replace(‘T’, ‘U’) Translation (1)  Making Genetic Code  Make a function to translate a codon to an amino acid def codon2aa(codon): genetic_code = { ‘UUU’: ‘F’, ‘UUC’: ‘F’, ‘UUA’: ‘L’, …… } if codon in genetic_code.keys(): return genetic_code[codon] else: return ‘Error’ 11 1/17/2017 Translation (2)  Simulating Translation  Make a function to translate an RNA into a protein sequence def translation(rna): protein = ‘’ for n in range(0, len(rna), 3): protein += codon2aa(rna[n:n+3]) return protein Translation (3)  Simulating Translation – cont’  Make a generator function which returns values from a series it computes def aa_generator(rna): return (codon2aa(rna[n:n+3]) for n in range(0, len(rna), 3) ) def translation(rna): gen = aa_generator(rna) protein = ‘’ aa = next(gen) while aa: protein += aa aa = next(gen) return protein 12 1/17/2017 Mutation  Simulating Mutation  Make a function to simulate single point mutations in a DNA sequence import random def mutation(dna): position = random.randint(0,len(dna)-1) bases = ‘ACGT’ new_base = bases[random.randint(0,3)] dna[position:position+1] = new_base return dna bases.replace(dna[position], ‘’) new_base = bases[random.randint(0,2)] Questions?  Lecture Slides are found on the Course Website, web.ecs.baylor.edu/faculty/cho/3360 13

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download BINF 3360, Introduction to Computational Biology