Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1/17/2017 BINF 3360, Introduction to Computational Biology Lecture 2, Introduction to Python Young-Rae Cho Associate Professor Department of Computer Science Baylor University Python Programming Language Script Language General-purpose script language Broad applications (web, bioinformatics, network programming, graphics, software engineering) Features Object-oriented Extension with modules Database integration Embeddable Web frameworks / Web modules 1 1/17/2017 Getting Started Download & Installation http://www.python.org/download/ (the most recent version: Python 3.3) Edit & Run Create a file named test.py Edit the code # This is a test. dna = ‘ATCGATGA’ print dna, ‘\n’ Run the code > python test.py Primitives Primitive Data Types Numbers or Strings num = 1234 st = ‘1234’ num_1 = num + int(st) st_1 = str(num) + st Substring dna1 = ‘ACGTGAACT’ dna2 = dna1[0:4] length = len(dna2) Reversing dna1 = ‘ACGTGAACT’ dna2 = dna1[::-1] 2 1/17/2017 Lists List Variables A list of comma-separated values lst1 = [‘A’, ‘C’, ‘G’] lst2 = [‘T’] lst1 = lst1 + lst2 Variable-length list Insert, Delete, Append, Reverse, and Sort lst = [‘A’, ‘T’, ‘G’] lst = [‘A’, ‘T’, ‘G’] lst.insert(1, ‘C’) lst [1:2] = ‘C’ del lst[2] lst [1:1] = ‘T’ lst.append(‘T’) lst [2:3] = ‘’ lst.extend([‘A’, ‘C’]) lst [len(lst) : len(lst)] = ‘T’ lst.reverse() lst [len(lst) : len(lst)] = [‘A’, ‘C’] lst.sort() lst [::-1] Sets Set Variables DNAbases = {‘A’, ‘C’, ‘G’, ‘T’} RNAbases = {‘A’, ‘C’, ‘G’, ‘U’} DNAbases | RNAbases DNAbases & RNAbases DNAbases - RNAbases Add and Remove bases = {‘A’, ‘D’, ‘G’} bases.add(‘T’) bases.remove(‘D’) 3 1/17/2017 Dictionaries Initialization d= { d = dict() ‘key1’: ‘value1’ , d[‘key1’] = ‘value1’ ‘key2’: ‘value2’ , k2, v2 = ‘key2’, ‘value2’ ‘key3’: ‘value3’ d[k2] = v2 } Mapping d[‘key1’] d.get(‘key1’) d.keys() d.values() Delete del d[‘key1’] Input / Output Standard Input import sys data = sys.stdin.readline().replace(‘\n’, ‘ ’) Reading Files name = ‘myfilename.txt’ name = sys.stdin.readline() name = sys.argv[1] with open(name) as file: with open(name) as file: with open(name) as file: data = file.read() Writing Files data = file.read() data = file.read() name = ‘output.txt’ with open(name, ‘w’) as file: file.write(‘ATCGATG’) 4 1/17/2017 Functions Types Built-in system functions User-defined functions Defining Function def function_name (parameter_list): statement statement return value Function Call Iteration Iterative Process def find_max(lst): max_so_far = lst[0] for item in lst[1:]: if item > max_so_far: max_so_far = item return max_so_far lst1 = [3,5,10,4,6] maximum = find_max(lst1) 5 1/17/2017 Recursion Recursive Call def print_tree(tree, level): print ‘ ’ * 4 * level, tree[0] for subtree in tree[1:]: print_tree(subtree, level+1) t1 = [‘A’, [‘T’, [‘A’], [‘T’]], [‘G’, [‘G’], [‘C’]]] print_tree(t1, 0) Modules Module A collection of functions Module python (.py) files in a library directory Module Call import random seq = 'ATCGATAGCTA' random_base = seq[random.randint(0,len(seq)-1)] from random import * seq = 'ATCGATAGCTA' random_base = seq[randint(0,len(seq)-1)] 6 1/17/2017 Regular Expressions (1) Special Languages Metacharacters (characters having special meanings): . (any character), \n, \t, \s (whitespace), \w (any alphabetic or numeric character), \W, \d (decimal digit), \D Quantifiers e.g., ‘ct .*g’, ‘ct .+g’, Grouping and back-reference Alternatives Character set ‘ct .?g’, ‘ct{2}g’, ‘ct{2,5}g’ e.g., ‘(.)(.)aa\1\2’ e.g., ‘(ct|ca)’ e.g., ‘[acgt]’, ‘[a-zA-Z]’ Anchors: ^ (the start of the string), $ (the end of the string) e.g., ‘^tata’ , ‘aa$’ Regular Expressions (2) Usage search: searches the first match of the pattern in a string, and returns the position as a import re pos = re.search(‘TATA .* AA’, seq) print pos.start() MatchObject instance findall: searches all matches of the pattern in a string, and returns a list of the matches import re matches = re.findall(‘TATA .* AA’, seq) print matches finditer: searches all matches of the patterns in a string, and returns an Iterator object as a MatchObject instance 7 1/17/2017 Biological Applications Parsing Sequences Sequence Validation Motif Search Sequence Transformation DNA Replication Transcription from DNA to RNA Translating RNA into Protein DNA Sequence Mutation Parsing Sequences (1) Single Sequence in FASTA Format >gi|5524211|gb|AAD44166.1| cytochrome b LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIP YIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDK IPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRS VPNKLGGVLALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYP YTIIGQMASILYFSIILAFLPIAGXIENY Parsing Make a function to return the sequence from the FASTA format def read_FASTA_seq(filename): with open(filename) as f: return f.read().partition(‘\n’)[2].replace(‘\n’, ‘’) 8 1/17/2017 Parsing Sequences (2) Multiple Sequences in FASTA Format >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIP QFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFY VMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQIATIGE NLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH Parsing ? Sequence Validation (1) DNA Sequence Validation Make a function to check the sequence consists of ‘A’, ‘T’, ‘C’, and ‘G’ only def validate_dna (base_sequence): seq = base_sequence.upper() for base in seq: if base not in ‘ACGT’: return False return True def validate_dna (base_sequence): seq = base_sequence.upper() return len(seq) == (seq.count(‘T’) + seq.count(‘C’) + seq.count(‘A’) + seq.count(‘G’) ) 9 1/17/2017 Sequence Validation (2) Counting Base Frequency Make a function to calculate the percent of ‘C’ and ‘G’ in a DNA sequence def percent_of_GC (base_sequence): seq = base_sequence.upper() count = 0 for base in seq: if base in ‘CG’: count += 1 return float(count) / len(seq) def percent_of_GC (base_sequence): seq = base_sequence.upper() return float(seq.count(‘G’) + seq.count(‘C’)) / len(seq) Motif Search Searching Substring Make a function to take a sequence and a motif and return the position(s) of matching in the sequence def motif_search (seq, motif): return seq.find(motif) def all_motif_search (seq, motif): pos = [] idx = seq.find(motif) pos.append(idx) seq = seq.partition(motif)[2] while seq.find(motif) > 0: idx += seq.find(motif) + len(motif) pos.append(idx) seq = seq.partition(motif)[2] return pos 10 1/17/2017 Transcription Simulating Transcription Make a function to transcribe a DNA into an RNA def transcription (dna): return dna.replace(‘T’, ‘U’) Translation (1) Making Genetic Code Make a function to translate a codon to an amino acid def codon2aa(codon): genetic_code = { ‘UUU’: ‘F’, ‘UUC’: ‘F’, ‘UUA’: ‘L’, …… } if codon in genetic_code.keys(): return genetic_code[codon] else: return ‘Error’ 11 1/17/2017 Translation (2) Simulating Translation Make a function to translate an RNA into a protein sequence def translation(rna): protein = ‘’ for n in range(0, len(rna), 3): protein += codon2aa(rna[n:n+3]) return protein Translation (3) Simulating Translation – cont’ Make a generator function which returns values from a series it computes def aa_generator(rna): return (codon2aa(rna[n:n+3]) for n in range(0, len(rna), 3) ) def translation(rna): gen = aa_generator(rna) protein = ‘’ aa = next(gen) while aa: protein += aa aa = next(gen) return protein 12 1/17/2017 Mutation Simulating Mutation Make a function to simulate single point mutations in a DNA sequence import random def mutation(dna): position = random.randint(0,len(dna)-1) bases = ‘ACGT’ new_base = bases[random.randint(0,3)] dna[position:position+1] = new_base return dna bases.replace(dna[position], ‘’) new_base = bases[random.randint(0,2)] Questions? Lecture Slides are found on the Course Website, web.ecs.baylor.edu/faculty/cho/3360 13