Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Wednesday #4 5m Announcements and questions 20m Tab- and comma-delimited text files Tab-delimited text files are ubiquitous in bioinformatics (and science generally) Referred to variously as TSV (tab-separated value) or CSV (comma-separated value) files Can be created or read by Excel et al, easy to handle simple tab-delimited text using split( "\t" ) But what about quotes, escape characters, and the like? import csv csv.reader( open( "filename" ), csv.excel_tab ) Exactly like an additional open( "filename" ), but automatically strips, splits, and unquotes lines for you The second (optional) argument specifies the "dialect" of input, typically csv.excel_tab This means tab-delimited, optionally double-quoted values Note that you cannot call csv.reader( "filename" ) It will produce strange results and try to read the string as tab-delimited data, not the file But in typical Python fashion, it won't throw an error; programmer beware TSV input for astrLine in csv.reader( open( "filename" ), csv.excel_tab ): do stuff TSV output csvw = csv.writer( open( "filename", "w" ), csv.excel_tab ) csvw.writerow( astrRow ) csvw.close( ) 15m Processing gff files The Generic Feature File format is a type of structured tab-delimited text file describing sequence annotations These might be binding sites, open reading frames, exons, or other annotations within a genome Each line describes one feature and consists of nine important columns: The name of the sequence in which the feature occurs, typically a chromosome The source of the feature annotation (who created it, e.g. a program or organization) The type of feature being described The start and stop locations of the feature The score (confidence/weight/etc.) The strand (+ or -) The frame (for coding sequences) The "group" of a feature, for associating multiple related annotations (e.g. exons in a gene) for astrLine in csv.reader( open( "filename.gff" ), csv.excel_tab ): strSeq, strSource, strFeature, strStart, strEnd, strScore, strStrand, \ strFrame, strGroup = astrLine[:9] iStart, iEnd, iScore = int(strStart), int(strEnd), int(strScore) fStrand = ( strStrand == "+" ) if strFeature == "CDS": print( ( iEnd - iStart ) if fStrand else ( iStart - iEnd ) ) 10m Processing fasta files for strLine in sys.stdin: mtch = re.search( r'^>\s*(.*)$', strLine ) if mtch: if strSeq: print( "\t".join( [strID, strSeq[:10]] ) ) strID = mtch.group( 1 ) strSeq = "" else: strSeq += strLine.strip( ) if strSeq: print( "\t".join( [strID, strSeq[:10]] ) ) 15m sql A typical database can be thought of as a group of tables Each table describes one type of entity (e.g. a sequence feature in a GFF table) Entities in different tables are related using unique identifiers called keys These are very much like keys in a dictionary They identify the "value" of the entire table row, i.e. one entity An example: microarrays A microarray experiment typically consists of several assays Each assay is described by some metadata detailing the experimental conditions Each assay results in a set of expression values, one per probe Each assay is performed using a specific microarray platform Each platform contains a fixed set of probes, most of them associated with one or more genes "Experiment", "Metadata", "Data", and "Platform" might each be a table Each experiment's unique ID associates with multiple assay IDs Each assay ID occurs once in metadata and once in data Each assay ID is paired with one platform ID Each platform ID has one list of probes (and genes) Python has built-in tools for accessing MySQL or SQLite databases The former is useful for "big", real databases, kept in a server The latter is useful for "small", customized data, kept in a special file on disk Relies on a special database description and query language called SQL Not going to teach you SQL! Again, quick overview to get a flavor for interfacing in Python An example: SQLite databases in Python import sqlite3 pDB = sqlite3.connect( "filename" ) pDB.execute( "CREATE TABLE genes (id INT, name TEXT)" ) pDB.execute( "CREATE TABLE assays (id INT, name TEXT)" ) pDB.execute( "CREATE TABLE ppis (id1 INT, id2 INT, assay INT)" ) for aValues in [[1, "TP53"], [2, "RAD51"], [3, "BRCA1"], [4, "BRCA2"]]: pDB.execute( "INSERT INTO genes VALUES (?, ?)", aValues ) for aValues in [[1, "y2h"], [2, "immuno"]]: pDB.execute( "INSERT INTO assays VALUES (?, ?)", aValues ) for aiValues in [[1, 2, 1], [1, 3, 2], [2, 3, 1]]: pDB.execute( "INSERT INTO ppis VALUES (?, ?, ?)", aiValues ) pDB.commit( ) pDB.close( ) ... pDB = sqlite3.connect( "filename" ) for pRow in pDB.execute( "SELECT * FROM ppis JOIN genes ON " + "(ppis.id1 = genes.id OR ppis.id2 = genes.id) WHERE genes.name = 'RAD51'" ): print( pRow ) 15m Downloading files Python provides two interfaces for downloading web and FTP files, urllib and urllib2 Both have some advantages/disadvantages, and they change a bit between Python 2.6/2.7/3.0 import urllib urllib.urlretrieve( "http://path/to/file", "output_file" ) Downloads the requested file to a local target Can also handle ftp:// URLs import urllib2 urllib2.urlopen( "http://path/to/file" ) Exactly like open( "filename" ), but resource can be either local or (normally) remote Usage pattern #1: for strLine in urllib2.urlopen( "http://path/to/file" ): do stuff Usage pattern #2: istm = urllib2.urlopen( "http://path/to/file" ) for strLine in istm: do stuff istm.close( ) 15m Executing system commands How can you make Python run another program? How can you capture the output of another program in Python? This has changed a bit from Python 2.6 to 2.7 to 3.0; we'll cover 2.7 for safety, the latter make it easier import subprocess subprocess.call( ["program", "arg1", "arg2"] ) Runs program, waits for it to finish, and displays the output to the screen Note that the expected format is an array of the command and all arguments, not a single string! You can (usually) run a single string command instead as: subprocess.call( "proggram arg1 arg2", shell = True ) strOutput = subprocess.check_output( ["program", "arg1", "arg2"] ) Runs program, waits for it to finish, returns output as a string Can be combined with shell = True trick to pass single argument Only exists in Python 2.7/3.0; the workaround in 2.6 is gross: pProc = subprocess.Popen( ["program", "arg1", "arg2"], stdout = subprocess.PIPE ) strOutput = pProc.communicate( )[0] Can be used to run any command line process - we'll see some that are particularly useful next week Reading Python XML: Model, Chapter 8 p300-309 Python downloads: Model, Chapter 9 p325-337 Python databases: Model, Chapter 10 p359-398 15m xml Structured markup language (like HTML) used for various bio data stores Hierarchical <tag attr="value">text</tag> structure Not going to go into detail here, but as an example: <file> <entry id="one"> <type name="foo"/> <text>text one</text> </entry> <entry id="two"> <type name="bar"/> <text>text two</text> </entry> </file> Each tag is a node with zero or more key/value attributes, zero or more children, zero or one text import xml pDoc = xml.dom.minidom.parse( "file.xml" ) for pNode in pDoc.childNodes: if pNode.nodeType == pNode.TEXT_NODE: print( pNode.data ) else: print( pNode.nodeName ) for pNode in pDoc.getElementsByTagName( "entry" ): for strKey, strValue in pNode.attributes.items( ): print( [strKey, strValue] )