Download W04 Notes: Network resources and data stores File

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Biology and consumer behaviour wikipedia , lookup

Minimal genome wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Wednesday #4
5m Announcements and questions
20m
Tab- and comma-delimited text files
Tab-delimited text files are ubiquitous in bioinformatics (and science generally)
Referred to variously as TSV (tab-separated value) or CSV (comma-separated value) files
Can be created or read by Excel et al, easy to handle simple tab-delimited text using split( "\t" )
But what about quotes, escape characters, and the like?
import csv
csv.reader( open( "filename" ), csv.excel_tab )
Exactly like an additional open( "filename" ), but automatically strips, splits, and unquotes lines for you
The second (optional) argument specifies the "dialect" of input, typically csv.excel_tab
This means tab-delimited, optionally double-quoted values
Note that you cannot call csv.reader( "filename" )
It will produce strange results and try to read the string as tab-delimited data, not the file
But in typical Python fashion, it won't throw an error; programmer beware
TSV input
for astrLine in csv.reader( open( "filename" ), csv.excel_tab ):
do stuff
TSV output
csvw = csv.writer( open( "filename", "w" ), csv.excel_tab )
csvw.writerow( astrRow )
csvw.close( )
15m
Processing gff files
The Generic Feature File format is a type of structured tab-delimited text file describing sequence annotations
These might be binding sites, open reading frames, exons, or other annotations within a genome
Each line describes one feature and consists of nine important columns:
The name of the sequence in which the feature occurs, typically a chromosome
The source of the feature annotation (who created it, e.g. a program or organization)
The type of feature being described
The start and stop locations of the feature
The score (confidence/weight/etc.)
The strand (+ or -)
The frame (for coding sequences)
The "group" of a feature, for associating multiple related annotations (e.g. exons in a gene)
for astrLine in csv.reader( open( "filename.gff" ), csv.excel_tab ):
strSeq, strSource, strFeature, strStart, strEnd, strScore, strStrand, \
strFrame, strGroup = astrLine[:9]
iStart, iEnd, iScore = int(strStart), int(strEnd), int(strScore)
fStrand = ( strStrand == "+" )
if strFeature == "CDS":
print( ( iEnd - iStart ) if fStrand else ( iStart - iEnd ) )
10m
Processing fasta files
for strLine in sys.stdin:
mtch = re.search( r'^>\s*(.*)$', strLine )
if mtch:
if strSeq:
print( "\t".join( [strID, strSeq[:10]] ) )
strID = mtch.group( 1 )
strSeq = ""
else:
strSeq += strLine.strip( )
if strSeq:
print( "\t".join( [strID, strSeq[:10]] ) )
15m
sql
A typical database can be thought of as a group of tables
Each table describes one type of entity (e.g. a sequence feature in a GFF table)
Entities in different tables are related using unique identifiers called keys
These are very much like keys in a dictionary
They identify the "value" of the entire table row, i.e. one entity
An example: microarrays
A microarray experiment typically consists of several assays
Each assay is described by some metadata detailing the experimental conditions
Each assay results in a set of expression values, one per probe
Each assay is performed using a specific microarray platform
Each platform contains a fixed set of probes, most of them associated with one or more genes
"Experiment", "Metadata", "Data", and "Platform" might each be a table
Each experiment's unique ID associates with multiple assay IDs
Each assay ID occurs once in metadata and once in data
Each assay ID is paired with one platform ID
Each platform ID has one list of probes (and genes)
Python has built-in tools for accessing MySQL or SQLite databases
The former is useful for "big", real databases, kept in a server
The latter is useful for "small", customized data, kept in a special file on disk
Relies on a special database description and query language called SQL
Not going to teach you SQL!
Again, quick overview to get a flavor for interfacing in Python
An example: SQLite databases in Python
import sqlite3
pDB = sqlite3.connect( "filename" )
pDB.execute( "CREATE TABLE genes (id INT, name TEXT)" )
pDB.execute( "CREATE TABLE assays (id INT, name TEXT)" )
pDB.execute( "CREATE TABLE ppis (id1 INT, id2 INT, assay INT)" )
for aValues in [[1, "TP53"], [2, "RAD51"], [3, "BRCA1"], [4, "BRCA2"]]:
pDB.execute( "INSERT INTO genes VALUES (?, ?)", aValues )
for aValues in [[1, "y2h"], [2, "immuno"]]:
pDB.execute( "INSERT INTO assays VALUES (?, ?)", aValues )
for aiValues in [[1, 2, 1], [1, 3, 2], [2, 3, 1]]:
pDB.execute( "INSERT INTO ppis VALUES (?, ?, ?)", aiValues )
pDB.commit( )
pDB.close( )
...
pDB = sqlite3.connect( "filename" )
for pRow in pDB.execute( "SELECT * FROM ppis JOIN genes ON " +
"(ppis.id1 = genes.id OR ppis.id2 = genes.id) WHERE genes.name = 'RAD51'" ):
print( pRow )
15m
Downloading files
Python provides two interfaces for downloading web and FTP files, urllib and urllib2
Both have some advantages/disadvantages, and they change a bit between Python 2.6/2.7/3.0
import urllib
urllib.urlretrieve( "http://path/to/file", "output_file" )
Downloads the requested file to a local target
Can also handle ftp:// URLs
import urllib2
urllib2.urlopen( "http://path/to/file" )
Exactly like open( "filename" ), but resource can be either local or (normally) remote
Usage pattern #1:
for strLine in urllib2.urlopen( "http://path/to/file" ):
do stuff
Usage pattern #2:
istm = urllib2.urlopen( "http://path/to/file" )
for strLine in istm:
do stuff
istm.close( )
15m
Executing system commands
How can you make Python run another program?
How can you capture the output of another program in Python?
This has changed a bit from Python 2.6 to 2.7 to 3.0; we'll cover 2.7 for safety, the latter make it easier
import subprocess
subprocess.call( ["program", "arg1", "arg2"] )
Runs program, waits for it to finish, and displays the output to the screen
Note that the expected format is an array of the command and all arguments, not a single string!
You can (usually) run a single string command instead as:
subprocess.call( "proggram arg1 arg2", shell = True )
strOutput = subprocess.check_output( ["program", "arg1", "arg2"] )
Runs program, waits for it to finish, returns output as a string
Can be combined with shell = True trick to pass single argument
Only exists in Python 2.7/3.0; the workaround in 2.6 is gross:
pProc = subprocess.Popen( ["program", "arg1", "arg2"],
stdout = subprocess.PIPE )
strOutput = pProc.communicate( )[0]
Can be used to run any command line process - we'll see some that are particularly useful next week
Reading
Python XML:
Model, Chapter 8 p300-309
Python downloads:
Model, Chapter 9 p325-337
Python databases:
Model, Chapter 10 p359-398
15m
xml
Structured markup language (like HTML) used for various bio data stores
Hierarchical <tag attr="value">text</tag> structure
Not going to go into detail here, but as an example:
<file>
<entry id="one">
<type name="foo"/>
<text>text one</text>
</entry>
<entry id="two">
<type name="bar"/>
<text>text two</text>
</entry>
</file>
Each tag is a node with zero or more key/value attributes, zero or more children, zero or one text
import xml
pDoc = xml.dom.minidom.parse( "file.xml" )
for pNode in pDoc.childNodes:
if pNode.nodeType == pNode.TEXT_NODE:
print( pNode.data )
else:
print( pNode.nodeName )
for pNode in pDoc.getElementsByTagName( "entry" ):
for strKey, strValue in pNode.attributes.items( ):
print( [strKey, strValue] )