Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Basic Computing Concepts for Bioinformatics Dr Richard White Basic computing concepts • “Basic Computing concepts” sounds a bit scary. I hope you’ll find it isn’t really. • Actually some of the “Basic Computing Concepts” you’ll be familiar with already. 2 What does the average biologist use computers for? • • • • Browsing the Web, searching with Google, etc. Email Word-processing for reports, etc. (e.g. MS Word) Data handling and simple statistics (e.g. Excel) • Playing CDs, games, etc. (no, of course not, only joking…) • So you’re probably quite experienced in computer use already • Databases? The use of bioinformatics databases figures prominently in this course. 3 What should biologists use computers for? • Access to biological databases, especially those containing bioinformatics information • Visualisation: ways to understand data better by visual exploration • Analysing data, especially to test hypotheses (to understand biology better) 4 Computer use during this course • Mostly we’ll be concerned with access to biological databases • Also some visualisation sessions and maybe some data analysis and hypothesis testing 5 Using predefined tools • You’ll be doing a lot of this work using tools available on the web. – This makes life easy, because the hard work of setting these tools up for use has already been done by someone else. • However, sometimes it’s useful to get your hands dirty and mess about with the data and ways to process it yourself, – especially if you want to do something that zillions of other people haven’t already thought of. 6 Use of databases • I’ll be running a session on the use of databases in week 4, but at the moment I want to think about this in order to discover some Basic Computing Concepts. • First, let’s consider the characteristics of databases for a moment. 7 Simple database concepts Computers allow the analysis of large data sets. These are frequently arranged as twodimensional data tables, based on the convention that – each row holds information on a separate object (or abstract entity such as a species), – each column holds information on a particular property or characteristic of the objects, – in general there will be a single value in each cell of the table, representing the value of a specific characteristic for one particular object. 8 Spreadsheets • Data in the form of two-dimensional tables is frequently analysed using computer spreadsheet programs such as Microsoft Excel, especially where the purpose is – – – – relatively simple data reorganisation, summarisation, statistical testing report generation. 9 Databases • It is becoming harder to distinguish between spreadsheet and database programs. • Most databases require more than one table: for example, one table may store data about proteins and another table stores data about the species these proteins are found in. • For more about database systems, see the PowerPoint presentation (DatabaseIntroduction.ppt) on my web site (see handout for details). 10 Methods for using databases • What methods exist to use databases? • Basically there are several approaches to the use of databases: 11 Database use 1: direct access to database tables • Run your own database on your own computer (e.g. MS Access) • Use a program on your PC which gives you direct access to the tables in the remote database (client-server database access) In both cases, you need instructions as to what the tables are and what they contain, such as SQL. 12 SQL statements • SQL (“Structured Query Language”) is a language for specifying the creation of databases and the updating and retrieval of information in them. It is general and “portable” – so that it can be used with a variety of different database systems without having to learn a new language for each one. • The language goes far beyond this scope of this course. Briefly, it can be used to: – Specify the tables in the database and the fields (columns) they contain – Make additions and updates to the data in those tables – Retrieve information from one or more of the tables 13 SQL for data retrieval • A typical SQL statement for data retrieval would look something like this: SELECT <some fields> FROM <table> WHERE <condition>; • The condition effectively selects certain rows from the table. • Thus the result is often a smaller table than the one being queried. • Tables can be “joined” together to combine information from more than one table, for example when extracting a molecular sequence from one table and the bibliographic details of the reference to where it was published from another table. 14 Database use 2: predefined operations Alternatively, you might have forms and queries already set up for you, which you can just run in order to perform predefined kinds of searches. These predefined operations can be made directly available to you by: • Browsing a web page, typically containing a form, which gives you access [NPI] to a database somewhere else. You’ve done this if you’ve ever bought anything on the Internet. • Using or even writing a small program (sometimes called a script to make it seem less scary) to fetch the data for you. This allows you to process the data in useful ways: – to search for features you’re interested in, – to summarise the data in the way you want, or – to extract data for statistical analysis to test hypotheses. 15 Database use 3: using predefined operations The predefined operations may be packaged as CGI programs or Web Services or in a variety of other ways, but basically you just send a request to the service, optionally with some ‘parameters’ to specify what you want, and wait for the reply. The reply may come back, usually, • in HTML (as a web page containing the data requested) or • as some other sort of file to be downloaded (i.e. stored on your PC), either – in one of a number of formats invented by the data providers, – in XML, a standard but flexible (and verbose) way to structure a data file, so that other programs (rather than humans) can process it easily. 16 Overview of NCBI Entrez In a later session, you’ll be introduced to a number of bioinformatics databases, but it’s worth spending a moment looking at a popular way to make use of some of them, because you will explore this in Practical 2 in week 4 of this course. • NCBI web site • Entrez utilities 17 Brief introduction to Perl programming (What? In ten minutes??) This will help you prepare for Practical 2 (the practical part of the 4th week of the course), in which we shall use simple Perl programs to request data from a bioinformatics information provider such as NCBI, by connecting with their Entrez utilities. (Additional Perl tutorial material may be made available.) • What is a Perl program? (or “script”) • How to run one • How to write one • What do you need? – See the handout 18 A computer program A program is a set of instructions to the computer, such as • Get input from user • Perform calculation • Display window • React to mouse click These are instructions at a very high level. They need to be broken down into smaller details. A program consists of combinations of: • Sequences of instructions (statements) • Repetitions (to execute statements repeatedly) • Selections (to choose which statements to execute) • Functions (subroutines or methods: groups of instructions) 19 A simple program • Here is a simple Perl program. #!/usr/local/bin/perl # Program to do the obvious print 'Hello world.'; • The first line: every Perl program starts off with this as its very first line, although it may vary from system to system, or not be used at all. It tells the machine what to do with the file when it is executed (it tells it to run the file through the Perl software to execute it). • Everything which is not a comment is a Perl statement which must end with a semicolon, like the last line above. • So the next thing to do is to run it. 20 Running the program • Type in the example program using a text editor, and save it in a file called something.pl. • Now to run the program just type the following at the Command Prompt. perl something.pl • If something goes wrong then you may get error messages, or you may get nothing at all. 21 Perl programming concepts: variables Variables can hold both strings and numbers. For example, the statement $priority = 9; sets the scalar variable $priority to 9, but you can also assign a string to exactly the same variable: $priority = 'high'; • In general variable names consists of numbers, letters and underscores, but they should not start with a number. Perl is case sensitive, so $a and $A are different variables. 22 Operations and Assignment Perl uses all the usual arithmetic operators: $a $a $a $a = = = = 1 3 5 7 + * / 2; 4; 6; 8; # # # # Add 1 and 2 and store in $a Subtract 4 from 3 and store in $a Multiply 5 and 6 Divide 7 by 8 to give 0.875 etc. and for strings Perl has the following among others: $a = $b . $c; # Concatenate $b and $c 23 Array variables A slightly more interesting kind of variable is the array variable which is a list of scalars (single values, i.e. numbers and strings). Array variables have the same format as scalar variables except that they are prefixed by an @ symbol. The statement @food = ("apples", "pears", "eels"); assigns a three element list to the array variable @food. The array is accessed by using indices starting from 0, and square brackets are used to specify the index. The expression $food[2] returns eels. Notice that the @ has changed to a $ because $food[2] and eels are scalars, not arrays. 24 File handling Here is a basic Perl program which does the same as the UNIX cat or Dos/Windows type command on a certain file. #!/usr/local/bin/perl # Program to open the password file, read it in, # print it, and close it again. $file = '/etc/passwd'; # Name the file open(INFO, $file); # Open the file @lines = <INFO>; # Read it into an array close(INFO); # Close the file print @lines; # Print the array 25 Control structures Perl supports lots of different kinds of control structures. Have a look at the Perl resources listed on the handout. Most Perl programs use these features. • Programs can make choose between alternative branches • Programs can repeat statements until something happens • Frequently used statements to carry out some common task can be made into a “subroutine” or “function” and called from others part of the program 26 End 27