Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSC 4630 Meeting 9 February 14, 2007 Valentine’s Day; Snow Day Last of awk • Quick review of scripting languages, and more generally, programming languages – Built-in variables – Variable typing – Implicit control structure of program – Assignment statements and operations – Control structures Next Week and Next Next Week • Exam 1: Monday, February 26 • Project 2: Wednesday, February 28 Last of awk (2) • • • • Control structures Arrays Formatted printing Subtleties and intricacies Control Structures • if (<expression>) <s1> else <s2> <expression> can be any expression; true is defined to be non-zero or non-null <s1> and <s2> can be any group of statements Note the critical parentheses that separate the conditional expression from <s1> Control Structures (2) • while (<expression>) <s1> Same rules as for if-then-else Control Structures (3) • for (<e1>;<e2>;<e3>) <s1> is equivalent to <e1>; while (<e2>) {<s1>;<e3>} • <e1> initializes the loop variable • <e2> checks the loop variable for termination • <e3> changes the value of the loop variable • for (k in <array>) <s1> loops over the subscripts of an array but the order of the subscripts is random. Careful: awk allows general subscripting. Strings can be used as subscripts. Control Structures (4) “Go to” structures • break when executed within a for or while statement, causes an immediate exit • continue when executed within a for or while statement, causes immediate execution of the next iteration • next causes the next line (record) of the input file to be read and the sequence of pattern {action} statements executed on it • exit causes the program to jump to the END pattern, execute it, and stop Practice Time • We’ll use pair programming – Pair up by twos – One person is in control of the keyboard – Sketch the features of the program – Test as you go awk Practice: Example 1 Input: A file containing syntactically correct North American telephone numbers in the form XXX-XXX-XXXX Output: A file containing the numbers from the input file formatted as international numbers, namely +1.XXX.XXX.XXXX Test file: Create your own awk Practice: Example 2 Input: A file, each line of which supposedly contains a North American style telephone number Output: The input file cleaned of bad numbers, inappropriate lines, and empty lines. Each correct number formatted as XXX-XXX-XXXX Test Input: /mnt/a/beck/samples/phonenumbers Notes: Program must handle arbitrary input files Start simple, add features as you investigate awk Practice: Example 3 Input: A file in the same form as for Example 2. Output: The input file cleaned and correct numbers formatted in international format, +1.xxx.xxx.xxxx awk Practice: Example 4 The website flightaware.com gives the departure and arrival history of commercial airline flights, among other things. You can easily extract the history to a text file by cutting and pasting. But then the file needs to be cleaned and reformatted to be useful. Input: A flight history file from flightaware.com, e.g. /mnt/a/beck/samples/flight1931 Example 4 (2) Output: Data from the input file involving one leg of the flight (use PHL to ATL), one line per day, fields separated by :: . Fields are date, departure time, arrival time, elapsed time. Include a header line that contains the flight number (1931 for the sample), origin (PHL), and destination (ATL). Include a second header line that labels the data columns. awk Practice: Example 5 Computations involving flight data. Input: Cleaned flight data file (the output file from Example 4) Output: Earliest and latest departure, earliest and latest arrival, shortest elapsed time, longest elapsed time, average elapsed time. Notes: Programs from Examples 4 and 5 should work with any set of flight data. awk Practice: Example 6 DNA to protein translation – In the computational biology world it is wellknown that each triple of bases along a DNA segment translates to one of the 20 amino acids, which are the building blocks for proteins. Input: A DNA sequence Output: The corresponding amino acid sequence Project 2 • Due, Wednesday, February 28 • Part 1 – Implement an improved version of mobilex entirely in awk. The program should take a file containing a chapter of the text and return the lexicon with frequency counts sorted in decreasing order of frequency. Project 2 (2) – Notes on Part 1 • Include one title line giving chapter number and title • All trailing punctuation should be removed • All initial capitalization should be removed • No numbers in lexicon • Compound words should be retained – Desirable features • Remove contractions and spell them out • Remove possessive constructions. The ‘s should not be counted as a different word. • Retain capitalized proper names Project 2 (3) • Part 2 – Add summary statistics to the mobylex program that give • Total number of words in chapter • Number of different words in chapter • Average word length (number of characters) (taken over distinct words) • Maximum word length