Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
File I/O and Regular Expressions Sandy Brownlee [email protected] Outline • Basic reading / writing of text files in Python – Use a library for more complex formats! – E.g. openpyxl, python-docx, pypdf2 • Regular Expressions (Regex) – Appears in Python, but also many other contexts – Introduction to basic operators and the Python implementation Text files • Open file, get handle • Step through the file – Line by line (pointer moves as we read) – (bytewise for binary files) • Close file – Releases locks and resources • Be careful about: – Windows / Unix format newlines – Character encoding (ASCII is *so* 1980s) Reading files in Python • f = open("data.txt", "r") – Open file for reading (“w”=writing, “a”=append) • s = f.readline() – Read next line from the file, store in string “s” • f.write(s + "\n") – Write “s” to file, followed by newline character – print >> f, s achieves the same • f.close() Reading files (1) f = open("data.txt", "r") print(f) line1 = f.readline() line2 = f.readline() line3 = f.readline() print(line1) print(line2 + " - " + line3) f.close() print("done.") data.txt Name,Room,Phone Bob,C11,4445 Alice,C12,4443 Jeff,B14,4456 Jonathan,B16,4452 Susan,B19,4476 Betty,AA1,4599 Sean,AX2,4598 Wilma,AX3,4578 Jim,AX5,4590 Mary,C44,4140 Output: <_io.TextIOWrapper name=‘data.txt' mode='r' encoding='cp1252'> Name,Room,Phone Bob,C11,4445 - Alice,C12,4443 done. Reading files (2) • Pretty ugly, right? • Use with… instead of file.open() & file.close(): – with open("data.txt") as f: – This automatically closes the file after the block • Use a loop to iterate over the file: – for line in f: • Strip those nasty newlines: – line.rstrip() Reading files (3) data.txt with open('data.txt') as f: print(f) for line in f: print(line.rstrip()) print("done.") Name,Room,Phone Bob,C11,4445 Alice,C12,4443 Jeff,B14,4456 Jonathan,B16,4452 Susan,B19,4476 Betty,AA1,4599 Sean,AX2,4598 Wilma,AX3,4578 Jim,AX5,4590 Mary,C44,4140 Output: <_io.TextIOWrapper name='data.txt' mode='r' encoding='cp1252'> Name,Room,Phone Bob,C11,4445 Alice,C12,4443 Jim,AX5,4590 Mary,C44,4140 done. Writing files with open('output.txt', "w") as f: for i in range(1, 10): print >> f, ("Line " + str(i)) # Python 2 # print("Line " + str(i), file=f) # Python 3 • What do you expect to be in the file? output.txt Line Line Line Line Line Line Line Line Line 1 2 3 4 5 6 7 8 9 CSVs • Comma separated values – text file with rows and columns, data separated by commas Name,Room,Phone “Lock,Alice”,C12,4443 “Hanson,Jeff”,B14,4456 “Holmes,Jonathan”,B16,4452 • Could read each line and use split(“,”) to break into lists, but this is quite easy to break! – e.g. commas within quotes (like the names below) • Better to use the Python csv library: – – – – csv.reader(file) csv.DictReader(file) csv.writer(file, dialect='excel') csv.DictWriter(file, fieldnames, dialect='excel') Regular Expressions • A regular expression (regex) provides a syntax for matching patterns of characters in a string • You have probably seen a simple version ("wildcards") for file names: *.txt or searching in SQL: LIKE a%b • Regexes are FAR more powerful, as we shall see Why Do we Need Them? • Searching: – Find all the email addresses in a file – Find all the words that have a suffix "ing" • Verification – Check an email address matches the required format • Manipulation – Remove certain characters – Change a=1,b=2,c=3 to {"a":1,"b":2,"c":3} Regex Example Naive_MOEAD_unseeded_Dup5_att_1.txt Naive_NSGAII_unseeded_Dup5_att_1.txt Naive_MOEAD_Dup5_att_1.txt Naive_NSGAII_Dup5_att_1.txt Bilevel_MOEAD_unseeded_Dup5_att_15.txt Bilevel_MOEAD_Dup5_att_15.txt Bilevel_NSGAII_unseeded_Dup5_att_15.txt … Find ^([^.]+).txt, replace with: \1 = read.table("\1.txt") Naive_MOEAD_unseeded_Dup5_att_1 = read.table("Naive_MOEAD_unseeded_Dup5_att_1.txt") Naive_NSGAII_unseeded_Dup5_att_1 = read.table("Naive_NSGAII_unseeded_Dup5_att_1.txt") Naive_MOEAD_Dup5_att_1 = read.table("Naive_MOEAD_Dup5_att_1.txt") Naive_NSGAII_Dup5_att_1 = read.table("Naive_NSGAII_Dup5_att_1.txt") Bilevel_MOEAD_unseeded_Dup5_att_15 = read.table("Bilevel_MOEAD_unseeded_Dup5_att_15.txt") Bilevel_MOEAD_Dup5_att_15 = read.table("Bilevel_MOEAD_Dup5_att_15.txt") Bilevel_NSGAII_unseeded_Dup5_att_15 = read.table("Bilevel_NSGAII_unseeded_Dup5_att_15.txt") Where are They Used? • Unix has a search function called Grep, which allows you to search files from the command line • Most programming languages have regex commands or libraries, notably: – Javascript (good for validating form entry) – Python (for data wrangling) – Java, C#, Perl, Ruby, PHP … • Many databases support Regex search, including MongoDB, MySQL … • Common in text editors / IDEs (e.g. Eclipse) I'm Sold - How Do I Use Them? • We will use a simple text editing program called EditPad (http://www.editpadlite.com) • It has a regex search facility, so is good to practice on • A regular expression is a string of characters that defines what patterns should be matched Regex Characters • Want to search for the word "cat"? The regular expression is cat • But if you want to do more, you need to use a combination of the regex characters: \ ^ $ . | > * + ( ) [] { } Examples • Here are a few lines of text in EditPad Examples Cat c.t Dog\d \D Anchors • ^ matches the start of a line • $ matches the end of a line Counts • {} brackets specify a count Character Sets • Use [] to signify a set of single characters • [abc] finds all occurrences of a OR b OR c • [0-5] finds all occurrences in a range • [a-fA-F] finds all occurrences in multiple ranges • • • • • Built-in sets include: \d finds digits [0-9] \D finds non-digits [^0-9] \s finds whitespace \S finds non-whitespace Alternation (OR) • If you want to search members of a list of strings, use | • cat|dog searches for cat or dog • Use word boundary \b to search for full words: \b(Cat|Dog)\b • (word boundary is whitespace or an end of line) • Brackets group the "or" part to mean: wordstart(cat or dog)wordend () Parentheses for Groups • Use ( ... ) parentheses to group part of a regular expression • Same logic as with mathematical expressions: (a+b)/c ≠ a+(b/c) • c(\d{3}) ≠ (c\d){3} • c123 c1c2c3 Repetition • We have already seen {} for counting • More general counters are: * means zero or more + means one or more ? means zero or one So how does it work? • The parser starts on the left of the regex, and on the left of the text, and works along towards the right, eating characters as it goes The quick brown fox jumps over the lazy dog q.*o • What do you expect to match in the text? Greedy / non-greedy • Repetitions like * and + are greedy • Regex engine try to match them as many times as possible • If later portions of the pattern don’t match, the engine will back up and try again • Non-greedy operators match as little as possible: *? +? Negation • Regular expressions are not naturally good at "not equal" type matches • [^abc] means “not a b or c”, but matching words doesn’t work • Negative look-ahead is one way to achieve negation Look-ahead • Exist in many regex implementations • Look-ahead allows you to specify that you want a positive match for a string that is (or is not) followed by something use (?=...) • Negative look-ahead using (?!...) Finds things NOT followed by something else: (?!dog) means “not dog” • "^(.(?!Dog))*$" matches strings that do not contain “Dog”: it means that, from the start to the end, all characters must not be followed by “Dog” • Look-aheads don’t “eat” characters in the way the other patterns do Identifying Groups • Replacements (slide 12 – “Regex Example”) • Fancy searches – e.g. find matching pairs of tags in HTML RegEx in Python • import re – re.search() – find a match anywhere in the string – re.match() – only try to match at start of string – re.findall() – find all matches and return as a list – re.split() – like string.split() but a regex pattern • Backslashes also mean something in Python strings, so used raw string for tidiness: r”\d+” • Use flags to modify pattern matching: re.DOTALL re.IGNORECASE re.MULTILINE re.VERBOSE Make . match any character, including newlines Do case-insensitive matches Multi-line matching, affecting ^ and $ Enable verbose REs, can be formatted more neatly Python RegEx Examples (1) import re quickfox = "The quick brown fox jumps over the 2 lazy dogs" result = re.search("b.", quickFox) print("Result: " + str(result)) print("Group: " + result.group()) expr = re.compile("b.") result = expr.search(quickFox) print("Result2: " + str(result)) if (expr.search(quickfox)): print("Matched!") Output: Result: <_sre.SRE_Match object; span=(10, 12), match='br'> Group: br Result2: <_sre.SRE_Match object; span=(10, 12), match='br'> Matched! Python RegEx Examples (2) import re quickfox = "The quick brown fox jumps over\nthe 2 lazy dogs" print("A: " + str(re.search("\\d....", quickfox))) print("B: " + str(re.search(r"\d....", quickfox))) print("C: " + str(re.search("T", quickfox))) print("D: " + str(re.search("T", quickfox, re.IGNORECASE))) print("E: " + str(re.search(".T", quickfox, re.IGNORECASE))) print("F: " + \ str(re.search(".T", quickfox, re.IGNORECASE | re.DOTALL))) Output: A: <_sre.SRE_Match B: <_sre.SRE_Match C: <_sre.SRE_Match D: <_sre.SRE_Match E: None F: <_sre.SRE_Match object; object; object; object; span=(35, 40), match='2 laz'> span=(35, 40), match='2 laz'> span=(0, 1), match='T'> span=(0, 1), match='T'> object; span=(30, 32), match='\nt'> This week’s lab • Open a CSV file • Print content of CSV to screen • Do some quick checks on the data using regex