Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Python & Pattern Matching with Regular Expressions (REs) OPIM 101 File:PythonREs.ppt 1 Foresight • Pattern matching – Literal – With metacharacters • Regular expressions (REs) • Using REs in Python 2 Consider: dir by Itself D:\athomepc\day\idt>dir Volume in drive D has no label Volume Serial Number is 3E4B-1609 Directory of D:\athomepc\day\idt . <DIR> 01-01-02 8:16a . .. <DIR> 01-01-02 8:16a .. SPRING~1 PDF 180,072 01-01-02 8:17a spring02idtfront.pdf SPRING~2 PDF 241,542 01-01-02 8:19a spring02idtpartI.pdf SPRING~3 PDF 1,246,514 01-01-02 8:20a spring02idtpartII.pdf SPRING~4 PDF 2,517,343 01-01-02 8:22a spring02idtpartIII.pdf SPRING~5 PDF 3,469,138 01-01-02 8:24a spring02idtpartIV.pdf CASE1-~1 DOC 35,328 01-01-02 8:42a case1-python.doc LECTUR~1 PPT 78,336 01-01-02 9:45a lecture01fall01.ppt PYTHON~1 PPT 34,816 01-01-02 9:46a Python_Intro.ppt PYTHON~2 PPT 37,376 01-01-02 9:46a Python_Structures.ppt LECTUR~2 PPT 154,112 01-01-02 11:51a lecture01spring02.ppt PYTHON~3 PPT 34,816 01-01-02 11:52a PythonREs.ppt 11 file(s) 8,029,393 bytes 2 dir(s) 1,209.06 MB free D:\athomepc\day\idt> 3 Now: dir with a Literal Search D:\athomepc\day\idt>dir case1-python.doc Volume in drive D has no label Volume Serial Number is 3E4B-1609 Directory of D:\athomepc\day\idt CASE1-~1 DOC 35,328 01-01-02 8:42a case1-python.doc 1 file(s) 35,328 bytes 0 dir(s) 1,209.06 MB free D:\athomepc\day\idt> 4 Now: dir with “*” D:\athomepc\day\idt>dir *.doc Volume in drive D has no label Volume Serial Number is 3E4B-1609 Directory of D:\athomepc\day\idt CASE1-~1 DOC case1-python.doc 1 file(s) 0 dir(s) 35,328 01-01-02 8:42a 35,328 bytes 1,209.06 MB free D:\athomepc\day\idt> 5 Literal vs. Pattern Searches • dir myfile.doc – Searches literally, for an exact match with “myfile.doc” • dir my*.doc – Does a pattern search. Matches to any file beginning with “my”, followed by 0 or more characters of any kind, followed by “.doc” 6 MetaCharacters • dir treats “*” as a metacharacter, a character not taken literally, but as instruction to match a certain kind of pattern (here: anything) • The dir metacharacter scheme is very useful 7 On Beyond * • ...and also very primitive and limited • A step up: grep in Unix & Linux; support for RE searches in some text editors, e.g., TextPad (www.textpad.com) • Regular expressions (REs) use a richer language and larger set of metacharacters, giving us a very powerful capability to extract information (patterns) from text 8 Python’s RE Metacharacters • Here’s the complete list: . ^ $ * + ? { } [ ] \ | ( ) • No use memorizing. We’ll learn by examples. • A natural question: But what if I want to search for a pattern that contains what Python’s RE counts as metacharacters? – Be just a little patient 9 Load Python’s re Module >>> import re >>> teststring = "Television is public anomie number 1.” >>> teststring 'Television is public anomie number 1.’ >>> len(teststring) 37 >>> match = re.search('anomie',teststring) >>> match == None 0 >>> match.span() (21, 27) >>> teststring[21:27] 'anomie’ >>> 10 Now a Nonliteral Match >>> match = re.search('Television',teststring) >>> match == None 0 >>> match = re.search('television',teststring) >>> match == None 1 >>> match = re.search('[tT]elevision',teststring) >>> match.span() (0, 10) >>> teststring 'Television is public anomie number 1.’ >>> 11 Square Bracket Notation: [...] • “[tT]” means “any one of the characters ‘t’ or ‘T’.” • [...] is called a character class • Examples: – [abc], [a-z], [A-Z] – [^t^T] not t and not T 12 Not Example ^ >>> teststring 'Television is public anomie number 1.’ >>> match = re.search('[^t^T][a-z]+',teststring) >>> match.span() (1, 10) >>> teststring[1:10] 'elevision’ >>> Note: + means “one or more of the previous” * means “zero or more” ? means “zero or one” 13 '\s\w+\.' and '\s(\w+)\.' >>> teststring 'Television is public anomie number 1.’ >>> match = re.search('\s\w+\.',teststring) >>> match.span() (34, 37) >>> teststring[34:37] ' 1.’ >>> match = re.search('\s(\w+)\.',teststring) >>> match.span(0) (34, 37) >>> match.span(1) (35, 36) >>> teststring[35:36] '1’ 14 >>> [.] == \. • Inside [...] most metacharacters are taken literally – So, [.] == \. • Note (again): [...] is called a character class >>> match = re.search('\s(\w+)[.]',teststring) >>> match.span() (34, 37) >>> 15 Avoiding Greed ? >>> newstring = '<div align="center">’ >>> newstring = newstring+'<i class="smaller">’ >>> newstring = newstring+'(As of 10:55 AM on 12/20/01)’ >>> newstring = newstring+'</i></div><br>’ >>> newstring '<div align="center"><i class="smaller">(As of 10:55 AM on 12/20/01 >>> match = re.search('<.+>',newstring) >>> match.span() (0, 81) >>> match = re.search('<.+?>',newstring) >>> match.group() <div align="center">’ 16 >>> More on Not Being Greedy >>> match = re.search(r'<(\w).+?>(.+)</(\1)',newstring) >>> match.groups() ('d', '<i class="smaller">(As of 10:55 AM on 12/20/01)</i>', 'd') >>> match = re.search(r'<(\w).+?>([^<]+)</(\1)',newstring) >>> match.groups() ('i', '(As of 10:55 AM on 12/20/01)', 'i') >>> \1 is called a backreference. It refers to group 1 17 Concluding • REs are a very powerful tool, very often very useful • The language notation is compact and a bit hard to read • Practice, study the examples, don’t worry about memorization. 18 Advice on Scripting • Scripting, and programming in general, is a process • Successful scripts don’t spring into existence whole – Scripts built in small increments • Attend to: – Decomposition – Stories – Testing 19 Advice on Scripting • Decomposition – Solve big problems by decomposing them into small problems and solving them • Stories – Scripting/programming as a form of literature – Use comments with code to tell a clear story about what the code is or should be doing • Testing – Everything, whole and part, often, varying inputs 20 Readings • IDT book, chapter 8, “Text and Pattern Processing” • Further information (but beyond the scope of 101) – The Python online documentation on the re module – “Regular Expression HOWTO” by A.M. Kuchling at http://py-howto.sourceforge.net/ and also at http://pyhowto.sourceforge.net/regex/regex.html 21