Download Regular Expressions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CSC 594 Topics in AI –
Natural Language Processing
Spring 2016/17
3. Regular Expressions
(Some slides adapted from Jurafsky & Martin)
1
Document Search
• ‘Information Retrieval (IR)’ implies a query (e.g. search terms)
– For a given query, relevant or similar documents are returned.
• But most basic document retrieval technique is keyword/search
term matching.
– Retrieve all (or selected) documents which contain the search terms -by string matching
– Python example:
>>> s1 = 'public'
>>> s2 = 'public'
>>> s2 == s1
True
myword = “month python”
with open("textfile.txt") as openfile:
for line in openfile:
if myword in line:
print line
2
Regular Expressions and Text
Searching
• Regular expressions are a compact textual
representation of a set of strings that constitute a
language
– In the simplest case, regular expressions describe regular
languages
• Here, a language means a set of strings given some alphabet.
• Extremely versatile and widely used technology
– Emacs, vi, perl, grep, etc.
3
Example
• Find all the instances of the word “the” in a text.
– /the/
– /[tT]he/
– /\b[tT]he\b/
4
String Matching Using Patterns
• Often, we wish to find a substring which matches a pattern
• e.g. E-mail addresses:
1. Any number of alphanumeric characters and/or dots (not a dot at
beginning or end)
2. @
3. Any number of alphanumeric characters and/or dots (not a dot at
beginning or end); must be at least one dot
• Examples:
– valid: [email protected], [email protected]
– Invalid: [email protected], tomuro@depaul
• But if you want to specify search words by patterns, regular
expressions are commonly used.
5
Regular Expressions (1)
Regular expression is an algebra for defining patterns. For example, a
regular expression “a*b” matches with a string “aaaab”.
But without going through the formal definitions, here is a (partial)
summary.
1. Simple Patterns
–
–
–
–
Characters match themselves. Note the chars are case-sensitive.
Metacharacters – not to be used literally _as is_
.^$*+?{}[]\|()
To use a metacharacter, a back-slash has to be given before it
\. \^ \+ etc.
Other special characters
\t, \n, \r, \f etc.
6
Regular Expressions (2)
2. Character classes
–
–
–
[abc] – a, b, or c
[^abc] – any character except a, b, or c.
[a-zA-Z] – a throughx, or A through Z inclusive (range)
3. Predefined character classes
–
–
–
–
–
–
–
. (dot) – any character
\d – a digit ([0-9])
\D – a non-digit ([^0-9])
\s – a whitespace character (e.g. space, \t, \n, \r)
\S – a non-whitespace character
\w – a word character ([a-zA-Z_0-9])
\W – a non-word character ([^\w])
4. Boundary matchers
–
–
^ -- the beginning of a line
$ -- the end of a line
7
Regular Expressions (3)
5. Greedy quantifiers
–
–
–
–
–
X? – X, once or not at all
Z* -- X, zero or more times
X+ -- X, one or more times
X{n} – X, exactly n times
X{n,m} – X, at least n but no more than m times
6. Logical operators
–
–
–
XY – X followed by Y
X|Y – either X or Y
(X) – X, as a capturing group
8
Regular Expression in Python (1)
• Regular expressions are in the ‘re’ package.
• Notation for patterns is slightly different from other languages –
using raw string as an alternative to Regular string.
Regular String
"ab*"
"\\\\section"
"\\w+\\s+\\1"
Raw string
r"ab*"
r"\\section"
r"\w+\s+\1"
• First compile an expression (into an re object). Then match it
against a string.
– >>> import re
>>> p = re.compile('ab*')
9
Regular Expression in Python (2)
• Matching a re object against a string is done in several ways.
Method/Attribute
match()
search()
findall()
finditer()
Purpose
Determine if the RE matches at the beginning of the
string.
Scan through a string, looking for any location where this
RE matches.
Find all substrings where the RE matches, and returns
them as a list.
Find all substrings where the RE matches, and returns
them as aniterator.
10
>>> import re
>>> sent = "This book on tennis cost $3.99 at Walmart."
>>> p1 = re.compile("ten")
>>> m1 = p1.match(sent)
>>> m1
>>> p2 = re.compile(".*ten.*")
>>> m2 = p2.match(sent)
>>> m2
<_sre.SRE_Match object; span=(0, 42), match='This book on tennis cost $3.99 at Walmart.'>
>>> m3 = re.search(p1,sent)
>>> m3
<_sre.SRE_Match object; span=(13, 16), match='ten'>
>>> m4 = re.search(p2,sent)
>>> m4
<_sre.SRE_Match object; span=(0, 42), match='This book on tennis cost $3.99 at Walmart.'>
>>> pp1 = re.compile("is")
>>> m5 = re.findall(pp1, sent)
>>> m5
['is', 'is']
>>> pp2 = re.compile("\\d")
>>> m6 = re.search(pp2, sent)
>>> m6
<_sre.SRE_Match object; span=(26, 27), match='3'>
>>> pp3 = re.compile("\\d+")
>>> m7 = re.search(pp3, sent)
>>> m7
<_sre.SRE_Match object; span=(26, 27), match='3'>
11
>>> pp3 = re.compile("\\$\\d+\\.\\d\\d")
>>> m8 = re.search(pp3, sent)
>>> m8
<_sre.SRE_Match object; span=(25, 30), match='$3.99'>
>>> pp4 = re.compile(r"\$\d+\.\d\d")
>>> m9 = re.search(pp4, sent)
>>> m9
<_sre.SRE_Match object; span=(25, 30), match='$3.99'>
12
Regular Expression in Python (3)
• Grouping – You can retrieve the matched substrings using
parentheses.
• Capturing groups are numbered by counting their opening
parentheses from left to right. In the expression ((A)(B(C))), for
example, there are four such groups:
–
–
–
–
((A)(B(C)))
(A)
(B(C))
(C)
• Group zero always stands for the entire expression.
13
>>> ppp1 = re.compile("(\\w+) cost (\\$\\d+\\.\\d\\d)")
>>> mm1 = re.search(ppp1, sent)
>>> mm1
<_sre.SRE_Match object; span=(13, 30), match='tennis cost $3.99'>
>>> mm1.group(0)
'tennis cost $3.99'
>>> mm1.group(1)
'tennis'
>>> mm1.group(2)
'$3.99'
14
Python ‘search()’ Example
#!/usr/bin/python
import re
line = "Cats are smarter than dogs";
searchObj = re.search( r'(.*) are (.*?) .*', line, re.M|re.I)
if searchObj:
print "searchObj.group() : ", searchObj.group()
print "searchObj.group(1) : ", searchObj.group(1)
print "searchObj.group(2) : ", searchObj.group(2)
else:
print "Nothing found!!"
When the above code is executed, it produces following result −
matchObj.group() : Cats are smarter than dogs
matchObj.group(1) : Cats
matchObj.group(2) : smarter
TutorialsPoint, http://www.tutorialspoint.com/python/python_reg_expressions.htm
15
Regular Expression Modifiers: Option Flags
Regular expression literals may include an optional modifier to control various
aspects of matching. The modifiers are specified as an optional flag. You can provide
multiple modifiers using exclusive OR (|), as shown previously and may be
represented by one of these −
Modifier
Description
re.I
Performs case-insensitive matching.
re.L
Interprets words according to the current locale. This interpretation
affects the alphabetic group (\w and \W), as well as word boundary
behavior (\b and \B).
re.M
Makes $ match the end of a line (not just the end of the string) and
makes ^ match the start of any line (not just the start of the string).
re.S
Makes a period (dot) match any character, including a newline.
re.U
Interprets letters according to the Unicode character set. This flag
affects the behavior of \w, \W, \b, \B.
re.X
Permits "cuter" regular expression syntax. It ignores whitespace
(except inside a set [] or when escaped by a backslash) and treats
unescaped # as a comment marker.
TutorialsPoint, http://www.tutorialspoint.com/python/python_reg_expressions.htm
16