Download Regular expressions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Java performance wikipedia , lookup

Join-pattern wikipedia , lookup

String literal wikipedia , lookup

C Sharp (programming language) wikipedia , lookup

AWK wikipedia , lookup

Binary search algorithm wikipedia , lookup

Regular expression wikipedia , lookup

Transcript
CE306/CE706 – Information Retrieval Crash Course in Regular Expressions
CE306/CE706– Spring 2017
1
The basic pre-­‐processing tasks (as you know)
•
•
•
•
TOKENIZATION: identify tokens in text
WORD COUNTING: count words and their frequencies (more soon!)
SEARCHING FOR WORDS
NORMALIZATION: • Information Retrieval, information retrieval, INFORMATION RETRIEVAL à
information retrieval
• Feb 2, 2nd of February, ….. à 02/02/2017 • STEMMING
CE306/CE706– Spring 2017
2
Searching text for patterns
• Most common case: searching using Google or similar
• Simpler case: just looking for web pages containing a word (‘accommodation’)
• More complex cases: • Different spellings: • `accomodation’ OR ‘accommodation’ • Centre OR Center “Cognitive Science”
• Patterns only occurring in certain contexts
• But also: to validate string entered by the user
• E.g., checking whether the string entered is a phone number
• (+44)(0)20-­12341234, 02012341234, +44 (0) 1234-­1234
• But not: (44+)020-­12341234, 12341234(+020)
• A regular email address
• [email protected], [email protected], [email protected]
• But not: asmith, @ mactech.com, a@a
• A post code:
• G1 1AA, EH10 2QQ, SW1 1ZZ CE306/CE706– Spring 2017
3
Regular Expressions : a formalism for expressing search patterns
•
Because matching is a very common problem, over the years computer scientists have identified a set of patterns that
1. Are very common
2. Can be searched for efficiently
•
•
•
The language of REGULAR EXPRESSIONS has been developed to characterize these patterns
Many programming languages (Perl, Java 1.4, TCL, Python…. ) / web search tools / software systems (awk, sed, emacs) allow users to use regular expressions to specify what they are searching – these REs are then compiled into efficient code
You do not need to write the code yourselves! CE306/CE706– Spring 2017
4
Regular Expressions: the basic case • The simplest form of regular expression: a SEQUENCE OF SYMBOLS
• /can/
• Matches any string which contains ‘can’: can, canterbury, scannning
• Whitespace can be included: /top ten/
• Also matches “how to stop tension”
CE306/CE706– Spring 2017
5
More complex types of regular expressions
• Disjunction:
• /centre|center/
• /accomodation|accommodation/
• Also:
• /[Cc]entre/
• /acco[m|mm]odation/
• Repetitions:
• +: Any number greater than 0 • /YES+!/
• Matches YES!, YESS!, YESSS!
• E.g., any binary number: [01]+
• *: 0 or more
• /ab*/
• Matches a, ab, abb, abbb
CE306/CE706– Spring 2017
6
Regular expressions in Java (from 1.4)
• Standard library: java.util.regex
• Tutorial (very good): http://download.oracle.com/javase/tutorial/essential/regex/
• Main classes:
• PATTERN (= compiled form of a RE)
• Pattern rePattern = Pattern.compile(“ab*");
• MATCHER (= analyze a string using a pattern)
• Matcher pm = rePattern.matcher(string);
• pm.find(): find the next substring that matches
• pm.group(): the substring found by find()
• Next week’s labscript … CE306/CE706– Spring 2017
7
Regular expressions in Perl
• Example: print lines containing the string ‘can’ (a simple version of the UNIX/Linux ‘grep’ program)
while (<STDIN>) {
if (/can/) { print $_;;} }
CE306/CE706– Spring 2017
8
Even more complex cases and more metacharacters (PERL-­‐ and Java-­‐specific )
• Other forms of disjunction:
• Range: /textfile0[2-­‐4]/
• Will match “textfile02” “textfile03” “textfile04” • Metacharacters (in Perl / Java):
• \d (any digit): /a\d+z/ matches a0z, a123z, a456z
• \w (letter, digit, or underscore _)
• \s (any whitespace)
• Any character: . (period)
• /cyclo.*ane/ matches • “cyclodecane”, “cyclohexane”, “cyclones drive me insane” • Zero or one times: ?
• /accomm?odation/ matches “accomodation” and “accommodation”
• Negation: [^abc]
• /textfile[^0268]/ matches “textfile1”, “textfile3”, … CE306/CE706– Spring 2017
9
Applications of more complex REs
• Web pages about Centres and Centers:
• /[Cc]entre|[Cc]enter/
• Regular expression to validate phone numbers:
• (+44)(0)20-­12341234, 02012341234, +44 (0) 1234-­1234
• But not: (44+)020-­12341234, 12341234(+020)
• ^(\(?\+?[0-­‐9]*\)?)?[0-­‐9_\-­‐ \(\)]*$ • Validating email addresses:
• [email protected], [email protected], [email protected]
• But not: asmith, @mactech.com, a@a
• ^([a-­zA-­Z0-­9_\-­\.]+)@((\[[0-­9]{1,3}\.[0-­9]{1,3}\.[0-­
9]{1,3}\.)|(([a-­zA-­Z0-­9\-­]+\.)+))([a-­zA-­Z]{2,4}|[0-­
9]{1,3})(\]?)$ CE306/CE706– Spring 2017
10
REs in action (some of my own work)
• Use of regular expressions as part of a simple processing pipeline to process query logs of a digital library (The European Library)
• Why? We want to automatically learn what query modification suggestions to propose to (future) users of the library search engine, e.g.:
CE306/CE706– Spring 2017
11
… example continued
• …
• 1889115 xxxx guest xxxx 71.249.xxx.xxx xxxx 8eb3bdv3odg9jncd71u0s2aff6
xxxx en xxxx ("mozart") xxxx search_url xxxx xxxx 0 xxxx - xxxx xxxx
xxxx 2008-06-24 22:02:52
• 1889118 xxxx guest xxxx 71.249.xxx.xxx xxxx 8eb3bdv3odg9jncd71u0s2aff6
xxxx en xxxx ("mozart") xxxx view_full xxxx xxxx 1 xxxx xxxx xxxx
xxxx 2008-06-24 22:03:03
• 1889120 xxxx guest xxxx 71.249.xxx.xxx xxxx 8eb3bdv3odg9jncd71u0s2aff6
xxxx en xxxx klavierkonzerte xxxx search_res_rec_all xxxx xxxx 0 xxxx
- xxxx xxxx xxxx 2008-06-24 22:03:55
• 1889121 xxxx guest xxxx 71.249.xxx.xxx xxxx 8eb3bdv3odg9jncd71u0s2aff6
xxxx en xxxx ("klavierkonzerte") xxxx view_full xxxx xxxx 1 xxxx
xxxx xxxx xxxx 2008-06-24 22:04:10
• …
CE306/CE706– Spring 2017
12
… example continued
Here comes (part of the) the processing pipeline:
more $CleanedFile | gawk 'BEGIN {FS = " xxxx "} {if (($5=="en") &&
($4 != "null") && ($6 !~ /^[ "(]*(test|toto|a)[ ")]*$/) &&
($7~/^search_/) && ($7 !~ /^search_adv/)) {print $4 " xxxx " $1 " xxxx
" $6 " xxxx " $13}}' | sort -k 1,1 -k 3,3n | gawk 'BEGIN {FS = " xxxx
"} {if ((OLD_ID == $1 && (OLD_QUERY !=$3))) {print OLD_LINE; PRINT =
"on"}; if ((OLD_ID != $1) && (PRINT == "on")) {print OLD_LINE; PRINT =
"off"}; OLD_ID=$1; split($_, a, " xxxx "); OLD_QUERY = a[3]; OLD_LINE
= $_}' > $SessionQueries
CE306/CE706– Spring 2017
13
… example continued
This is what I get:
• …
8eb3bdv3odg9jncd71u0s2aff6 xxxx 1889115 xxxx ("mozart") xxxx 2008-0624 22:02:52
8eb3bdv3odg9jncd71u0s2aff6 xxxx 1889120 xxxx klavierkonzerte xxxx
2008-06-24 22:03:55
…
CE306/CE706– Spring 2017
14
REs: Notational Variants
• Different programming languages tend to use different notations for expressing REs. • In FSA, •
•
•
•
•
Sequence: [d,o,g]
Disjunction: {[c,a,t],[d,o,g]} (instead of cat|dog)
Range: ‘a’..’z’ (instead of a-­‐z)
Any symbol whatsoever: ? (instead of `.’)
Optional character: E^ (instead of E?)
CE306/CE706– Spring 2017
15
Notational variants: advanced search in Google
• CAPITALIZATION, etc:
• Google search is not case-­‐sensitive
• “OR” search: • vacation london OR paris
• NUMRANGE search:
• DVD player $250..350
• WILDCARD search:
• "Sony Vaio * laptop"
• For more tips: http://www.google.com/help/refinesearch.html
CE306/CE706– Spring 2017
16
Readings
• RexExr online tool to practice regular expressions:
http://www.regexr.com/
• The Java tutorial at Sun, section on regular expressions:
http://download.oracle.com/javase/tutorial/essential/regex/
CE306/CE706– Spring 2017
17