* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Regular expressions
Survey
Document related concepts
Transcript
CE306/CE706 – Information Retrieval Crash Course in Regular Expressions CE306/CE706– Spring 2017 1 The basic pre-‐processing tasks (as you know) • • • • TOKENIZATION: identify tokens in text WORD COUNTING: count words and their frequencies (more soon!) SEARCHING FOR WORDS NORMALIZATION: • Information Retrieval, information retrieval, INFORMATION RETRIEVAL à information retrieval • Feb 2, 2nd of February, ….. à 02/02/2017 • STEMMING CE306/CE706– Spring 2017 2 Searching text for patterns • Most common case: searching using Google or similar • Simpler case: just looking for web pages containing a word (‘accommodation’) • More complex cases: • Different spellings: • `accomodation’ OR ‘accommodation’ • Centre OR Center “Cognitive Science” • Patterns only occurring in certain contexts • But also: to validate string entered by the user • E.g., checking whether the string entered is a phone number • (+44)(0)20-12341234, 02012341234, +44 (0) 1234-1234 • But not: (44+)020-12341234, 12341234(+020) • A regular email address • [email protected], [email protected], [email protected] • But not: asmith, @ mactech.com, a@a • A post code: • G1 1AA, EH10 2QQ, SW1 1ZZ CE306/CE706– Spring 2017 3 Regular Expressions : a formalism for expressing search patterns • Because matching is a very common problem, over the years computer scientists have identified a set of patterns that 1. Are very common 2. Can be searched for efficiently • • • The language of REGULAR EXPRESSIONS has been developed to characterize these patterns Many programming languages (Perl, Java 1.4, TCL, Python…. ) / web search tools / software systems (awk, sed, emacs) allow users to use regular expressions to specify what they are searching – these REs are then compiled into efficient code You do not need to write the code yourselves! CE306/CE706– Spring 2017 4 Regular Expressions: the basic case • The simplest form of regular expression: a SEQUENCE OF SYMBOLS • /can/ • Matches any string which contains ‘can’: can, canterbury, scannning • Whitespace can be included: /top ten/ • Also matches “how to stop tension” CE306/CE706– Spring 2017 5 More complex types of regular expressions • Disjunction: • /centre|center/ • /accomodation|accommodation/ • Also: • /[Cc]entre/ • /acco[m|mm]odation/ • Repetitions: • +: Any number greater than 0 • /YES+!/ • Matches YES!, YESS!, YESSS! • E.g., any binary number: [01]+ • *: 0 or more • /ab*/ • Matches a, ab, abb, abbb CE306/CE706– Spring 2017 6 Regular expressions in Java (from 1.4) • Standard library: java.util.regex • Tutorial (very good): http://download.oracle.com/javase/tutorial/essential/regex/ • Main classes: • PATTERN (= compiled form of a RE) • Pattern rePattern = Pattern.compile(“ab*"); • MATCHER (= analyze a string using a pattern) • Matcher pm = rePattern.matcher(string); • pm.find(): find the next substring that matches • pm.group(): the substring found by find() • Next week’s labscript … CE306/CE706– Spring 2017 7 Regular expressions in Perl • Example: print lines containing the string ‘can’ (a simple version of the UNIX/Linux ‘grep’ program) while (<STDIN>) { if (/can/) { print $_;;} } CE306/CE706– Spring 2017 8 Even more complex cases and more metacharacters (PERL-‐ and Java-‐specific ) • Other forms of disjunction: • Range: /textfile0[2-‐4]/ • Will match “textfile02” “textfile03” “textfile04” • Metacharacters (in Perl / Java): • \d (any digit): /a\d+z/ matches a0z, a123z, a456z • \w (letter, digit, or underscore _) • \s (any whitespace) • Any character: . (period) • /cyclo.*ane/ matches • “cyclodecane”, “cyclohexane”, “cyclones drive me insane” • Zero or one times: ? • /accomm?odation/ matches “accomodation” and “accommodation” • Negation: [^abc] • /textfile[^0268]/ matches “textfile1”, “textfile3”, … CE306/CE706– Spring 2017 9 Applications of more complex REs • Web pages about Centres and Centers: • /[Cc]entre|[Cc]enter/ • Regular expression to validate phone numbers: • (+44)(0)20-12341234, 02012341234, +44 (0) 1234-1234 • But not: (44+)020-12341234, 12341234(+020) • ^(\(?\+?[0-‐9]*\)?)?[0-‐9_\-‐ \(\)]*$ • Validating email addresses: • [email protected], [email protected], [email protected] • But not: asmith, @mactech.com, a@a • ^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0- 9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0- 9]{1,3})(\]?)$ CE306/CE706– Spring 2017 10 REs in action (some of my own work) • Use of regular expressions as part of a simple processing pipeline to process query logs of a digital library (The European Library) • Why? We want to automatically learn what query modification suggestions to propose to (future) users of the library search engine, e.g.: CE306/CE706– Spring 2017 11 … example continued • … • 1889115 xxxx guest xxxx 71.249.xxx.xxx xxxx 8eb3bdv3odg9jncd71u0s2aff6 xxxx en xxxx ("mozart") xxxx search_url xxxx xxxx 0 xxxx - xxxx xxxx xxxx 2008-06-24 22:02:52 • 1889118 xxxx guest xxxx 71.249.xxx.xxx xxxx 8eb3bdv3odg9jncd71u0s2aff6 xxxx en xxxx ("mozart") xxxx view_full xxxx xxxx 1 xxxx xxxx xxxx xxxx 2008-06-24 22:03:03 • 1889120 xxxx guest xxxx 71.249.xxx.xxx xxxx 8eb3bdv3odg9jncd71u0s2aff6 xxxx en xxxx klavierkonzerte xxxx search_res_rec_all xxxx xxxx 0 xxxx - xxxx xxxx xxxx 2008-06-24 22:03:55 • 1889121 xxxx guest xxxx 71.249.xxx.xxx xxxx 8eb3bdv3odg9jncd71u0s2aff6 xxxx en xxxx ("klavierkonzerte") xxxx view_full xxxx xxxx 1 xxxx xxxx xxxx xxxx 2008-06-24 22:04:10 • … CE306/CE706– Spring 2017 12 … example continued Here comes (part of the) the processing pipeline: more $CleanedFile | gawk 'BEGIN {FS = " xxxx "} {if (($5=="en") && ($4 != "null") && ($6 !~ /^[ "(]*(test|toto|a)[ ")]*$/) && ($7~/^search_/) && ($7 !~ /^search_adv/)) {print $4 " xxxx " $1 " xxxx " $6 " xxxx " $13}}' | sort -k 1,1 -k 3,3n | gawk 'BEGIN {FS = " xxxx "} {if ((OLD_ID == $1 && (OLD_QUERY !=$3))) {print OLD_LINE; PRINT = "on"}; if ((OLD_ID != $1) && (PRINT == "on")) {print OLD_LINE; PRINT = "off"}; OLD_ID=$1; split($_, a, " xxxx "); OLD_QUERY = a[3]; OLD_LINE = $_}' > $SessionQueries CE306/CE706– Spring 2017 13 … example continued This is what I get: • … 8eb3bdv3odg9jncd71u0s2aff6 xxxx 1889115 xxxx ("mozart") xxxx 2008-0624 22:02:52 8eb3bdv3odg9jncd71u0s2aff6 xxxx 1889120 xxxx klavierkonzerte xxxx 2008-06-24 22:03:55 … CE306/CE706– Spring 2017 14 REs: Notational Variants • Different programming languages tend to use different notations for expressing REs. • In FSA, • • • • • Sequence: [d,o,g] Disjunction: {[c,a,t],[d,o,g]} (instead of cat|dog) Range: ‘a’..’z’ (instead of a-‐z) Any symbol whatsoever: ? (instead of `.’) Optional character: E^ (instead of E?) CE306/CE706– Spring 2017 15 Notational variants: advanced search in Google • CAPITALIZATION, etc: • Google search is not case-‐sensitive • “OR” search: • vacation london OR paris • NUMRANGE search: • DVD player $250..350 • WILDCARD search: • "Sony Vaio * laptop" • For more tips: http://www.google.com/help/refinesearch.html CE306/CE706– Spring 2017 16 Readings • RexExr online tool to practice regular expressions: http://www.regexr.com/ • The Java tutorial at Sun, section on regular expressions: http://download.oracle.com/javase/tutorial/essential/regex/ CE306/CE706– Spring 2017 17