0% found this document useful (0 votes)
99 views

PCD Lab Manual

This document provides an overview of regular expressions (regex). It discusses basic regex syntax, character classes, repetition operators, anchors, grouping, backreferences, and matching modes. Examples are provided for text matching, full text matching, character classes, repetition, greedy/nongreedy matching, alternation, and replacing text using regex. The document is intended to introduce the key concepts and components of regex.

Uploaded by

Sumit Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views

PCD Lab Manual

This document provides an overview of regular expressions (regex). It discusses basic regex syntax, character classes, repetition operators, anchors, grouping, backreferences, and matching modes. Examples are provided for text matching, full text matching, character classes, repetition, greedy/nongreedy matching, alternation, and replacing text using regex. The document is intended to introduce the key concepts and components of regex.

Uploaded by

Sumit Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 28

$address =~ m/(\d* .*)\n(.*?, ([A-Z]{2}) (\d{5})-?

(\d{0,5})/
Introduction to Regular Expressions

• It’s all about patterns


• Character Classes match any text of a certain type
• Repetition operators specify a recurring pattern
• Search flags change how the RegEx operates
• In this presentation…
• green denotes a character class
• yellow denotes a repetition quantifier
• orange denotes a search flag or other symbol

• My examples use Perl syntax


Introduction to Regular Expressions

• Basic syntax
• All RegEx statements must begin and end with /
• /something/
• Escaping reserved characters is crucial
• /(i.e. / is invalid because ( must be closed
• However, /\(i\.e\. / is valid for finding ‘(i.e. ’
• Reserved characters include:
• .*?+()[]{}/\|
• Also some characters have special meanings
based on their position in the statement
Regular Expression Matching

• Text Matching
• A RegEx can match plain text
• ex. if ($name =~ /Dan/) { print “match”; }
• But this will match Dan, Danny, Daniel, etc…
• Full Text Matching with Anchors
• Might want to match a whole line (or string)
• ex. if ($name =~ /^Dan$/) { print “match”; }
• This will only match Dan
• ^ anchors to the front of the line
• $ anchors to the end of the line
Regular Expression Matching

• Order of results
• The search will begin at the start of the string
• This can be altered, don’t ask yet
• Every character is important
• Any plain text in the expression is treated literally
• Nothing is neglected (close doesn’t count)
• / s/ is not the same as / s/
• Far easier to write than to debug!
Regular Expression Char Classes

• Allows specification of only certain allowable chars


• [dofZ] matches only the letters d, o, f, and Z
• If you have a string ‘dog’ then /[dofZ]/ would
match ‘d’ only even though ‘o’ is also in the class
• So this expression can be stated “match one of
either d, o, f, or Z.”
• [A-Za-z] matches any letter
• [a-fA-F0-9] matches any hexadecimal character
• [^*$/\\] matches anything BUT *, $, /, or \
• The ^ in the front of the char class specifies ‘not’
• In a char class, you only need to escape \ ( ] - ^
Regular Expression Char Classes

• Special character classes match specific characters


• \d matches a single digit
• \w matches a word character (A-Z, a-z, _)
• \b matches a word boundary /\bword\b/
• \s matches a whitespace character (spc, tab, newln)
• . wildcard matches everything except newlines
• Use very carefully, you could get anything!

• To match “anything but…” capitalize the char class


• i.e. \D matches anything that isn’t a digit
Regular Expression Char Classes

• Character Class Examples


• $bodyPart =~ /e\w\w/;
• Matches ear, eye, etc
• $thing = ‘1, 2, 3 strikes!’; $thing =~ /\s\d/;
• Matches ‘ 2’
• $thing = ‘1, 2, 3 strikes!’; $thing =~ /[\s\d]/;
• Matches ‘1’
• Not always useful to match single characters
• $phone =~ /\d\d\d-\d\d\d-\d\d\d\d/;
• There’s a better way…
Regular Expression Repetition

• Repetition allows for flexibility


• Range of occurrences
• $weight =~ /\d{2,3}/;
• Matches any weight from 10 to 999
• $name =~ /\w{5,}/;
• Matches any name longer than 5 letters
• if ($SSN =~ /\d{9}/) { print “Invalid SSN!”; }
• Matches exactly 9 digits
Regular Expression Repetition

• General Quantifiers
• Some more special characters
• $favoriteNumber =~ /\d*/;
• Matches any size number or no number at all
• $firstName =~ /\w+/;
• Matches one or more characters
• $middleInitial =~ /\w?/;
• Matches one or zero characters
Regular Expression Repetition

• Greedy vs Nongreedy matching


• Greedy matching gets the longest results possible
• Nongreedy matching gets the shortest possible
• Let’s say $robot = ‘The12thRobotIs2ndInLine’
• $robot =~ /\w*\d+/; (greedy)
• Matches The12thRobotIs2
• Maximizes the length of \w
• $robot =~ /\w*?\d+/; (nongreedy)
• Matches The12
• Minimizes the length of \w
Regular Expression Repetition

• Greedy vs Nongreedy matching


• Suppose $txt = ‘something is so cool’;
• $txt =~ /something/;
• Matches ‘something’
• $txt =~ /so(mething)?/;
• Matches ‘something’ and the second ‘so’
• $txt =~ /so(mething)??/;
• Matches only ‘so’ and the second ‘so’
• Doesn’t really make sense to do this
Regular Expression Real Life Examples

• Using what you’ve learned so far, you can…


• Validate a standard 8.3 file name
• $path =~ /^\w{1,8}\.[A-Za-z0-9]{2,3}$/
• Account for poorly spelled user input
• $answer =~ /^ban{1,2}an{1,2}a$/
• $iansLastName =~ /^P[ae]t{1,2}ers[oe]n$/
• $iansFirstName =~ /^E?[Ii]?[aeo]?n$/
• Matches Ian, Ean, Eian, Eon, Ien, Ein
• At least everyone gets the n right…
Alternation

• Alternation allows multiple possibilities


• Let $story = ‘He went to get his mother’;
• $story =~ /^(He|She)\b.*?\b(his|her)\b.*?
(mother|father|brother|sister|dog)/;
• Also matches ‘She punched her fat brother’
• Make sure the grouping is correct!
• $ans =~ /^(true|false)$/
• Matches only ‘true’ or ‘false’
• $ans =~ /^true|false$/ (same as /(^true|false$)/)
• Matches ‘true never’ or ‘not really false’
Grouping for Backreferences

• Backreferences
• With all these wildcards and possible matches, we
usually need to know what the expression finally
ended up matching.
• Backreferences let you see what was matched
• Can be used after the expression has evaluated or
even inside the expression itself
• Handled very differently in different languages
• Numbered from left to right, starting at 1
Grouping for Backreferences

• Perl backreferences
• Used inside the expression
• $txt =~ /\b(\w+)\s+\1\b/
• Finds any duplicated word, must use \1 here
• Used after the expression
• $class =~ /(.+?)-(\d+)/
• The first word between hyphens is stored in the
Perl variable $1 (not \1) and the number goes in $2
• print “I am in class $1, section $2”;
Grouping for Backreferences

• Java backreferences
• Annoying but still useful
• Pattern p = Pattern.compile(“(.+?)-(\\d+)”);
Matcher m = p.matcher(mySchedule);
m.find();
System.out.println(“I am in class ” + m.group(1) +
“, section ” + m.group(2));
• Ugly, but usually better than the alternative
• m.group() returns the entire string matched
Grouping for Backreferences

• Javascript backreferences
• Used inside the expression
• Not supported
• Used after the expression
• /(.+?)-(\d+)/.test(class);
• alert(RegExp.$1);
• str = str.replace(/(\S+)\s+(\S+)/, “$2 $1”);
• RegExp supports all of Perl’s special backreference
variables (wait a few slides)
Grouping for Backreferences

• PHP/Python backreferences
• Allows the use of specifically named backreferences
• Groups also maintain their numbers
• .NET backreferences
• Allows named backreferences
• If you try to access named groups by number, stuff
breaks

• Check the web for info on how to use backreferences


in these and other languages.
Grouping without Backreferences

• Sometimes you just need to make a group


• If important groups must be backreferenced, disable
backreferencing for any unimportant groups
• $sentence =~ /(?:He|She) likes (\w+)\./;
• I don’t care if it’s a he or she
• All I want to know is what he/she likes
• Therefore I use (?:) to forgo the backreference
• $1 will contain that thing that he/she likes
Matching Modes

• Matching has different functional modes


• Modes can be set by flags outside the expression (only
in some languages & implementations)
• $name =~ /[a-z]+/i;
• i turns off case sensitivity
• $xml =~ /title=“([\w ]*)”.*keywords=“([\w ]*)”/s;
• s enables . to match newlines
• $report =~ /^\s*Name:[\s\S]*?The End.\s*$/m;
• m allows newlines between ^ and $
Matching Modes
• Matching has different functional modes
• Modes can be set by flags inside the expression
(except in Javascript and Ruby)
• $password =~ /^[a-z](?i)[a-jp-xz0-9]{4,11}$/;
• If an insane web site specifies that your
password must begin with a lowercase letter
followed by 4 to 11 upper/lower alphanumeric
characters excluding k through o and y.
• $element =~ /^(?i)[A-Z](?-i)[a-z]?$/;
• (?i) makes the first letter case insensitive (if
they type o, but meant O, we still know they
mean oxygen). (?-i) makes sure the second
letter is lowercase, otherwise it’s 2 elements
Regular Expression Replacing
• Replacements simplify complex data modification
• Generally the first part of a replace command is the
regular expression and the second part is what to
replace the matched text with
• Usually a backreference variable can be used in the
replacement text to refer to a group matched in the
expression
• The RegEx engine continues searching at the point in
the string following the replacement
• Replacements use all the same syntax, but have
several unique features and are implemented very
differently in various languages.
Regular Expression Replacing
• Perl replacement syntax
• $phone =~ s/\D//;
• Removes the first non-digit character in a phone #
• Note that leaving the replacement blank deletes
• $html =~ s/^(\s*)/$1\t/;
• Adds a tab to a line of HTML using backreferences
• $sample =~ s/[abc]/[ABC]/;
• Might not do what is expected
• The second part is NOT a regular expression, it’s a
string
Regular Expression Replacing
• Java replacement syntax (sucks)
• Pattern p = Pattern.compile(“\\\\\\\\server(\\d)”);
• p.matcher(netPath).replaceAll(“\\\\workstation$1”);
• Yes, you actually have to use 8 \’s to make \\
• Any \ in the expression needs to be doubled
• Matcher should parse replacement for $1
• This has the same effect but is slightly faster than
• netPath.replaceAll(“\\\\\\\\server(\\d)”,
“\\\\workstation$1”);
• No, you can’t seem to use .replace()…
Replacement Modes
• Replacements can be performed singly or globally
• The examples I have been using replace only single
occurrences of patterns
• Use the g flag to force the expression to scan the
entire string
• $phone =~ s/\D//g;
• Removes all non-digits in the phone number
• $myGarage =~ s/Jeep|Cougar/Boeing/g;
• Gives me jets in exchange for cars
• Don’t use it if it’s not necessary
Combining Replace and Match Modes
• Combining modes is easy
• To combine modes, just append the flags
• $alphabet =~ /Q//gi;
• Get rid of the pesky letter Q (and q too)
• $response =~ /(?im)“([aeiou].*?)”(?-m)(.*)/;
• This example sucks. Point is you can combine
modes inside the statement, too.
References for Learning More
• Tutorials for other programming languages
• http://www.regular-expressions.info/

• In-depth syntax
• http://kobesearch.cpan.org/htdocs/perl/perlreref.html

• Code Search (ex: ‘ip address regex’)


• http://www.google.com/codesearch

You might also like