Skip to content

Latest commit

 

History

History
160 lines (110 loc) · 6.14 KB

05-RegularExpressions.md

File metadata and controls

160 lines (110 loc) · 6.14 KB
title author date lang
Regular expressions — a matching game
CSC Training
2019-12
en

Matching text

  • A number of Unix text-processing utilities let you search for, and in some cases change, text strings.
    • These utilities include the editing programs ed, ex, vi and sed, the awk programming language, and the commands grep and egrep.
  • Regular expressions — or regexes for short — are a way to match text with patterns.
  • Regular expressions are a pattern matching standard for string parsing and replacement.

The most simple regex

  • In it's simplest form, a regular expression is a string of symbols to match "as is".

    Regex Matches
    abc abcdef
    234 12345
$ grep '234'

Basic regexes vs. extended regexes

  • The Basic Regular Expressions or BRE flavor standardizes a flavor similar to the one used by the traditional UNIX grep command.
    • The only supported quantifiers are . (dot), ^ (caret), $ (dollar), and * (star). To match these characters literally, escape them with a \ (backslash).
    • Some implementations support \? and \+, but they are not part of the POSIX standard.
  • Most modern regex flavors are extensions to the BRE flavor, thus called ERE flavor. By today's standard, the POSIX ERE flavor is rather bare bones.
  • We will be using extended regexes, so:
$ alias grep='grep --color=auto -E'

Quantifiers

  • To match several characters you need to use a quantifier:

    • * matches any number of what's before it, from zero to infinity.
    • ? matches zero or one of what's before it.
    • + matches one or more of what's before it.
    Regex Matches
    23*4 1245, 12345, 123345
    23?4 1245, 12345
    23+4 12345, 123345
$ grep '23*4'

Regexes are hoggish

  • By default, regexes are greedy. They match as many characters as possible.

    Regex Matches
    2 122223
  • You can define how many instances of a match you want by using ranges:

    • {m} matches only m number of what's before it.
    • {m,n} matches m to n number of what's before it ({0,1} = ?).
    • {m,} matches m or more number of what's before it ({1,}= +).

Special characters

  • A lot of special characters are available for regex building. Here are some of the more usual ones:
    • . matches any single character.
    • ^ matches the beginning of the input string.
    • $ matches the end of the input string.
    • \w matches an alphanumeric character, \W a non-alphanumeric.
    • \ to escape special characters, e.g. \. matches a dot, and \\ matches a backslash.

Special character examples

Regex Matches Does not match
1.3 1234, 1z3, 0133 13
1.*3 13, 123, 1zdfkj3
\w+@\w+ a@a, email@oy.ab ,.-!"#€%&/
^1.*3$ 13, 123, 1zdfkj3 x13, 123x, x1zdfkj3x

Character classes

  • You can group characters by putting them between square brackets. This way, any character in the class will match any one character in the input.
    • [abc] matches any of a, b, and c.
    • [a-z] matches any character between a and z.
      • Note: if you want to include - in the matching charaters, add it as the first or last entry in the class, otherwise it will be interpreted as a range definition!
    • [^abc] matches anything other than a, b, or c.
      • Note that here the caret ^ at the beginning indicates "not" instead of beginning of line.
    • [+*?.] matches any of +, *, ? or the dot.
      • Most special characters have no meaning inside the square brackets.

Character class examples

Regex Matches Does not match
[^ab] c, d, abc, sadvbcv a, b, ab
^[1-9][0-9]*$ 1, 45, 101 0123, -1, a1, 2.0
[0-9]*[,.]?[0-9]+ 1, .1, 0.1, 1,000, 0,0,0.0

Grouping and alternatives

  • It might be necessary to group things together, which is done with parentheses ( and ).

    Regex Matches Does not match
    (ab) ab, abab, aabb aa, bb
    • Grouping itself usually does not do much, but combined with other features turns out to be very useful.
  • The OR operator | may be used for alternatives.

    Regex Matches Does not match
    `(aa bb)` aa, bbaa, aabb

Subexpressions

  • With parentheses, you can also define subexpressions to store the match after it has happened and then refer to it later on.

    Regex Matches Does not match
    (ab)\1 ababcdcd ab, abcabc
    (ab)c.*\1 abcabc, abcdefabcdef abc, ababc

Some practical (?) examples

  • Check for a valid format for email address:

    $ grep '\w[A-Za-z0-9._+-]+[^.]@\w[A-Za-z0-9.-]+\.[A-Za-z]{2,}'
    • \w[A-Za-z0-9._+-]+[^.] matches all acceptable characters not starting or ending with a dot.
    • @ matches the @ sign.
    • \w[A-Za-z0-9.-]+ matches any domain name, incl. dots.
    • \.[A-Za-z]{2,}$ matches a literal dot followed by two or more characters at the end.
  • Check for a valid format for Finnish social security:

    • Format is ddmmyyCnnnV, where C=century, and V=verify
    $ grep '[0-9]{2}[01][0-9]{3}[-+A][0-9]{3}[ABCDEFHJLKMNPRSTUVWXY]'