title | author | date | lang |
---|---|---|---|
Regular expressions — a matching game |
CSC Training |
2019-12 |
en |
- A number of Unix text-processing utilities let you search for, and in some cases change, text strings.
- These utilities include the editing programs
ed
,ex
,vi
andsed
, theawk
programming language, and the commandsgrep
andegrep
.
- These utilities include the editing programs
- Regular expressions — or regexes for short — are a way to match text with patterns.
- Regular expressions are a pattern matching standard for string parsing and replacement.
-
In it's simplest form, a regular expression is a string of symbols to match "as is".
Regex Matches abc
abcdef 234
12345
$ grep '234'
- The Basic Regular Expressions or BRE flavor standardizes a flavor similar to the one used by the traditional UNIX
grep
command.- The only supported quantifiers are
.
(dot),^
(caret),$
(dollar), and*
(star). To match these characters literally, escape them with a\
(backslash). - Some implementations support
\?
and\+
, but they are not part of the POSIX standard.
- The only supported quantifiers are
- Most modern regex flavors are extensions to the BRE flavor, thus called ERE flavor. By today's standard, the POSIX ERE flavor is rather bare bones.
- We will be using extended regexes, so:
$ alias grep='grep --color=auto -E'
-
To match several characters you need to use a quantifier:
*
matches any number of what's before it, from zero to infinity.?
matches zero or one of what's before it.+
matches one or more of what's before it.
Regex Matches 23*4
1245, 12345, 123345 23?4
1245, 12345 23+4
12345, 123345
$ grep '23*4'
-
By default, regexes are greedy. They match as many characters as possible.
Regex Matches 2
122223 -
You can define how many instances of a match you want by using ranges:
{m}
matches only m number of what's before it.{m,n}
matches m to n number of what's before it ({0,1}
=?
).{m,}
matches m or more number of what's before it ({1,}
=+
).
- A lot of special characters are available for regex building. Here are some of the more usual ones:
.
matches any single character.^
matches the beginning of the input string.$
matches the end of the input string.\w
matches an alphanumeric character,\W
a non-alphanumeric.\
to escape special characters, e.g.\.
matches a dot, and\\
matches a backslash.
Regex | Matches | Does not match |
---|---|---|
1.3 |
1234, 1z3, 0133 | 13 |
1.*3 |
13, 123, 1zdfkj3 | |
\w+@\w+ |
a@a, email@oy.ab | ,.-!"#€%&/ |
^1.*3$ |
13, 123, 1zdfkj3 | x13, 123x, x1zdfkj3x |
- You can group characters by putting them between square brackets. This way, any character in the class will match any one character in the input.
[abc]
matches any of a, b, and c.[a-z]
matches any character between a and z.- Note: if you want to include
-
in the matching charaters, add it as the first or last entry in the class, otherwise it will be interpreted as a range definition!
- Note: if you want to include
[^abc]
matches anything other than a, b, or c.- Note that here the caret
^
at the beginning indicates "not" instead of beginning of line.
- Note that here the caret
[+*?.]
matches any of +, *, ? or the dot.- Most special characters have no meaning inside the square brackets.
Regex | Matches | Does not match |
---|---|---|
[^ab] |
c, d, abc, sadvbcv | a, b, ab |
^[1-9][0-9]*$ |
1, 45, 101 | 0123, -1, a1, 2.0 |
[0-9]*[,.]?[0-9]+ |
1, .1, 0.1, 1,000, 0,0,0.0 |
-
It might be necessary to group things together, which is done with parentheses
(
and)
.Regex Matches Does not match (ab)
ab, abab, aabb aa, bb - Grouping itself usually does not do much, but combined with other features turns out to be very useful.
-
The OR operator
|
may be used for alternatives.Regex Matches Does not match `(aa bb)` aa, bbaa, aabb
-
With parentheses, you can also define subexpressions to store the match after it has happened and then refer to it later on.
Regex Matches Does not match (ab)\1
ababcdcd ab, abcabc (ab)c.*\1
abcabc, abcdefabcdef abc, ababc
-
Check for a valid format for email address:
$ grep '\w[A-Za-z0-9._+-]+[^.]@\w[A-Za-z0-9.-]+\.[A-Za-z]{2,}'
\w[A-Za-z0-9._+-]+[^.]
matches all acceptable characters not starting or ending with a dot.@
matches the @ sign.\w[A-Za-z0-9.-]+
matches any domain name, incl. dots.\.[A-Za-z]{2,}$
matches a literal dot followed by two or more characters at the end.
-
Check for a valid format for Finnish social security:
- Format is
ddmmyyCnnnV
, whereC
=century, andV
=verify
$ grep '[0-9]{2}[01][0-9]{3}[-+A][0-9]{3}[ABCDEFHJLKMNPRSTUVWXY]'
- Format is