10.12. Regex Recap¶

10.12.1. Literals¶

Also known as "Literal Characters"
Occurrence of that character in the string

Syntax:

a - exact
a|b - alternative

Example:

1 - number 1 anywhere in text
1|2|3 - numbers 1, 2 or 3 anywhere in text

10.12.2. Classes¶

Also known as "Character Classes"
One out of several characters

Syntax:

[abc] - enumeration
[a-z] - range

Examples:

[12345] - numbers 1,2,3,4 or 5 anywhere in text
[0-9] - numbers from 0 to 9 anywhere in text
[a-z] - lowercase letters from a to z anywhere in text
[A-Z] - uppercase letters from A to Z anywhere in text
[a-zA-Z0-9] - uppercase and lowercase letters (from a to z) anywhere in text

10.12.3. Metacharacters¶

Special characters
\ - backslash
^ - caret
$ - dollar sign
. - period or dot
| - vertical bar or pipe symbol
? - question mark
* - asterisk or star
+ - plus sign
( - opening parenthesis
) - closing parenthesis
[ - opening square bracket
[ - closing square bracket
{ - opening curly brace
} - closing curly brace

Example:

. - Any character anywhere in text, by default does not match a newline (this changes with re.DOTALL)

10.12.4. Anchors¶

Match a position before, after, or between characters

Syntax:

^ - start of line (changes meaning with re.MULTILINE)
$ - end of line (changes meaning with re.MULTILINE)
\A - start of text (doesn't change meaning with re.MULTILINE)
\Z - end of text (doesn't change meaning with re.MULTILINE)

Examples:

^[0-9] - digit at the line start
[0-9]$ - digit at the line end
\A[0-9] - digit at the text start
[0-9]\Z - digit at the text end

10.12.5. Negation¶

Negation logically inverts qualifier

Syntax:

[^] - negation

Examples:

[0-9] - digit anywhere in text
[^0-9] - anything but a digit anywhere in text
^[0-9] - digit at the beginning of a line
^[^0-9] - not-a-digit at the beginning of a line

10.12.6. Shorthands¶

Shorthand Character Classes

Syntax:

\d - digit anywhere in text, alias to [0-9]
\D - anything but a digit anywhere in text, alias to [^0-9]
\s - whitespace character (space, tab, newline, non-breaking space), alias to [ \t\v\f\n\r\n]
\S - anything but a whitespace
\btodo\b - word boundary, string "todo" being a separate word, but non alphabet characters can precede or follow: 'todo:', 'todo()'
\Btodo\B - anything but word boundary, string "todo" being a part of other word, such as: 'mastodont' or 'autodoc'
\w - any unicode alphabet character (lower or upper, also with diacritics (i.e. ąćęłńóśżź...), numbers and underscores
\W - anything but any unicode alphabet character (i.e. whitespace, dots, comas, dashes, brackets)

10.12.7. Quantifiers¶

Repetition
How many occurrences of preceding token
Exact - exactly number of times
Greedy - prefer longest match, works better with numbers, (default)
Lazy - prefer shortest matches - works better with text

Exact:

{n} - exactly n repetitions

Greedy:

{,n} - maximum n repetitions, prefer longer (greedy)
{n,} - minimum n repetitions, prefer longer (greedy)
{n,m} - minimum n repetitions, maximum m times, prefer longer (greedy)
* - minimum 0 repetitions, no maximum, prefer longer (alias to {0,}) (greedy)
+ - minimum 1 repetitions, no maximum, prefer longer (alias to {1,}) (greedy)
? - minimum 0 repetitions, maximum 1 repetitions, prefer longer (alias to {0,1}) (greedy)

Lazy:

{,n}? - maximum n repetitions, prefer shorter
{n,}? - minimum n repetitions, prefer shorter
{n,m}? - minimum n repetitions, maximum m times, prefer shorter
*? - minimum 0 repetitions, no maximum, prefer shorter (alias to {0,}?)
+? - minimum 1 repetitions, no maximum, prefer shorter (alias to {1,}?)
?? - minimum 0 repetitions, maximum 1 repetition, prefer shorter (alias to {0,1}?)

Examples:

\d{4} - digit exactly 4 times (exact)
\d{2,4} - digit from 2 to 4 times (greedy, prefer longest)
\d{2,} - digit from 2 to infinity times (greedy, prefer longest)
\d{,4} - digit from 0 to 4 times (greedy, prefer longest)
\d{1,} - at least one digit (greedy, prefer longest)
\d+ - at least one digit, alias to \d{1,} (greedy, prefer longest)
\d{0,} - at least zero digit (greedy, prefer longest)
\d* - at least zero digit, alias to \d{0,} (greedy, prefer longest)
\d{0,1} - optional digit (greedy, prefer longest)
\d? - optional digit, alias to \d{0,1} (greedy, prefer longest)
\d{2,4}? - digit from 2 to 4 times (lazy, prefer shortest)
\d{2,}? - digit from 2 to infinity times (lazy, prefer shortest)
\d{,4}? - digit from 0 to 4 times (lazy, prefer shortest)
\d{1,}? - at least one digit (lazy, prefer shortest)
\d+? - at least one digit, alias to \d{1,} (lazy, prefer shortest)
\d{0,}? - at least zero digit (lazy, prefer shortest)
\d*? - at least zero digit, alias to \d{0,} (lazy, prefer shortest)
\d{0,1}? - optional digit (lazy, prefer shortest)
\d?? - optional digit, alias to \d{0,1} (lazy, prefer shortest)

10.12.8. Groups¶

Catch expression results
Can be named or positional

Syntax:

(...) - unnamed group (positional)
(?P<mygroup>...) - named group (with name: mygroup)
(?:...) - non-capturing group
(?#...) - comment

Examples:

(\d{1,2}) - group with 1 or 2 digits (unnamed group)
(?P<year>\d{4}) - 4 digits in a group named "year" (named group)
(?P<month>\w+) - three word characters in a group named "month" (named group)
(?P<day>\d{1,2}) - 1 or 2 digits in a group named "day" (named group)
Nov (\d{1,2}) - text "Nov" followed by 1 or 2 digits (unnamed group)
Nov \d{2}(st|nd|th|rd) - text "Nov" followed by by 1 or 2 digits and one of: "st", "nd", "th" or "rd" - match the ordinal
Nov \d{2}(?:st|nd|th|rd) - text "Nov" followed by by 1 or 2 digits and one of: "st", "nd", "th" or "rd" - do not match the ordinal
Nov \d{2}st(?#ordinal) - text "Nov" followed by by 1 or 2 digits and one of: "st", "nd", "th" or "rd" and comment "ordinal"

10.12.9. Backreference¶

Match the same text as previously matched by a capturing group

Syntax:

\g<number> - backreferencing by group number
\g<name> - backreferencing by group name
(?P=name) - backreferencing by group name

Examples:

\g<2> \g<1> \g<3>
\g<day> \g<month> \g<year>
<(?P<tagname>[a-z]+)>(.*)</(?P=tagname)>

10.12.10. Flags¶

re.ASCII - perform ASCII-only matching instead of full Unicode matching
re.IGNORECASE - case-insensitive search
re.LOCALE - case-insensitive matching dependent on the current locale (deprecated)
re.MULTILINE - match can start in one line, and end in another
re.DOTALL - dot (.) matches also newline characters
re.UNICODE - turns on unicode character support for \w
re.VERBOSE - ignores spaces (except \s) and allows for comments in in re.compile()
re.DEBUG - display debugging information during pattern compilation

10.12.11. Python¶

re.findall() - all matches at once, returns list[str]
re.finditer() - all matches one at a time, returns Iterator[re.Match]
re.search() - whether text contains (stop after first match), returns re.Match | None
re.match() - whether text matches pattern (validation, np. email, ssn, tax id, phone), returns re.Match | None
re.split() - splits text by pattern, returns list[str]
re.sub() - replaces group matches in text (works best with named groups), returns str
re.compile() - prepares pattern for further use (match against it), returns re.Pattern