Anchors
Anchors aren't characters - they're positions. Match where in the string you are.
^- start of string (or line, with themflag)$- end of string (or line)\b- word boundary (between word and non-word char)\B- anywhere except a word boundary
^cat$ matches the exact string "cat" but not "cats" or " cat". \bcat\b matches "cat" anywhere, including inside "the cat sat" but not inside "category".
Character classes
.- any character (except newline by default;sflag includes newline)\d- digit (0-9)\D- non-digit\w- word char (a-z A-Z 0-9 _)\W- non-word char\s- whitespace (space, tab, newline)\S- non-whitespace[abc]- any one of a, b, c[^abc]- anything except a, b, c[a-z]- range. Inside[], the dash is a range. Outside, it's a literal dash.
Quantifiers
x?- 0 or 1 of xx*- 0 or more of xx+- 1 or more of xx{n}- exactly nx{n,m}- between n and mx{n,}- at least n
By default, quantifiers are greedy - they match as much as possible. Adding a ? after them makes them lazy - match as little as possible. .* matches as much as it can; .*? matches as little as it can. The lazy form is what you want when extracting between delimiters: <.*> on <a>X</a> matches the entire string; <.*?> matches each tag separately.
Groups
(abc)- capturing group. Available as$1,$2, etc., in replacements.(?:abc)- non-capturing group. Same matching behavior, no capture slot. Faster.(?<name>abc)- named capture. Reference as$<name>in replacements.a|b- alternation. Matches a or b.
Groups also let you apply quantifiers to multi-character sequences: (ab)+ matches one or more "ab"s.
Lookarounds (zero-width assertions)
Sometimes you need to match X but only when followed/preceded by Y - without including Y in the match.
X(?=Y)- positive lookahead. Match X only when followed by Y.X(?!Y)- negative lookahead.(?<=Y)X- positive lookbehind. Match X only when preceded by Y.(?<!Y)X- negative lookbehind.
Most modern engines (JS post-ES2018, Python, Go's regexp2, PCRE) support all four. Some older or simpler engines (Go's stdlib regexp for example) only support lookahead. Check before you ship.
The 12 patterns you'll actually use
- Email (good enough):
^[\w.%+-]+@[\w.-]+\.[a-zA-Z]{2,}$- pragmatic, not RFC 5322 perfect (perfect is ~6,000 characters long). - URL (HTTP/HTTPS):
^https?://\S+$- coarse but works. - IPv4:
^(?:(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.){3}(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)$ - UUID v4:
^[0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$(case-insensitive). - Hex color:
^#(?:[0-9a-fA-F]{3}){1,2}$- matches #fff and #ffffff. - Slug:
^[a-z0-9]+(?:-[a-z0-9]+)*$- lowercase, hyphens between, no leading/trailing/double hyphens. - ISO date (YYYY-MM-DD):
^\d{4}-\d{2}-\d{2}$- coarse; doesn't validate that 02-30 is impossible. - Whitespace runs:
\s+- for collapsing multiple spaces. Replace with single space. - Trailing whitespace:
\s+$withmflag - strip per line. - HTML/XML tag:
<([a-z][a-z0-9]*)\b[^>]*>.*?</\1>- naive but fine for snippets. Don't try to fully parse HTML with regex. - Phone (loose international):
^\+?[0-9\s\-()]{7,}$- varies massively by country; loose is best. - Markdown header:
^(#{1,6})\s+(.+)$withmflag - captures level and text.
Flags
i- case insensitiveg- global (find all matches; in JS, also enablesmatchAll)m- multiline (^and$match per line)s- dotall (.matches newlines)u- unicode (treats input as UTF-8 codepoints, supports\p{}properties)
Common mistakes
- Forgetting to escape inside character classes:
[.]matches a literal dot..outside[]matches any char. Inside[]the dot is already literal. - Greedy when you wanted lazy:
".*"on"foo" "bar"matches the entire"foo" "bar". Use".*?". - Not anchoring:
\d{4}matches inside "abc1234567def" because it isn't anchored. Add^and$for full-string match. - Trying to parse HTML with regex: nested tags, attribute quoting, comments, CDATA - regex can't handle the recursion. Use a real parser.
JustKit's role
The JustKit regex tester shows live matches as you type, with each match highlighted in the input. Use it to iterate on a pattern, paste in real-world test data, and verify edge cases before shipping.