Developer Guides Intermediate

Complete Regular Expressions Tutorial

Master regex patterns, syntax, and practical applications

15 min read • Updated January 2025

What is Regular Expression?

A Regular Expression (regex or regexp) is a sequence of characters that defines a search pattern. It's a powerful tool used for pattern matching, text searching, validation, and text manipulation across virtually every programming language and text editor.

Originally formalized by mathematician Stephen Cole Kleene in the 1950s, regular expressions have become an essential skill for developers, data scientists, and system administrators. They enable you to describe complex patterns with concise syntax, making tasks like email validation, log parsing, and data extraction significantly easier.

Key Characteristics:

  • Concise pattern-matching language supported across all major programming languages
  • Used for validation, searching, replacing, and extracting text
  • Ranges from simple patterns like "abc" to complex expressions with lookarounds and backreferences
  • Standardized by POSIX and PCRE (Perl Compatible Regular Expressions)
  • Powerful but can be difficult to read - proper commenting and testing is essential

Common Use Cases:

Email validation, phone number formatting, URL parsing, log file analysis, data extraction, input sanitization, syntax highlighting, find-and-replace operations.

Basic Regex Syntax

Regular expressions are composed of two types of characters: literals and metacharacters.

Literal Characters

Match themselves exactly. Most letters and numbers are literals.

Pattern: cat
Matches: "cat" in "The cat sat on the mat"
Does not match: "Cat" or "cats" (case-sensitive by default)

Metacharacters (Special Characters)

Characters with special meaning that must be escaped with backslash to match literally.

. ^ $ * + ? { } [ ] \ | ( )

Examples:
\. matches a literal period
\? matches a literal question mark
\\ matches a literal backslash

The Dot Metacharacter (.)

Matches any single character except newline.

Pattern: c.t
Matches: "cat", "cot", "c3t", "c@t"
Does not match: "ct" or "cart"

Escape Sequences

Special sequences starting with backslash.

\d  matches any digit (0-9)
\D  matches any non-digit
\w  matches any word character (a-z, A-Z, 0-9, _)
\W  matches any non-word character
\s  matches any whitespace (space, tab, newline)
\S  matches any non-whitespace

Examples with Escape Sequences

Pattern: \d\d\d-\d\d\d\d
Matches: "555-1234" (phone number format)

Pattern: \w+@\w+\.\w+
Matches: "user@example.com" (simple email pattern)

Character Classes and Ranges

Character classes let you match any one character from a set of characters.

Basic Character Class [...]

Matches any single character within the brackets.

Pattern: [aeiou]
Matches: any single vowel
Example: "a" in "cat", "e" in "bed"

Pattern: [0-9]
Matches: any single digit (equivalent to \d)

Pattern: [a-z]
Matches: any lowercase letter

Pattern: [A-Za-z0-9]
Matches: any alphanumeric character

Negated Character Class [^...]

Matches any character NOT in the brackets.

Pattern: [^0-9]
Matches: any non-digit character

Pattern: [^aeiou]
Matches: any consonant or non-letter

Pattern: [^\s]
Matches: any non-whitespace character

Ranges

Use hyphen to specify a range of characters.

[a-z]     lowercase letters a through z
[A-Z]     uppercase letters A through Z
[0-9]     digits 0 through 9
[a-zA-Z]  any letter
[0-9a-f]  hexadecimal digits

Special Characters in Character Classes

Most metacharacters lose their special meaning inside brackets.

Pattern: [.$+]
Matches: literal period, dollar sign, or plus
(No need to escape . + inside brackets)

Pattern: [\\[\]]
Matches: backslash, opening bracket, or closing bracket
(Still need to escape \ [ ] even inside brackets)

Practical Example

Pattern: [A-Za-z][a-z]*
Matches: Words starting with uppercase letter
Example: "Hello" in "Hello world", "Cat" in "The Cat"

Pattern: #[0-9A-Fa-f]6
Matches: Hex color codes like "#FF5733" or "#a1b2c3"

Quantifiers

Quantifiers specify how many times a pattern should match.

Asterisk (*) - Zero or More

Matches zero or more occurrences of the preceding element.

Pattern: ab*c
Matches: "ac", "abc", "abbc", "abbbc"
(zero or more b's between a and c)

Pattern: \d*
Matches: "", "5", "123", "999999"
(including empty string)

Plus (+) - One or More

Matches one or more occurrences of the preceding element.

Pattern: ab+c
Matches: "abc", "abbc", "abbbc"
Does not match: "ac" (requires at least one b)

Pattern: \d+
Matches: "5", "123", "999999"
Does not match: "" (requires at least one digit)

Question Mark (?) - Zero or One

Matches zero or one occurrence (makes preceding element optional).

Pattern: colou?r
Matches: "color" or "colour"

Pattern: https?://
Matches: "http://" or "https://"

Pattern: -?\d+
Matches: "5" or "-5" (optional minus sign)

Curly Braces {n,m} - Specific Counts

{n}     exactly n times
{n,}    n or more times
{n,m}   between n and m times

Examples:
d{3}        exactly 3 digits: "123"
d{3,}       3 or more digits: "123", "12345"
d{3,5}      3 to 5 digits: "123", "1234", "12345"
[a-z]{2,4}   2 to 4 lowercase letters: "ab", "cat", "test"

Greedy vs. Lazy Quantifiers

By default, quantifiers are greedy (match as much as possible). Add ? to make them lazy.

Greedy: <.+>
Matches: "<b>Hello</b>" in "<b>Hello</b> <i>World</i>"
(matches everything from first < to last >)

Lazy: <.+?>
Matches: "<b>" and "</b>" separately in "<b>Hello</b>"
(matches shortest possible string)

Examples:
.*?   lazy version of .*
.+?   lazy version of .+
.{2,5}?  lazy version of .{2,5}

Practical Examples

Pattern: \d{3}-\d{3}-\d{4}
Matches: US phone numbers like "555-123-4567"

Pattern: \b[A-Z][a-z]{2,}\b
Matches: Capitalized words of 3+ letters

Pattern: \$\d+(\.\d{2})?
Matches: Dollar amounts like "$5" or "$19.99"

Anchors and Boundaries

Anchors match positions in the string, not actual characters.

Caret (^) - Start of String

Matches the position at the beginning of the string.

Pattern: ^Hello
Matches: "Hello world"
Does not match: "Say Hello" (Hello not at start)

Pattern: ^\d+
Matches: "123 Main St" (digits at start)
Does not match: "Main St 123"

Dollar ($) - End of String

Matches the position at the end of the string.

Pattern: world$
Matches: "Hello world"
Does not match: "world is big" (world not at end)

Pattern: \d+$
Matches: "Main St 123" (digits at end)
Does not match: "123 Main St"

Combining ^ and $

Match the entire string (exact match).

Pattern: ^\d3$
Matches: "123" (exactly 3 digits, nothing else)
Does not match: "1234" or "12" or "123 "

Pattern: ^[A-Za-z]+$
Matches: "Hello" (only letters, no spaces/numbers)
Does not match: "Hello World" or "Hello123"

Word Boundary (\b)

Matches the position between a word character and non-word character.

Pattern: \bcat\b
Matches: "cat" in "The cat sat"
Does not match: "cat" in "catch" or "scat"

Pattern: \b\d3\b
Matches: "123" in "code 123 here"
Does not match: "123" in "code1234"

Non-Word Boundary (\B)

Matches where \b does not match.

Pattern: \Bcat\B
Matches: "cat" in "concatenate"
Does not match: "cat" in "The cat" or "catch"

Pattern: \B\d+\B
Matches: "23" in "123456"
Does not match: "123" in "code 123"

Practical Examples

Pattern: ^\s*$
Matches: Empty lines or whitespace-only lines

Pattern: \b[A-Z]{2,}\b
Matches: Acronyms like "USA", "NASA", "HTTP"

Pattern: ^[A-Z].*\.$
Matches: Sentences starting with capital and ending with period

Groups and Capturing

Groups allow you to treat multiple characters as a single unit and capture matched text for later use.

Capturing Groups (...)

Parentheses create a capturing group that remembers the matched text.

Pattern: (\d3)-(\d3)-(\d4)
Matches: "555-123-4567"
Captures: Group 1: "555", Group 2: "123", Group 3: "4567"

Pattern: (https?)://([^/]+)
Matches: "https://example.com"
Captures: Group 1: "https", Group 2: "example.com"

Non-Capturing Groups (?:...)

Groups characters without capturing them (better performance).

Pattern: (?:Mr|Mrs|Ms)\. ([A-Z][a-z]+)
Matches: "Mr. Smith", "Mrs. Johnson"
Captures: Only the name (not the title)

Pattern: (?:https?|ftp)://[^\s]+
Matches: URLs with http, https, or ftp
Does not capture the protocol separately

Backreferences (\1, \2, etc.)

Reference previously captured groups within the same pattern.

Pattern: (\w+)\s+\1
Matches: Repeated words like "the the" or "is is"

Pattern: <(\w+)>.*?</\1>
Matches: HTML tags like "<b>text</b>" or "<div>content</div>"
(ensures closing tag matches opening tag)

Pattern: (['"])(.*?)\1
Matches: Quoted strings with matching quotes
"hello" or 'world' (not mixing quotes)

Named Capturing Groups (?<name>...)

Give groups meaningful names for better readability.

Pattern: (?<year>d{4})-(?<month>d{2})-(?<day>d{2})
Matches: "2025-01-27"
Captures: year: "2025", month: "01", day: "27"

Pattern: (?<protocol>https?)://(?<domain>[^/]+)
Matches: "https://example.com"
Captures: protocol: "https", domain: "example.com"

Alternation (|) - OR Operator

Match one pattern or another.

Pattern: cat|dog
Matches: "cat" or "dog"

Pattern: (jpg|jpeg|png|gif)$
Matches: Image file extensions

Pattern: \b(yes|no|maybe)\b
Matches: Complete words "yes", "no", or "maybe"

Practical Examples

Pattern: (d{1,3}.){3}d{1,3}
Matches: IP addresses like "192.168.1.1"

Pattern: ([A-Z][a-z]+)s+([A-Z][a-z]+)
Matches: Full names like "John Smith"
Captures: First name and last name separately

Pattern: (?:TODO|FIXME|HACK):s*(.+)
Matches: Code comments like "TODO: Fix this"
Captures: Only the comment text

Lookaheads and Lookbehinds

Lookarounds are zero-width assertions that match a position based on what comes before or after, without consuming characters.

Positive Lookahead (?=...)

Matches if followed by the pattern, but doesn't consume it.

Pattern: \d+(?= dollars)
Matches: "100" in "100 dollars"
Does not match: "100" in "100 euros"
(matches number only if followed by " dollars")

Pattern: [A-Za-z]+(?=\d)
Matches: "test" in "test123"
Does not match: "test" in "test"

Negative Lookahead (?!...)

Matches if NOT followed by the pattern.

Pattern: \d+(?! dollars)
Matches: "100" in "100 euros"
Does not match: "100" in "100 dollars"

Pattern: \b\w+(?!ing\b)
Matches: Words NOT ending in "ing"
Matches: "test", "code"
Does not match: "running", "testing"

Positive Lookbehind (?<=...)

Matches if preceded by the pattern.

Pattern: (?<=\$)\d+
Matches: "100" in "$100"
Does not match: "100" in "100"
(matches number only if preceded by $)

Pattern: (?<=@)\w+
Matches: "username" in "email@username"
Extracts text after @ symbol

Negative Lookbehind (?<!...)

Matches if NOT preceded by the pattern.

Pattern: (?<!\$)\d+
Matches: "100" in "100"
Does not match: "100" in "$100"

Pattern: (?<!un)\w+able
Matches: "readable", "breakable"
Does not match: "unable", "unreachable"

Combining Lookarounds

Use multiple lookarounds for complex conditions.

Pattern: ^(?=.*[A-Z])(?=.*[a-z])(?=.*d).{8,}$
Matches: Strong passwords with:
- At least one uppercase letter
- At least one lowercase letter
- At least one digit
- Minimum 8 characters

Pattern: (?<=s|^)d+(?=s|$)
Matches: Numbers surrounded by spaces or at string boundaries

Practical Examples

Pattern: (?<=<title>).*?(?=</title>)
Extracts: Content between <title> tags

Pattern: w+(?=.jpg|.png|.gif)
Matches: Filenames without extensions (for image files)

Pattern: (?<!d)-d+(?!d)
Matches: Negative numbers not part of larger numbers

Performance Note:

Lookarounds can be computationally expensive. Use them judiciously in performance-critical applications. Some regex engines (like JavaScript before ES2018) don't support lookbehinds.

Common Regex Patterns

Here are battle-tested regex patterns for common validation and extraction tasks.

Email Validation

Basic:
^[a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$

More strict:
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$

Matches: user@example.com, john.doe@company.co.uk
Note: Perfect email validation is extremely complex.
Use built-in validators when possible.

URL Validation

Basic:
^https?://[^s/$.?#].[^s]*$

More comprehensive:
^https?://(?:www.)?[-a-zA-Z0-9@:%._+~#=]{1,256}.[a-zA-Z0-9()]{1,6}(?:[-a-zA-Z0-9()@:%_+.~#?&/=]*)$

Matches:
https://example.com
http://www.example.com/path?query=1

Phone Numbers

US Format:
^d{3}-d{3}-d{4}$
Matches: 555-123-4567

Flexible US:
^(+1s?)?((d{3})|d{3})[s.-]?d{3}[s.-]?d{4}$
Matches:
555-123-4567
(555) 123-4567
+1 555-123-4567
5551234567

IP Address (IPv4)

Basic:
^d{1,3}.d{1,3}.d{1,3}.d{1,3}$

Strict (validates ranges 0-255):
^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$

Matches: 192.168.1.1, 10.0.0.1
Does not match: 256.1.1.1 (invalid)

Date Formats

YYYY-MM-DD:
^d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]d|3[01])$
Matches: 2025-01-27, 2024-12-31

MM/DD/YYYY:
^(?:0[1-9]|1[0-2])/(?:0[1-9]|[12]d|3[01])/d{4}$
Matches: 01/27/2025, 12/31/2024

DD-MM-YYYY:
^(?:0[1-9]|[12]d|3[01])-(?:0[1-9]|1[0-2])-d{4}$
Matches: 27-01-2025, 31-12-2024

Credit Card Numbers

Basic (with optional spaces/dashes):
^d{4}[s-]?d{4}[s-]?d{4}[s-]?d{4}$

Visa (starts with 4):
^4d{3}[s-]?d{4}[s-]?d{4}[s-]?d{4}$

MasterCard (starts with 51-55):
^5[1-5]d{2}[s-]?d{4}[s-]?d{4}[s-]?d{4}$

Hex Color Codes

^#([A-Fa-f0-9]6|[A-Fa-f0-9]3)$

Matches: #FFF, #FFFFFF, #a1b2c3
Does not match: #GGGGGG, #12345

Username (alphanumeric with underscore)

^[a-zA-Z0-9_]{3,16}$

Matches: user123, john_doe, Test_User
Length: 3-16 characters
Allows: letters, numbers, underscore

Password Strength

Strong password (8+ chars, uppercase, lowercase, digit, special):
^(?=.*[a-z])(?=.*[A-Z])(?=.*d)(?=.*[@$!%*?&])[A-Za-zd@$!%*?&]{8,}$

Matches: Password123!, Secure@Pass1
Does not match: password (no uppercase/digit/special)
                 Pass123 (no special character)

HTML Tags

Match opening and closing tags:
<(\w+)[^>]*>.*?</\1>

Extract tag content:
<[^>]+>(.*?)</[^>]+>

Remove all HTML tags:
<[^>]*>

Test Your Patterns

Use our Regex Tester to validate and test these patterns with your own data.

Regex in Different Languages

While regex syntax is mostly consistent, each language has its own API for working with regular expressions.

JavaScript

// Creating regex
const regex1 = /d{3}-d{3}-d{4}/;
const regex2 = new RegExp('\d{3}-\d{3}-\d{4}');

// Flags: g (global), i (case-insensitive), m (multiline)
const regex = /hello/gi;

// Testing
regex.test('Hello World'); // true

// Matching
'555-123-4567'.match(/d{3}-d{3}-d{4}/); // ['555-123-4567']

// Replace
'Hello World'.replace(/world/i, 'JavaScript'); // 'Hello JavaScript'

// Split
'a,b,c'.split(/,/); // ['a', 'b', 'c']

// Match all with groups
const text = 'John: 555-1234, Jane: 555-5678';
const matches = text.matchAll(/(w+):s*(d{3}-d{4})/g);
for (const match of matches) {
  console.log(match[1], match[2]); // name, phone
}

Python

import re

# Compile regex (recommended for reuse)
pattern = re.compile(r'd{3}-d{3}-d{4}')

# Search (find first match)
match = re.search(r'd+', 'Phone: 555-1234')
if match:
    print(match.group())  # '555'

# Match (match from start)
match = re.match(r'd+', '123-456')
print(match.group())  # '123'

# Find all matches
numbers = re.findall(r'd+', 'Call 555-1234 or 555-5678')
# ['555', '1234', '555', '5678']

# Replace
text = re.sub(r'd', 'X', 'Phone: 555-1234')
# 'Phone: XXX-XXXX'

# Split
parts = re.split(r'[,s]+', 'a, b,  c')
# ['a', 'b', 'c']

# Groups
match = re.search(r'(w+)@(w+.w+)', 'user@example.com')
match.group(1)  # 'user'
match.group(2)  # 'example.com'

# Flags
re.search(r'hello', 'HELLO', re.IGNORECASE)  # case-insensitive
re.search(r'^line', text, re.MULTILINE)      # multiline mode

PHP

// preg_match - find first match
if (preg_match('/d{3}-d{3}-d{4}/', '555-123-4567', $matches)) {
    echo $matches[0]; // '555-123-4567'
}

// preg_match_all - find all matches
preg_match_all('/d+/', 'Call 555-1234 or 555-5678', $matches);
print_r($matches[0]); // ['555', '1234', '555', '5678']

// preg_replace - replace
$text = preg_replace('/d/', 'X', 'Phone: 555-1234');
// 'Phone: XXX-XXXX'

// preg_split - split
$parts = preg_split('/[,s]+/', 'a, b,  c');
// ['a', 'b', 'c']

// With groups
if (preg_match('/(w+)@(w+.w+)/', 'user@example.com', $matches)) {
    echo $matches[1]; // 'user'
    echo $matches[2]; // 'example.com'
}

// Flags (modifiers)
// i - case insensitive
// m - multiline
// s - dot matches newline
preg_match('/hello/i', 'HELLO'); // matches

Java

import java.util.regex.*;

// Compile pattern
Pattern pattern = Pattern.compile("\d{3}-\d{3}-\d{4}");
Matcher matcher = pattern.matcher("555-123-4567");

// Find
if (matcher.find()) {
    System.out.println(matcher.group()); // "555-123-4567"
}

// Matches (entire string)
boolean matches = Pattern.matches("\d+", "12345"); // true

// Replace
String result = "555-123-4567".replaceAll("\d", "X");
// "XXX-XXX-XXXX"

// Split
String[] parts = "a,b,c".split(",");

// Groups
Pattern p = Pattern.compile("(\w+)@(\w+\.\w+)");
Matcher m = p.matcher("user@example.com");
if (m.find()) {
    System.out.println(m.group(1)); // "user"
    System.out.println(m.group(2)); // "example.com"
}

// Flags
Pattern p = Pattern.compile("hello", Pattern.CASE_INSENSITIVE);

C# (.NET)

using System.Text.RegularExpressions;

// Match
Match match = Regex.Match("555-123-4567", @"d{3}-d{3}-d{4}");
if (match.Success) {
    Console.WriteLine(match.Value); // "555-123-4567"
}

// IsMatch (boolean test)
bool isValid = Regex.IsMatch("555-123-4567", @"d{3}-d{3}-d{4}");

// Find all matches
MatchCollection matches = Regex.Matches("Call 555-1234 or 555-5678", @"d+");
foreach (Match m in matches) {
    Console.WriteLine(m.Value);
}

// Replace
string result = Regex.Replace("555-123-4567", @"d", "X");
// "XXX-XXX-XXXX"

// Split
string[] parts = Regex.Split("a,b,c", ",");

// Groups
Match m = Regex.Match("user@example.com", @"(w+)@(w+.w+)");
Console.WriteLine(m.Groups[1].Value); // "user"
Console.WriteLine(m.Groups[2].Value); // "example.com"

// Options
Regex regex = new Regex("hello", RegexOptions.IgnoreCase);

Go

import "regexp"

// Compile
re := regexp.MustCompile(`d{3}-d{3}-d{4}`)

// Match (boolean test)
matched := re.MatchString("555-123-4567") // true

// Find
result := re.FindString("Call 555-123-4567") // "555-123-4567"

// Find all
results := re.FindAllString("555-1234 or 555-5678", -1)
// []string{"555-1234", "555-5678"}

// Replace
result := re.ReplaceAllString("555-123-4567", "XXX-XXX-XXXX")

// Groups
re := regexp.MustCompile(`(w+)@(w+.w+)`)
matches := re.FindStringSubmatch("user@example.com")
// matches[0] = full match
// matches[1] = first group
// matches[2] = second group

// Split
re := regexp.MustCompile(`,s*`)
parts := re.Split("a, b, c", -1) // []string{"a", "b", "c"}

Best Practices

  • Test extensively - Use tools like regex101.com or our Regex Tester to validate patterns with real data
  • Keep it simple - Complex patterns are hard to maintain. Break them into smaller parts or use multiple simpler patterns
  • Use non-capturing groups - Use (?:...) instead of (...) when you don't need to capture text (better performance)
  • Be specific - Use [0-9] instead of . when you mean digits. Overly broad patterns can match unintended text
  • Use anchors - Add ^ and $ to ensure complete string matches for validation
  • Compile and reuse - Compile regex patterns once and reuse them for better performance
  • Comment complex patterns - Use comments (in code, not in pattern) to explain what each part does
  • Handle errors gracefully - Regex operations can fail or timeout. Always use try-catch blocks
  • Beware of catastrophic backtracking - Patterns like (a+)+ can cause exponential time complexity. Test with long inputs
  • Consider alternatives - For simple tasks, string methods might be clearer and faster than regex

Common Pitfalls to Avoid:

  • Using regex to parse HTML/XML (use proper parsers instead)
  • Not escaping special characters when matching literals
  • Forgetting that . doesn't match newlines by default
  • Overusing greedy quantifiers in large texts
  • Not testing edge cases (empty strings, very long strings, special characters)

Practice Exercises

Exercise 1: Extract All Email Addresses

Write a pattern to find all email addresses in a text.

Input: "Contact us at support@example.com or sales@company.co"
Expected: ["support@example.com", "sales@company.co"]

Exercise 2: Validate Password

Create a pattern for passwords: 8-16 characters, at least one uppercase, one lowercase, one digit.

Valid: "Password123", "SecurePass1"
Invalid: "password" (no uppercase/digit), "Pass1" (too short)

Exercise 3: Extract Hashtags

Find all hashtags (# followed by alphanumeric characters) in text.

Input: "Love #coding and #webdev! #JavaScript2025"
Expected: ["#coding", "#webdev", "#JavaScript2025"]

Exercise 4: Format Phone Numbers

Convert 10-digit phone numbers to (XXX) XXX-XXXX format using replace.

Input: "5551234567"
Expected: "(555) 123-4567"
Hint: Use capturing groups and replacement patterns

Exercise 5: Validate Hex Color

Write a pattern that matches valid CSS hex colors (3 or 6 digits).

Valid: "#FFF", "#FFFFFF", "#a1b2c3"
Invalid: "#GGG", "#12345", "FFF" (no #)

Test your solutions using our Regex Tester

Ready to Practice?

Test your regex skills with our interactive tools