Regular Expressions

Regular expressions (regex) are powerful patterns for matching and manipulating text. They're essential for validation, data extraction, text processing, and parsing. While they can seem complex at first, understanding regex will make you much more effective at working with text data.

In this lesson, you'll learn regex syntax, common patterns, how to use Python's re module, and practical applications like email validation and text extraction. These skills are valuable for data processing, web scraping, and building robust applications.

What You'll Learn

Introduction to regex patterns
Common regex patterns and metacharacters
Using the re module
Matching, searching, and replacing
Compiling regex patterns
Practical text processing examples

Introduction to Regular Expressions

A regular expression is a sequence of characters that defines a search pattern. Python's re module provides regex functionality:

import re

# Simple pattern matching
text = "Hello, World!"
pattern = "Hello"
match = re.search(pattern, text)
if match:
    print("Found:", match.group())  # Found: Hello

Basic Patterns

Here are fundamental regex patterns:

import re

# Literal characters
text = "cat"
match = re.search("cat", text)  # Matches "cat"

# Character classes
text = "The cat sat on the mat"
# [abc] matches a, b, or c
match = re.search("[cm]at", text)  # Matches "cat" or "mat"

# Ranges
# [a-z] matches any lowercase letter
# [0-9] matches any digit
# [A-Za-z] matches any letter
text = "Room 101"
match = re.search("[0-9]", text)  # Matches "1"

# Negated character class
# [^abc] matches anything except a, b, or c
text = "hello"
match = re.search("[^aeiou]", text)  # Matches "h"

Common Metacharacters

Metacharacters have special meaning in regex:

import re

# . (dot) - matches any character except newline
text = "cat bat rat"
match = re.search("c.t", text)  # Matches "cat"

# ^ - start of string
text = "Hello World"
match = re.search("^Hello", text)  # Matches at start

# $ - end of string
text = "Hello World"
match = re.search("World$", text)  # Matches at end

# * - zero or more of preceding
text = "caaat"
match = re.search("ca*t", text)  # Matches "caaat" (a* = zero or more a's)

# + - one or more of preceding
text = "caaat"
match = re.search("ca+t", text)  # Matches "caaat" (a+ = one or more a's)

# ? - zero or one of preceding
text = "color or colour"
match = re.search("colou?r", text)  # Matches "color" or "colour"

# {n} - exactly n occurrences
text = "caaat"
match = re.search("ca{3}t", text)  # Matches "caaat"

# {n,m} - between n and m occurrences
text = "caat"
match = re.search("ca{1,3}t", text)  # Matches 1-3 a's

Character Classes and Shorthand

Python provides shorthand character classes:

import re

# \d - digit [0-9]
text = "Room 101"
match = re.search(r"\d", text)  # Matches "1"

# \w - word character [a-zA-Z0-9_]
text = "Hello_World123"
match = re.search(r"\w+", text)  # Matches "Hello_World123"

# \s - whitespace
text = "Hello World"
match = re.search(r"\s", text)  # Matches space

# \D - non-digit
# \W - non-word
# \S - non-whitespace

# Note: Use raw strings (r"...") to avoid escaping issues

The re Module Functions

Python's re module provides several functions:

import re

text = "Contact: alice@example.com or bob@test.org"

# search() - find first match
match = re.search(r"\w+@\w+\.\w+", text)
if match:
    print(match.group())  # alice@example.com

# findall() - find all matches
emails = re.findall(r"\w+@\w+\.\w+", text)
print(emails)  # ['alice@example.com', 'bob@test.org']

# match() - match at start of string
text = "Hello World"
match = re.match(r"Hello", text)  # Matches
match = re.match(r"World", text)  # Doesn't match (not at start)

# sub() - replace matches
text = "Hello World"
new_text = re.sub(r"World", "Python", text)
print(new_text)  # Hello Python

# split() - split by pattern
text = "apple,banana,orange"
items = re.split(r",", text)
print(items)  # ['apple', 'banana', 'orange']

Groups and Capturing

Parentheses create groups that you can extract:

import re

# Simple group
text = "Contact: alice@example.com"
match = re.search(r"(\w+)@(\w+\.\w+)", text)
if match:
    print(match.group(0))  # Full match: alice@example.com
    print(match.group(1))  # First group: alice
    print(match.group(2))  # Second group: example.com

# Named groups
text = "Date: 2025-11-26"
match = re.search(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})", text)
if match:
    print(match.group("year"))   # 2025
    print(match.group("month"))  # 11
    print(match.group("day"))    # 26

Compiling Patterns

For repeated use, compile patterns for better performance:

import re

# Compile pattern once
pattern = re.compile(r"\d{3}-\d{3}-\d{4}")  # Phone number pattern

# Use compiled pattern multiple times
text1 = "Call 555-123-4567"
text2 = "Phone: 555-987-6543"

match1 = pattern.search(text1)
match2 = pattern.search(text2)

if match1:
    print(match1.group())  # 555-123-4567
if match2:
    print(match2.group())  # 555-987-6543

Practical Examples

Here are real-world regex applications:

# Example 1: Email validation (simple)
import re

def is_valid_email(email):
    """Simple email validation."""
    pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
    return bool(re.match(pattern, email))

print(is_valid_email("user@example.com"))    # True
print(is_valid_email("invalid.email"))       # False

# Example 2: Extract phone numbers
text = "Call 555-123-4567 or 555-987-6543"
pattern = r"\d{3}-\d{3}-\d{4}"
phones = re.findall(pattern, text)
print(phones)  # ['555-123-4567', '555-987-6543']

# Example 3: Extract dates
text = "Events on 2025-11-26 and 2025-12-25"
pattern = r"\d{4}-\d{2}-\d{2}"
dates = re.findall(pattern, text)
print(dates)  # ['2025-11-26', '2025-12-25']

# Example 4: Clean text
text = "Hello!!!   World???   Python..."
# Remove multiple punctuation and spaces
cleaned = re.sub(r"[!?.]+", "", text)
cleaned = re.sub(r"\s+", " ", cleaned)
print(cleaned)  # Hello World Python

# Example 5: Extract URLs
text = "Visit https://example.com or http://test.org"
pattern = r"https?://[^\s]+"
urls = re.findall(pattern, text)
print(urls)  # ['https://example.com', 'http://test.org']

# Example 6: Password validation
def is_strong_password(password):
    """Check if password meets requirements."""
    # At least 8 chars, one uppercase, one lowercase, one digit
    if len(password) < 8:
        return False
    if not re.search(r"[A-Z]", password):
        return False
    if not re.search(r"[a-z]", password):
        return False
    if not re.search(r"\d", password):
        return False
    return True

print(is_strong_password("Password123"))  # True
print(is_strong_password("weak"))        # False

Try It Yourself

Practice with regex:

Phone Number Formatter: Create a function that extracts and formats phone numbers from text in various formats.
Email Extractor: Write a function that extracts all email addresses from a block of text.
HTML Tag Remover: Create a function that removes HTML tags from text using regex.
Credit Card Validator: Write a function that validates credit card numbers using the Luhn algorithm and regex.
Log Parser: Create a function that parses log entries and extracts timestamps, log levels, and messages.

Summary

Regular expressions are powerful tools for text pattern matching and manipulation. The re module provides functions like search(), findall(), sub(), and split(). Understanding metacharacters, character classes, and groups enables you to create complex patterns for validation, extraction, and text processing.

Regex is essential for data processing, validation, web scraping, and many other tasks. While it can be complex, mastering the basics will make you much more effective at working with text data. Practice with common patterns and gradually build more complex expressions.

What's Next?

In the next lesson, we'll explore working with JSON and APIs. You'll learn how to parse and create JSON data, make HTTP requests, and interact with REST APIs. These skills are essential for modern Python applications that communicate with web services.