Regular expressions (regex) are powerful patterns for matching and manipulating text. They're essential for validation, data extraction, text processing, and parsing. While they can seem complex at first, understanding regex will make you much more effective at working with text data.
In this lesson, you'll learn regex syntax, common patterns, how to use Python's re module, and practical applications like email validation and text extraction. These skills are valuable for data processing, web scraping, and building robust applications.
What You'll Learn
- Introduction to regex patterns
- Common regex patterns and metacharacters
- Using the re module
- Matching, searching, and replacing
- Compiling regex patterns
- Practical text processing examples
Introduction to Regular Expressions
A regular expression is a sequence of characters that defines a search pattern. Python's re module provides regex functionality:
import re
# Simple pattern matching
text = "Hello, World!"
pattern = "Hello"
match = re.search(pattern, text)
if match:
print("Found:", match.group()) # Found: Hello
Basic Patterns
Here are fundamental regex patterns:
import re
# Literal characters
text = "cat"
match = re.search("cat", text) # Matches "cat"
# Character classes
text = "The cat sat on the mat"
# [abc] matches a, b, or c
match = re.search("[cm]at", text) # Matches "cat" or "mat"
# Ranges
# [a-z] matches any lowercase letter
# [0-9] matches any digit
# [A-Za-z] matches any letter
text = "Room 101"
match = re.search("[0-9]", text) # Matches "1"
# Negated character class
# [^abc] matches anything except a, b, or c
text = "hello"
match = re.search("[^aeiou]", text) # Matches "h"
Common Metacharacters
Metacharacters have special meaning in regex:
import re
# . (dot) - matches any character except newline
text = "cat bat rat"
match = re.search("c.t", text) # Matches "cat"
# ^ - start of string
text = "Hello World"
match = re.search("^Hello", text) # Matches at start
# $ - end of string
text = "Hello World"
match = re.search("World$", text) # Matches at end
# * - zero or more of preceding
text = "caaat"
match = re.search("ca*t", text) # Matches "caaat" (a* = zero or more a's)
# + - one or more of preceding
text = "caaat"
match = re.search("ca+t", text) # Matches "caaat" (a+ = one or more a's)
# ? - zero or one of preceding
text = "color or colour"
match = re.search("colou?r", text) # Matches "color" or "colour"
# {n} - exactly n occurrences
text = "caaat"
match = re.search("ca{3}t", text) # Matches "caaat"
# {n,m} - between n and m occurrences
text = "caat"
match = re.search("ca{1,3}t", text) # Matches 1-3 a's
Character Classes and Shorthand
Python provides shorthand character classes:
import re
# \d - digit [0-9]
text = "Room 101"
match = re.search(r"\d", text) # Matches "1"
# \w - word character [a-zA-Z0-9_]
text = "Hello_World123"
match = re.search(r"\w+", text) # Matches "Hello_World123"
# \s - whitespace
text = "Hello World"
match = re.search(r"\s", text) # Matches space
# \D - non-digit
# \W - non-word
# \S - non-whitespace
# Note: Use raw strings (r"...") to avoid escaping issues
The re Module Functions
Python's re module provides several functions:
import re
text = "Contact: alice@example.com or bob@test.org"
# search() - find first match
match = re.search(r"\w+@\w+\.\w+", text)
if match:
print(match.group()) # alice@example.com
# findall() - find all matches
emails = re.findall(r"\w+@\w+\.\w+", text)
print(emails) # ['alice@example.com', 'bob@test.org']
# match() - match at start of string
text = "Hello World"
match = re.match(r"Hello", text) # Matches
match = re.match(r"World", text) # Doesn't match (not at start)
# sub() - replace matches
text = "Hello World"
new_text = re.sub(r"World", "Python", text)
print(new_text) # Hello Python
# split() - split by pattern
text = "apple,banana,orange"
items = re.split(r",", text)
print(items) # ['apple', 'banana', 'orange']
Groups and Capturing
Parentheses create groups that you can extract:
import re
# Simple group
text = "Contact: alice@example.com"
match = re.search(r"(\w+)@(\w+\.\w+)", text)
if match:
print(match.group(0)) # Full match: alice@example.com
print(match.group(1)) # First group: alice
print(match.group(2)) # Second group: example.com
# Named groups
text = "Date: 2025-11-26"
match = re.search(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})", text)
if match:
print(match.group("year")) # 2025
print(match.group("month")) # 11
print(match.group("day")) # 26
Compiling Patterns
For repeated use, compile patterns for better performance:
import re
# Compile pattern once
pattern = re.compile(r"\d{3}-\d{3}-\d{4}") # Phone number pattern
# Use compiled pattern multiple times
text1 = "Call 555-123-4567"
text2 = "Phone: 555-987-6543"
match1 = pattern.search(text1)
match2 = pattern.search(text2)
if match1:
print(match1.group()) # 555-123-4567
if match2:
print(match2.group()) # 555-987-6543
Practical Examples
Here are real-world regex applications:
# Example 1: Email validation (simple)
import re
def is_valid_email(email):
"""Simple email validation."""
pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
return bool(re.match(pattern, email))
print(is_valid_email("user@example.com")) # True
print(is_valid_email("invalid.email")) # False
# Example 2: Extract phone numbers
text = "Call 555-123-4567 or 555-987-6543"
pattern = r"\d{3}-\d{3}-\d{4}"
phones = re.findall(pattern, text)
print(phones) # ['555-123-4567', '555-987-6543']
# Example 3: Extract dates
text = "Events on 2025-11-26 and 2025-12-25"
pattern = r"\d{4}-\d{2}-\d{2}"
dates = re.findall(pattern, text)
print(dates) # ['2025-11-26', '2025-12-25']
# Example 4: Clean text
text = "Hello!!! World??? Python..."
# Remove multiple punctuation and spaces
cleaned = re.sub(r"[!?.]+", "", text)
cleaned = re.sub(r"\s+", " ", cleaned)
print(cleaned) # Hello World Python
# Example 5: Extract URLs
text = "Visit https://example.com or http://test.org"
pattern = r"https?://[^\s]+"
urls = re.findall(pattern, text)
print(urls) # ['https://example.com', 'http://test.org']
# Example 6: Password validation
def is_strong_password(password):
"""Check if password meets requirements."""
# At least 8 chars, one uppercase, one lowercase, one digit
if len(password) < 8:
return False
if not re.search(r"[A-Z]", password):
return False
if not re.search(r"[a-z]", password):
return False
if not re.search(r"\d", password):
return False
return True
print(is_strong_password("Password123")) # True
print(is_strong_password("weak")) # False
Try It Yourself
Practice with regex:
-
Phone Number Formatter: Create a function that extracts and formats phone numbers from text in various formats.
-
Email Extractor: Write a function that extracts all email addresses from a block of text.
-
HTML Tag Remover: Create a function that removes HTML tags from text using regex.
-
Credit Card Validator: Write a function that validates credit card numbers using the Luhn algorithm and regex.
-
Log Parser: Create a function that parses log entries and extracts timestamps, log levels, and messages.
Summary
Regular expressions are powerful tools for text pattern matching and manipulation. The re module provides functions like search(), findall(), sub(), and split(). Understanding metacharacters, character classes, and groups enables you to create complex patterns for validation, extraction, and text processing.
Regex is essential for data processing, validation, web scraping, and many other tasks. While it can be complex, mastering the basics will make you much more effective at working with text data. Practice with common patterns and gradually build more complex expressions.
What's Next?
In the next lesson, we'll explore working with JSON and APIs. You'll learn how to parse and create JSON data, make HTTP requests, and interact with REST APIs. These skills are essential for modern Python applications that communicate with web services.