Data Cleaning in SQL
Clean and standardize data using SQL techniques.
Introduction to Data Cleaning in SQL
Real-world data is messy. Customer names have inconsistent capitalization. Phone numbers come in different formats. NULLs appear where you don't expect them. Before you can analyze data, you need to clean it.
Data cleaning transforms raw, inconsistent data into a clean, standardized format ready for analysis. SQL provides powerful functions for:
- Handling NULL values
- Standardizing text formats
- Parsing and transforming strings
- Converting data types
- Removing duplicates
- Fixing inconsistencies
Real-world scenario: You receive customer data from multiple sources:
- Source A: Names in "LastName, FirstName" format
- Source B: Names in "FirstName LastName" format
- Some phone numbers include dashes, others don't
- Dates come as strings like "2024-01-15" or "01/15/2024"
- Many fields contain NULL or empty strings
SQL can clean all of this!
The process of detecting and correcting (or removing) corrupt, inaccurate, or inconsistent records from a dataset. Essential before analysis or reporting.
Issue | Example | SQL Solution |
|---|---|---|
| NULL values | Missing customer emails | COALESCE, NULLIF, IFNULL |
| Inconsistent formatting | "John Doe" vs "JOHN DOE" | UPPER, LOWER, TRIM |
| Extra whitespace | " Alice " | TRIM, LTRIM, RTRIM |
| Mixed data types | "100" stored as text | CAST, CONVERT |
| Duplicates | Same customer entered twice | DISTINCT, ROW_NUMBER |
| Invalid values | -1 for age | CASE, NULLIF |
Handling NULL Values
NULL means "unknown" or "missing." You need to handle NULLs carefully in calculations and comparisons.
customer_id | name | email | phone | city |
|---|---|---|---|---|
| 1 | Alice Smith | alice@mail.com | 555-1234 | NYC |
| 2 | Bob Jones | NULL | 555-5678 | NULL |
| 3 | Charlie Brown | charlie@mail.com | NULL | Boston |
| 4 | Diana Prince | NULL | 555-9012 | LA |
COALESCE - Return first non-NULL value:
SELECT
customer_id,
name,
COALESCE(email, 'no-email@example.com') AS email,
COALESCE(phone, 'No phone') AS phone,
COALESCE(city, 'Unknown') AS city
FROM customers_raw;NULLIF - Convert specific values to NULL:
product_id | name | price | discount |
|---|---|---|---|
| 1 | Laptop | 999.99 | 50.00 |
| 2 | Mouse | 29.99 | 0 |
| 3 | Keyboard | 79.99 | -1 |
-- Convert invalid discount values (-1, 0) to NULL
SELECT
product_id,
name,
price,
NULLIF(discount, -1) AS discount_clean1,
NULLIF(NULLIF(discount, -1), 0) AS discount_clean2
FROM products_raw;COALESCE vs NULLIF:
- COALESCE: Replace NULL with a default value
- NULLIF: Convert a specific value to NULL
Text Cleaning and Standardization
Standardize text data for consistency.
TRIM, LTRIM, RTRIM - Remove whitespace:
id | name |
|---|---|
| 1 | Alice |
| 2 | Bob |
| 3 | Charlie |
SELECT
id,
CONCAT('[', name, ']') AS original,
CONCAT('[', TRIM(name), ']') AS trimmed,
CONCAT('[', LTRIM(name), ']') AS left_trimmed,
CONCAT('[', RTRIM(name), ']') AS right_trimmed
FROM names_raw;UPPER, LOWER, INITCAP - Case conversion:
SELECT
'John DOE' AS original,
UPPER('John DOE') AS uppercase,
LOWER('John DOE') AS lowercase,
CONCAT(
UPPER(SUBSTRING('john doe', 1, 1)),
LOWER(SUBSTRING('john doe', 2))
) AS title_case;REPLACE - Find and replace text:
customer_id | phone |
|---|---|
| 1 | (555) 123-4567 |
| 2 | 555.987.6543 |
| 3 | 555-111-2222 |
-- Standardize phone numbers to digits only
SELECT
customer_id,
phone AS original,
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
phone, '(', ''), ')', ''), '-', ''), '.', ''), ' ', ''
) AS digits_only
FROM phones_raw;Chaining REPLACE: You can nest multiple REPLACE calls to remove different characters. Modern databases also support REGEXP_REPLACE for pattern-based replacement.
String Parsing and Extraction
SUBSTRING - Extract part of a string:
SELECT
'Alice Smith' AS full_name,
SUBSTRING('Alice Smith', 1, 5) AS first_name,
SUBSTRING('Alice Smith', 7) AS last_name,
LEFT('Alice Smith', 5) AS left_5,
RIGHT('Alice Smith', 5) AS right_5;SPLIT_PART / SUBSTRING_INDEX - Split by delimiter:
user_id | email |
|---|---|
| 1 | alice@company.com |
| 2 | bob@startup.io |
| 3 | charlie@business.org |
-- Extract username and domain
SELECT
user_id,
email,
SUBSTRING_INDEX(email, '@', 1) AS username,
SUBSTRING_INDEX(email, '@', -1) AS domain
FROM emails;LENGTH - Check string length:
-- Find invalid phone numbers (not 10 digits)
SELECT
customer_id,
phone,
LENGTH(phone) AS phone_length,
CASE
WHEN LENGTH(phone) = 10 THEN 'Valid'
ELSE 'Invalid'
END AS validation
FROM phones_cleaned;Type Conversion and Casting
CAST - Convert between data types:
order_id | customer_id | total | order_date |
|---|---|---|---|
| 1 | 101 | 150.50 | 2024-01-15 |
| 2 | 102 | 200 | 2024-01-16 |
| 3 | 103 | invalid | 2024-01-17 |
-- Convert string to number, handle errors
SELECT
order_id,
total AS original,
CASE
WHEN total REGEXP '^[0-9]+(\.[0-9]+)?$'
THEN CAST(total AS DECIMAL(10,2))
ELSE NULL
END AS total_numeric
FROM orders_raw;Date parsing:
id | date_string | format |
|---|---|---|
| 1 | 2024-01-15 | ISO |
| 2 | 01/15/2024 | US |
| 3 | 15-01-2024 | EU |
SELECT
id,
date_string,
STR_TO_DATE(date_string, '%Y-%m-%d') AS iso_date,
STR_TO_DATE(date_string, '%m/%d/%Y') AS us_date,
STR_TO_DATE(date_string, '%d-%m-%Y') AS eu_date
FROM dates_raw;Removing Duplicates
DISTINCT - Simple deduplication:
customer_id | email | name |
|---|---|---|
| 1 | alice@mail.com | Alice Smith |
| 2 | bob@mail.com | Bob Jones |
| 3 | alice@mail.com | Alice Smith |
| 4 | charlie@mail.com | Charlie Brown |
-- Remove exact duplicates
SELECT DISTINCT email, name
FROM customers_dupes
ORDER BY name;ROW_NUMBER - Keep first occurrence:
-- Keep earliest customer_id per email
SELECT customer_id, email, name
FROM (
SELECT
customer_id,
email,
name,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY customer_id) AS rn
FROM customers_dupes
) ranked
WHERE rn = 1;CASE Statements for Conditional Cleaning
CASE is incredibly versatile for data cleaning:
product_id | category | price | stock |
|---|---|---|---|
| 1 | ELECTRONICS | 999.99 | 10 |
| 2 | electronics | 29.99 | -5 |
| 3 | Electronics | 79.99 | 0 |
| 4 | Furniture | -10.00 | 20 |
SELECT
product_id,
-- Standardize category
UPPER(TRIM(category)) AS category_clean,
-- Fix negative prices
CASE
WHEN price < 0 THEN NULL
ELSE price
END AS price_clean,
-- Fix negative stock
CASE
WHEN stock < 0 THEN 0
ELSE stock
END AS stock_clean,
-- Add status flag
CASE
WHEN price < 0 OR stock < 0 THEN 'NEEDS REVIEW'
WHEN stock = 0 THEN 'OUT OF STOCK'
ELSE 'OK'
END AS data_quality
FROM products_messy;Complete Data Cleaning Example
Scenario: Clean customer data from multiple sources
id | name | email | phone | city | state |
|---|---|---|---|---|---|
| 1 | ALICE SMITH | Alice@Email.com | (555) 123-4567 | new york | NY |
| 2 | bob jones | NULL | 555.987.6543 | NULL | ca |
| 3 | Charlie Brown | charlie@email.com | Boston | ma | |
| 4 | NULL | diana@email.com | 555-111-2222 | Los Angeles | CA |
SELECT
id,
-- Clean name: trim, title case
CASE
WHEN TRIM(name) = '' OR name IS NULL THEN 'Unknown'
ELSE CONCAT(
UPPER(SUBSTRING(TRIM(LOWER(name)), 1, 1)),
LOWER(SUBSTRING(TRIM(name), 2))
)
END AS name_clean,
-- Clean email: lowercase, handle NULL
COALESCE(LOWER(TRIM(email)), 'no-email@example.com') AS email_clean,
-- Clean phone: remove formatting
CASE
WHEN TRIM(phone) = '' OR phone IS NULL THEN NULL
ELSE REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
phone, '(', ''), ')', ''), '-', ''), '.', ''), ' ', ''
)
END AS phone_clean,
-- Clean city: title case
CASE
WHEN city IS NULL OR TRIM(city) = '' THEN 'Unknown'
ELSE CONCAT(
UPPER(SUBSTRING(TRIM(LOWER(city)), 1, 1)),
LOWER(SUBSTRING(TRIM(city), 2))
)
END AS city_clean,
-- Clean state: uppercase, 2 letters
UPPER(TRIM(COALESCE(state, 'XX'))) AS state_clean
FROM customers_import;This query handles:
- NULL values
- Inconsistent capitalization
- Extra whitespace
- Phone number formatting
- Missing data
Create a cleaned table:
CREATE TABLE customers_clean AS
SELECT
id AS customer_id,
-- [all cleaning logic from above]
FROM customers_import
WHERE email IS NOT NULL; -- Exclude rows without emailsData Quality Checks
Identify data quality issues:
-- Find records with issues
SELECT
'Missing email' AS issue,
COUNT(*) AS count
FROM customers_import
WHERE email IS NULL OR TRIM(email) = ''
UNION ALL
SELECT
'Missing name',
COUNT(*)
FROM customers_import
WHERE name IS NULL OR TRIM(name) = ''
UNION ALL
SELECT
'Invalid phone length',
COUNT(*)
FROM customers_import
WHERE LENGTH(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
phone, '(', ''), ')', ''), '-', ''), '.', ''), ' ', ''
)) != 10;Common Mistakes to Avoid
Don't make these mistakes:
-
Modifying original data directly: ❌
- Always clean into a new table or view
- Keep raw data intact for auditing
-
Not documenting cleaning rules: ❌
- Document what you changed and why
- Future you will thank you!
-
Over-cleaning: Sometimes
- Don't remove variations that might be legitimate
- "Bob" vs "Robert" might be different people
-
Ignoring NULLs in calculations: ❌
NULL + 100 = NULL- Use COALESCE before calculations
-
Case-sensitive comparisons: ⚠️
- "alice@email.com" != "Alice@Email.com" in some databases
- Always use UPPER() or LOWER() for comparisons
-
Not handling edge cases: ❌
- Empty strings vs NULL
- Zero vs NULL
- Whitespace-only strings
-
Running UPDATE without WHERE: ❌
- Test in SELECT first!
- Then convert to UPDATE
-
Not validating after cleaning: ❌
- Always check that cleaning worked
- Run quality checks on cleaned data
Key Takeaways
What you learned:
✅ COALESCE replaces NULL with default values ✅ NULLIF converts specific values to NULL ✅ TRIM removes whitespace ✅ UPPER/LOWER standardizes case ✅ REPLACE removes or replaces characters ✅ SUBSTRING extracts portions of text ✅ CAST converts data types ✅ DISTINCT and ROW_NUMBER remove duplicates ✅ CASE handles conditional logic ✅ Always clean into new tables, keep originals ✅ Document your cleaning rules ✅ Validate results after cleaning
Data cleaning workflow:
- Assess data quality (run checks)
- Document issues found
- Write cleaning queries (test in SELECT first!)
- Create cleaned table or view
- Validate cleaned data
- Document what was changed
Remember: Good data analysis starts with clean data. Garbage in, garbage out!
Practice Exercise: Clean this messy data:
- Standardize all emails to lowercase
- Remove non-digit characters from phone numbers
- Replace NULL cities with 'Unknown'
- Convert all names to Title Case
- Remove rows where email is invalid (no @ sign)
- Flag records with missing required fields