Data Cleaning in SQL

Clean and standardize data using SQL techniques.

22 min read
Intermediate

Introduction to Data Cleaning in SQL

Real-world data is messy. Customer names have inconsistent capitalization. Phone numbers come in different formats. NULLs appear where you don't expect them. Before you can analyze data, you need to clean it.

Data cleaning transforms raw, inconsistent data into a clean, standardized format ready for analysis. SQL provides powerful functions for:

  • Handling NULL values
  • Standardizing text formats
  • Parsing and transforming strings
  • Converting data types
  • Removing duplicates
  • Fixing inconsistencies

Real-world scenario: You receive customer data from multiple sources:

  • Source A: Names in "LastName, FirstName" format
  • Source B: Names in "FirstName LastName" format
  • Some phone numbers include dashes, others don't
  • Dates come as strings like "2024-01-15" or "01/15/2024"
  • Many fields contain NULL or empty strings

SQL can clean all of this!

The process of detecting and correcting (or removing) corrupt, inaccurate, or inconsistent records from a dataset. Essential before analysis or reporting.

Common Data Quality Issues
Issue
Example
SQL Solution
NULL valuesMissing customer emailsCOALESCE, NULLIF, IFNULL
Inconsistent formatting"John Doe" vs "JOHN DOE"UPPER, LOWER, TRIM
Extra whitespace" Alice "TRIM, LTRIM, RTRIM
Mixed data types"100" stored as textCAST, CONVERT
DuplicatesSame customer entered twiceDISTINCT, ROW_NUMBER
Invalid values-1 for ageCASE, NULLIF

Handling NULL Values

NULL means "unknown" or "missing." You need to handle NULLs carefully in calculations and comparisons.

customers_raw
customer_id
name
email
phone
city
1Alice Smithalice@mail.com555-1234NYC
2Bob JonesNULL555-5678NULL
3Charlie Browncharlie@mail.comNULLBoston
4Diana PrinceNULL555-9012LA

COALESCE - Return first non-NULL value:

sql
SELECT
  customer_id,
  name,
  COALESCE(email, 'no-email@example.com') AS email,
  COALESCE(phone, 'No phone') AS phone,
  COALESCE(city, 'Unknown') AS city
FROM customers_raw;

NULLIF - Convert specific values to NULL:

products_raw
product_id
name
price
discount
1Laptop999.9950.00
2Mouse29.990
3Keyboard79.99-1
sql
-- Convert invalid discount values (-1, 0) to NULL
SELECT
  product_id,
  name,
  price,
  NULLIF(discount, -1) AS discount_clean1,
  NULLIF(NULLIF(discount, -1), 0) AS discount_clean2
FROM products_raw;

COALESCE vs NULLIF:

  • COALESCE: Replace NULL with a default value
  • NULLIF: Convert a specific value to NULL

Text Cleaning and Standardization

Standardize text data for consistency.

TRIM, LTRIM, RTRIM - Remove whitespace:

names_raw
id
name
1Alice
2Bob
3Charlie
sql
SELECT
  id,
  CONCAT('[', name, ']') AS original,
  CONCAT('[', TRIM(name), ']') AS trimmed,
  CONCAT('[', LTRIM(name), ']') AS left_trimmed,
  CONCAT('[', RTRIM(name), ']') AS right_trimmed
FROM names_raw;

UPPER, LOWER, INITCAP - Case conversion:

sql
SELECT
  'John DOE' AS original,
  UPPER('John DOE') AS uppercase,
  LOWER('John DOE') AS lowercase,
  CONCAT(
    UPPER(SUBSTRING('john doe', 1, 1)),
    LOWER(SUBSTRING('john doe', 2))
  ) AS title_case;

REPLACE - Find and replace text:

phones_raw
customer_id
phone
1(555) 123-4567
2555.987.6543
3555-111-2222
sql
-- Standardize phone numbers to digits only
SELECT
  customer_id,
  phone AS original,
  REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
    phone, '(', ''), ')', ''), '-', ''), '.', ''), ' ', ''
  ) AS digits_only
FROM phones_raw;

Chaining REPLACE: You can nest multiple REPLACE calls to remove different characters. Modern databases also support REGEXP_REPLACE for pattern-based replacement.

String Parsing and Extraction

SUBSTRING - Extract part of a string:

sql
SELECT
  'Alice Smith' AS full_name,
  SUBSTRING('Alice Smith', 1, 5) AS first_name,
  SUBSTRING('Alice Smith', 7) AS last_name,
  LEFT('Alice Smith', 5) AS left_5,
  RIGHT('Alice Smith', 5) AS right_5;

SPLIT_PART / SUBSTRING_INDEX - Split by delimiter:

emails
user_id
email
1alice@company.com
2bob@startup.io
3charlie@business.org
sql
-- Extract username and domain
SELECT
  user_id,
  email,
  SUBSTRING_INDEX(email, '@', 1) AS username,
  SUBSTRING_INDEX(email, '@', -1) AS domain
FROM emails;

LENGTH - Check string length:

sql
-- Find invalid phone numbers (not 10 digits)
SELECT
  customer_id,
  phone,
  LENGTH(phone) AS phone_length,
  CASE
    WHEN LENGTH(phone) = 10 THEN 'Valid'
    ELSE 'Invalid'
  END AS validation
FROM phones_cleaned;

Type Conversion and Casting

CAST - Convert between data types:

orders_raw
order_id
customer_id
total
order_date
1101150.502024-01-15
21022002024-01-16
3103invalid2024-01-17
sql
-- Convert string to number, handle errors
SELECT
  order_id,
  total AS original,
  CASE
    WHEN total REGEXP '^[0-9]+(\.[0-9]+)?$'
    THEN CAST(total AS DECIMAL(10,2))
    ELSE NULL
  END AS total_numeric
FROM orders_raw;

Date parsing:

dates_raw
id
date_string
format
12024-01-15ISO
201/15/2024US
315-01-2024EU
sql
SELECT
  id,
  date_string,
  STR_TO_DATE(date_string, '%Y-%m-%d') AS iso_date,
  STR_TO_DATE(date_string, '%m/%d/%Y') AS us_date,
  STR_TO_DATE(date_string, '%d-%m-%Y') AS eu_date
FROM dates_raw;

Removing Duplicates

DISTINCT - Simple deduplication:

customers_dupes
customer_id
email
name
1alice@mail.comAlice Smith
2bob@mail.comBob Jones
3alice@mail.comAlice Smith
4charlie@mail.comCharlie Brown
sql
-- Remove exact duplicates
SELECT DISTINCT email, name
FROM customers_dupes
ORDER BY name;

ROW_NUMBER - Keep first occurrence:

sql
-- Keep earliest customer_id per email
SELECT customer_id, email, name
FROM (
  SELECT
    customer_id,
    email,
    name,
    ROW_NUMBER() OVER (PARTITION BY email ORDER BY customer_id) AS rn
  FROM customers_dupes
) ranked
WHERE rn = 1;

CASE Statements for Conditional Cleaning

CASE is incredibly versatile for data cleaning:

products_messy
product_id
category
price
stock
1ELECTRONICS999.9910
2electronics29.99-5
3Electronics79.990
4Furniture-10.0020
sql
SELECT
  product_id,
  -- Standardize category
  UPPER(TRIM(category)) AS category_clean,
  -- Fix negative prices
  CASE
    WHEN price < 0 THEN NULL
    ELSE price
  END AS price_clean,
  -- Fix negative stock
  CASE
    WHEN stock < 0 THEN 0
    ELSE stock
  END AS stock_clean,
  -- Add status flag
  CASE
    WHEN price < 0 OR stock < 0 THEN 'NEEDS REVIEW'
    WHEN stock = 0 THEN 'OUT OF STOCK'
    ELSE 'OK'
  END AS data_quality
FROM products_messy;

Complete Data Cleaning Example

Scenario: Clean customer data from multiple sources

customers_import
id
name
email
phone
city
state
1ALICE SMITHAlice@Email.com(555) 123-4567new yorkNY
2bob jonesNULL555.987.6543NULLca
3Charlie Browncharlie@email.comBostonma
4NULLdiana@email.com555-111-2222Los AngelesCA
sql
SELECT
  id,
  -- Clean name: trim, title case
  CASE
    WHEN TRIM(name) = '' OR name IS NULL THEN 'Unknown'
    ELSE CONCAT(
      UPPER(SUBSTRING(TRIM(LOWER(name)), 1, 1)),
      LOWER(SUBSTRING(TRIM(name), 2))
    )
  END AS name_clean,
  -- Clean email: lowercase, handle NULL
  COALESCE(LOWER(TRIM(email)), 'no-email@example.com') AS email_clean,
  -- Clean phone: remove formatting
  CASE
    WHEN TRIM(phone) = '' OR phone IS NULL THEN NULL
    ELSE REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
      phone, '(', ''), ')', ''), '-', ''), '.', ''), ' ', ''
    )
  END AS phone_clean,
  -- Clean city: title case
  CASE
    WHEN city IS NULL OR TRIM(city) = '' THEN 'Unknown'
    ELSE CONCAT(
      UPPER(SUBSTRING(TRIM(LOWER(city)), 1, 1)),
      LOWER(SUBSTRING(TRIM(city), 2))
    )
  END AS city_clean,
  -- Clean state: uppercase, 2 letters
  UPPER(TRIM(COALESCE(state, 'XX'))) AS state_clean
FROM customers_import;

This query handles:

  • NULL values
  • Inconsistent capitalization
  • Extra whitespace
  • Phone number formatting
  • Missing data

Create a cleaned table:

sql
CREATE TABLE customers_clean AS
SELECT
  id AS customer_id,
  -- [all cleaning logic from above]
FROM customers_import
WHERE email IS NOT NULL;  -- Exclude rows without emails

Data Quality Checks

Identify data quality issues:

sql
-- Find records with issues
SELECT
  'Missing email' AS issue,
  COUNT(*) AS count
FROM customers_import
WHERE email IS NULL OR TRIM(email) = ''

UNION ALL

SELECT
  'Missing name',
  COUNT(*)
FROM customers_import
WHERE name IS NULL OR TRIM(name) = ''

UNION ALL

SELECT
  'Invalid phone length',
  COUNT(*)
FROM customers_import
WHERE LENGTH(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
  phone, '(', ''), ')', ''), '-', ''), '.', ''), ' ', ''
)) != 10;

Common Mistakes to Avoid

Don't make these mistakes:

  1. Modifying original data directly: ❌

    • Always clean into a new table or view
    • Keep raw data intact for auditing
  2. Not documenting cleaning rules: ❌

    • Document what you changed and why
    • Future you will thank you!
  3. Over-cleaning: Sometimes

    • Don't remove variations that might be legitimate
    • "Bob" vs "Robert" might be different people
  4. Ignoring NULLs in calculations: ❌

    • NULL + 100 = NULL
    • Use COALESCE before calculations
  5. Case-sensitive comparisons: ⚠️

    • "alice@email.com" != "Alice@Email.com" in some databases
    • Always use UPPER() or LOWER() for comparisons
  6. Not handling edge cases: ❌

    • Empty strings vs NULL
    • Zero vs NULL
    • Whitespace-only strings
  7. Running UPDATE without WHERE: ❌

    • Test in SELECT first!
    • Then convert to UPDATE
  8. Not validating after cleaning: ❌

    • Always check that cleaning worked
    • Run quality checks on cleaned data

Key Takeaways

What you learned:

COALESCE replaces NULL with default values ✅ NULLIF converts specific values to NULL ✅ TRIM removes whitespace ✅ UPPER/LOWER standardizes case ✅ REPLACE removes or replaces characters ✅ SUBSTRING extracts portions of text ✅ CAST converts data types ✅ DISTINCT and ROW_NUMBER remove duplicates ✅ CASE handles conditional logic ✅ Always clean into new tables, keep originals ✅ Document your cleaning rules ✅ Validate results after cleaning

Data cleaning workflow:

  1. Assess data quality (run checks)
  2. Document issues found
  3. Write cleaning queries (test in SELECT first!)
  4. Create cleaned table or view
  5. Validate cleaned data
  6. Document what was changed

Remember: Good data analysis starts with clean data. Garbage in, garbage out!

Practice Exercise: Clean this messy data:

  1. Standardize all emails to lowercase
  2. Remove non-digit characters from phone numbers
  3. Replace NULL cities with 'Unknown'
  4. Convert all names to Title Case
  5. Remove rows where email is invalid (no @ sign)
  6. Flag records with missing required fields