Kekkei - Advancing Education For Everyone

Introduction to Data Cleaning in SQL

Real-world data is messy. Customer names have inconsistent capitalization. Phone numbers come in different formats. NULLs appear where you don't expect them. Before you can analyze data, you need to clean it.

Data cleaning transforms raw, inconsistent data into a clean, standardized format ready for analysis. SQL provides powerful functions for:

Handling NULL values
Standardizing text formats
Parsing and transforming strings
Converting data types
Removing duplicates
Fixing inconsistencies

Real-world scenario: You receive customer data from multiple sources:

Source A: Names in "LastName, FirstName" format
Source B: Names in "FirstName LastName" format
Some phone numbers include dashes, others don't
Dates come as strings like "2024-01-15" or "01/15/2024"
Many fields contain NULL or empty strings

SQL can clean all of this!

The process of detecting and correcting (or removing) corrupt, inaccurate, or inconsistent records from a dataset. Essential before analysis or reporting.

Common Data Quality Issues

Issue	Example	SQL Solution
NULL values	Missing customer emails	COALESCE, NULLIF, IFNULL
Inconsistent formatting	"John Doe" vs "JOHN DOE"	UPPER, LOWER, TRIM
Extra whitespace	" Alice "	TRIM, LTRIM, RTRIM
Mixed data types	"100" stored as text	CAST, CONVERT
Duplicates	Same customer entered twice	DISTINCT, ROW_NUMBER
Invalid values	-1 for age	CASE, NULLIF

Handling NULL Values

NULL means "unknown" or "missing." You need to handle NULLs carefully in calculations and comparisons.

customers_raw

customer_id	name	email	phone	city
1	Alice Smith	alice@mail.com	555-1234	NYC
2	Bob Jones	NULL	555-5678	NULL
3	Charlie Brown	charlie@mail.com	NULL	Boston
4	Diana Prince	NULL	555-9012	LA

COALESCE - Return first non-NULL value:

sql

SELECT
  customer_id,
  name,
  COALESCE(email, 'no-email@example.com') AS email,
  COALESCE(phone, 'No phone') AS phone,
  COALESCE(city, 'Unknown') AS city
FROM customers_raw;

NULLIF - Convert specific values to NULL:

products_raw

product_id	name	price	discount
1	Laptop	999.99	50.00
2	Mouse	29.99	0
3	Keyboard	79.99	-1

sql

-- Convert invalid discount values (-1, 0) to NULL
SELECT
  product_id,
  name,
  price,
  NULLIF(discount, -1) AS discount_clean1,
  NULLIF(NULLIF(discount, -1), 0) AS discount_clean2
FROM products_raw;

COALESCE vs NULLIF:

COALESCE: Replace NULL with a default value
NULLIF: Convert a specific value to NULL

Text Cleaning and Standardization

Standardize text data for consistency.

TRIM, LTRIM, RTRIM - Remove whitespace:

names_raw

id	name
1	Alice
2	Bob
3	Charlie

sql

SELECT
  id,
  CONCAT('[', name, ']') AS original,
  CONCAT('[', TRIM(name), ']') AS trimmed,
  CONCAT('[', LTRIM(name), ']') AS left_trimmed,
  CONCAT('[', RTRIM(name), ']') AS right_trimmed
FROM names_raw;

UPPER, LOWER, INITCAP - Case conversion:

sql

SELECT
  'John DOE' AS original,
  UPPER('John DOE') AS uppercase,
  LOWER('John DOE') AS lowercase,
  CONCAT(
    UPPER(SUBSTRING('john doe', 1, 1)),
    LOWER(SUBSTRING('john doe', 2))
  ) AS title_case;

REPLACE - Find and replace text:

phones_raw

customer_id	phone
1	(555) 123-4567
2	555.987.6543
3	555-111-2222

sql

-- Standardize phone numbers to digits only
SELECT
  customer_id,
  phone AS original,
  REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
    phone, '(', ''), ')', ''), '-', ''), '.', ''), ' ', ''
  ) AS digits_only
FROM phones_raw;

Chaining REPLACE: You can nest multiple REPLACE calls to remove different characters. Modern databases also support REGEXP_REPLACE for pattern-based replacement.

String Parsing and Extraction

SUBSTRING - Extract part of a string:

sql

SELECT
  'Alice Smith' AS full_name,
  SUBSTRING('Alice Smith', 1, 5) AS first_name,
  SUBSTRING('Alice Smith', 7) AS last_name,
  LEFT('Alice Smith', 5) AS left_5,
  RIGHT('Alice Smith', 5) AS right_5;

SPLIT_PART / SUBSTRING_INDEX - Split by delimiter:

emails

user_id	email
1	alice@company.com
2	bob@startup.io
3	charlie@business.org

sql

-- Extract username and domain
SELECT
  user_id,
  email,
  SUBSTRING_INDEX(email, '@', 1) AS username,
  SUBSTRING_INDEX(email, '@', -1) AS domain
FROM emails;

LENGTH - Check string length:

sql

-- Find invalid phone numbers (not 10 digits)
SELECT
  customer_id,
  phone,
  LENGTH(phone) AS phone_length,
  CASE
    WHEN LENGTH(phone) = 10 THEN 'Valid'
    ELSE 'Invalid'
  END AS validation
FROM phones_cleaned;

Type Conversion and Casting

CAST - Convert between data types:

orders_raw

order_id	customer_id	total	order_date
1	101	150.50	2024-01-15
2	102	200	2024-01-16
3	103	invalid	2024-01-17

sql

-- Convert string to number, handle errors
SELECT
  order_id,
  total AS original,
  CASE
    WHEN total REGEXP '^[0-9]+(\.[0-9]+)?$'
    THEN CAST(total AS DECIMAL(10,2))
    ELSE NULL
  END AS total_numeric
FROM orders_raw;

Date parsing:

dates_raw

id	date_string	format
1	2024-01-15	ISO
2	01/15/2024	US
3	15-01-2024	EU

sql

SELECT
  id,
  date_string,
  STR_TO_DATE(date_string, '%Y-%m-%d') AS iso_date,
  STR_TO_DATE(date_string, '%m/%d/%Y') AS us_date,
  STR_TO_DATE(date_string, '%d-%m-%Y') AS eu_date
FROM dates_raw;

Removing Duplicates

DISTINCT - Simple deduplication:

customers_dupes

customer_id	email	name
1	alice@mail.com	Alice Smith
2	bob@mail.com	Bob Jones
3	alice@mail.com	Alice Smith
4	charlie@mail.com	Charlie Brown

sql

-- Remove exact duplicates
SELECT DISTINCT email, name
FROM customers_dupes
ORDER BY name;

ROW_NUMBER - Keep first occurrence:

sql

-- Keep earliest customer_id per email
SELECT customer_id, email, name
FROM (
  SELECT
    customer_id,
    email,
    name,
    ROW_NUMBER() OVER (PARTITION BY email ORDER BY customer_id) AS rn
  FROM customers_dupes
) ranked
WHERE rn = 1;

CASE Statements for Conditional Cleaning

CASE is incredibly versatile for data cleaning:

products_messy

product_id	category	price	stock
1	ELECTRONICS	999.99	10
2	electronics	29.99	-5
3	Electronics	79.99	0
4	Furniture	-10.00	20

sql

SELECT
  product_id,
  -- Standardize category
  UPPER(TRIM(category)) AS category_clean,
  -- Fix negative prices
  CASE
    WHEN price < 0 THEN NULL
    ELSE price
  END AS price_clean,
  -- Fix negative stock
  CASE
    WHEN stock < 0 THEN 0
    ELSE stock
  END AS stock_clean,
  -- Add status flag
  CASE
    WHEN price < 0 OR stock < 0 THEN 'NEEDS REVIEW'
    WHEN stock = 0 THEN 'OUT OF STOCK'
    ELSE 'OK'
  END AS data_quality
FROM products_messy;

Complete Data Cleaning Example

Scenario: Clean customer data from multiple sources

customers_import

id	name	email	phone	city	state
1	ALICE SMITH	Alice@Email.com	(555) 123-4567	new york	NY
2	bob jones	NULL	555.987.6543	NULL	ca
3	Charlie Brown	charlie@email.com	Boston	ma
4	NULL	diana@email.com	555-111-2222	Los Angeles	CA

sql

SELECT
  id,
  -- Clean name: trim, title case
  CASE
    WHEN TRIM(name) = '' OR name IS NULL THEN 'Unknown'
    ELSE CONCAT(
      UPPER(SUBSTRING(TRIM(LOWER(name)), 1, 1)),
      LOWER(SUBSTRING(TRIM(name), 2))
    )
  END AS name_clean,
  -- Clean email: lowercase, handle NULL
  COALESCE(LOWER(TRIM(email)), 'no-email@example.com') AS email_clean,
  -- Clean phone: remove formatting
  CASE
    WHEN TRIM(phone) = '' OR phone IS NULL THEN NULL
    ELSE REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
      phone, '(', ''), ')', ''), '-', ''), '.', ''), ' ', ''
    )
  END AS phone_clean,
  -- Clean city: title case
  CASE
    WHEN city IS NULL OR TRIM(city) = '' THEN 'Unknown'
    ELSE CONCAT(
      UPPER(SUBSTRING(TRIM(LOWER(city)), 1, 1)),
      LOWER(SUBSTRING(TRIM(city), 2))
    )
  END AS city_clean,
  -- Clean state: uppercase, 2 letters
  UPPER(TRIM(COALESCE(state, 'XX'))) AS state_clean
FROM customers_import;

This query handles:

NULL values
Inconsistent capitalization
Extra whitespace
Phone number formatting
Missing data

Create a cleaned table:

sql

CREATE TABLE customers_clean AS
SELECT
  id AS customer_id,
  -- [all cleaning logic from above]
FROM customers_import
WHERE email IS NOT NULL;  -- Exclude rows without emails

Data Quality Checks

Identify data quality issues:

sql

-- Find records with issues
SELECT
  'Missing email' AS issue,
  COUNT(*) AS count
FROM customers_import
WHERE email IS NULL OR TRIM(email) = ''

UNION ALL

SELECT
  'Missing name',
  COUNT(*)
FROM customers_import
WHERE name IS NULL OR TRIM(name) = ''

UNION ALL

SELECT
  'Invalid phone length',
  COUNT(*)
FROM customers_import
WHERE LENGTH(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
  phone, '(', ''), ')', ''), '-', ''), '.', ''), ' ', ''
)) != 10;

Common Mistakes to Avoid

Don't make these mistakes:

Modifying original data directly: ❌
- Always clean into a new table or view
- Keep raw data intact for auditing
Not documenting cleaning rules: ❌
- Document what you changed and why
- Future you will thank you!
Over-cleaning: Sometimes
- Don't remove variations that might be legitimate
- "Bob" vs "Robert" might be different people
Ignoring NULLs in calculations: ❌
- NULL + 100 = NULL
- Use COALESCE before calculations
Case-sensitive comparisons: ⚠️
- "alice@email.com" != "Alice@Email.com" in some databases
- Always use UPPER() or LOWER() for comparisons
Not handling edge cases: ❌
- Empty strings vs NULL
- Zero vs NULL
- Whitespace-only strings
Running UPDATE without WHERE: ❌
- Test in SELECT first!
- Then convert to UPDATE
Not validating after cleaning: ❌
- Always check that cleaning worked
- Run quality checks on cleaned data

Key Takeaways

What you learned:

✅ COALESCE replaces NULL with default values ✅ NULLIF converts specific values to NULL ✅ TRIM removes whitespace ✅ UPPER/LOWER standardizes case ✅ REPLACE removes or replaces characters ✅ SUBSTRING extracts portions of text ✅ CAST converts data types ✅ DISTINCT and ROW_NUMBER remove duplicates ✅ CASE handles conditional logic ✅ Always clean into new tables, keep originals ✅ Document your cleaning rules ✅ Validate results after cleaning

Data cleaning workflow:

Assess data quality (run checks)
Document issues found
Write cleaning queries (test in SELECT first!)
Create cleaned table or view
Validate cleaned data
Document what was changed

Remember: Good data analysis starts with clean data. Garbage in, garbage out!

Practice Exercise: Clean this messy data:

Standardize all emails to lowercase
Remove non-digit characters from phone numbers
Replace NULL cities with 'Unknown'
Convert all names to Title Case
Remove rows where email is invalid (no @ sign)
Flag records with missing required fields

Introduction to Data Cleaning in SQLFocusStart Focus Mode

Handling NULL ValuesFocusStart Focus Mode

Text Cleaning and StandardizationFocusStart Focus Mode

String Parsing and ExtractionFocusStart Focus Mode

Type Conversion and CastingFocusStart Focus Mode

Removing DuplicatesFocusStart Focus Mode

CASE Statements for Conditional CleaningFocusStart Focus Mode

Complete Data Cleaning ExampleFocusStart Focus Mode

Data Quality ChecksFocusStart Focus Mode

Common Mistakes to AvoidFocusStart Focus Mode

Key TakeawaysFocusStart Focus Mode