Clean Data XLS Skill
What is Data Cleaning?
Data cleaning (also called data wrangling or data preparation) is the process of detecting and correcting errors, inconsistencies, and formatting problems in raw data before it can be used for analysis. In finance, this step is often overlooked or rushed, but it is critical: a financial model built on dirty data will produce unreliable outputs regardless of how sophisticated the model is. Financial data gets messy for many reasons:- Multiple sources — Pulling data from Bloomberg, SEC filings, and internal databases produces different formats for the same fields (revenue as “485.0 million”).
- Manual entry — Analysts typing data into spreadsheets introduce typos, inconsistent casing (“Apple Inc.” vs “APPLE INC” vs “Apple Inc”), and formatting variations.
- Copy-paste artifacts — Copying data from PDFs, web pages, or other spreadsheets brings invisible characters, weird spacing, and numbers stored as text.
- Date format chaos — “3/8/26” might mean March 8, 2026 in the US or August 3, 2026 in Europe. Mixed date formats in the same column make time-series analysis impossible.
- Encoding issues — Characters like accented letters (e.g., in “Societe Generale”) can get corrupted into mojibake (é instead of e).
Why It Matters
Data quality issues compound as they flow through analysis:- Numbers stored as text cannot be summed, averaged, or used in formulas. Excel will silently skip them in SUM ranges, producing subtotals that are wrong but look reasonable.
- Inconsistent company names cause comps to fail. If “J.P. Morgan” appears in one dataset and “JPMorgan” in another, a VLOOKUP will not match them.
- Mixed date formats break time-series calculations. If some dates parse as March and others as August, growth rates and trend lines will be nonsensical.
- Leading/trailing spaces cause exact-match lookups to fail. “Apple Inc.” and “Apple Inc. ” (with a trailing space) are different strings to Excel.
Key Concepts
| Term | Definition | Why It Matters |
|---|---|---|
| Numbers as Text | Numeric values stored as text strings. Excel shows a green triangle in the corner. | Cannot be used in calculations. SUM and AVERAGE will silently skip these cells. |
| Whitespace | Leading spaces, trailing spaces, or double spaces in cell values. | Causes lookup failures. “Apple Inc.” will not match ” Apple Inc.” in VLOOKUP. |
| Mixed Types | A column that is mostly numbers but contains a few text entries. | Breaks formulas that expect consistent types. The text entries are usually data entry errors. |
| Near-Duplicates | Rows that represent the same entity but with slight differences (“Apple Inc.” vs “Apple Inc” vs “APPLE INC”). | Inflates counts and distorts analysis if not merged. |
| Mojibake | Garbled characters from encoding errors (é instead of e). | Makes data unreadable and breaks text matching. |
| Helper Column | A new column with a formula that cleans the adjacent data (e.g., =TRIM(A2)). | Preserves the original data while creating a clean version — auditable and reversible. |
How It Works
Triggers when: “clean this data”, “normalize this data”, “fix formatting”, “dedupe”, “standardize this column”, or “this data is messy.”Core Principle: Formulas Over Overwrites
| Cleaning Task | Formula Approach |
|---|---|
| Trim whitespace | =TRIM(A2) |
| Convert to uppercase | =UPPER(B2) |
| Convert to proper case | =PROPER(C2) |
| Strip currency symbols | =VALUE(SUBSTITUTE(D2,"$","")) |
| Remove commas | =VALUE(SUBSTITUTE(E2,",","")) |
| Parse dates | =DATEVALUE(F2) |
| Remove % sign | =VALUE(SUBSTITUTE(G2,"%",""))/100 |
Workflow
Determine Scope
Identify the range to clean. Profile each column: detect its dominant type (text, number, date) and identify outliers.
Detect Issues
Scan for all issue types before proposing any fixes:
| Issue | What to Look For |
|---|---|
| Whitespace | Leading/trailing spaces, double spaces |
| Casing | Inconsistent casing in categorical columns |
| Number-as-text | Numeric values stored as text; stray $, commas, % |
| Dates | Mixed formats in the same column |
| Duplicates | Exact-duplicate rows and near-duplicates |
| Blanks | Empty cells in otherwise-populated columns |
| Mixed types | A column that is 98% numbers but has 3 text entries |
| Encoding | Mojibake, non-printing characters |
| Errors | #REF!, #N/A, #VALUE!, #DIV/0! |
Propose Fixes
Show a summary table before changing anything:
| Column | Issue | Count | Proposed Fix |
|---|---|---|---|
| Company Name | Whitespace | 47 rows | =TRIM(A2) in helper column |
| Revenue | Numbers as text | 23 rows | =VALUE(SUBSTITUTE(B2,”$”,"")) |
| Country | Inconsistent casing | 15 rows | =UPPER(C2) |
Apply Fixes
Apply each category in sequence, confirming with the user after each:
- Whitespace — show before/after sample — confirm
- Casing — show before/after sample — confirm
- Number conversion — show before/after sample — confirm
- Dates — show before/after sample — confirm
- Deduplication — show rows to be removed — confirm
Handling Special Cases
Near-Duplicates:How to Add to Your Local Context
Best Practices
- Profile the data before cleaning. Understanding the dominant type and common issues in each column prevents applying the wrong fix.
- Always preserve the original data. Use helper columns rather than overwriting. If something goes wrong, you can always go back.
- Clean in order. Fix whitespace first (it affects everything else), then casing, then type conversions, then deduplication.
- Verify totals after cleaning. Compare SUM of the original column to SUM of the cleaned column. If they differ, some values were converted that should not have been (or vice versa).
- Document what you did. Add a “Data Cleaning Notes” tab or section documenting which columns were cleaned, how, and why.