Skip to main content

Clean Data XLS Skill

What is Data Cleaning?

Data cleaning (also called data wrangling or data preparation) is the process of detecting and correcting errors, inconsistencies, and formatting problems in raw data before it can be used for analysis. In finance, this step is often overlooked or rushed, but it is critical: a financial model built on dirty data will produce unreliable outputs regardless of how sophisticated the model is. Financial data gets messy for many reasons:
  • Multiple sources — Pulling data from Bloomberg, SEC filings, and internal databases produces different formats for the same fields (revenue as “485M","485,000",or"485M", "485,000", or "485.0 million”).
  • Manual entry — Analysts typing data into spreadsheets introduce typos, inconsistent casing (“Apple Inc.” vs “APPLE INC” vs “Apple Inc”), and formatting variations.
  • Copy-paste artifacts — Copying data from PDFs, web pages, or other spreadsheets brings invisible characters, weird spacing, and numbers stored as text.
  • Date format chaos — “3/8/26” might mean March 8, 2026 in the US or August 3, 2026 in Europe. Mixed date formats in the same column make time-series analysis impossible.
  • Encoding issues — Characters like accented letters (e.g., in “Societe Generale”) can get corrupted into mojibake (é instead of e).

Why It Matters

Data quality issues compound as they flow through analysis:
  • Numbers stored as text cannot be summed, averaged, or used in formulas. Excel will silently skip them in SUM ranges, producing subtotals that are wrong but look reasonable.
  • Inconsistent company names cause comps to fail. If “J.P. Morgan” appears in one dataset and “JPMorgan” in another, a VLOOKUP will not match them.
  • Mixed date formats break time-series calculations. If some dates parse as March and others as August, growth rates and trend lines will be nonsensical.
  • Leading/trailing spaces cause exact-match lookups to fail. “Apple Inc.” and “Apple Inc. ” (with a trailing space) are different strings to Excel.
In institutional finance, data quality issues are not just annoyances — they can lead to incorrect valuations, flawed peer comparisons, and models that produce wrong answers. Catching these problems early is far cheaper than debugging a model that produces suspicious outputs.

Key Concepts

TermDefinitionWhy It Matters
Numbers as TextNumeric values stored as text strings. Excel shows a green triangle in the corner.Cannot be used in calculations. SUM and AVERAGE will silently skip these cells.
WhitespaceLeading spaces, trailing spaces, or double spaces in cell values.Causes lookup failures. “Apple Inc.” will not match ” Apple Inc.” in VLOOKUP.
Mixed TypesA column that is mostly numbers but contains a few text entries.Breaks formulas that expect consistent types. The text entries are usually data entry errors.
Near-DuplicatesRows that represent the same entity but with slight differences (“Apple Inc.” vs “Apple Inc” vs “APPLE INC”).Inflates counts and distorts analysis if not merged.
MojibakeGarbled characters from encoding errors (é instead of e).Makes data unreadable and breaks text matching.
Helper ColumnA new column with a formula that cleans the adjacent data (e.g., =TRIM(A2)).Preserves the original data while creating a clean version — auditable and reversible.

How It Works

Triggers when: “clean this data”, “normalize this data”, “fix formatting”, “dedupe”, “standardize this column”, or “this data is messy.”

Core Principle: Formulas Over Overwrites

Prefer formulas over hardcoded cleaned values. Write cleaning formulas in helper columns rather than overwriting the original data. This keeps the transformation transparent and auditable — you can always see what changed and why.
Cleaning TaskFormula Approach
Trim whitespace=TRIM(A2)
Convert to uppercase=UPPER(B2)
Convert to proper case=PROPER(C2)
Strip currency symbols=VALUE(SUBSTITUTE(D2,"$",""))
Remove commas=VALUE(SUBSTITUTE(E2,",",""))
Parse dates=DATEVALUE(F2)
Remove % sign=VALUE(SUBSTITUTE(G2,"%",""))/100

Workflow

1

Determine Scope

Identify the range to clean. Profile each column: detect its dominant type (text, number, date) and identify outliers.
2

Detect Issues

Scan for all issue types before proposing any fixes:
IssueWhat to Look For
WhitespaceLeading/trailing spaces, double spaces
CasingInconsistent casing in categorical columns
Number-as-textNumeric values stored as text; stray $, commas, %
DatesMixed formats in the same column
DuplicatesExact-duplicate rows and near-duplicates
BlanksEmpty cells in otherwise-populated columns
Mixed typesA column that is 98% numbers but has 3 text entries
EncodingMojibake, non-printing characters
Errors#REF!, #N/A, #VALUE!, #DIV/0!
3

Propose Fixes

Show a summary table before changing anything:
ColumnIssueCountProposed Fix
Company NameWhitespace47 rows=TRIM(A2) in helper column
RevenueNumbers as text23 rows=VALUE(SUBSTITUTE(B2,”$”,""))
CountryInconsistent casing15 rows=UPPER(C2)
4

Apply Fixes

Apply each category in sequence, confirming with the user after each:
  1. Whitespace — show before/after sample — confirm
  2. Casing — show before/after sample — confirm
  3. Number conversion — show before/after sample — confirm
  4. Dates — show before/after sample — confirm
  5. Deduplication — show rows to be removed — confirm

Handling Special Cases

Near-Duplicates:
"Apple Inc." vs "Apple Inc" vs "APPLE INC" -- same company?
"J.P. Morgan" vs "JP Morgan" vs "JPMorgan" -- same company?
Show these cases to the user; do not auto-merge. Mixed-Type Columns: Flag non-conforming cells, show the user which rows have text in a numeric column, and ask whether they are data entry errors or intentional (e.g., “N/A”, “TBD”).
Never remove duplicates or overwrite source data without explicit user approval. A “dedup” that removes the wrong rows can destroy data that cannot be recovered.

How to Add to Your Local Context

1

Install the Plugin

claude plugin install financial-analysis@financial-services-plugins
2

Set Data Standards

Edit skills/clean-data-xls/SKILL.md:
## Firm Data Standards
- Company names: always use official legal name from Capital IQ
- Dates: always standardize to YYYY-MM-DD format
- Currency: always strip symbols and store as numbers
- Percentages: always store as decimals (0.15, not 15%)
- Blanks: always fill with "N/A" for text columns, leave empty for numeric
3

Add Common Mappings

## Standard Name Mappings
- "JPM" / "JP Morgan" / "J.P. Morgan" -> "JPMorgan Chase & Co."
- "GS" / "Goldman" -> "The Goldman Sachs Group, Inc."
- "MS" / "Morgan Stanley" -> "Morgan Stanley"

Best Practices

  • Profile the data before cleaning. Understanding the dominant type and common issues in each column prevents applying the wrong fix.
  • Always preserve the original data. Use helper columns rather than overwriting. If something goes wrong, you can always go back.
  • Clean in order. Fix whitespace first (it affects everything else), then casing, then type conversions, then deduplication.
  • Verify totals after cleaning. Compare SUM of the original column to SUM of the cleaned column. If they differ, some values were converted that should not have been (or vice versa).
  • Document what you did. Add a “Data Cleaning Notes” tab or section documenting which columns were cleaned, how, and why.