Clean Data XLS Skill

What is Data Cleaning?

Data cleaning (also called data wrangling or data preparation) is the process of detecting and correcting errors, inconsistencies, and formatting problems in raw data before it can be used for analysis. In finance, this step is often overlooked or rushed, but it is critical: a financial model built on dirty data will produce unreliable outputs regardless of how sophisticated the model is. Financial data gets messy for many reasons:

Multiple sources — Pulling data from Bloomberg, SEC filings, and internal databases produces different formats for the same fields (revenue as “ $485M", "485,000", or "$ 485.0 million”).
Manual entry — Analysts typing data into spreadsheets introduce typos, inconsistent casing (“Apple Inc.” vs “APPLE INC” vs “Apple Inc”), and formatting variations.
Copy-paste artifacts — Copying data from PDFs, web pages, or other spreadsheets brings invisible characters, weird spacing, and numbers stored as text.
Date format chaos — “3/8/26” might mean March 8, 2026 in the US or August 3, 2026 in Europe. Mixed date formats in the same column make time-series analysis impossible.
Encoding issues — Characters like accented letters (e.g., in “Societe Generale”) can get corrupted into mojibake (Ã© instead of e).

Why It Matters

Data quality issues compound as they flow through analysis:

Numbers stored as text cannot be summed, averaged, or used in formulas. Excel will silently skip them in SUM ranges, producing subtotals that are wrong but look reasonable.
Inconsistent company names cause comps to fail. If “J.P. Morgan” appears in one dataset and “JPMorgan” in another, a VLOOKUP will not match them.
Mixed date formats break time-series calculations. If some dates parse as March and others as August, growth rates and trend lines will be nonsensical.
Leading/trailing spaces cause exact-match lookups to fail. “Apple Inc.” and “Apple Inc. ” (with a trailing space) are different strings to Excel.

In institutional finance, data quality issues are not just annoyances — they can lead to incorrect valuations, flawed peer comparisons, and models that produce wrong answers. Catching these problems early is far cheaper than debugging a model that produces suspicious outputs.

Key Concepts

Term	Definition	Why It Matters
Numbers as Text	Numeric values stored as text strings. Excel shows a green triangle in the corner.	Cannot be used in calculations. SUM and AVERAGE will silently skip these cells.
Whitespace	Leading spaces, trailing spaces, or double spaces in cell values.	Causes lookup failures. “Apple Inc.” will not match ” Apple Inc.” in VLOOKUP.
Mixed Types	A column that is mostly numbers but contains a few text entries.	Breaks formulas that expect consistent types. The text entries are usually data entry errors.
Near-Duplicates	Rows that represent the same entity but with slight differences (“Apple Inc.” vs “Apple Inc” vs “APPLE INC”).	Inflates counts and distorts analysis if not merged.
Mojibake	Garbled characters from encoding errors (Ã© instead of e).	Makes data unreadable and breaks text matching.
Helper Column	A new column with a formula that cleans the adjacent data (e.g., =TRIM(A2)).	Preserves the original data while creating a clean version — auditable and reversible.

How It Works

Triggers when: “clean this data”, “normalize this data”, “fix formatting”, “dedupe”, “standardize this column”, or “this data is messy.”

Core Principle: Formulas Over Overwrites

Prefer formulas over hardcoded cleaned values. Write cleaning formulas in helper columns rather than overwriting the original data. This keeps the transformation transparent and auditable — you can always see what changed and why.

Cleaning Task	Formula Approach
Trim whitespace	`=TRIM(A2)`
Convert to uppercase	`=UPPER(B2)`
Convert to proper case	`=PROPER(C2)`
Strip currency symbols	`=VALUE(SUBSTITUTE(D2,"$",""))`
Remove commas	`=VALUE(SUBSTITUTE(E2,",",""))`
Parse dates	`=DATEVALUE(F2)`
Remove % sign	`=VALUE(SUBSTITUTE(G2,"%",""))/100`

Workflow

Determine Scope

Identify the range to clean. Profile each column: detect its dominant type (text, number, date) and identify outliers.

Detect Issues

Scan for all issue types before proposing any fixes:

Issue	What to Look For
Whitespace	Leading/trailing spaces, double spaces
Casing	Inconsistent casing in categorical columns
Number-as-text	Numeric values stored as text; stray $, commas, %
Dates	Mixed formats in the same column
Duplicates	Exact-duplicate rows and near-duplicates
Blanks	Empty cells in otherwise-populated columns
Mixed types	A column that is 98% numbers but has 3 text entries
Encoding	Mojibake, non-printing characters
Errors	#REF!, #N/A, #VALUE!, #DIV/0!

Propose Fixes

Show a summary table before changing anything:

Column	Issue	Count	Proposed Fix
Company Name	Whitespace	47 rows	=TRIM(A2) in helper column
Revenue	Numbers as text	23 rows	=VALUE(SUBSTITUTE(B2,”$”,""))
Country	Inconsistent casing	15 rows	=UPPER(C2)

Apply Fixes

Apply each category in sequence, confirming with the user after each:

Whitespace — show before/after sample — confirm
Casing — show before/after sample — confirm
Number conversion — show before/after sample — confirm
Dates — show before/after sample — confirm
Deduplication — show rows to be removed — confirm

Handling Special Cases

Near-Duplicates:

"Apple Inc." vs "Apple Inc" vs "APPLE INC" -- same company?
"J.P. Morgan" vs "JP Morgan" vs "JPMorgan" -- same company?

Show these cases to the user; do not auto-merge. Mixed-Type Columns: Flag non-conforming cells, show the user which rows have text in a numeric column, and ask whether they are data entry errors or intentional (e.g., “N/A”, “TBD”).

Never remove duplicates or overwrite source data without explicit user approval. A “dedup” that removes the wrong rows can destroy data that cannot be recovered.

How to Add to Your Local Context

Install the Plugin

claude plugin install financial-analysis@financial-services-plugins

Set Data Standards

Edit skills/clean-data-xls/SKILL.md:

## Firm Data Standards
- Company names: always use official legal name from Capital IQ
- Dates: always standardize to YYYY-MM-DD format
- Currency: always strip symbols and store as numbers
- Percentages: always store as decimals (0.15, not 15%)
- Blanks: always fill with "N/A" for text columns, leave empty for numeric

Add Common Mappings

## Standard Name Mappings
- "JPM" / "JP Morgan" / "J.P. Morgan" -> "JPMorgan Chase & Co."
- "GS" / "Goldman" -> "The Goldman Sachs Group, Inc."
- "MS" / "Morgan Stanley" -> "Morgan Stanley"

Best Practices

Profile the data before cleaning. Understanding the dominant type and common issues in each column prevents applying the wrong fix.
Always preserve the original data. Use helper columns rather than overwriting. If something goes wrong, you can always go back.
Clean in order. Fix whitespace first (it affects everything else), then casing, then type conversions, then deduplication.
Verify totals after cleaning. Compare SUM of the original column to SUM of the cleaned column. If they differ, some values were converted that should not have been (or vice versa).
Document what you did. Add a “Data Cleaning Notes” tab or section documenting which columns were cleaned, how, and why.

Overview

Financial Analysis

Equity Research

Investment Banking

Private Equity

Wealth Management

LSEG

S&P Global

Clean Data XLS Skill

Clean Data XLS Skill

What is Data Cleaning?

Why It Matters

Key Concepts

How It Works

Core Principle: Formulas Over Overwrites

Workflow

Handling Special Cases

How to Add to Your Local Context

Best Practices

​Clean Data XLS Skill

​What is Data Cleaning?

​Why It Matters

​Key Concepts

​How It Works

​Core Principle: Formulas Over Overwrites

​Workflow

​Handling Special Cases

​How to Add to Your Local Context

​Best Practices

Clean Data XLS Skill

What is Data Cleaning?

Why It Matters

Key Concepts

How It Works

Core Principle: Formulas Over Overwrites

Workflow

Handling Special Cases

How to Add to Your Local Context

Best Practices