Tutorial

How to remove duplicates from a CSV file — without writing code

May 19, 20263 min readBy Sam A.
How to remove duplicates from a CSV file — without writing code

Most deduplication tutorials assume you're comfortable with pandas or awk. You shouldn't need to be. If you have a CSV with duplicate rows and a browser, you're already equipped.

Duplicate rows are one of the most common problems in data work. They appear in CRM exports, form submissions, API responses, and any situation where two sources get merged without a dedup pass. Left uncleaned, they skew aggregates, inflate counts, and cause bugs downstream.

The problem with duplicate rows

A "duplicate" isn't always an exact copy. In a contacts CSV, you might have two rows for "Alice Johnson" — one from a 2023 export and one from a 2024 sync — with the same email but a different job title. Which one is the duplicate depends entirely on which columns you care about.

This is why a simple sort | uniq often fails silently: it only removes rows that are byte-for-byte identical, and real data almost never is.

Using dedup.ing in Advanced mode

The Advanced tab on dedup.ing turns the plain textarea into a spreadsheet-style grid. Upload a CSV or paste one in, and the tool parses it into columns you can inspect.

Before you run anything, the grid already highlights rows it considers duplicates — based on whichever columns are currently checked. Red strikethrough rows are what would be removed. This is the key interaction: you're configuring the dedupe key visually, not writing it.

HOW IT WORKS

The duplicate preview updates live as you tick and untick column checkboxes. You don't need to run the tool to see which rows would be removed — the grid shows you before you commit.

Choosing your key columns

The columns you tick define the "key" — the combination of fields that must be unique. The right choice depends on your data:

  • Contacts list: tick email only. Two contacts with the same email are the same person, even if the name or phone differs.
  • Order records: tick order_id. A duplicate order ID is always an error.
  • Survey responses: tick user_id + survey_date. One response per user per day.
  • Product catalog: tick sku + variant. SKU alone might be too broad.

Leave a column unticked and it's ignored for the purpose of detecting duplicates — but it's still included in the output. Untick all columns and the Dedupe button disables.

Handling case sensitivity

The "Case" option in the toolbar controls whether ALICE@ACME.COM and alice@acme.com are treated as the same value. For email addresses, always use case-insensitive. For codes, SKUs, and IDs that are case-sensitive by design, leave it on Sensitive.

"Keep First" vs "Keep Last" controls which row survives when duplicates are found. For time-series data, keeping the last occurrence usually means keeping the most recent record — which is typically what you want.

The Python alternative

If you're processing files programmatically, pandas handles this in four lines:

import pandas as pd

python
df = pd.read_csv('contacts.csv')
df.drop_duplicates(subset=['email'], keep='first', inplace=True)
df.to_csv('contacts_clean.csv', index=False)

Add ignore_index=True to reset the row numbers in the output. For case-insensitive deduplication, normalise the key column first:

python
df['email_key'] = df['email'].str.lower().str.strip()
df.drop_duplicates(subset=['email_key'], keep='first', inplace=True)
df.drop(columns=['email_key'], inplace=True)

The command-line approach

For plain text lists — one item per line, no columns — the shell is still the fastest option:

sh
# Sort and remove exact duplicates
sort contacts.txt | uniq > contacts_clean.txt

# Count how many duplicates were removed
echo "$(wc -l < contacts.txt)$(sort -u contacts.txt | wc -l) unique"

# Case-insensitive deduplication
sort -f contacts.txt | uniq -i > contacts_clean.txt

This approach only works for single-column data. For anything with structure — CSV, TSV, JSON — you need a parser that understands the format. The browser tool or pandas is the right choice there.

The right tool depends on the job. For a one-off cleanup in the browser: dedup.ing. For a repeatable pipeline: pandas or a shell script. The important thing is that you're explicit about what "duplicate" means in your data — which column or combination of columns defines identity — rather than relying on byte-for-byte comparison and hoping for the best.

Get new tutorials in your inbox.

No spam, just useful updates when we ship something new or write something worth reading.

Related articles