Last updated on September 19th, 2024 at 12:46 pm
Keeping data right is very important for all databases. When we have copies, it can cause problems and use more space. To help with this, we will learn how to delete duplicate rows in SQL. We'll start with simple ways and work up to complicated ones.
We'll explore a range of techniques, from the fundamental DISTINCT keyword to utilizing advanced Common Table Expressions (CTEs) in conjunction with the ROW_NUMBER() function. This will make you adapt at using SQL and keep your data clean and efficient in no time!
Delete duplicate rows in SQL
In SQL, deleting duplicate rows means putting off entries from a table comprising equal information primarily based on specific criteria. Duplicate rows can occur for diverse reasons, including data entry mistakes, integrations from different assets, or incomplete deduplication methods.
Deleting duplicates facilitates:
- Improve data integrity: By eliminating redundant data, you make sure that the tables are correctly filled with data and constant.
- Save storage space: Duplicate rows occupy needless garage space, and getting rid of them can optimize database performance.
- Enhance data analysis: Duplicate rows can skew statistics evaluation consequences. Removing them results in more correct and dependable insights.
How to delete duplicate rows in SQL using sample data
Here's how testing makes it clear to see how duplicate rows take-out works in SQL:
Sample data
Let's consider a table named Customers with the following columns:
CustomerID | Name | |
1 | John Doe | john.doe@email.com |
2 | Jane Smith | jane.smith@email.com |
3 | Mike Jones | mike.jones@email.com |
4 | John Doe | john.doe@email.com (duplicate) |
This table has the same row twice for John Doe. We can take an example like this to show how various SQL ways find and delete duplicate rows.
Delete duplicate rows in SQL using Group
Using GROUP BY and HAVING clauses is a strong method to remove repeated rows in SQL. You select columns to group the data and then use the HAVING clause to filter the groups. It helps find rows with the same values in specific columns.
Here's how it works:
- Group By: You choose which columns to group the data by. This puts rows with the same values in those columns into categories.
- HAVING Clause: This filters the groups made by GROUP BY. You can use COUNT(*) inside HAVING to find groups with more than one row (copies).
How to delete duplicate rows in SQL with Group By and Having
To do this, follow the steps mentioned here.
Step 1. Find duplicate rules: Decide which columns show a duplicate in your data. For example, in a list of customers, duplicates can be found by matching Name and Email together.
Step 2. Build the DELETE Query: This is the basic format:
DELETE FROM your_table_name
WHERE your_table_name.column_name_1 IN (
SELECT column_name_1
FROM your_table_name
GROUP BY column_name_1, column_name_2 (columns for duplicate check)
HAVING COUNT(*) > 1
);
Example
Consider a table named Products with columns ProductCode, ProductName, and Price. We want to delete duplicate products based on ProductCode and Price.
DELETE FROM Products
WHERE Products.ProductCode IN (
SELECT ProductCode
FROM Products
GROUP BY ProductCode, Price
HAVING COUNT(*) > 1
);
Result: This query will put things together by ProductCode and Price. The part saying HAVING COUNT(*) > 1 shows sets with the same products and prices. The DELETE statement then takes away rows with codes that are the same as these found duplicates.
Fetching and Identifying the duplicate rows
It's crucial to identify them accurately before knowing how to remove duplicates in SQL. Data science professionals often use SQL's functionalities like querying and filtering to pinpoint these duplicate entries. Here are some methods to fetch and identify duplicate rows:
1. Using GROUP BY and COUNT(*)
This is a common approach that uses both grouping and aggregate functions. The idea is to group rows based on the columns that define duplicates.
Use COUNT(*) to determine the number of rows in each group. Groups with a count greater than 1 indicate duplicates.
Syntax
SELECT column_name_1, column_name_2, ..., COUNT(*) AS row_count
FROM your_table_name
GROUP BY column_name_1, column_name_2, ...;
2. Using DISTINCT and Self-Join
The SQL remove duplicates option is a very handy way to handle your data. This method utilizes DISTINCT to fetch unique combinations and a self-join to compare rows. Use SELECT DISTINCT on the columns defining duplicates to get unique combinations. Later on, perform a self-join on the table itself, matching these unique combinations with the original table.
Syntax
SELECT t1*.
FROM (SELECT DISTINCT column_name_1, column_name_2, ... FROM your_table_name) AS unique_data
INNER JOIN your_table_name AS t1 ON (unique_data.column_name_1 = t1.column_name_1 AND ...)
WHERE unique_data.column_name_1 = t1.column_name_1 AND ...;
3. Using ROW_NUMBER()
This method assigns a row number within groups defined by duplicate criteria, allowing you to identify duplicates based on their order.
Syntax
SELECT *, ROW_NUMBER() OVER (PARTITION BY column_name_1, column_name_2, ... ORDER BY column_name_3) AS row_num
FROM your_table_name;
How do you choose the right method?
The right way depends on your needs and table size. Using GROUP BY and COUNT(*) is good for most cases. If you know how to remove duplicates in SQL, you might as well learn when to use which method.
If you have complicated copies or need to filter based on order, you could try ROW_NUMBER(). If you want to see all the copies, using self-join can help.
Delete duplicate rows in SQL with an intermediate table
The "Intermediate table" way is good for doing away with the same rows in SQL. You use another table to keep the different info, and then swap it with the first table. For example, in a table called Customers with CustomerID, Name, and Email, with the same data.
Steps
- Create Intermediate Table: CREATE TABLE Customers_Temp LIKE Customers;
- Insert Distinct Rows: INSERT INTO Customers_Temp
- SELECT DISTINCT CustomerID, Name, Email
- FROM Customers;
- (Optional) Drop Original Table: DROP TABLE Customers;
- Rename Intermediate Table: ALTER TABLE Customers_Temp RENAME TO Customers;
Deleting duplicate rows in SQL using ROW_NUMBER() function
The ROW_NUMBER() function is a handy tool for deleting duplicate rows within a database table. For a query to delete duplicate records in SQL, you have a convenient option in this function.
This function assigns a unique number to each row within a result set, based on a specified ordering. It uses the following syntax:
ROW_NUMBER() OVER (PARTITION BY <column_list> ORDER BY <column_list>) AS row_num
where
- PARTITION BY <column_list>: This clause groups rows together based on the specified columns. Rows within each group will be assigned unique row numbers.
- ORDER BY <column_list>: This clause defines the order in which the rows within each partition will be numbered.
Example
Suppose you have a table named Customers with columns customer_id, name, and email. You want to delete duplicate customer entries based on name and email. Here's the query:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY name, email ORDER BY customer_id) AS row_num
FROM Customers
)
DELETE FROM cte
WHERE row_num > 1;
Result: This query first creates a CTE named cte. It assigns a row number (row_num) to each row in the Customers table. The partitioning is done by name and email, and the ordering is based on customer_id. Then, the DELETE statement removes rows from the CTE where row_num is greater than 1, eliminating duplicates.
Delete duplicate rows in SQL using Common Table Expressions (CTE)
Common Table Expressions (CTEs) offer a powerful way to delete duplicate rows from your database tables. Here's how you can use CTEs with the ROW_NUMBER() function for this task:
Step 1. Define the CTE
- The CTE identifies the duplicate rows. Here, you'll use the ROW_NUMBER() function to assign a unique sequential number to each row.
- The PARTITION BY clause groups rows together based on specific columns. Only rows within the same group will compete for unique numbering.
- The ORDER BY clause defines the order in which rows within each group are numbered.
Step 2. Filter and delete
After creating the CTE, you can use the DELETE statement to target the CTE alias.
Within the DELETE statement, you'll filter for rows where the ROW_NUMBER() (often aliased as row_num) is greater than 1. This effectively removes duplicates while keeping the first occurrence of each unique combination.
How to Delete duplicate rows in SQL using CTE
While procedures are a great way to encapsulate logic, removing duplicates with CTEs is typically done within a single SQL statement. However, here's how you could potentially create a procedure using CTEs as an example:
Step 1. Procedure creation
CREATE PROCEDURE RemoveDuplicates (
@tableName VARCHAR(50), -- Name of the table to process
@columnList VARCHAR(200) -- Comma-separated list of columns for duplicate check
)
AS
BEGIN
-- Implement the logic here
END;
Step 2. Logic within the procedure (using CTE)
DECLARE @cteName VARCHAR(50); -- To store dynamic CTE name
SET @cteName = 'cte_' + @tableName; -- Generate unique CTE name
WITH (@cteName) AS ( -- Define CTE dynamically
SELECT *,
ROW_NUMBER() OVER (PARTITION BY @columnList ORDER BY some_column) AS row_num
FROM @tableName
)
DELETE FROM @cteName -- Delete from CTE
WHERE row_num > 1;
END;
Rank function to SQL delete duplicate rows
The RANK() function in SQL can be a great tool for deleting duplicate rows from a table. The function assigns a ranking number to each row within a result set, considering a specified ordering. Similar to ROW_NUMBER(), it uses the following syntax:
RANK() OVER (PARTITION BY <column_list> ORDER BY <column_list>) AS rank_num
where
- PARTITION BY <column_list>: This clause groups rows together based on the specified columns. Rows within each group will receive ranks.
- ORDER BY <column_list>: This one defines the order in which the rows within each partition will be ranked.
Steps for Deleting duplicate rows in SQL with RANK
The steps are explained here:
Step 1. Identify duplicates: The RANK() function assigns the same rank to rows with identical values in the PARTITION BY columns.
Step 2. Delete ranked duplicates: We can leverage a CTE to isolate the duplicates and then delete them based on the rank.
Example for RANK function
Suppose you have a table named Products with columns for product_id, name, and color. You want to remove duplicate rows in SQL by targeting the product entries based on name and color. Here's the query:
WITH cte AS (
SELECT *, RANK() OVER (PARTITION BY name, color ORDER BY product_id) AS rank_num
FROM Products
)
DELETE FROM cte
WHERE rank_num > 1;
Result: This query first creates a CTE named cte. It assigns a rank_num to each row in the Products table. The partitioning is done by name and color, and the ordering is based on product_id. Rows with the same name and color will receive the same rank_num.
Then, the DELETE statement removes rows from the CTE where rank_num is greater than 1, eliminating duplicate entries.
Final Thoughts
Duplicate rows in your database can cause wasted space and skewed analysis. This article enables you to delete duplicate rows in SQL effectively. We explored methods like GROUP BY with HAVING for basic tasks, and advanced techniques with ROW_NUMBER() and CTEs.
Choosing the right method depends on your table size and needs. For a data-driven approach to managing your databases, consider Imarticus's Postgraduate Program in Data Science Analytics. This data science course equips you with the skills to wrangle, analyze, and visualize data, making you an expert in data management. Register instantly!