How to Remove Duplicates and Move Data in Excel Using Python

Introduction

Cleaning up Excel files can be a hassle, especially when you have to manually find and remove duplicates. In this guide, we’ll build a simple Python tool with a graphical user interface (GUI) to make this process easy.

Using pandas, tkinter, and Excel, you’ll learn how to load an Excel file, remove duplicates, and copy the cleaned data—all with just a few clicks.

End Result

Simple GUI Excel data cleaner and copier using python

Step 1: Setting Up Your Environment

Before we start writing our program, we need to set up the development environment. Follow these steps to get everything ready:

Install Python

Ensure you have Python installed on your computer. You can download it from the official Python website.

Set Up Visual Studio Code (VS Code)

  1. Download and Install VS Code: Get Visual Studio Code from its official site.
  2. Open VS Code: Launch the application.
  3. Install Python Extension: Go to the Extensions view (Ctrl+Shift+X), search for “Python”, and install the extension provided by Microsoft.

Create a Project Folder

  1. Create a New Folder: Choose a location on your computer and create a new folder for your project, e.g., ExcelDataCleaner.
  2. Open the Folder in VS Code: Use File > Open Folder to open your new project folder in VS Code.

Install Necessary Libraries

  1. Open Terminal in VS Code: Use Terminal > New Terminal from the menu.
  2. Install Pandas: Run the following command to install the pandas library, which is necessary for handling Excel files:
pip install pandas

Install Tkinter

Tkinter is included with Python by default, but if it’s not available, you may need to install it separately. On most systems, you can install it using your package manager. For example, on Debian-based Linux distributions:

sudo apt-get install python3-tk

Step 2: Program Breakdown

Now that your environment is set up, let’s break down the code to understand how it works. We’ll go through each part step by step.

Importing Libraries

At the beginning of the script, we import the necessary libraries:

pythonCopy codeimport tkinter as tk
from tkinter import filedialog, messagebox, ttk
import pandas as pd
import shutil
  • tkinter: Used for creating the graphical user interface (GUI).
  • filedialog and messagebox: Used for file selection dialogs and showing messages.
  • ttk: Provides advanced widgets like Treeview for displaying tables.
  • pandas: Manages Excel file reading and writing.
  • shutil: Helps with copying files.

Initialize Global Variables

We define global variables to manage the Excel file and its data:

pythonCopy codedf = None
file_path = ""
  • df: Holds the Excel data.
  • file_path: Stores the path of the loaded file.

Load Excel File

The load_file function allows users to open and load an Excel file:

pythonCopy codedef load_file():
    global file_path, df
    file_path = filedialog.askopenfilename(filetypes=[("Excel files", "*.xlsx;*.xls")])
    if file_path:
        try:
            df = pd.read_excel(file_path)
            messagebox.showinfo("File Loaded", "Excel file loaded successfully!")
            preview_data()
        except Exception as e:
            messagebox.showerror("Error", f"Failed to load file: {e}")
  • filedialog.askopenfilename: Opens a file dialog to select an Excel file.
  • pd.read_excel: Reads the selected file into a DataFrame.
  • If loading is successful, it previews the data; otherwise, it shows an error.

Preview Data

The preview_data function displays the data in a new window:

pythonCopy codedef preview_data():
    if df is not None:
        preview_window = tk.Toplevel(root)
        preview_window.title("Data Preview")
        preview_window.geometry("600x400")
        
        tree = ttk.Treeview(preview_window)
        tree.pack(expand=True, fill='both')
        
        tree["columns"] = list(df.columns)
        tree["show"] = "headings"
        
        for col in df.columns:
            tree.heading(col, text=col)
            tree.column(col, anchor="w")
        
        for _, row in df.iterrows():
            tree.insert("", "end", values=list(row))
    else:
        messagebox.showwarning("No File", "Please load an Excel file first.")
  • tk.Toplevel: Creates a new window for data preview.
  • ttk.Treeview: Displays the DataFrame as a table.

Remove Duplicates

The remove_duplicates function lets users remove duplicate rows based on selected columns:

pythonCopy codedef remove_duplicates():
    global df
    if df is not None:
        cols = df.columns.tolist()
        selected_cols = []
        
        def select_columns():
            nonlocal selected_cols
            selected_cols = [cols[i] for i in col_list.curselection()]
            remove_duplicates_action(selected_cols)
            select_window.destroy()

        select_window = tk.Toplevel(root)
        select_window.title("Select Columns to Remove Duplicates")
        
        col_list = tk.Listbox(select_window, selectmode='multiple', selectbackground="#f0f0f0")
        col_list.pack(pady=10)
        for col in cols:
            col_list.insert(tk.END, col)

        tk.Button(select_window, text="Select", command=select_columns).pack(pady=5)
    else:
        messagebox.showwarning("No File", "Please load an Excel file first.")
  • A new window allows users to select columns for duplicate checking.
  • remove_duplicates_action processes the selected columns to remove duplicates.

Remove Duplicates Action

The remove_duplicates_action function performs the actual removal of duplicates:

pythonCopy codedef remove_duplicates_action(selected_cols):
    global df
    if selected_cols:
        df.drop_duplicates(subset=selected_cols, inplace=True)
        messagebox.showinfo("Duplicates Removed", f"Duplicates removed based on columns: {', '.join(selected_cols)}")
        preview_data()
  • df.drop_duplicates: Removes duplicates based on the selected columns.
  • A confirmation message is shown, and the data is re-previewed.

Save Cleaned File

The save_file function saves the cleaned data to a new file:

pythonCopy codedef save_file():
    if df is not None:
        save_path = filedialog.asksaveasfilename(defaultextension=".xlsx", filetypes=[("Excel files", "*.xlsx")])
        if save_path:
            df.to_excel(save_path, index=False)
            messagebox.showinfo("File Saved", "Cleaned data saved successfully!")
    else:
        messagebox.showwarning("No File", "Please load an Excel file first.")
  • filedialog.asksaveasfilename: Opens a dialog to select where to save the cleaned file.
  • df.to_excel: Saves the DataFrame to the chosen location.

Copy Original File

The copy_file function copies the original file to a new location:

pythonCopy codedef copy_file():
    if file_path:
        destination = filedialog.asksaveasfilename(defaultextension=".xlsx", filetypes=[("Excel files", "*.xlsx")])
        if destination:
            try:
                shutil.copy(file_path, destination)
                messagebox.showinfo("File Copied", "Excel file copied successfully!")
            except Exception as e:
                messagebox.showerror("Error", f"Failed to copy file: {e}")
    else:
        messagebox.showwarning("No File", "Please load an Excel file first.")
  • shutil.copy: Copies the original file to a new location.

Setting Up the GUI

Finally, we set up the main GUI window and add buttons for each function:

pythonCopy coderoot = tk.Tk()
root.title("Excel Data Cleaner and Copier")
root.geometry("400x300")
root.configure(bg="#f0f0f0")

frame = tk.Frame(root, bg="#f0f0f0")
frame.pack(pady=20)

load_button = tk.Button(frame, text="Load Excel File", command=load_file, width=20, bg="#4CAF50", fg="white")
remove_duplicates_button = tk.Button(frame, text="Remove Duplicates", command=remove_duplicates, width=20, bg="#2196F3", fg="white")
save_button = tk.Button(frame, text="Save Cleaned File", command=save_file, width=20, bg="#FF9800", fg="white")
copy_button = tk.Button(frame, text="Copy Excel File", command=copy_file, width=20, bg="#f44336", fg="white")

load_button.pack(pady=5)
remove_duplicates_button.pack(pady=5)
save_button.pack(pady=5)
copy_button.pack(pady=5)

root.mainloop()
  • tk.Tk(): Creates the main window.
  • Buttons are added to perform different tasks, styled for better usability.

Step 3: How to run the script

Save Your File: Make sure your script is saved (Ctrl+S).

Run the Script: Open the terminal in VS Code again (if it’s not already open) and run the script by typing:

python excel_data_cleaner.py

Interact with the Application: A window will appear for your Excel Data Cleaner and Copier. You can now use the buttons to:

  • Load an Excel File: Click the “Load Excel File” button to select and load an Excel file.
Sample excel data that contains dublicate
  • Remove Duplicates: Click “Remove Duplicates” to remove any duplicate rows from your data.
  • Save Cleaned File: Save the cleaned Excel file by clicking “Save Cleaned File.”
  • Copy Excel File: Copy the original file to a new location by clicking “Copy Excel File.”

Troubleshooting

If you run into any issues, here are a few tips:

  • Command Not Found: If you get an error saying python is not recognized, try using python3 instead, or check if Python was added to your system’s PATH.
  • Module Not Found: If you see ModuleNotFoundError: No module named 'pandas', ensure you installed the required library with pip install pandas.
  • Script Won’t Run: Make sure your file is saved and that you are in the correct directory in the terminal (the ExcelDataCleaner folder).

Step 4: Get access to the script

To make it even easier for you to get started, we’ve uploaded the Excel Data Cleaner and Copier script to our GitHub account. Here’s how you can find and use it:

Visit Our GitHub Repository

  1. Go to GitHub: Open your web browser and visit our GitHub repository at https://github.com/automatecoders/excel-data-cleaner.
  2. Explore the Repository: You’ll see a list of files and folders in the repository. Look for the file named excel_data_cleaner.py.

Download the Script

  1. Open the Script: Click on excel_data_cleaner.py to open the file in GitHub.
  2. Download the Script: Click on the “Raw” button to view the raw content of the script. Then, right-click anywhere on the page and select “Save as…” to save the file to your computer.

Conclusion

You have now learned how to set up and run a Python script to automate Excel data management. This tool allows you to load, preview, clean, and save your Excel files more efficiently, streamlining your workflow.

Feel free to explore the script further on our GitHub repository, make modifications to suit your needs, and continue learning as you build your skills in Python and data automation.

Thank you for following along, and best of luck with your future projects.

Leave a Reply

Your email address will not be published. Required fields are marked *