Find Duplicate files by computing Hash in C#

 Here we will try to find the duplicate files by its content in it with computing its hash. If you copy the file several times and change the file name still it will scan and can detect that its a duplicate file.

I will use .NET 5 and C# 9 here.


let me show you how

using System;
using System.IO;
using System.Linq;
using System.Security.Cryptography;
using System.Text;

namespace DupTest
{
    class Program
    {
        static void Main()
        {
            //Your path here
            string dir = @"C:\Users\priti\Documents\Microsoft.WindowsSoundRecorder_8wekyb3d8bbwe!App\Sound recordings";
            var fileinfos = Directory.GetFiles(dir)
                .Select(x => new { FileName = x, Hash = GetStringFromBytes(GetHashValue(x)) });

            var group_infos = fileinfos.GroupBy(x => x.Hash).ToList();

            foreach (var g in group_infos)
            {
                Print(g.Key);
                Print("Duplicate count : "+g.Count());
                foreach (var info in g)
                {
                    Print("\t"+info.FileName);
                }
            }
        }
        private static byte[] GetHashValue(string filename)
        {
            SHA256 sha256 = SHA256.Create();
            using (var fileStream = File.OpenRead(filename))
            {
                return sha256.ComputeHash(fileStream);
            }

        }
        private static string GetStringFromBytes(byte[] bytes)
        {
            StringBuilder builder = new StringBuilder();
            foreach (var b in bytes)
            {
                builder.Append(b.ToString("X2"));
            }
            return builder.ToString();
        }
        private static void Print(object obj)
        {
            Console.WriteLine(obj.ToString());
        }
    }
}


Let me explain what's going on here.
There is two methods i defined one is GetHashValue()
.This method opens the file and computes its hash value with the help of the SHA256 class
from System.Security.Cryptography;

The second method is Print() which is same as Console.WriteLine which I am not a big fan of. Print is small , sweet and it works just fine and  the code looks clean. #opinion.


We are getting all the list of files from the Directory.GetFiles method and computing hash for each file and generating an IEnumerable of an anonymous type which has two property Name(the name of the file) and Hash(the hash value computed from the method). LINQ is making our code easier here.

Again using LINQ we are grouping the collection by the hash value and counting how many duplicates of which file are there in the directory.

If you want to scan for subfolders you have use a recursive function but for now it gets the job done.

For testing what I did is i had few files in the folders and i copied them in the same directory to test if it is working or not. If you make a copy of a file and you rename it to some another name still it can detect that its a duplicate file.

Let's see the test results




Leave a comment on what do you think on this.

Thanks for reading. 
Happy Coding



 

Comments

Popular posts from this blog

Use SCSS with ASP.NET Core 5.x or 3.X

Building a Login Flow with .NET MAUI

PySpark Schema Generator - A simple tool to generate PySpark schema from JSON data