Find Duplicate files by computing Hash in C#

September 10, 2021

Here we will try to find the duplicate files by its content in it with computing its hash. If you copy the file several times and change the file name still it will scan and can detect that its a duplicate file.

I will use .NET 5 and C# 9 here.

let me show you how

using System;
using System.IO;
using System.Linq;
using System.Security.Cryptography;
using System.Text;

namespace DupTest
{
    class Program
    {
        static void Main()
        {
            //Your path here
            string dir = @"C:\Users\priti\Documents\Microsoft.WindowsSoundRecorder_8wekyb3d8bbwe!App\Sound recordings";
            var fileinfos = Directory.GetFiles(dir)
                .Select(x => new { FileName = x, Hash = GetStringFromBytes(GetHashValue(x)) });

            var group_infos = fileinfos.GroupBy(x => x.Hash).ToList();

            foreach (var g in group_infos)
            {
                Print(g.Key);
                Print("Duplicate count : "+g.Count());
                foreach (var info in g)
                {
                    Print("\t"+info.FileName);
                }
            }
        }
        private static byte[] GetHashValue(string filename)
        {
            SHA256 sha256 = SHA256.Create();
            using (var fileStream = File.OpenRead(filename))
            {
                return sha256.ComputeHash(fileStream);
            }

        }
        private static string GetStringFromBytes(byte[] bytes)
        {
            StringBuilder builder = new StringBuilder();
            foreach (var b in bytes)
            {
                builder.Append(b.ToString("X2"));
            }
            return builder.ToString();
        }
        private static void Print(object obj)
        {
            Console.WriteLine(obj.ToString());
        }
    }
}

Let me explain what's going on here.

There is two methods i defined one is GetHashValue()

.This method opens the file and computes its hash value with the help of the SHA256 class

from System.Security.Cryptography;

The second method is Print() which is same as Console.WriteLine which I am not a big fan of. Print is small , sweet and it works just fine and the code looks clean. #opinion.

We are getting all the list of files from the Directory.GetFiles method and computing hash for each file and generating an IEnumerable of an anonymous type which has two property Name(the name of the file) and Hash(the hash value computed from the method). LINQ is making our code easier here.

Again using LINQ we are grouping the collection by the hash value and counting how many duplicates of which file are there in the directory.

If you want to scan for subfolders you have use a recursive function but for now it gets the job done.

For testing what I did is i had few files in the folders and i copied them in the same directory to test if it is working or not. If you make a copy of a file and you rename it to some another name still it can detect that its a duplicate file.

Let's see the test results

Leave a comment on what do you think on this.

Thanks for reading.

Happy Coding

Search This Blog

PREETish Code Blogs

Find Duplicate files by computing Hash in C#

Comments

Post a Comment

Popular posts from this blog

Building a Login Flow with .NET MAUI

Use SCSS with ASP.NET Core 5.x or 3.X

Generate PySpark Schema dynamically in Python from JSON Sample