Remove duplicates from a collection but keep the duplicates in another collection

#1
Hello,
I am working on a process right now on which I have to merge some collections into a single one (with around 1000 rows ) and after that to remove duplicates based on a single column.
I already did these steps but I would also need to keep track of which rows were removed from the original collection after removing duplicates. Is there any object that can help me ?

Thank you.
 
#3
I would but just now I figured out that I was not doing it ok. Instead of removing duplicates based only on one column, my object was removing duplicates based on all columns. Is there a way to remove duplicates but based only on one column ? Without using loops because the collection is big.
 
#5
Also, must keep the entire rows after removing the duplicates.
You have few choices to do that :
1. Use Excel, paste the collection, apply COUNTIF formula (in some column) to find out duplicates and filter the duplicate values based on the result of formula.
2. Use a code stage to do the job within the collection itself.

(VB CODE)
OutputCollection : Output collection Name(Output for Code Stage)
InputCollection : Input collection Name (Input for Code Stage)
ColumnName : Column name to check for Duplicates (Input for Code Stage)

This code will return UNIQUE ROWS :
OutputCollection=InputCollection.AsEnumerable().GroupBy(Function(a) a.Field(of string)(ColumnName).ToString).Select(Function(b) b.First()).CopyToDatatable()

This code will return FIRST AVAILABLE DUPLICATE ROW :
OutputCollection=InputCollection.AsEnumerable().GroupBy(Function(a) a.Field(of string)(ColumnName).ToString).Where(Function(b) b.Count() > 1).Select(Function(b) b.First()).CopyToDatatable()

*This works for TEXT Datatype Column's only since Field(of string)*
You can edit that for the DataType you wish for


*Add Below Namespace's in Code Options
System.Data
System.Linq
System.Collections.Generic

____________________________________________________________________________________________________________

Hope this will set you in right direction. For more information, read about LINQ queries for DataTable in .NET
 
Last edited:
#6
You have few choices to do that :
1. Use Excel, paste the collection, apply COUNTIF formula (in some column) to find out duplicates and filter the duplicate values based on the result of formula.
2. Use a code stage to do the job within the collection itself.

(VB CODE)
OutputCollection : Output collection Name(Output for Code Stage)
InputCollection : Input collection Name (Input for Code Stage)
ColumnName : Column name to check for Duplicates (Input for Code Stage)

This code will return UNIQUE ROWS :
OutputCollection=InputCollection.AsEnumerable().GroupBy(Function(a) a.Field(of string)(ColumnName).ToString).Select(Function(b) b.First()).CopyToDatatable()

This code will return FIRST AVAILABLE DUPLICATE ROW :
OutputCollection=InputCollection.AsEnumerable().GroupBy(Function(a) a.Field(Of String)(ColumnName).ToString).Where(Function(b) b.Count() > 1).Select(Function(b) b.First()).CopyToDatatable()

*This works for TEXT Datatype Column's only since Field(of string)*
You can edit that for the DataType you wish for


*Add Below Namespace's in Code Options
System.Data
System.Linq
System.Collections.Generic

____________________________________________________________________________________________________________

Hope this will set you in right direction. For more information, read about LINQ queries for DataTable in .NET
Thank you, Sahil. Your code works brilliant and saves a lot of time. Cheers to you. Just a small notice that the following External References are needed : System.Core.dll & System.Data.DataSetExtensions.dll

Now, I'm about to make the problem more challenging, maybe you have also an idea on this :
Let's say that for example in my collection, on column B I have 4 rows with same value.
How can I get all the duplicates row but not the the row with the first occurence of the duplicate value ( Eg. If row 1, 2, 3, 4 have the same value on column B, is it possible to extract only the rows 2,3,4 <which also have to be deleted> ?

Thanks again.
 
Top