Solved Remove duplicates from a collection but keep the duplicates in another collection

stefanamihaesei

New Member
Hello,
I am working on a process right now on which I have to merge some collections into a single one (with around 1000 rows ) and after that to remove duplicates based on a single column.
I already did these steps but I would also need to keep track of which rows were removed from the original collection after removing duplicates. Is there any object that can help me ?

Thank you.
 

Pete_L

Active Member
Please show the code for your duplicates removal, including the configuration of the code stage. If we can see how you removed the duplicates, we may be able to help save the removed items.
 

stefanamihaesei

New Member
I would but just now I figured out that I was not doing it ok. Instead of removing duplicates based only on one column, my object was removing duplicates based on all columns. Is there a way to remove duplicates but based only on one column ? Without using loops because the collection is big.
 

stefanamihaesei

New Member
I would but just now I figured out that I was not doing it ok. Instead of removing duplicates based only on one column, my object was removing duplicates based on all columns. Is there a way to remove duplicates but based only on one column ? Without using loops because the collection is big.
Also, must keep the entire rows after removing the duplicates.
 

sahil_raina_91

Active Member
Also, must keep the entire rows after removing the duplicates.

You have few choices to do that :
1. Use Excel, paste the collection, apply COUNTIF formula (in some column) to find out duplicates and filter the duplicate values based on the result of formula.
2. Use a code stage to do the job within the collection itself.

(VB CODE)
OutputCollection : Output collection Name(Output for Code Stage)
InputCollection : Input collection Name (Input for Code Stage)
ColumnName : Column name to check for Duplicates (Input for Code Stage)

This code will return UNIQUE ROWS :
OutputCollection=InputCollection.AsEnumerable().GroupBy(Function(a) a.Field(of string)(ColumnName).ToString).Select(Function(b) b.First()).CopyToDatatable()

This code will return FIRST AVAILABLE DUPLICATE ROW :
OutputCollection=InputCollection.AsEnumerable().GroupBy(Function(a) a.Field(of string)(ColumnName).ToString).Where(Function(b) b.Count() > 1).Select(Function(b) b.First()).CopyToDatatable()

*This works for TEXT Datatype Column's only since Field(of string)*
You can edit that for the DataType you wish for


*Add Below Namespace's in Code Options
System.Data
System.Linq
System.Collections.Generic

____________________________________________________________________________________________________________

Hope this will set you in right direction. For more information, read about LINQ queries for DataTable in .NET
 
Last edited:

stefanamihaesei

New Member
You have few choices to do that :
1. Use Excel, paste the collection, apply COUNTIF formula (in some column) to find out duplicates and filter the duplicate values based on the result of formula.
2. Use a code stage to do the job within the collection itself.

(VB CODE)
OutputCollection : Output collection Name(Output for Code Stage)
InputCollection : Input collection Name (Input for Code Stage)
ColumnName : Column name to check for Duplicates (Input for Code Stage)

This code will return UNIQUE ROWS :
OutputCollection=InputCollection.AsEnumerable().GroupBy(Function(a) a.Field(of string)(ColumnName).ToString).Select(Function(b) b.First()).CopyToDatatable()

This code will return FIRST AVAILABLE DUPLICATE ROW :
OutputCollection=InputCollection.AsEnumerable().GroupBy(Function(a) a.Field(Of String)(ColumnName).ToString).Where(Function(b) b.Count() > 1).Select(Function(b) b.First()).CopyToDatatable()

*This works for TEXT Datatype Column's only since Field(of string)*
You can edit that for the DataType you wish for


*Add Below Namespace's in Code Options
System.Data
System.Linq
System.Collections.Generic

____________________________________________________________________________________________________________

Hope this will set you in right direction. For more information, read about LINQ queries for DataTable in .NET
Thank you, Sahil. Your code works brilliant and saves a lot of time. Cheers to you. Just a small notice that the following External References are needed : System.Core.dll & System.Data.DataSetExtensions.dll

Now, I'm about to make the problem more challenging, maybe you have also an idea on this :
Let's say that for example in my collection, on column B I have 4 rows with same value.
How can I get all the duplicates row but not the the row with the first occurence of the duplicate value ( Eg. If row 1, 2, 3, 4 have the same value on column B, is it possible to extract only the rows 2,3,4 <which also have to be deleted> ?

Thanks again.
 

stefanamihaesei

New Member
Thank you, Sahil. Your code works brilliant and saves a lot of time. Cheers to you. Just a small notice that the following External References are needed : System.Core.dll & System.Data.DataSetExtensions.dll

Now, I'm about to make the problem more challenging, maybe you have also an idea on this :
Let's say that for example in my collection, on column B I have 4 rows with same value.
How can I get all the duplicates row but not the the row with the first occurence of the duplicate value ( Eg. If row 1, 2, 3, 4 have the same value on column B, is it possible to extract only the rows 2,3,4 <which also have to be deleted> ?

Thanks again.
Any idea ? I tried but can't reach to the solution I want.
 

sahil_raina_91

Active Member
Any idea ? I tried but can't reach to the solution I want.

This code will return ALL AVAILABLE DUPLICATE ROWS :
OutputCollection = InputCollection.AsEnumerable().GroupBy(Function(a) a.Field(of string)(ColumnName).ToString).Where(Function(b) b.Count() > 1).SelectMany(Function(b) b).CopyToDatatable()

Create a Data Item Previous with Initial Value = ""/Blank
Loop OutputCollection
. Compare Previous Value with Current Row's Value
. If Values DONT MATCH
. . . . . Store Current Value in Previous & Delete The Row
. If Value MATCHES
. . . . . Do Nothing
Loop End OutputCollection

*This will ensure 1st Row of Each Group is Deleted*
 
Last edited:

stefanamihaesei

New Member
This code will return ALL AVAILABLE DUPLICATE ROWS :
OutputCollection = InputCollection.AsEnumerable().GroupBy(Function(a) a.Field(of string)(ColumnName).ToString).Where(Function(b) b.Count() > 1).SelectMany(Function(b) b).CopyToDatatable()

Create a Data Item Previous with Initial Value = ""/Blank
Loop OutputCollection
. Compare Previous Value with Current Row's Value
. If Values DONT MATCH
. . . . . Store Current Value in Previous & Delete The Row
. If Value MATCHES
. . . . . Do Nothing
Loop End OutputCollection

*This will ensure 1st Row of Each Group is Deleted*
I tried to run the code, but it's returning all the rows from the original collection. I got the idea of using the loop, looks like working but the code don't seems to. Will check more and see what has to be changed or what I am doing wrong.

Thanks a lot.
 

stefanamihaesei

New Member
I tried to run the code, but it's returning all the rows from the original collection. I got the idea of using the loop, looks like working but the code don't seems to. Will check more and see what has to be changed or what I am doing wrong.

Thanks a lot.
My bad, code works just fine. Thanks a lot !
 

stefanamihaesei

New Member
This code will return ALL AVAILABLE DUPLICATE ROWS :
OutputCollection = InputCollection.AsEnumerable().GroupBy(Function(a) a.Field(of string)(ColumnName).ToString).Where(Function(b) b.Count() > 1).SelectMany(Function(b) b).CopyToDatatable()

Create a Data Item Previous with Initial Value = ""/Blank
Loop OutputCollection
. Compare Previous Value with Current Row's Value
. If Values DONT MATCH
. . . . . Store Current Value in Previous & Delete The Row
. If Value MATCHES
. . . . . Do Nothing
Loop End OutputCollection

*This will ensure 1st Row of Each Group is Deleted*
Do you have an idea if the loop can be integrated in the Code Stage ?
 

sahil_raina_91

Active Member
Do you have an idea if the loop can be integrated in the Code Stage ?

While I advice against this since it has little to no reusability, if it's saving significant time, then go ahead.
CAUTION : This code works for Duplicate and Sorted Rows ONLY (it will Remove any unique row if it exists)

Code :
Dim Previous As String = ""
For i As Integer = Input_Collection.Rows.Count -1 to 0 Step -1
Dim temp As String = Input_Collection.Rows(i)(ColumnName).ToString()
If temp <> Previous
Previous = temp
Input_Collection.Rows.Remove(Input_Collection.Rows(i))
End If
Next
Output_Collection = Input_Collection
 
Last edited:

stefanamihaesei

New Member
While I advice against this since it has little to no reusability, if it's saving significant time, then go ahead.
CAUTION : This code works for Duplicate and Sorted Rows ONLY (it will Remove any unique row if it exists)

Code :
Dim Previous As String = ""
For i As Integer = Input_Collection.Rows.Count -1 to 0 Step -1
Dim temp As String = Input_Collection.Rows(i)(ColumnName).ToString()
If temp <> Previous
Previous = temp
Input_Collection.Rows.Remove(Input_Collection.Rows(i))
End If
Next
Output_Collection = Input_Collection
Thank again for yout help, the code is working but it deletes the last row from every group of duplicate rows, instead of the first row. Tried to change the way of iteration but that just makes it messy. I guess that when you say that the collection has to be sorted, you mean to change the position of the rows wanted to be deleted so that the code stage can delete the last row of every group.
 

sahil_raina_91

Active Member
Thank again for yout help, the code is working but it deletes the last row from every group of duplicate rows, instead of the first row. Tried to change the way of iteration but that just makes it messy. I guess that when you say that the collection has to be sorted, you mean to change the position of the rows wanted to be deleted so that the code stage can delete the last row of every group.

My bad. I forgot you were looking to delete the first duplicate row and NOT just any duplicate row

Code:
Dim Previous As String = ""
For i As Integer = Input_Collection.Rows.Count -1 to 0 Step -1
Dim temp As String = Input_Collection.Rows(i)(ColumnName).ToString()
If temp <> Previous AND Previous <> ""
Input_Collection.Rows.Remove(Input_Collection.Rows(i+1))
End If
Previous = temp
Next
Input_Collection.Rows.Remove(Input_Collection.Rows(0))
Output_Collection = Input_Collection
 

ramithun

New Member
i am having doubt on this

All List Items:

123
B45
789
123
A45
456


1) Create collection with above list items
2) Filter only numeric values
3) Remove duplicate
4) Sort descending
5) Add items in to work queue
6) Defer the item for next day process
 
Top