Remove Duplictes: ReVisted

Started by stanl, March 12, 2023, 07:28:11 AM

Previous topic - Next topic

stanl

I should have been obvious, but the text files I alluded to in the previous post had headers as they were set up for ETL based on delimiter [in this case ^].


So, I wrote code to sort and remove duplicates, but in that case the headers were at the last row of the de-duped file. So I added code to preserve the header row and append to the unique de-duped 'body'.  This ended up with the header at both the beginning and end of the output file. So added extra code to remove the last 2 lines [assuming CRLF at end] of the output file.  All worked well. But, of course, it turned out the process would only work for that specific file format.


So, went back to original idea of a map with fileread => filewrite...   and that worked for several files I tested with headers... but as Tony alluded... can take time based on either file size or number of duplicates.


Maybe a stupid ask: but is there a way WB could import a text file [with headers but irrespective of the delimiter - so it is treated as a single line, not a column-based file] into an array


as UNIQUE values, then export the array to de-duped output file with headers intact.


and would that make any difference in time as opposed to using a map procedure [as suggested by Tony and others]




td

ArrayFileGet loads a file into a rank one array.  However, I am not sure that this will help you to make your task any faster and you would likely be limited to about 300MB files.

The map example can be modified to not hash each record and instead just place each record directly into the map. This would cut the processing time almost in half. The problem is that you could only process files up to about 300MB. After that, you would be in danger of running out of string space.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

stanl

Thanks. Doubt even a 200mb limit will be hit for the particular de-duping, and for obvious reasons running in WB64 is the ticket.

JTaylor

Not sure if this is of use or of interest but was sitting around waiting for a job to start and added an snFileDedupe(ifile,ofile) function to my extender.  I think it is working right.  I tested it on a 278mb file with 6.7 million lines.  Took about 2 seconds.   I had a lot of duplicates though so, will probably be longer if that is not the case.

http://www.jtdata.com/anonymous/wbomnibus.zip


Jim

td

Removing duplicates from files has been on the to-do list for quite some time. In part due to recent forum posts on the subject, we were planning to add a couple of new functions to that end in an upcoming WinBatch release.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

JTaylor

Cool.   Wish I knew what functions people were using in my Extender so I could remove them as you add things   :)

In case it is helpful I also added a snFileFilter() function which allows you to remove lines based on a text string.   Working on a snFileSplit() function but it isn't ready yet.

Jim