Using WB Map to remove duplicates

Started by stanl, February 12, 2023, 12:12:26 PM

Previous topic - Next topic

stanl

Contacted by old client I wrote a compiled WB script for.  Basically, middleware, downloads data from Server 1.... sends to server 2.  Data now comes as ^ delimted text [with headers] but file sizes are 200mb or more. Now the receiving end sees duplicate rows in text file which messes up their ETL.  Was asked if I could come up with a UDF to eliminate duplicates.


I know that de-duping is not rocket science. I had set this up earlier to use an ADODB Recordset based on a composite key. But now the data can change in any column [the text fie has 154 columns, several are Notes]. The source claims duplicates are possible.


I started thinking about WB Maps [to act as a hash table], but noticed there is not a MapAdd() method and only a .csv output.  But,

       
  • just code: output = Mapcreate() to define a blank map
  • read though test file, assign each row to variable [or is there a faster way]
  • using variable as a key, see if it exits in output [map], if not assign it [that would be a MapAdd()]
  • set the value to a constant, like 'null' since only the key would be important
  • Output the keys only [with no duplicates] to same input file name, therefore eliminating duplicates
In theory, should work... but would it be efficient and operable.

JTaylor

Is the entire row the key now?   That is, the duplication check has to take every field into account?

Jim

stanl

Quote from: JTaylor on February 12, 2023, 12:26:51 PM
Is the entire row the key now?   That is, the duplication check has to take every field into account?

Jim


Yes. So basically treat each line as a text string. I was told a file could have ~800k rows with about 2k duplicates.

bottomleypotts

I wouldn't complicate it with map. Put into array. Sort. Iterate for duplicates.

JTaylor

Would it work to have a one field table and make that your unique index?   I know databases have a limit on index size so not certain in this situation. 

Maybe split it up into 3-4 fields, if needed?   My Omnibus extender might handle that using the snStrCutFile() function.

Jim

stanl

Quote from: bottomleypotts on February 13, 2023, 03:58:19 AM
I wouldn't complicate it with map. Put into array. Sort. Iterate for duplicates.


Makes sense, but a map is basically an array/hash table. So either building the map or array might take about the same time, but the map saves the array =>sort => iterating for duplicates. [I may be wrong.].

stanl

Quote from: JTaylor on February 13, 2023, 06:08:48 AM
Would it work to have a one field table and make that your unique index?   


I have a possible ADO solution, but bringing in a table isn't that efficient for the file size and number of rows. I asked about MariaDB a couple weeks ago - I think raw data comes from there.


NOTE: person I'm trying to help, a friend and I compiled several WB scripts for him to get his consulting business started. A  lot went South and he is in a new venture. Offers to pay, but I have refused... thinking I could come up with a easy solution :-\


JTaylor

If you have a sample file that isn't a security/confidentiality issue, send me a link.  I am always looking for way to test my extender.

Is automating Excel an option?

Jim

JTaylor

Was wondering if a DOS sort and then using WB to dedupe would be feasible?   Not sure now long reading through the file would take with WB.

Just a thought...

Jim

ChuckC

Is sorting of the data from the input file permissible to begin with?

Or, to ask it a different way, outside of removing duplicate lines that may not be adjacent to each other, does the order of the lines in the file need to remain the same?  And, if so, is it a requirement that the first occurrence of any given line is the instance to be kept while subsequent duplicates are the ones to be removed?


Input List:

D
A
C
A
B
F
C
B


Sorted, duplicates removed:

A
B
C
D
F


Original order preserved, duplicates removed:

D
A
C
B
F

JTaylor


bottomleypotts

Okay, I take it back. Having done some testing with a 200Mb file, with random string lengths (between 25 and 125 characters in length), and encountering string memory issues - the method I proposed will not work.

But you've also added an requirement to the task, and that is to preserve the order, so after doing some testing, I changed my process to read each line of the file and checking if its is already in my array and storing it if it isn't! I'm sure that I could run out of string memory but not likely with a 200Mb input file. Anything bigger and I'd move the task to a database.

Edit: even this doesn't feel right, looking at binary solutions now.

Edit 2: wow I cannot believe how inefficient arrays are. FileRead into a Map is by far the quickest solution. Even ForEach on maps is slow.


bottomleypotts

Code (winbatch) Select
fh=FileOpen(DirScript():`remote dups.txt`,`Read`)

m=MapCreate()
c=0

btmod=10000

While @True
fr=FileRead(fh)
If fr==`*EOF*` Then   Break

If !MapKeyExist(m,fr)
m[fr]=ArrInfo(m,1)
EndIf

If c mod btmod==0 Then   BoxText(c:@CR:fr)
c=c+1
EndWhile

FileClose(fh)

a=ArrDimension(1,2)
n=0

btmod=1000

ForEach e in m
If n mod btmod==0 Then   BoxText(n:@CR:e)

a[n,0]=e
a[n,1]=m[e]
n=n+1
Next

ArraySort(a,@ASCENDING,1)

fh=FileOpen(DirScript():`remote dups_done.txt`,`Write`)

For i=0 to n-1
FileWrite(fh,a[i,0])

If i mod btmod==0 Then   BoxText(i:@CR:a[i,0])
Next i

FileClose(fh)

ChuckC

There should be no need to keep each line in memory in an array.  Compute a hash/digest of some sort for each line as a "fingerprint" of the line and use the fingerprints to determine uniqueness.

stanl

Quote from: ChuckC on February 17, 2023, 04:44:08 PM
There should be no need to keep each line in memory in an array.  Compute a hash/digest of some sort for each line as a "fingerprint" of the line and use the fingerprints to determine uniqueness.


So, maybe a FileRead => FileWrite with hash map to determine uniqueness

stanl

Got a work around. The compiled WB code he uses was started in 2011. While I had initially thought Poweshell would do the task, that is not an option here. Dug up the code and my friend reminded me that I had included a UDF which allowed calling php.exe to run .php scripts.


So


$lines = file('file.txt');
$lines = array_unique($lines);
file_put_contents('file.txt', implode($lines));



as .php file, where file.txt was renamed to whatever... he said it worked, and speed not an issue. Fingers crossed. And thanks everyone for your comments.

ChuckC

Quote from: stanl on February 18, 2023, 04:01:30 AM
So, maybe a FileRead => FileWrite with hash map to determine uniqueness

Yes, that's what I had in mind.  Read each line, compute the hash, check for hash as a key in the map, add it if missing and write the line to the output file, if already present in map, skip to the next line.

kdmoyers

QuoteRead each line, compute the hash, check for hash as a key in the map,...
This seems like the best strategy to me too.  But how to compute the hash?  Maybe BinaryCheckSum ?
The mind is everything; What you think, you become.

td

BinaryChecksum should produce the hash. The key would be selecting the hash algorithm that is the most likely to produce the quickest result with minimal risk of two unique records having the same hash.

The other question would relate to satisfactory performance. The speed would likely depend more on the number of records than the overall file size.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

td

FWTW and out of curiosity, I wrote a quick script to "dedup" a file of just over 1.6 million records I happened to have in a test data folder on my system. I used the SHA-1 hash (probably overkill) and set the map hash table size to 1000 (which could be much bigger) using IntControl 100. The script took about 1.9 minutes to run using 64-bit WinBatch. I suspect carefully chosen optimizations could make it considerably faster. 
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

stanl

Funny: I was just reminded that Notepad++ has a Remove duplicate lines option.

kdmoyers

Quote from: td on February 20, 2023, 10:27:42 PM
FWTW and out of curiosity, I wrote a quick script to "dedup" a file of ...
Is is cool to share that? (the script, not the data file)
The mind is everything; What you think, you become.

td

Don't know that this script has any application in the real world and after trying and failing twice at posting the script because of missed obvious bugs perhaps this try will succeed.

If inclined to play with the script, it is best to use 64-bit WiinBatch for files of any significant size. One of many likely issues is that paradoxically, the fewer the duplicates the longer the script will take because of additional FileWrites and a larger hash table.

Code (winbatch) Select
; As usual, bugs are offered at no extra charge.
start = GetTickCount64()
BoxesUp("300,300, 500, 500", @Normal)
BoxTitle("Purging Duplicates From a File")
BoxText("This could take awhile...")
hIN = FileOpen(<Your file to purge here>, "Read")   
hOut = FileOpen(<Your purged file here>,"Write")       ;
Line = ""
Prev = IntControl(100, 1000000, 0, 0, 0) ; Adjust for fine tuning.
; Buffer could adjusted dynamically or
; simply made bigger. Memory usage should
; not be an issue on a modern 64-bit system.
hBuf = BinaryAlloc(100000)
PurgeMap = MapCreate()
Line = FileRead(hIn)
while Line != "*EOF*"
   BinaryPokeStr(hBuf, 0, Line)
   BinaryEodSet(hBuf, strlen(line))
   ; Using CRC32. MD5 may prevent false matches, collisions, and faster
   ; lookups but uses more string space.
   Hash = BinaryChecksum(hBuf,2) 
   if MapKeyExist(PurgeMap, Hash)
      Line = FileRead(hIn)
      continue ; Skip record write.
   endif
   PurgeMap[Hash] = 0  ; Add record hash to table
   FileWrite(hOut, Line)
   Line = FileRead(hIn)
endwhile
; Clean up for no reason in particular.
BinaryFree(hBuf)
Drop(PurgeMap)
IntControl(100, Prev, 0, 0, 0)
FileClose(hIn)
FileClose(hOut)
secs = (GetTickCount64()-Start)/1000
Message("File Purge Completed", @lf:"Total time: ":secs/60:" minutes ":secs mod 60:" seconds")

exit

"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade