Print Page - Remove duplicate lines from a text file

Title: Remove duplicate lines from a text file
Post by: kle2 on January 27, 2025, 10:11:37 AM

Hello everyone,

I have an output.txt file like below. I would like to find all the duplicate server names from the first column and add up all the numbers from the second column (columns are seperated by comma delimiter) onto a single line (using winbatch). For example I need to look through the output.txt file and find all the "server1" which there are 3 lines that has "server1", then I combine all "server1" into 1 line and add up all the number from the second column (79+79+79=237) and it should become:
server1,237,VSP,01/27/25,,1262,

Same process for all other servers

Does any one have a sample or give me a hint?
Thanks in advance!
KLE2

output.txt
***********************
server1,79,VSP,01/27/25,,1262,
server1,79,VSP,01/27/25,,1263,
server1,79,VSP,01/27/25,,1264,
server2,100,VSP,01/27/25,,1265,
server6,100,VSP,01/27/25,,1266,
server2,100,VSP,01/27/25,,1268,
server5,430,VSP,01/27/25,,1294,
server2,100,VSP,01/27/25,,1269,
server2,100,VSP,01/27/25,,1276,
server2,380,VSP,01/27/25,,1277,
server2,380,VSP,01/27/25,,1278,
server2,380,VSP,01/27/25,,1279,
server3,80,VSP,01/27/25,afps,1291,
server2,430,VSP,01/27/25,,1292,
server2,430,VSP,01/27/25,,1293,
server6,430,VSP,01/27/25,,1294,
server6,430,VSP,01/27/25,,1295,
server2,430,VSP,01/27/25,,1296,
server2,430,VSP,01/27/25,,1297,

Title: Re: Remove duplicate lines from a text file
Post by: td on January 27, 2025, 01:03:58 PM

It depends on whether you want to preserve the other fields on each line to some extent. Either way, a WIL Map would be a straightforward solution. It would involve the ArrayFileGetCSV and MapKeyExist functions along with creating a map from the WIL array returned by ArrayFileGetCSV.

Title: Re: Remove duplicate lines from a text file
Post by: td on January 27, 2025, 02:16:30 PM

It may not address your problem directly and the script makes a lot of assumptions, but it may be enough to get you started.

Code Select

aWithDups = ArrayFileGetCsv('Output.txt', 0)
aSums = MapCreate()

nMax = ArrInfo(aWithDups, 1) -1
for i=0 to nMax
   if MapKeyExist(aSums, aWithDups[i,0])
      aSums[aWithDups[i,0]] += aWithDups[i,1]
   else
      aSums[aWithDups[i,0]] = aWithDups[i,1]
   endif
Next
MapFilePutCsv('Summed.txt', aSums)

Message('Example Test', FileGet('Summed.txt'))

Title: Re: Remove duplicate lines from a text file
Post by: kle2 on January 27, 2025, 02:52:36 PM

Thanks for quick responding, TD!

Yes, I only want to combine the duplicated name and add up the second column. The other columns can be anything, I am not worry much about them.

I got an error "Uninitialized variable, undefined function...for the aSums = MapCreate(). Is there any extender I need to add?

Thanks,
kle2

Title: Re: Remove duplicate lines from a text file
Post by: kdmoyers on January 28, 2025, 03:38:30 AM

Maybe that message means your version does not know the MapCreate function.
What version of WinBatch are you running?
MapCreate came in at version 2019A, I think.

Title: Re: Remove duplicate lines from a text file
Post by: kle2 on January 28, 2025, 04:46:56 AM

Quote from: kdmoyers on January 28, 2025, 03:38:30 AMMaybe that message means your version does not know the MapCreate function.
What version of WinBatch are you running?
MapCreate came in at version 2019A, I think.

Oh...that makes sense. I am using Winbatch version 2016A. I will have to find another way :(

Thanks kdmoyers!

Title: Re: Remove duplicate lines from a text file
Post by: spl on January 28, 2025, 06:00:06 AM

Quote from: kle2 on January 28, 2025, 04:46:56 AMOh...that makes sense. I am using Winbatch version 2016A. I will have to find another way :(

Probably a file-read / aggregate into fabricated recordset / then output with getstring(). Not as convenient as Tony's suggestion, but very workable, and not complicated to understand. I can cook up a simple script if you wish.

Title: Re: Remove duplicate lines from a text file
Post by: snowsnowsnow on January 28, 2025, 06:44:12 AM

First, I want to clarify exactly what the desired output is. Here is the output of a script I wrote in another language (AWK, not WinBatch). Is this what you want (given the data given above):

Code Select

server1,237,VSP,01/27/25,,1262,
server2,3260,VSP,01/27/25,,1265,
server3,80,VSP,01/27/25,afps,1291,
server5,430,VSP,01/27/25,,1294,
server6,960,VSP,01/27/25,,1266,

Note that the data in the 3rd through fifth fields is essentially ignored, and what I did was to preserve the first line for each server name, just replacing the 2nd field with the total.

Assuming that is correct, re-writing it in WB should be straightforward, but note that, as I've mentioned a few times in the past, WB is not an ideal language for doing normal text/data processing tasks. But it can be done.

Title: Re: Remove duplicate lines from a text file
Post by: td on January 28, 2025, 09:19:04 AM

Here is another approach that while more resource-intensive might work with the OP's version of WinBatch.

Code Select

; The tried and true algorithm for dedupping data...
aWithDups = ArrayFileGetCsv('Output.txt', 0)
nRows = ArrInfo(aWithDups, 1)
ArraySort( aWithDups, @ASCENDING , 0)
aSums = ArrDimension(nRows, 2)
j = 0
aSums[j,0] = aWithDups[0,0]
aSums[j,1] = aWithDups[0,1]
nMax = nRows - 1
for i = 1 to nMax         
   if StriCmp(aWithDups[i-1, 0], aWithDups[i, 0])   
      j += 1
      aSums[j,0] = aWithDups[i,0]
      aSums[j,1] = aWithDups[i,1]
   else
      aSums[j,1] += aWithDups[i,1]
   endif
next
ArrayRedim(aSums,j+1, -1)

ArrayFilePutCsv('Summed.txt', aSums)

Message('Example Test', FileGet('Summed.txt'))
exit

Title: Re: Remove duplicate lines from a text file
Post by: spl on January 28, 2025, 09:32:39 AM

Snow++ made a good point, as there were no duplicate lines in the original post. So the assumption being you wanted all columns for the first instance of server, but the subsequent value of column 2 for each server aggregated, the code below will do that, although super tweakable. It assumes your input is a file that can be read. The fabricated recordset is based on the number of items per the delimiter used (in this case a comma). You can be more creative with field names. I suggested an output file but chose to display Getstring(), as you might prefer fileout with trimmed fields. Also, there is a bit of variable substitution, but as I reflected on another thread, no harm no foul....

Code Select

input = Dirscript():"servers.txt"
output = Dirscript():"serversaggregate.txt"
If !FileExist(input) Then Terminate(@TRUE,"Input File Not Foubd",input)
rs=CreateObject("ADODB.Recordset")
rs.CursorType = 1
rs.LockType = 3
  
rs.Fields.Append("f1",129,10,32768)
rs.Fields.Append("f2",3,10)
rs.Fields.Append("f3",129,20,32)
rs.Fields.Append("f4",129,20,32)
rs.Fields.Append("f5",129,20,32)
rs.Fields.Append("f6",129,20,32)
rs.Fields.Append("f7",129,20,32)
rs.Open()

h = FileOpen(input, "READ")
While @True
   line = FileRead(h)
   If line == "*EOF*" Then Break
   svr = ItemExtract(1,line,",")
   
   rs.Find("f1='%svr%'")
   if rs.eof()
      rs.Addnew()
      for i= 1 to 7
         rs.Fields("f%i%").Value =  ItemExtract(i,line,",")
      next
   else
      rs.Fields("f2").Value =  rs.Fields("f2").Value +  ItemExtract(2,line,",")
   endif
   rs.update()
   rs.MoveFirst()

Endwhile
FileClose(h)
Message("Aggregated Server",rs.Getstring(2,,",",))

rs.close()
rs=0
exit

Title: Re: Remove duplicate lines from a text file
Post by: snowsnowsnow on January 28, 2025, 10:00:00 AM

In AWK, I used AWK's "associative arrays" - which are arrays with string indexes. The server name is the index of the array(s); this gets you the "de-dupe" functionality, as there can only be one array element for each index (i.e., server name).

If I was doing this (in WB), I'd use the DataSaver library (written a million years ago by Marty), as that implements associative arrays in WB.

Title: Re: Remove duplicate lines from a text file
Post by: kle2 on January 28, 2025, 11:29:21 AM

Quote from: spl on January 28, 2025, 06:00:06 AM
Quote from: kle2 on January 28, 2025, 04:46:56 AMOh...that makes sense. I am using Winbatch version 2016A. I will have to find another way :(

Probably a file-read / aggregate into fabricated recordset / then output with getstring(). Not as convenient as Tony's suggestion, but very workable, and not complicated to understand. I can cook up a simple script if you wish.

Thanks for the offering spl. Yes, please give me a script.
kle2

Title: Re: Remove duplicate lines from a text file
Post by: kle2 on January 28, 2025, 11:31:44 AM

Quote from: snowsnowsnow on January 28, 2025, 06:44:12 AMFirst, I want to clarify exactly what the desired output is. Here is the output of a script I wrote in another language (AWK, not WinBatch). Is this what you want (given the data given above):

Code Select Expand
server1,237,VSP,01/27/25,,1262, server2,3260,VSP,01/27/25,,1265, server3,80,VSP,01/27/25,afps,1291, server5,430,VSP,01/27/25,,1294, server6,960,VSP,01/27/25,,1266,
Note that the data in the 3rd through fifth fields is essentially ignored, and what I did was to preserve the first line for each server name, just replacing the 2nd field with the total.

Assuming that is correct, re-writing it in WB should be straightforward, but note that, as I've mentioned a few times in the past, WB is not an ideal language for doing normal text/data processing tasks. But it can be done.

I can try anything as long as it works. Yes please give me the script. Thanks, kle2

Title: Re: Remove duplicate lines from a text file
Post by: kle2 on January 28, 2025, 11:38:54 AM

Quote from: td on January 28, 2025, 09:19:04 AMHere is another approach that while more resource-intensive might work with the OP's version of WinBatch.

Code Select Expand
; The tried and true algorithm for dedupping data... aWithDups = ArrayFileGetCsv('Output.txt', 0) nRows = ArrInfo(aWithDups, 1) ArraySort( aWithDups, @ASCENDING , 0) aSums = ArrDimension(nRows, 2) j = 0 aSums[j,0] = aWithDups[0,0] aSums[j,1] = aWithDups[0,1] nMax = nRows - 1 for i = 1 to nMax if StriCmp(aWithDups[i-1, 0], aWithDups[i, 0]) j += 1 aSums[j,0] = aWithDups[i,0] aSums[j,1] = aWithDups[i,1] else aSums[j,1] += aWithDups[i,1] endif next ArrayRedim(aSums,j+1, -1) ArrayFilePutCsv('Summed.txt', aSums) Message('Example Test', FileGet('Summed.txt')) exit

Wow! This works very well for my purpose. Thanks td!

One more question, if I want to put the rest of the fields of the first row in the result? for example the result for server1 would be:

server1,237,VSP,01/27/25,,1262,

How do I do that?

Thanks
kle2

Title: Re: Remove duplicate lines from a text file
Post by: snowsnowsnow on January 28, 2025, 12:07:23 PM

Well, I'll be. WB now supports += (as in C) ? Who knew?

Title: Re: Remove duplicate lines from a text file
Post by: td on January 28, 2025, 01:38:08 PM

Quote from: kle2 on January 28, 2025, 11:38:54 AM
QuoteWow! This works very well for my purpose. Thanks td!

One more question, if I want to put the rest of the fields of the first row in the result? for example the result for server1 would be:

server1,237,VSP,01/27/25,,1262,

How do I do that?

Thanks
kle2

Just create the aSums array with the same number of columns as the aWithDups array. You then create a loop to copy there remaining columns from aWithDups to aSums when ever you find a new server name in the first column.

Title: Re: Remove duplicate lines from a text file
Post by: td on January 28, 2025, 01:58:35 PM

Code Select

; The tried and true algorithm for dedupping data...
aWithDups = ArrayFileGetCsv('Output.txt', 0)
nRows = ArrInfo(aWithDups, 1)
nCols = ArrInfo(aWithDups, 2)
ArraySort( aWithDups, @ASCENDING , 0)
aSums = ArrDimension(nRows, nCols)
nMaxRow = nRows - 1
nMaxCol = nCols - 1
j = 0
for k = 0 to nMaxCol
   aSums[j,k] = aWithDups[0,k]
next

for i = 1 to nMaxRow         
   if StriCmp(aWithDups[i-1, 0], aWithDups[i, 0])   
      j += 1
      for k = 0 to nMaxCol
         aSums[j,k] = aWithDups[i,k]
      next
   else
      aSums[j,1] += aWithDups[i,1]
   endif
next
ArrayRedim(aSums,j+1, -1)

ArrayFilePutCsv('Summed.txt', aSums)

Message('Example Test', FileGet('Summed.txt'))
exit

Title: Re: Remove duplicate lines from a text file
Post by: kle2 on January 28, 2025, 08:19:09 PM

Quote from: td on January 28, 2025, 01:58:35 PM
Code Select Expand
; The tried and true algorithm for dedupping data... aWithDups = ArrayFileGetCsv('Output.txt', 0) nRows = ArrInfo(aWithDups, 1) nCols = ArrInfo(aWithDups, 2) ArraySort( aWithDups, @ASCENDING , 0) aSums = ArrDimension(nRows, nCols) nMaxRow = nRows - 1 nMaxCol = nCols - 1 j = 0 for k = 0 to nMaxCol aSums[j,k] = aWithDups[0,k] next for i = 1 to nMaxRow if StriCmp(aWithDups[i-1, 0], aWithDups[i, 0]) j += 1 for k = 0 to nMaxCol aSums[j,k] = aWithDups[i,k] next else aSums[j,1] += aWithDups[i,1] endif next ArrayRedim(aSums,j+1, -1) ArrayFilePutCsv('Summed.txt', aSums) Message('Example Test', FileGet('Summed.txt')) exit

Perfect! Thank you again, TD!

WinBatch® Technical Support Forum

All Things WinBatch => WinBatch => Topic started by: kle2 on January 27, 2025, 10:11:37 AM