Remove duplicate lines from a text file

Started by kle2, January 27, 2025, 10:11:37 AM

Previous topic - Next topic

kle2

Hello everyone,

I have an output.txt file like below.  I would like to find all the duplicate server names from the first column and add up all the numbers from the second column (columns are seperated by comma delimiter) onto a single line (using winbatch).  For example I need to look through the output.txt file and find all the "server1" which there are 3 lines that has "server1", then I combine all "server1" into 1 line and add up all the number from the second column (79+79+79=237) and it should become:
server1,237,VSP,01/27/25,,1262,

Same process for all other servers

Does any one have a sample or give me a hint?
Thanks in advance!
KLE2

output.txt
***********************
server1,79,VSP,01/27/25,,1262,
server1,79,VSP,01/27/25,,1263,
server1,79,VSP,01/27/25,,1264,
server2,100,VSP,01/27/25,,1265,
server6,100,VSP,01/27/25,,1266,
server2,100,VSP,01/27/25,,1268,
server5,430,VSP,01/27/25,,1294,
server2,100,VSP,01/27/25,,1269,
server2,100,VSP,01/27/25,,1276,
server2,380,VSP,01/27/25,,1277,
server2,380,VSP,01/27/25,,1278,
server2,380,VSP,01/27/25,,1279,
server3,80,VSP,01/27/25,afps,1291,
server2,430,VSP,01/27/25,,1292,
server2,430,VSP,01/27/25,,1293,
server6,430,VSP,01/27/25,,1294,
server6,430,VSP,01/27/25,,1295,
server2,430,VSP,01/27/25,,1296,
server2,430,VSP,01/27/25,,1297,

td

It depends on whether you want to preserve the other fields on each line to some extent. Either way, a WIL Map would be a straightforward solution. It would involve the ArrayFileGetCSV and MapKeyExist functions along with creating a map from the WIL array returned by ArrayFileGetCSV.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

td

It may not address your problem directly and the script makes a lot of assumptions, but it may be enough to get you started.

aWithDups = ArrayFileGetCsv('Output.txt', 0)
aSums = MapCreate()

nMax = ArrInfo(aWithDups, 1) -1
for i=0 to nMax
   if MapKeyExist(aSums, aWithDups[i,0])
      aSums[aWithDups[i,0]] += aWithDups[i,1]
   else
      aSums[aWithDups[i,0]] = aWithDups[i,1]
   endif
Next
MapFilePutCsv('Summed.txt', aSums)

Message('Example Test', FileGet('Summed.txt'))
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

kle2

Thanks for quick responding, TD!

Yes, I only want to combine the duplicated name and add up the second column.  The other columns can be anything, I am not worry much about them.

I got an error "Uninitialized variable, undefined function...for the aSums = MapCreate().  Is there any extender I need to add?

Thanks,
kle2

kdmoyers

Maybe that message means your version does not know the MapCreate function.
What version of WinBatch are you running? 
MapCreate came in at version 2019A, I think.
The mind is everything; What you think, you become.

kle2

Quote from: kdmoyers on January 28, 2025, 03:38:30 AMMaybe that message means your version does not know the MapCreate function.
What version of WinBatch are you running? 
MapCreate came in at version 2019A, I think.

Oh...that makes sense.  I am using Winbatch version 2016A.  I will have to find another way :(

Thanks kdmoyers!

spl

Quote from: kle2 on January 28, 2025, 04:46:56 AMOh...that makes sense.  I am using Winbatch version 2016A.  I will have to find another way :(


Probably a file-read / aggregate into fabricated recordset / then output with getstring(). Not as convenient as Tony's suggestion, but very workable, and not complicated to understand. I can cook up a simple script if you wish.
Stan - formerly stanl [ex-Pundit]

snowsnowsnow

First, I want to clarify exactly what the desired output is.  Here is the output of a script I wrote in another language (AWK, not WinBatch).  Is this what you want (given the data given above):

server1,237,VSP,01/27/25,,1262,
server2,3260,VSP,01/27/25,,1265,
server3,80,VSP,01/27/25,afps,1291,
server5,430,VSP,01/27/25,,1294,
server6,960,VSP,01/27/25,,1266,

Note that the data in the 3rd through fifth fields is essentially ignored, and what I did was to preserve the first line for each server name, just replacing the 2nd field with the total.

Assuming that is correct, re-writing it in WB should be straightforward, but note that, as I've mentioned a few times in the past, WB is not an ideal language for doing normal text/data processing tasks.  But it can be done.

td

Here is another approach that while more resource-intensive might work with the OP's version of WinBatch.

; The tried and true algorithm for dedupping data...
aWithDups = ArrayFileGetCsv('Output.txt', 0)
nRows = ArrInfo(aWithDups, 1)
ArraySort( aWithDups, @ASCENDING , 0)
aSums = ArrDimension(nRows, 2)
j = 0
aSums[j,0] = aWithDups[0,0]
aSums[j,1] = aWithDups[0,1]
nMax = nRows - 1
for i = 1 to nMax         
   if StriCmp(aWithDups[i-1, 0], aWithDups[i, 0])   
      j += 1
      aSums[j,0] = aWithDups[i,0]
      aSums[j,1] = aWithDups[i,1]
   else
      aSums[j,1] += aWithDups[i,1]
   endif
next
ArrayRedim(aSums,j+1, -1)

ArrayFilePutCsv('Summed.txt', aSums)

Message('Example Test', FileGet('Summed.txt'))
exit

"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

spl

Snow++ made a good point, as there were no duplicate lines in the original post. So the assumption being you wanted all columns for the first instance of server, but the subsequent value of column 2 for each server aggregated, the code below will do that, although super tweakable. It assumes your input is a file that can be read. The fabricated recordset is based on the number of items per the delimiter used (in this case a comma). You can be more creative with field names. I suggested an output file but chose to display Getstring(), as you might prefer fileout with trimmed fields. Also, there is a bit of variable substitution, but as I reflected on another thread, no harm no foul....
input = Dirscript():"servers.txt"
output = Dirscript():"serversaggregate.txt"
If !FileExist(input) Then Terminate(@TRUE,"Input File Not Foubd",input)
rs=CreateObject("ADODB.Recordset")
rs.CursorType = 1
rs.LockType = 3
 
rs.Fields.Append("f1",129,10,32768)
rs.Fields.Append("f2",3,10)
rs.Fields.Append("f3",129,20,32)
rs.Fields.Append("f4",129,20,32)
rs.Fields.Append("f5",129,20,32)
rs.Fields.Append("f6",129,20,32)
rs.Fields.Append("f7",129,20,32)
rs.Open()

h = FileOpen(input, "READ")
While @True
   line = FileRead(h)
   If line == "*EOF*" Then Break
   svr = ItemExtract(1,line,",")
  
   rs.Find("f1='%svr%'")
   if rs.eof()
      rs.Addnew()
      for i= 1 to 7
         rs.Fields("f%i%").Value =  ItemExtract(i,line,",")
      next
   else
      rs.Fields("f2").Value =  rs.Fields("f2").Value +  ItemExtract(2,line,",")
   endif
   rs.update()
   rs.MoveFirst()

Endwhile
FileClose(h)
Message("Aggregated Server",rs.Getstring(2,,",",))

rs.close()
rs=0
exit

Stan - formerly stanl [ex-Pundit]

snowsnowsnow

In AWK, I used AWK's "associative arrays" - which are arrays with string indexes.  The server name is the index of the array(s); this gets you the "de-dupe" functionality, as there can only be one array element for each index (i.e., server name).

If I was doing this (in WB), I'd use the DataSaver library (written a million years ago by Marty), as that implements associative arrays in WB.

kle2

Quote from: spl on January 28, 2025, 06:00:06 AM
Quote from: kle2 on January 28, 2025, 04:46:56 AMOh...that makes sense.  I am using Winbatch version 2016A.  I will have to find another way :(


Probably a file-read / aggregate into fabricated recordset / then output with getstring(). Not as convenient as Tony's suggestion, but very workable, and not complicated to understand. I can cook up a simple script if you wish.
Thanks for the offering spl.  Yes, please give me a script.
kle2

kle2

Quote from: snowsnowsnow on January 28, 2025, 06:44:12 AMFirst, I want to clarify exactly what the desired output is.  Here is the output of a script I wrote in another language (AWK, not WinBatch).  Is this what you want (given the data given above):

server1,237,VSP,01/27/25,,1262,
server2,3260,VSP,01/27/25,,1265,
server3,80,VSP,01/27/25,afps,1291,
server5,430,VSP,01/27/25,,1294,
server6,960,VSP,01/27/25,,1266,

Note that the data in the 3rd through fifth fields is essentially ignored, and what I did was to preserve the first line for each server name, just replacing the 2nd field with the total.

Assuming that is correct, re-writing it in WB should be straightforward, but note that, as I've mentioned a few times in the past, WB is not an ideal language for doing normal text/data processing tasks.  But it can be done.
I can try anything as long as it works.  Yes please give me the script.  Thanks, kle2

kle2

Quote from: td on January 28, 2025, 09:19:04 AMHere is another approach that while more resource-intensive might work with the OP's version of WinBatch.

; The tried and true algorithm for dedupping data...
aWithDups = ArrayFileGetCsv('Output.txt', 0)
nRows = ArrInfo(aWithDups, 1)
ArraySort( aWithDups, @ASCENDING , 0)
aSums = ArrDimension(nRows, 2)
j = 0
aSums[j,0] = aWithDups[0,0]
aSums[j,1] = aWithDups[0,1]
nMax = nRows - 1
for i = 1 to nMax         
   if StriCmp(aWithDups[i-1, 0], aWithDups[i, 0])   
      j += 1
      aSums[j,0] = aWithDups[i,0]
      aSums[j,1] = aWithDups[i,1]
   else
      aSums[j,1] += aWithDups[i,1]
   endif
next
ArrayRedim(aSums,j+1, -1)

ArrayFilePutCsv('Summed.txt', aSums)

Message('Example Test', FileGet('Summed.txt'))
exit



Wow!  This works very well for my purpose.  Thanks td!

One more question, if I want to put the rest of the fields of the first row in the result?  for example the result for server1 would be:

server1,237,VSP,01/27/25,,1262,

How do I do that?

Thanks
kle2

snowsnowsnow

Well, I'll be.  WB now supports += (as in C) ?  Who knew?

td

Quote from: kle2 on January 28, 2025, 11:38:54 AM
QuoteWow!  This works very well for my purpose.  Thanks td!

One more question, if I want to put the rest of the fields of the first row in the result?  for example the result for server1 would be:

server1,237,VSP,01/27/25,,1262,

How do I do that?

Thanks
kle2
Just create the aSums array with the same number of columns as the aWithDups array. You then create a loop to copy there remaining columns from aWithDups to aSums when ever you find a new server name in the first column.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

td

; The tried and true algorithm for dedupping data...
aWithDups = ArrayFileGetCsv('Output.txt', 0)
nRows = ArrInfo(aWithDups, 1)
nCols = ArrInfo(aWithDups, 2)
ArraySort( aWithDups, @ASCENDING , 0)
aSums = ArrDimension(nRows, nCols)
nMaxRow = nRows - 1
nMaxCol = nCols - 1
j = 0
for k = 0 to nMaxCol
   aSums[j,k] = aWithDups[0,k]
next

for i = 1 to nMaxRow        
   if StriCmp(aWithDups[i-1, 0], aWithDups[i, 0])  
      j += 1
      for k = 0 to nMaxCol
         aSums[j,k] = aWithDups[i,k]
      next
   else
      aSums[j,1] += aWithDups[i,1]
   endif
next
ArrayRedim(aSums,j+1, -1)

ArrayFilePutCsv('Summed.txt', aSums)

Message('Example Test', FileGet('Summed.txt'))
exit
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

kle2

Quote from: td on January 28, 2025, 01:58:35 PM; The tried and true algorithm for dedupping data...
aWithDups = ArrayFileGetCsv('Output.txt', 0)
nRows = ArrInfo(aWithDups, 1)
nCols = ArrInfo(aWithDups, 2)
ArraySort( aWithDups, @ASCENDING , 0)
aSums = ArrDimension(nRows, nCols)
nMaxRow = nRows - 1
nMaxCol = nCols - 1
j = 0
for k = 0 to nMaxCol
   aSums[j,k] = aWithDups[0,k]
next

for i = 1 to nMaxRow         
   if StriCmp(aWithDups[i-1, 0], aWithDups[i, 0])   
      j += 1
      for k = 0 to nMaxCol
         aSums[j,k] = aWithDups[i,k]
      next
   else
      aSums[j,1] += aWithDups[i,1]
   endif
next
ArrayRedim(aSums,j+1, -1)

ArrayFilePutCsv('Summed.txt', aSums)

Message('Example Test', FileGet('Summed.txt'))
exit


Perfect!  Thank you again, TD!