Comparing large files

Started by hdsouza, December 17, 2017, 03:20:42 PM

Previous topic - Next topic

hdsouza

I have two files which have over 300,000 lines. I need to find the lines which are different ( or added  to the newer file) and write them out to a TXT file. I can read each line in one file and find if it exist in the other file, but that will take several days.

I looked at FileCompare, but it does not tell me which lines are different
Also tried using BinaryCompare, but it needs a count. I need to compare the full files.

Any thoughts would be really appreciated.
Thanks



JTaylor

Do you have access to some form of Unix?   Easy Peasy there.

Or, maybe WinDiff   https://support.microsoft.com/en-us/help/159214/how-to-use-the-windiff-exe-utility

Jim

td

There is always the "fc" command shell utility.  Just type "fc /?" at a command prompt for details.  It can be used as parameter to cmd.exe with one of the WinBatch Run* functions.

Also, Windows 10 now has a Unix subsystem and the "bash" shell, if you prefer using a Unix shell "diff" command.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

hdsouza

Thanks all.
Comparing the large files is part of a larger autmation script which I have written in winbatch. That is why I wanted to stay within winbatch.
Unfortunately I do not have unix. I have tried FC and the results are not reliable for large files which contain long lines.

I just tried powershell and that may work better.
Will probably have to call powershell from winbatch for the compare

hdsouza

So if I were to call this Powershell script ....

Code (powershell) Select

$File_Path = "C:\temp\"
$File_CurrentDwnld = $File_Path  + "File_CurrentDwnld.txt"
$File_EarlierDwnld = $File_Path  + "File_EarlierDwnld.txt"
$Compare_Download = compare-object (get-content $File_CurrentDwnld) (get-content $File_EarlierDwnld)
$Compare_Download = $Compare_Download.InputObject

$File_Difference = $File_Path  + "File_Difference.txt"
$Compare_Download > $File_Difference


..... from winbatch, would I need to execute something like this.. or there an easier or cleaner way.
Code (winbatch) Select

params = "C:\temp\comparefiles.ps1"
dir_Path = "C:\temp"
RunShell(Environment ("COMSPEC"), "/c C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe %params%", dir_Path, @NORMAL, @GETPROCID)

kdmoyers

Well, if I *had* to do it in wbt, I'd compute a quick hash of each line in the file and put the hashes in an array, along with offset and length of line.  Then at least your searches can be a fast ArraySearch function call.  You would need to check for hash collisions.

But writing a good "diff" is non trivial, so doing it myself would be a last resort.

$0.02
-Kirby
The mind is everything; What you think, you become.

td

Against better judgement because I am not sure how useful it is, here it is another way to use Powershell in a WinBatch script:

Code (winbatch) Select
; Get files text.  Could use Powershell commands for this too.
strTextRef = FileGet("C:\temp\dump.txt")
strTextDif = FileGet("C:\temp\dummy.txt")

ObjectClrOption("useany", "System.Management.Automation")
objAutoPs = ObjectClrNew("System.Management.Automation.PowerShell")

objPowShell = objAutoPs.Create()
objPowShell.AddCommand("Compare-Object")
objPowShell.AddParameter("ReferenceObject", strTextRef )
objPowShell.AddParameter("DifferenceObject", strTextDif)
objAsync = objPowShell.BeginInvoke()
objPsCollection = objPowShell.EndInvoke(objasync)
foreach objItem in objPsCollection
   Pause("Difference", objItem.ToString())
next
objPowShell.Dispose()
exit


As an aside: if you have Windows 10 Pro build 1703 or 1709, you have Unix or at the least the command shell.  MSFT just makes you go through a few hoops to set it up. 
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

hdsouza

thanks Td.
On line ObjectClrOption("useany", "System.Management.Automation") I get the error:
1850: CLR Unknown runtime option

td

You must not be using the latest version of WinBatch.  You will have to use the less user friendly 'use' option.   Something like the following:

Code (winbatch) Select
ObjectClrOption("use","System.Management.Automation, Version=1.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35")

The version value will depend on your version of dot Burgerflipper (a.k.a. .Net.)
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

td

FWIW, Windows 10 version 1709 (OS Build 16299.125) has the following version of the automation assembly:
Code (winbatch) Select
ObjectClrOption("use","System.Management.Automation, Version=3.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35")
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

hdsouza