Author Topic: File Split with Headers  (Read 131 times)

stanl

  • Pundit
  • *****
  • Posts: 1422
File Split with Headers
« on: November 26, 2020, 08:09:44 am »
I have a situation where large ^ delimited text files [185 columns] 600mb-1.2 gig in size need to be split into smaller files [with headers] due to limitations of an SFTP load application. I was surprised to see that in Powershell and the Tech db here, file splits were best addressed with chunking, and the PS scripts that included headers and split by lines could take minutes or hours.


I cobbled together a PS script to address issues and, by including a stop-watch found it could split a 600mb+ file into 3 sub files:


myfile.txt => myfile_001.txt, myfile_002.txt, myfile_003.txt  in 4-6 seconds.


This done by hard-coding 200000 lines as the split parameter length. I then found a quick way to calculate the total number of lines in the file and then calculate the number of split files that did not exceed 200mb which only added 3 seconds, but then seemed redundant since 200000 was hard-coded.


I'm pretty sure a native WB FileRead() script could equal or surpass the PS script in speed. Maybe I am over-thinking but if I were to set a seed=5 for the number of split files to create, could the split lines be determined with any file size by 50000 increments?




td

  • Tech Support
  • *****
  • Posts: 3681
    • WinBatch
Re: File Split with Headers
« Reply #1 on: November 26, 2020, 09:21:15 am »
I am not sure I follow what you are asking.  The number of files given the limits is straightforward math (solve for X.) Calculating the number of lines or rows in a large file more or less depends on what demarcates a line. If speed is your primary criteria then you would need 64-bit WinBatch and a couple of binary buffer functions.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

stanl

  • Pundit
  • *****
  • Posts: 1422
Re: File Split with Headers
« Reply #2 on: November 26, 2020, 05:04:20 pm »
File is assumed with CRLF line feeds. Yes, I use 64bit PS to obtain the 4-7 second speeds. Finally. the question was scrambled. If I am to assume a seed of 200mb then no calculations with total lines in file is needed, and calculating takes time. Was a pretty messed up post. you can delete with permission.