Remove NULL char from entire files

Started by NateT, June 19, 2015, 12:36:42 PM

Previous topic - Next topic

NateT

I'm trying to find the best code for taking a file and stripping out any NULL characters in the file.  FileGet has an option to do that in the function, so I simply went with:

FilePut(param1, FileGet(param1,""))

I just wanted to see if there were any arguments for using the Binary functions or any other ideas.  My testing accomplished what I wanted and it is very fast even on a 10 MB file.  I'm just questioning if it is really that easy or if there are caveats to doing it this way.

td

"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

kdmoyers

Quote from: td on June 19, 2015, 02:00:36 PM
Count your blessings.

I think (might be wrong) that is Tony's short-winded way of saying:

Nope -- no caveats if it works.  You are fortunate that your files are that small, and that you are doing this in 2015 and not 2000, when the memory situation in most PCs was very different.  (Heck, I don't think FilePut was available in 2000).  When the files fit in handily memory, that method is safe and fast.

If your files get into the gigabytes, there are other less tidy options.
The mind is everything; What you think, you become.

td

If memory serves, FileGet will run out of interpreter string space at somewhere around the 80-90 MB file size mark.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

NateT

Quote from: td on June 22, 2015, 09:50:24 PM
If memory serves, FileGet will run out of interpreter string space at somewhere around the 80-90 MB file size mark.

Thanks.  I'll watch the file sizes.

DAG_P6

Bigger files can be dispatched with a BinaryRead loop.
David A. Gray
You are more important than any technology.

td

Slightly modified  help file example from the 'BinaryReplace' function topic:
Code (winbatch) Select
; Should be good for something a little over ~250 MB file size depending on execution environment.
str="" ; Search for nuls.
rep="" ; Replace with "nothing".
strFile="C:\Temp\FileWithNuls.txt"
nFs = FileSize( strFile )
hBuf = BinaryAlloc( nFs+100 )
ret = BinaryRead( hBuf, strFile )
nReps = BinaryReplace( hBuf, str, rep ,0)
Message( "Number of '%str%' strings replaced", nReps )
;;strFile="C:\Temp\FileNoNuls.txt"
BinaryWrite( hBuf, strFile )
BinaryFree( hBuf)
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

td

Here's another version that is much slower but should handle files of almost any size - assuming enough spare disk space is available. 
Code (winbatch) Select
;; Lightly tested!  Use at your own risk!
AddExtender("WWHUG44I.DLL", 0, "WWHUG64I.DLL")

str        = ""
rep        = ""
Offset     = "0"
OutOffset  = "0"
RepsTotal  = "0"
strFileOut = "C:\Temp\FileNoNuls.txt"
strFileIn  = "C:\Temp\FileWithNuls.txt"

; Could do file management task here like deleting an existing out file

Fs = FileSize( strFileIn, 1 )
if StrSub(huge_Subtract (Fs, "100000000"),1,1) !="-" then BufSize = 100000000 ;  Arbitrary binary buffer size
else BufSize = Fs
hBuf = BinaryAlloc( BufSize+100 )
while StrSub(huge_Subtract (Fs, Offset),1,1) !="-"
   nRead = BinaryReadEx( hBuf, 0, strFileIn, Offset, BufSize)
   nReps = BinaryReplace( hBuf, str, rep ,0)
   nWritten  = BinaryWriteEx(hBuf,0, strFileOut, OutOffset, nRead - nReps) 
   OutOffset = huge_Add(OutOffset, nWritten)
   RepsTotal = huge_Add(RepsTotal,nReps)
   Offset    = huge_Add(Offset,BufSize)
endwhile

; Could perform file delete and rename here.

Message( "Number of '":str:"' strings replaced", RepsTotal )
BinaryFree( hBuf)
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

kdmoyers

The mind is everything; What you think, you become.

DAG_P6

Though it's been a few years since I did so, once upon a time, I ran a series of benchmarks in which I used various sized buffers to perform sequential read and write operations on large files. Somewhat to my surprise, given the then current sizes of disk sectors and cylinders, I found that there was a sweet spot around 8192 bytes beyond which I saw very little gain in performance. Though the fact that I was using synchronous file I/O may have had some effect on the outcome, I suspect not, since the application is already I/O bound, since there was nothing else to keep the application occupied while it waited for I/O operations to complete.

The reason this result surprised me was that when I wrote code for IBM mainframe computers, the sweet spot was almost always the number of bytes that fit into a cylinder (one track all the way around one side of one platter). Even then, cylinder sizes varied significantly, but the sweet spot was usually much higher than 8192 bytes. Conversely, when the output destination was a 9 track tape, the sweet spot was more like 8 KB, because if your blocks were much bigger than that, you spent a lot of time waiting for the tape drive to skip past bad spots in the tape, because it had to lay down the whole block in one contiguous run of usable tape. The other disadvantage of writing excessively large blocks onto a tape was also the result of the aforementioned bad spots; you frequently got significantly less data to fit onto one reel. In the worst cases, this meant that your job went into a holding pattern while a new tape was mounted and spun up.
David A. Gray
You are more important than any technology.

kdmoyers

[complicated joke comparing block size of 9 track mag tape and shoe box full of punch cards ommitted]
Boy, I feel old.
-K
The mind is everything; What you think, you become.

td

Saw a fellow student trip while walking down the stairs of one of the science buildings with a shoebox under his arm.  It was wet and windy....
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

JTaylor


kdmoyers

Speaking of the old days, here's an interesting article about how ghosts of the old days persist into the present: skeuomorphs
(it's from a SQL related forum, so there is an SQL tilt to the text)

https://www.simple-talk.com/opinion/opinion-pieces/sql-style-habits-attack-of-the-skeuomorphs/
The mind is everything; What you think, you become.

stanl

Quote from: kdmoyers on July 24, 2015, 04:10:13 AM
Speaking of the old days, here's an interesting article

skeuomorphism  -  love it.

then there is meuomorphism  -  "my way or the highway"

JTaylor