Best way to replace text in csv files except when its contained in double quotes

Started by jbesch, November 18, 2015, 07:01:30 AM

Previous topic - Next topic

jbesch

Hi- I have a need to replace space characters in CSV files, with another delimiter such as a pipe, *except* when the space character is within text delimited by quotes.  These CSV files can be quite large so I'm looking for the fastest way to do so within Winbatch (Binary functions, RegEx, etc).  As an example, consider the following: 

2015-11-09 00:00:00.808 "C:\Temp\My Test" 200 "SVR1" - 0 7594 302 "application/GetTestService-v1.0"

I would like to replace all spaces with a | character except for the space in the quote delimited "C:\Temp\My Test" text.

Any thoughts? 

Thanks in Advance,
-Jeff

td

If by "quite large" you mean something less than maybe 70 or 80 Megabytes then I would guess that using regular expressions might be the fastest approach.  Basically, read the whole file into a string variable and pass it to the dotNet RegEx class and then write the string back to a file.   Of course, this assumes some skill at Regular Expressions writing.   
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

td

Here's a crude attempt.  There are likely much faster approaches using regular expressions.

Code (winbatch) Select
ObjectClrOption("use", "System, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089")

; Group 1 = non quoted text and group 2 = quoted text.
strPattern = `(.*?)((?<quote>[""']).*?\k<quote>|$)`

; Create an instance of our regular expression.
objRegEx = objectClrNew("System.Text.RegularExpressions.Regex", strPattern)

; Test input.
strInput = '2015-11-09 00:00:00.808 "C:\Temp\My Test" 200 "SVR1" - 0 7594 302 "application/GetTestService-v1.0"'
strInput := @CRLF:strInput

objMatches = objRegEx.Matches(strInput)
strInput = '' ; Free up some string space.

strResult = ''
foreach objMatch in objMatches
   ; The StrReplace function might be more efficient than the Replace here.
   strResult :=  objRegEx.Replace(objMatch.Groups.Item(1).Value, ' ', '|'):objMatch.Groups.Item(2).Value
next

Message('Result', strResult)
   
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

td

Here is a marginally better version that would preserve the lines in a file.

Code (winbatch) Select
ObjectClrOption("use", "System, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089")

; Group 1 = non quoted text and group 2 quoted text.
strPattern = `(.*?)((?<quote>[""']).*?\k<quote>|$)`

; Create an instance of a regular expression class object.
objRegEx = objectClrNew("System.Text.RegularExpressions.Regex") ;, strPattern)

; Test input.
strInput = '2015-11-09 00:00:00.808 "C:\Temp\My Test" 200 "SVR1" - 0 7594 302 "application/GetTestService-v1.0"'
strInput := @CRLF:strInput

; 2 = multiline mode so file line terminators are preserved.
objMatches = objRegEx.Matches(strInput, strPattern, System.Text.RegularExpressions.RegexOptions:2)
strInput = '' ; Free up some string space.

strResult = ''
foreach objMatch in objMatches
   strResult :=  objRegEx.Replace(objMatch.Groups.Item(1).Value, ' ', '|'):objMatch.Groups.Item(2).Value
next

Message('Result', strResult)
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

snowsnowsnow

Here's a GAWK (the most commonly available and best flavor of AWK) solution.  Nice in that it's only 2 lines:

BEGIN { FPAT = "([^ ]+)|(\"[^\"]+\")";OFS = "|" }
{ $1 = $1;print }

The trick here is to define FPAT to match your fields (the character after the ^ is your delimiter - note, BTW, that this format is *not* CSV - because the C stands for "comma" - it is, in fact, BSV) then set OFS to the new field delimiter.

Anyway, note that GAWK can be compiled (I've done it myself, once, long ago) as a DLL that is callable from other programs (e.g., WinBatch), so you could actually do this from within WB, making it about as topical as using .NET from within WB (heh heh).


td

Since the CLR and FCL (dotNet) are built into the operating system as of about 9 years ago, I would not call GAWK 'as topical'.  But I am sure GAWK would be a lot faster performance wise because you are avoiding a WIL interpreted loop.

And also note that dotNet  regular expressions can be compiled. 
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

jbesch

Thanks for the all replies and advice!  I'll be testing them out.  Been tied up at work for a few days.