Dealing with broken quotes in CSV files

Started by KeithW, August 22, 2024, 09:16:20 AM

Previous topic - Next topic

KeithW

Greetings,

I have a client that supplied a 4.6M row CSV file that I have to load into a database.
In the process of loading I found embedded quotes in fields that break the rules for CSV files
For Example the field  might look like:
field1,Robert "Bob" Smith,field3             ...    when it should be
field1,Robert ""Bob"" Smith, field3

I can write a script to fix this, I am wondering if it have already been addressed by anyone that would be willing to share rather than reinventing yet another wheel?

Thanx,
Keith

td

Depending on the size of the file (and not the number of records) properly escaping double quotes could be as simple as a 3-line script using either the StrReplace with FileGet/FilePut or BinaryReplace with BinaryRead/BinaryWrite functions. You can make it slightly more complicated with 5 lines and much slower by stepping through the file's records one at a time using FileOpen/FileRead/FileWrite/FileClose. This method has the advantage of not having any practical file size limit. You can use Excel COM Automation objects if the above approaches are not complicated enough. Of course, the Excel approach requires access to a locally installed copy of Excel.

If your file contains some properly escaped double quotes mixed with incorrectly escaped double quotes, things do get a little more complicated. You would need to perform some text parser using WinBatch string functions or use .Net regular expressions.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

KeithW

LOL,

OK, out of 4.6M rows, I now know 51 rows have misplaced double-quotes.
I figured, if I could automate now, in the future it would be easier to deal with these issues.
Are there properly double-quotes fields, I do not know, but figure out of 4.6M rows there might be a good possibility.  I guess a quick search with an editor for "" might (?) yield an answer.

JTaylor

Would that be considered valid if it was assumed there were no commas in the data and no double-quote processing was done.  Does it fail to load as CSV if you use ArrayFileGetCSV()?

jim

td

ArrayFileGetCSV can properly handle correctly escaped CSV files. For example:

1,e685B8690f9fbce,Erik,Little,"""Blankenship PLC""",Caitlynmouth,Sao Tome and Principe,457-542-6899,055.415.2664x5425,shanehester@campbell.org,2021-12-23,https://wagner.com/

but coughs up a hairball on something like

1,e685B8690f9fbce,Erik,Little,""Blankenship PLC",Caitlynmouth,Sao Tome and Principe,457-542-6899,055.415.2664x5425,shanehester@campbell.org,2021-12-23,https://wagner.com/
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

spl

Just .02, but seems all breaks down to what are the rules. Used to work with huge csv downloads from SalesForce and to use an edit on Tony's data
1,e685B8690f9fbce,Erik,Little,"Blankenship, PLC",Caitlynmouth,Sao Tome and Principe,457-542-6899,055.415.2664x5425,shanehester@campbell.org,2021-12-23,https://wagner.com/

We determined the data was non-quoted comma-separated but specific columns could contain quotes if a comma in the data [I hope this is what Jim was referring to]. Later had issues with | or ^ delimited data, and enacted appropriate rules. If bad data was found either ask for a new download, or upload what did work and ask for corrected bad rows to be resent. Which only hints at rules for Unicode with umlautes or accent grave.
Stan - formerly stanl [ex-Pundit]

JTaylor

Understood, but I don't think that was his problem though.  His was the following.  It appears it doesn't escape the quotes but it also doesn't enclose text in quotes.

field1,Robert "Bob" Smith,field3

Jim

JTaylor

If I understand the problem, and the size of the file isn't an issue, something like the following will produce more of what you expect and something the WinBatch functions will work with as well.

AddExtender("wbOmniCSV.dll")
csvLoadFile("my.csv",0,",",@TRUE,@TRUE)
csvSaveFile("my_new.csv")

spl

Quote from: JTaylor on August 23, 2024, 05:28:47 AMUnderstood, but I don't think that was his problem though.  His was the following.  It appears it doesn't escape the quotes but it also doesn't enclose text in quotes.


I thought I understood as well. The issue, not really the data, but how to INSERT into a database. If it has to be loaded/treated/recopied then bulk inserted then what you suggested makes sense, but if it can be loaded, parsed correctly and inserted by row [without any changes] then rules make sense. Of course I was speaking from a purely corporate perspective where there were groups that monitored IT ETL.
Stan - formerly stanl [ex-Pundit]

JTaylor

I think The INSERT is a problem due to some readers choking on the data since it is not strict CSV formatting.   The question was whether the data was corrupted with the way the quotes were handled.  If the example presented was indicative of the situation then it really isn't corrupt, it just isn't a technically valid CSV format.   Had they done the double-quote enclosures for the text data then it would be corrupt as the quotes current there would have needed to be escaped.

Jim

td

Quote from: JTaylor on August 23, 2024, 05:28:47 AMUnderstood, but I don't think that was his problem though.  His was the following.  It appears it doesn't escape the quotes but it also doesn't enclose text in quotes.

field1,Robert "Bob" Smith,field3

Jim

From the published CSV RFC  (https://www.rfc-editor.org/rfc/rfc4180):

  5.  Each field may or may not be enclosed in double quotes (however
      some programs, such as Microsoft Excel, do not use double quotes
      at all).  If fields are not enclosed with double quotes, then
      double quotes may not appear inside the fields
.  For example:

      "aaa","bbb","ccc" CRLF
      zzz,yyy,xxx

  6.  Fields containing line breaks (CRLF), double quotes, and commas
      should be enclosed in double-quotes.
  For example:

      "aaa","b CRLF
      bb","ccc" CRLF
      zzz,yyy,xxx

   7.  If double-quotes are used to enclose fields, then a double-quote
       appearing inside a field must be escaped by preceding it with
       another double quote.  For example:

       "aaa","b""bb","ccc"


The ArrayFileGetCSV functions objects to quotes in non-quoted fields. Unfortunately, not all applications adhere to the RFC.

"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

spl

Quote from: JTaylor on August 23, 2024, 06:46:22 AMIf the example presented was indicative of the situation then it really isn't corrupt, it just isn't a technically valid CSV format.

Yes, the rules.
Stan - formerly stanl [ex-Pundit]

JTaylor

I understand.  I wasn't disagreeing that the formatting was wrong for CSV.  I was noting that the data wasn't corrupted, per se.  From the example, the data was consistent in the way it was formatted.  They weren't just throwing quotes in there willy-nilly.  Escaping some, not others, open quotes without closing them, etc.

Jim

spl

Quote from: JTaylor on August 23, 2024, 09:43:17 AMI understand.  I wasn't disagreeing that the formatting was wrong for CSV.  I was noting that the data wasn't corrupted, per se.  From the example, the data was consistent in the way it was formatted.  They weren't just throwing quotes in there willy-nilly.  Escaping some, not others, open quotes without closing them, etc.

Jim

Yes, and not understanding the innards for the engine that produced the data would raise my alarm about rules. No harm, no foul,
Stan - formerly stanl [ex-Pundit]

td

Alarm about rules? This entire topic thread exemplifies why adhering to published rules is important when sharing data across entities.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

JTaylor

Sorry, was replying to Tony.


Quote from: spl on August 23, 2024, 09:52:38 AMYes, and not understanding the innards for the engine that produced the data would raise my alarm about rules. No harm, no foul,

JTaylor

Agreed.

Quote from: td on August 23, 2024, 10:14:37 AMAlarm about rules? This entire topic thread exemplifies why adhering to published rules is important when sharing data across entities.

spl

Stan - formerly stanl [ex-Pundit]