Comma delimited with commas in text fields

Started by MW4, March 03, 2016, 10:07:23 AM

Previous topic - Next topic

MW4

Hello friends,
I have a huge file that I get from a vendor that is set up as comma delimited text, and unfortunately there are three fields within that the data contains a company name which can contain a comma.
The guys that put together this data have a fix in place but it probably won't be until June, and I really can't wait that long.

Short of identifying each instance of a comma use and hard coding a StrReplace, I'm at a loss.

File is like 12,000 records, 33 fields.

Any ideas from the board?

td

12,000 records and 33 fields isn't all that big a file.   The complication would seem to be that you have three fields that may be offenders.  If you could assume that when one offends they all offend, it wouldn't be an issue.  If you can't assume that then it would difficult to programmatically identify which fields are offending unless you have some specify knowledge about surrounding fields or the content of the offending fields.  For example, a company name field is always preceded by a field that contains numbers in a range or contains specific characters; or the erroneous field always has a value of 'Inc' or 'Co'.

If you can make some  assumptions about your data then you could use the File* and perhaps  Item*  WIL functions in a relatively simple script to both identify and correct the problem.   
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

snowsnowsnow

This is an all-too-common problem - which illustrates well why comma is just about the worst possible choice for a delimiter in a "CSV" file.  The best thing to do would be to get your guy to re-export the file using some other delimiter (TAB is pretty much the best) - but that may not be an option at this point and on your schedule (although re-exporting it should be easier for him than having to come up with a custom fix (as your post implies would be the case in the "wait until June" scenario)).

Anyway, that all said, the usual case is that when there are embedded commas in the fields, there is some kind of "quoting" mechanism in place to signal that this is a literal comma, not a field-delimiting comma.  The most straightforward approach is to enclose either all text fields in quote marks ("like this") or to enclose just the ones that have embedded commas in them.  In either case, it then becomes relatively simple to parse it out and "do the right thing" (although I don't think this particular task can be done natively in WB (I could be wrong about that)).  It is pretty straightforward in other scripting languages, such as AWK.

So, question for the OP: Is there any kind of quoting mechanism in use in your file?

MW4

No, they didn't put that much thought in the file.  That's what they are going to do by June is add quotes.

MW4

3 fields are offenders and 800+ different offenders withing those fields, and no rhyme or reason of course.

JTaylor

What I would do would be to ItemCount() the number of fields and then, of course, deal with the ones that have too many fields.   Also, guessing if the comma is in the middle of the data it is probably followed by a SPACE and the delimiters are not.   As far as figuring out other offenders, as Tony mentioned, you can look for types in other fields so if a zip code is supposed to be in column 15 but there are no 5 digit numbers in column 15 you know it has been shoved over too far so you can do those type of checks to narrow down which field is the offender.   Hopefully the comma+space option will fix most of your problems.

Jim

MW4

I'm going to give it a shot in a bit.

Thanks for the ideas

td

As snow++ suggested if you can identify the bad fields, you can fix them by placing a double quote just before the beginning of the first field and just after the second.  That will make them a single field instead of two fields.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

stanl

I deal with this daily parsing over 1 gig of exports from Microstrategy.  Basically the data exports similar to

"id","last,first","address"     etc

The exported text is processed line by line due to file size (ruling out fileget). But because the pattern is consistent, for each line I can replace "," with | and remaining " with nothing (and choose to replace any remaining commas with | if I need further separation), then use the bar delimiter to process each section into an appropriate database field. [old school, but each file type is processed with rules from an .ini file]

I agree with Snow++ that a comma is a bad delimiter... maybe | is also but I have used it successfully for years and often receive data in that format. The point is look for a pattern you can manipulate as the WB file I/o functions are excellent.

td

In in this case both quoting and delimiter changing likely have there strengths and weaknesses.  Here's an example of how you might join to fields using quoting

Code (winbatch) Select
; Note: ItemExtractCSV is not a good choice here.
strRecord = '123,Borg Industries, Inc., xyx'
strRecord = ItemReplace('"':ItemExtract(2,strRecord,','),2,strRecord,',')
strRecord = ItemReplace(ItemExtract(3,strRecord,','):'"',3,strRecord,',')


One admittedly small advantage to the above is that it does not change any item index locations, if you plan on performing more operations on the record.  One perhaps big disadvantage is that it requires four passes over the data by Item* functions to perform one field join.

Of course, the field joining method becomes moot, if you can figure out which fields need joining. 
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

MW4

Ok I have found a pattern that I think I can use on the first field

Shortened record 4 field examples...

firstfield,secondfield,Bad data with commas, like here,44,  (should be firstfield,secondfield,"Bad data with commas, like here",44,)
firstfield,secondfield,Bad data with commas, like here,92,  (should be firstfield,secondfield,"Bad data with commas, like here",92,)
firstfield,secondfield,No Bad data,44,  (should be firstfield,secondfield,"No Bad data",44,)

So at the second comma need to add a " after the comma.
When ,44, or ,92, is encountered place " before the comma preceding the field.

I've spent a while trying to come up with a strategy, and I'm coming up empty any thoughts?

JTaylor

First question.  Do you *HAVE* to use commas as the the delimiters?  I know you get the file that way and can't change that but does it matter once you have it?

Also, do all the bad commas have a space after them and all the good ones not?

Jim

MW4

No, don't really need commas after I get it, other than it is the field delimiter.
I've thought about just pulling each piece and creating a data table.

Nothing can be counted on as in a space
sometimes--   ABC, inc    sometimes-- ABC,inc sometimes-- A,B,C,inc

I hate to get too complex as it is just a fix until June


MW4

How can I get to 2nd or 3rd comma position using StrIndex, nest them?

JTaylor

Working under the assumption you are having to look at each line...something like (I have made no attempt to debug and is only intended to prompt ideas.  Also, this might be easier in an array.):
Code (winbatch) Select

ow = FileOpen("NewFile.txt","WRITE")
txt = FileGet("file.txt")
txt = StrReplace(txt,@CRLF,@CR)
lcnt = ItemCount(txt,@CR)
For x = 1 to lcnt

  line = ItemExtract(x,txt,@CR)

  icnt = ItemCount(line,",")
  If icnt == 33  Then
    FileWrite(ow,StrReplace(line,",",@TAB))     ;(Or whatever indicates a good line)
  Else
     nline = ItemExtract(1,line,",")   ;THIS ASSUMES FIRST FIELD IS ALWAYS GOOD...MAY BE BAD ASSUMPTION
     For y = 2 to icnt
        itxt = ItemExtract(1,line,",")
        If *field_is_good* Then        ;NOT SURE HOW TO TELL A GOOD FIELD.
           nline = nline:@TAB:itxt
        Else
            y = y + 1
            itxt = '"':itxt:ItemExtract(y,line,","):'"'
            ;IF YOU HAVE MULTIPLE COMMAS IN A FIELD THEN YOU WILL HAVE TO SORT THAT OUT HERE AS I DON'T KNOW WHAT CONSTITUTES A GOOD FIELD.
            nline = nline:@TAB:itxt
        EndIf
    Next
    FileWrite(ow,nline)     ;(Or whatever indicates a good line)
  EndIf

Next
FileClose(ow)

MW4

I'm doing something similar, I just need to figure out how to get the index of the 20th comma in a line

JTaylor

ItemExtract (20,....) should do that.  If you know it should stay there then you can join that with ItemExtract(21,....) .......or am I misunderstanding something?  You will, of course, need to put the comma back in the joined text.


Jim

MW4

OK itemextract is bringing me back the first part of the field before any extra commas. such as ABC company   (of ABC company,LLC)

What I need is the line position of the start of that itemextract  -- The position of the "A"

JTaylor

Take a look at StrIndex().

I would still recommend an Array approach.   You can then find the offending parts and easily join them in the appropriate Cell.   Then ArrayFilePutCSV() will take care of the quoting for you.

Jim

MW4

I need the 22nd comma position, that works great for the first one, I must be missing something.



I have identified the first bad fields and have been able to double quote successfully.

The second bad field is 22 commas away , then it needs a double quote to mark the beginning.
Then move to the end of that record and @backscan 10 commas to add the other double quote.

Just two bad fields, but no other identifying spaces words or otherwise that is consistent.


I'm all ears on the array...how?

JTaylor

That is why I am recommending an Array approach.   

Jim

MW4

ArrayFileGetCSV  gives me an error....file contains null characters

MW4

I'm going brute sterngth....

StrReplace (ThisLine, ", INC.", " INC")
StrReplace (ThisLine, ",LLC", " LLC")



JTaylor

Got to wondering if different length columns would create a problem....or maybe you do have NULLs in the file.   Guess you are back to the other approach or just combine the approaches.

Create an array but  read through your file in using the old method but populate the array as you go.  You can then join the needed fields and you can easily output the results.

Jim