WinBatch® Technical Support Forum

All Things WinBatch => WinBatch => Topic started by: JTaylor on February 15, 2018, 12:01:12 PM

Title: Parsing HMTL
Post by: JTaylor on February 15, 2018, 12:01:12 PM
I am looking for a way to parse HTML without launching IE or creating a Browser Control in a dialog.   I know of the DOMExtender but it doesn't handle Unicode.   I tried "htmlfile" but only get an access denied response. 
Read this could be done with MSHTML but no luck on that front.

I have been trying out "HtmlZap" which is, both, pretty slick and also limited (but that may just be my ignorance).

Suggestions?    If looking for ideas for an Extender, duplicating the DOM Extender would be a good one...or see if those folks are around and would give up the source.  They said they would but never followed through :-)

Jim
Title: Re: Parsing HMTL
Post by: td on February 15, 2018, 12:59:37 PM
Google is your friend.
Title: Re: Parsing HMTL
Post by: JTaylor on February 15, 2018, 01:13:58 PM
Yes.  We are well acquainted and how I found HtmlZap but no other luck after much time so thought I would ask here.   Guessing there are some C# libraries that would probably work but was hoping someone knew of something a little more straightforward.

Jim
Title: Re: Parsing HMTL
Post by: td on February 15, 2018, 01:57:30 PM
I have seen and even used some elegantly written C++ open source code for XML parsing.  There are multiple open source XML parsing projects and you even find well-written comparisons of the various approaches.  I am not aware of the existence of similar coverage of HTML parsing as I have no need but you would think that something exists.   

Title: Re: Parsing HMTL
Post by: stanl on February 16, 2018, 02:54:45 AM
Quote from: td on February 15, 2018, 01:57:30 PM
I have seen and even used some elegantly written C++ open source code for XML parsing.  There are multiple open source XML parsing projects and you even find well-written comparisons of the various approaches.  I not aware of the existence of similar coverage of HTML parsing as I have no need but you would think that something exists.

Possibly the HTMLAgilityPack? https://www.nuget.org/packages/HtmlAgilityPack

[EDIT]:  If I can find it, I put together a short PS script using the AgilityPack dll and the PS Invoke WebRequest
Title: Re: Parsing HMTL
Post by: stanl on February 18, 2018, 07:02:17 AM
Quote from: stanl on February 16, 2018, 02:54:45 AM

[EDIT]:  If I can find it, I put together a short PS script using the AgilityPack dll and the PS Invoke WebRequest

Apologize. Found script and actually had commented out AgilityPack and used PS exclusively.  Their Invoke-WebResuest cmdlet returns everything.
Title: Re: Parsing HMTL
Post by: JTaylor on February 18, 2018, 07:39:28 AM
No problem.  Appreciate you looking.   

To help clarify why I pursue some things over others...Apart from something doing the job I need done my next most important criteria is usually the ease of distribution.   So, the most attractive solutions are ones that will already be on the machine and the next are ones that require nothing but copying the needed files, hence my interest in Extenders.  Obviously these two are not always an option and registering things are sometimes required.   If it is something I am using locally then I am wide open to different options but I distribute a lot of stuff to systems where I have absolutely no say over things so my interest begins with the first two options.   Don't want anyone thinking I am ungrateful for the suggestions as they may be exactly what is needed or lead to a perfect solution.

Thanks again.

Jim
Title: Re: Parsing HMTL
Post by: stanl on February 19, 2018, 03:04:58 AM
NP.  You know depending on how complicated you expect the HTML to be parsing an HTTP Request with Binary Buffers may be an option.
Title: Re: Parsing HMTL
Post by: JTaylor on February 19, 2018, 07:39:33 AM
Yeah...once they get busy and add the requested functions to the Binary Set :)

Jim
Title: Re: Parsing HMTL
Post by: kdmoyers on February 20, 2018, 07:29:55 AM
Depending on the exact problem, a RegEx solution often works, using ObjectOpen("VBScript.RegExp"), and requires no external packages.

Using a RegEx to parse legitimate formats like html, xml, json is technically a no-no, since RegExs don't actually understand what they are doing, and are easily defeated by things like comments and nesting.

But in actual usage, sample files are sometimes very simple, and don't feature any of the constructs that would defeat a RegEx.  So for the simple case of "dig the numbers out of the web page", sometimes a few lines of RegEx code will do the job.

If this sounds interesting, post your sample file and a description of the data you are interested in and I'll code something up.  Might take a few days as I am on the road.
Title: Re: Parsing HMTL
Post by: JTaylor on February 20, 2018, 08:20:40 AM
Appreciate the offer but this is a something better suited to a DOM parser.  I have some options but was just hoping someone had some ideas that distributed well, didn't require the launching of a browser of some form and that I liked better than what I have :)     Some of the ideas here have been helpful and I am trying out some UDFs based on HTMLzap and see if that is a viable option.   Going to see if the local university has any C++ classes and maybe get a better understanding on that front so writing Extenders doesn't take me so long.  If so, might crank a few of those out that I find useful.   The DOM Extender was VERY useful.

Thanks again.

Jim
Title: Re: Parsing HMTL
Post by: kdmoyers on February 20, 2018, 08:49:35 AM
Quote from: JTaylor on February 20, 2018, 08:20:40 AMThe DOM Extender was VERY useful.
Yes it was!! I miss it.
Title: Re: Parsing HMTL
Post by: stanl on February 20, 2018, 12:44:51 PM
Does anyone remember ChilKat?
Title: Re: Parsing HMTL
Post by: JTaylor on February 20, 2018, 01:08:00 PM
Name is familiar.

Jim
Title: Re: Parsing HMTL
Post by: Mogens Christensen on February 21, 2018, 06:55:31 AM
"The DOM Extender was VERY useful." !!! Is the DOM extender unuseable ? i still use it
Title: Re: Parsing HMTL
Post by: stanl on February 21, 2018, 07:02:30 AM
Quote from: Mogens Christensen on February 21, 2018, 06:55:31 AM
"The DOM Extender was VERY useful." !!! Is the DOM extender unuseable ? i still use it

Not un-usable but limited given when it was written and the advances made to HTML sites. I think at one time, circa maybe 2011 someone asked for the C source. Not sure those conversations are in the WB history archives.
Title: Re: Parsing HMTL
Post by: JTaylor on February 21, 2018, 07:45:52 AM
Works fine apart from the fact that it doesn't handle Unicode.  Many of my needs require that capability so had to drop most of my use of it.

Jim