Parsing HMTL

Started by JTaylor, February 15, 2018, 12:01:12 PM

Previous topic - Next topic

JTaylor

I am looking for a way to parse HTML without launching IE or creating a Browser Control in a dialog.   I know of the DOMExtender but it doesn't handle Unicode.   I tried "htmlfile" but only get an access denied response. 
Read this could be done with MSHTML but no luck on that front.

I have been trying out "HtmlZap" which is, both, pretty slick and also limited (but that may just be my ignorance).

Suggestions?    If looking for ideas for an Extender, duplicating the DOM Extender would be a good one...or see if those folks are around and would give up the source.  They said they would but never followed through :-)

Jim

td

"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

JTaylor

Yes.  We are well acquainted and how I found HtmlZap but no other luck after much time so thought I would ask here.   Guessing there are some C# libraries that would probably work but was hoping someone knew of something a little more straightforward.

Jim

td

I have seen and even used some elegantly written C++ open source code for XML parsing.  There are multiple open source XML parsing projects and you even find well-written comparisons of the various approaches.  I am not aware of the existence of similar coverage of HTML parsing as I have no need but you would think that something exists.   

"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

stanl

Quote from: td on February 15, 2018, 01:57:30 PM
I have seen and even used some elegantly written C++ open source code for XML parsing.  There are multiple open source XML parsing projects and you even find well-written comparisons of the various approaches.  I not aware of the existence of similar coverage of HTML parsing as I have no need but you would think that something exists.

Possibly the HTMLAgilityPack? https://www.nuget.org/packages/HtmlAgilityPack

[EDIT]:  If I can find it, I put together a short PS script using the AgilityPack dll and the PS Invoke WebRequest

stanl

Quote from: stanl on February 16, 2018, 02:54:45 AM

[EDIT]:  If I can find it, I put together a short PS script using the AgilityPack dll and the PS Invoke WebRequest

Apologize. Found script and actually had commented out AgilityPack and used PS exclusively.  Their Invoke-WebResuest cmdlet returns everything.

JTaylor

No problem.  Appreciate you looking.   

To help clarify why I pursue some things over others...Apart from something doing the job I need done my next most important criteria is usually the ease of distribution.   So, the most attractive solutions are ones that will already be on the machine and the next are ones that require nothing but copying the needed files, hence my interest in Extenders.  Obviously these two are not always an option and registering things are sometimes required.   If it is something I am using locally then I am wide open to different options but I distribute a lot of stuff to systems where I have absolutely no say over things so my interest begins with the first two options.   Don't want anyone thinking I am ungrateful for the suggestions as they may be exactly what is needed or lead to a perfect solution.

Thanks again.

Jim

stanl

NP.  You know depending on how complicated you expect the HTML to be parsing an HTTP Request with Binary Buffers may be an option.

JTaylor

Yeah...once they get busy and add the requested functions to the Binary Set :)

Jim

kdmoyers

Depending on the exact problem, a RegEx solution often works, using ObjectOpen("VBScript.RegExp"), and requires no external packages.

Using a RegEx to parse legitimate formats like html, xml, json is technically a no-no, since RegExs don't actually understand what they are doing, and are easily defeated by things like comments and nesting.

But in actual usage, sample files are sometimes very simple, and don't feature any of the constructs that would defeat a RegEx.  So for the simple case of "dig the numbers out of the web page", sometimes a few lines of RegEx code will do the job.

If this sounds interesting, post your sample file and a description of the data you are interested in and I'll code something up.  Might take a few days as I am on the road.
The mind is everything; What you think, you become.

JTaylor

Appreciate the offer but this is a something better suited to a DOM parser.  I have some options but was just hoping someone had some ideas that distributed well, didn't require the launching of a browser of some form and that I liked better than what I have :)     Some of the ideas here have been helpful and I am trying out some UDFs based on HTMLzap and see if that is a viable option.   Going to see if the local university has any C++ classes and maybe get a better understanding on that front so writing Extenders doesn't take me so long.  If so, might crank a few of those out that I find useful.   The DOM Extender was VERY useful.

Thanks again.

Jim

kdmoyers

The mind is everything; What you think, you become.

stanl

Does anyone remember ChilKat?

JTaylor


Mogens Christensen

"The DOM Extender was VERY useful." !!! Is the DOM extender unuseable ? i still use it

stanl

Quote from: Mogens Christensen on February 21, 2018, 06:55:31 AM
"The DOM Extender was VERY useful." !!! Is the DOM extender unuseable ? i still use it

Not un-usable but limited given when it was written and the advances made to HTML sites. I think at one time, circa maybe 2011 someone asked for the C source. Not sure those conversations are in the WB history archives.

JTaylor

Works fine apart from the fact that it doesn't handle Unicode.  Many of my needs require that capability so had to drop most of my use of it.

Jim