HtmlAgilityPack

Started by JTaylor, April 07, 2020, 08:34:08 PM

Previous topic - Next topic

JTaylor

For anyone looking for an HTML parser...

I often have a need to parse HTML outside of IE and have been looking for a replacement for the DOM Extender since it will not handle Unicode.   Have tried many things with little success but ran across HtmlAgilityPack.  Thought I would see if I could make it COM accessible but turns out it already is set up for that.   If you download the project and compile the NET45 project and then register the DLL using regasm it will work fairly well.  Might be useful to someone so thought I would mention it.   Here is an example script.  It uses XPath syntax.

Code (winbatch) Select


html = FileGet("test.html")

Message("HTML",html)

hDoc      = ObjectCreate("HtmlAgilityPack.HtmlDocument")

             hDoc.LoadHtml(html)

hBody = hDoc.DocumentNode.SelectSingleNode("//body")
Message("Single Node - Body Node",hBody)

htm = hBody.InnerHtml
Message("Body Node - InnerHtml",htm)

divNodes = hDoc.DocumentNode.SelectNodes("//div")
For x = 0 to divNodes.count-1
  div_txt = divNodes.Item(x).InnerHtml
  Message("Multiple Nodes - Node# %x%",div_txt)
Next



divNodes = hDoc.DocumentNode.SelectNodes("//div[@data-asin]")   ;DIVs with data-asin attribute
For x = 0 to divNodes.count-1
  div_txt = divNodes.Item(x).GetAttributeValue("data-asin", 1)
  Message("Attributes - Node# %x%",div_txt)
Next



ObjectClose(hDoc)





Jim

stanl

I used one of the first releases in 2012. You can reference the .dll directly from powershell, but wrap the PS code in WB


add-type -Path 'C:\Agility\HtmlAgilityPack.dll'
$HTMLDocument = New-Object HtmlAgilityPack.HtmlDocument
$wc = New-Object System.Net.WebClient
$result = $HTMLDocument.LoadHTML($wc.DownloadString("http://www.simple-talk.com/blogbits/philf/quotations.html"))
$HTMLDocument.DocumentNode.SelectSingleNode("(//table)[2]").InnerText

JTaylor

Okay.   Discovered many of the Methods were not COM accessible because they used Generic types so spent the day making most things work with WinBatch.   There are MANY more Methods than what the online documentation shows and more than what you will find in my test script.   Can't guarantee they will all work but I altered most that i thought would be useful and, if in my test script, should work, obviously.   I did comment out the Messages so you will need to uncomment any for which you wish to see the results.   

I am attaching the altered HtmlNode.cs and htmlNodeCollections.cs files.  Just rename the original one and replace it with mine if you wish to live dangerously.   I am also including the test script and test.html file.   You will need to change DOMEx to HtmlAgilityPack in the test script, unless you change your namespace like I did.

Don't laugh too much at my code.   I am learning but very much a noob when it comes to C#.   No guarantees as I have only "tested" this and not used it in production so may encounter issues under real use.   Let me know of anything you find and I will fix it.  Hopefully these are the only ones I changed that you need for WinBatch.

Jim

stanl

Good stuff.  I can't find regasm and without all of the warnings not about to download. The initial reason I went the powershell route was to be able to access the dll without a lot of hoops. To my chagrin, I confused PS add-type with WB's CLR 'appbase' - so this fails
Code (WINBATCH) Select


ObjectClrOption('appbase','C:\Scripts\Powershell\HtmlAgilityPack.dll')
oHTML = ObjectClrNew("HtmlAgilityPack.HtmlDocument")



Got plenty of time, so maybe I'll try gacutil.exe and if that works maybe can help with DetectEncoding (at the bottom of your test.wbt file).


EDIT:  Not bad, ran gacutil.exe -i HtmlAgilityPack.dll and this worked
Code (WINBATCH) Select


ObjectClrOption('useany','HtmlAgilityPack')
oHTML = ObjectClrNew("HtmlAgilityPack.HtmlDocument")
Message("HtmlAgilityPack",oHTML)



but to move forward probably have to include System.Net.WebClient as container.

stanl

Got this to work using WB CLR (after using gacutil to insert HtmlAgilityPack.dll into GAC). If Tony looks at this thread he can probably school me on creating a secure channel so that oWeb.DownloadData("http://www.simple-talk.com/blogbits/philf/quotations.html") doesn't error.


Not sure if using CLR can contribute any more to your use of dll through COM.
Code (WINBATCH) Select


IntControl(73,1,0,0,0)
gosub udfs
html=""
ObjectClrOption("useany", "System.Data")
oWeb = ObjectClrNew('System.Net.WebClient') ;to obtain html string from URL
ObjectClrOption('useany','HtmlAgilityPack')
oHTML = ObjectClrNew("HtmlAgilityPack.HtmlDocument")


Message("HtmlAgilityPack",oHTML) ;test that object is created


;this will error could not create SSL/TSL secure channel
;html = oWeb.DownloadData("http://www.simple-talk.com/blogbits/philf/quotations.html")
;Message("Test",html)


test() ;sample html string
oHTML.LoadHtml(html)
htmlBody = oHTML.DocumentNode.SelectSingleNode("//body")


Message("Test",htmlBody.OuterHtml)
oWeb=0
oHTML=0
Exit


:WBERRORHANDLER
oDiag=0
geterror()
Message("Error Encountered",errmsg)
Exit


:udfs
#DefineSubRoutine geterror()
   wberroradditionalinfo = wberrorarray[6]
   lasterr = wberrorarray[0]
   handlerline = wberrorarray[1]
   textstring = wberrorarray[5]
   linenumber = wberrorarray[8]
   errmsg = "Error: ":lasterr:@LF:textstring:@LF:"Line (":linenumber:")":@LF:wberroradditionalinfo
   Return(errmsg)
#EndSubRoutine


#DefineSubRoutine test()
;simple test for HTML string
html = $"<!DOCTYPE html>
<html>
<body>
   <h1>This is bold heading</h1>
   <p>This is underlined paragraph</p>
   <h2>This is italic heading</h2>
</body>
</html> "
$"
Return(html)
#EndSubRoutine
Return




EDIT: Jim, this worked for DetectEncoding [partial code]


Code (WINBATCH) Select


oHTML.LoadHtml(html)
oHTML.Save("C:\temp\test.html")
encoding = oHTML.DetectEncoding("C:\temp\test.html")
Message("Encoding",encoding.ToString())

JTaylor

Thanks.   Should have thought about the GAC...probably wouldn't have had to rewrite stuff but I did learn a lot as well as discovering that the documentation doesn't cover very much. 

Also, some of what they did didn't appear to make much sense, especially in regards to optional parameters.   Maybe I missed the point but redid most of those by removing what I considered unnecessary duplication of methods.   They created two methods.  One without a parameter whose only purpose seemed to be to call the other function with the desired default parameter.

Regasm.exe can be found under the C:\Winodws\Microsoft.NET\Framework(64)\ path.   Depends on versions you have installed.  One for every version but don't know that it matters which you use.  More recent the better I would assume.   Again, may not even need it with you using GAC.

Had actually forgotten about the note on DetectEncoding.   Thanks.   I will have to test that again.

Jim

JTaylor

I still get the "specified ole variant is invalid" error on Detect encoding through COM.   I don't know that I would use that Method but was trying to make "everything" work. 

Jim

stanl

As I implied: COM vs. CLR is six of one....


Re-read some original posts, when I first brought up https://api.pwnedpasswords.com/range/ and needed TLS security. Tony provided code to resolve
Code (WINBATCH) Select


objUri = ObjectClrNew('System.Uri', strUrl)
objSvcManager = ObjectClrNew('System.Net.ServicePointManager')
protocols = ObjectClrType("System.Net.SecurityProtocolType",3072|768) 
objSvcManager.SecurityProtocol = protocols
objSvcPoint = objSvcManager.FindServicePoint(objUri)



so think I can work this in to use HtmlAgility against URL's. 


In reference to regasm.exe - I did a complete search up through .NET 4.8 and it doesn't exist anywhere, and I have version 1909 on my Win10 machine.  In any event, my biggest question is why 'appbase' fails if the .dll can easily be placed in the GAC and then used by CLR?

JTaylor

Interesting.   I did notice that it doesn't show in anything past 4.0 so maybe that has gone away.   I am running that version of Win10 as well.  I can send you a copy if you want to try it out.   If so, just email directly so I have your current email.

Maybe Tony has anthropomorphized his computer and now has to stay 6 feet away?   ;)

In any event, appreciate your input.   I am starting some real world use of it to see how it goes.   I have about 70 scripts where I use the DOM Extender so will change over a couple and see if it is stable in production and then work through the others if all is good.   This will be a big help with the foreign language stuff I have to parse.

Jim

stanl

Figured out the URL stuff... this worked via WB CLR with .dll in GAC
Code (WINBATCH) Select


IntControl(73,1,0,0,0)
gosub udfs
html=""
ObjectClrOption("useany", "System.Data")
oWeb = ObjectClrNew('System.Net.WebClient') ;to obtain html string from URL
ObjectClrOption('useany','HtmlAgilityPack')
oHTML = ObjectClrNew("HtmlAgilityPack.HtmlDocument")
;oNode = ObjectClrNew("HtmlAgilityPack.HtmlDocument.HtmlNode")


Message("HtmlAgilityPack",oHTML) ;test that object is created
;avoid secure channel errors
strUrl="http://www.simple-talk.com/blogbits/philf/quotations.html"
objUri = ObjectClrNew('System.Uri', strUrl)
objSvcManager = ObjectClrNew('System.Net.ServicePointManager')
protocols = ObjectClrType("System.Net.SecurityProtocolType",3072|768|192|48|0) 
objSvcManager.SecurityProtocol = protocols
objSvcPoint = objSvcManager.FindServicePoint(objUri)


html = oWeb.DownloadString(strUrl)


oHTML.LoadHtml(html)
htmlBody = oHTML.DocumentNode.SelectSingleNode("(//table)[2]").InnerText
Message("Insults",htmlBody)
oHTML.Save("C:\temp\test.html")
encoding = oHTML.DetectEncoding("C:\temp\test.html")
Message("Encoding",encoding.ToString())


oWeb=0
oHTML=0
Exit


:WBERRORHANDLER
oDiag=0
geterror()
Message("Error Encountered",errmsg)
Exit


:udfs
#DefineSubRoutine geterror()
   wberroradditionalinfo = wberrorarray[6]
   lasterr = wberrorarray[0]
   handlerline = wberrorarray[1]
   textstring = wberrorarray[5]
   linenumber = wberrorarray[8]
   errmsg = "Error: ":lasterr:@LF:textstring:@LF:"Line (":linenumber:")":@LF:wberroradditionalinfo
   Return(errmsg)
#EndSubRoutine


Return



JTaylor

Excellent.  Thanks again.

I haven't gotten around to testing that via COM yet.   Will post my script as well if it works.

Jim

JTaylor

Here is COM version for loading from URL.

Jim

Code (winbatch) Select


url  = "http://www.yourserver.com/test.html";

hDoc      = ObjectCreate("DomEx.HtmlDocument")
htmlNode  = ObjectCreate("DomEx.HtmlNode")
hWeb      = ObjectCreate("DomEx.HtmlWeb")

hDoc = hWeb.Load(url);

hBody = hDoc.DocumentNode.SelectSingleNode("//body")
htm = hBody.OuterHtml
Message("Body Node - InnerHtml",htm)


divNodes = hDoc.DocumentNode.SelectNodes("//div[@data-asin]")
For x = 0 to divNodes.count-1

  div_txt = divNodes.Item(x).OuterHTML
  Message("Get Data ASIN Attribute",div_txt)

Next

ObjectClose(hDoc)
ObjectClose(hWeb)
ObjectClose(htmlNode)



stanl

Nice. Still waiting for Tony's comment on why 'appbase' fails even though CLR works with .dll if placed in GAC.


EDIT: no need, just make first line: ObjectClrOption("Appbase", 'C:\scripts\powershell\') ;or wherever you place .dll.  So now can use dll w/out either regasm or gactutil. 

JTaylor

Most Excellent.  I thought there was a way to do that but found what I was thinking about and it was something different so gave up.  Thank you.

Jim

JTaylor

Does this work for you?  I get an error on "hDoc.LoadHtml(html) ".   I have not loaded this into GAC.

Jim

Code (winbatch) Select


html = FileGet("hap_test.html")

ObjectClrOption("Appbase", DirScript())
ObjectClrOption('useany','HtmlAgilityPack')
hDoc  = ObjectClrNew("HtmlAgilityPack.HtmlDocument")

hDoc.LoadHtml(html)


stanl


JTaylor

Same error.   So it works for you?   GAC may have affected that I guess???

Jim

stanl

Hmmm.  I ran gacutil.exe /u HTMLAgilityPack which uninstalled from GAC. I then ran my script with Appbase and it worked. I did notice a slight time difference in execution but could be the Internet.
Code (WINBATCH) Select


IntControl(73,1,0,0,0)
gosub udfs
html=""
ObjectClrOption("Appbase", 'C:\scripts\powershell\')
ObjectClrOption("useany", "System.Data")
oWeb = ObjectClrNew('System.Net.WebClient') ;to obtain html string from URL
ObjectClrOption('use','HtmlAgilityPack')
oHTML = ObjectClrNew("HtmlAgilityPack.HtmlDocument")
;avoid secure channel errors
strUrl="http://www.simple-talk.com/blogbits/philf/quotations.html"
objUri = ObjectClrNew('System.Uri', strUrl)
objSvcManager = ObjectClrNew('System.Net.ServicePointManager')
protocols = ObjectClrType("System.Net.SecurityProtocolType",3072|768|192|48|0) 
objSvcManager.SecurityProtocol = protocols
objSvcPoint = objSvcManager.FindServicePoint(objUri)


html = oWeb.DownloadString(strUrl)


oHTML.LoadHtml(html)
htmlBody = oHTML.DocumentNode.SelectSingleNode("(//table)[2]").InnerText
Message("Insults",htmlBody)
;save html and display encoding
oHTML.Save("C:\temp\test.html")
encoding = oHTML.DetectEncoding("C:\temp\test.html")
Message("Encoding",encoding.ToString())


oWeb=0
oHTML=0
Exit


:WBERRORHANDLER
oWeb=0
oHTML=0
geterror()
Message("Error Encountered",errmsg)
Exit


:udfs
#DefineSubRoutine geterror()
   wberroradditionalinfo = wberrorarray[6]
   lasterr = wberrorarray[0]
   handlerline = wberrorarray[1]
   textstring = wberrorarray[5]
   linenumber = wberrorarray[8]
   errmsg = "Error: ":lasterr:@LF:textstring:@LF:"Line (":linenumber:")":@LF:wberroradditionalinfo
   Return(errmsg)
#EndSubRoutine


Return
[size=78%] [/size]

JTaylor

Happy Easter!!!

Here is the error I get.   Maybe I need to go back and Build the original project and see what happens???

Jim

Error: 1261
Ole: Exception
Line (21)
COM/CLR Exception:

    HtmlAgilityPack
    Object reference not set to an instance of an object.

stanl

I didn't really build anything. Just downloaded the latest html-agility-pack-master.zip, unzipped, and fleshed out the .dll to my scripts\powershell folder. Oh, and I had re-booted after uninstalling from the GAC and Appbase still works. On a tangent right now trying to integrate a filestream object into the mix as it seems to offer a way to save in specified encoding. Tried accessing a URL as a webstream and seeing if Agility can decode - but error was "URI encoding not supported".... but this has been fun

JTaylor

Apparently it is something with the changes I made because if I Build the solution right out of the box it works but I am back to the "Generic Types cannot be Marshaled to the COM Interface Pointers" message for some of the Methods even though I am using the CLR stuff via WinBatch.

Jim

stanl

So, bottom line: were you able to get CLR to initiate at all?

JTaylor

Yes, with the version right out of the box,  but many of the methods are unusable because it still gives me the same error I was getting via COM before all my changes.   

    "Generic Types cannot be Marshaled to the COM Interface Pointers"

Jim

JTaylor

Is WinBatch able to create a "Collection" natively?   That is, a list that ForEach can be used with?

Jim

JTaylor

FINALLY!!! Tracked down the problem with using CLR and my version.    Related to the Generic Problem but wasn't nice enough to tell me that.   It is their use of "Dictionary", which is a Generic Collection.   IDictionary is supposedly a non-generic version, which should theoretically work but haven't cracked the code yet.   Only took me a couple of hours and lots of MessageBoxes to track it down.

Jim

JTaylor

Just to pose this as a question...Is there a way to overcome the  "Generic Types cannot be Marshaled to the COM Interface Pointers" issue with some setting on compile or initialization or does it require a rewrite?   I load it as follows.  The load works fine and some methods are good.   Just those passing back Dictionaries and other such generic types.  Thanks.


Code (winbatch) Select


html = FileGet("hap_test.html")

ObjectClrOption("Appbase", DirScript())
ObjectClrOption('use','HtmlAgilityPack')
hDoc = ObjectClrNew("HtmlAgilityPack.HtmlDocument")

hDoc.LoadHtml(html)




Jim

stanl

Maybe a dumb answer but since I never encountered that error is it possible that your running REGASM is somehow interfering with loading via the CDL?

JTaylor

I unregistered it but is possible I suppose.  When registered it won't load, if I remember correctly.  I have tried so many things it gets fuzzy.  Here is a one that fails.

Code (winbatch) Select

divNodes = hDoc.DocumentNode.SelectNodes("//div[@data-asin]")
ForEach iNode in divNodes

  div_txt = iNode.GetDataAttribute("asin")        ;Returns Attributes with an attribute of "data-"+parameter
   Message("Get Data ASIN Attribute",div_txt.Value)

  div_txt  = iNode.GetDataAttributes()            ;Returns Attributes that begin with "data-"
  Foreach y in div_txt
    Message("Get Data Attributes",y.Value)
  Next

Next


Jim

JTaylor

I went through the registry and cleaned out anything related but still fails.   I know everything you posted would have worked.   Appreciate it if you tried that last bit of code and see what happens. 

Jim