WB Webscraping 101

Started by spl, July 30, 2024, 12:57:16 PM

Previous topic - Next topic

spl

My default browser is Duck Duck Go.... I have worked with both Agility Pack and Selenium for web scraping, and have submitted code for Power Query where url contains tables, Now trying to code for itemizing a drill-down. Using a public test url [Jim should like it because he works with books and has probably already parsed this]:

https://books.toscrape.com/

then to move down 2 levels and come up with attached. [tab-delimited].

[EDIT]. Looking at AngleSharp C# Open Source HTML parser
Stan - formerly stanl [ex-Pundit]

bottomleypotts

Isn't this just a straight forward winbatch app to scrape and search? I wouldn't be contemplating using anything but winbatch to accomplish this. The HTML is not difficult to decipher. What is winbatch not doing that you need to use an external HTML parser?

spl

Quote from: bottomleypotts on July 31, 2024, 12:45:36 AMIsn't this just a straight forward winbatch app to scrape and search?

Not really. It is a test url for how one might code to get specific information. Not that it cannot be parsed in WB, but WB lacks internal DOM or XPATH elements. Those can treat HTML as objects or array items. In this case, I tested to
  • Get all links for the book categories
  • Then for each category obtain links to individual books
 

At that point, one can send an HTTP request and obtain values from individual HTML tags. The url heavily uses javascript, DIV's and classes. I wanted to include Availability in the export file I attached, but I was parsing from an <li> class and availability kept coming up 0. I did see it in a <table> class, so figured I could drop back to Power Query and obtain results. Ideally, would display the final HTML requests for individual books in a list box, or the categories=>books... whatever.

code for a single book, and attached Excel output
; Winbatch 2024A - Power Query to Obtain Table Values
;===============================================================================================================
gosub udfs
IntControl(73,1,0,0,0)

url = "https://books.toscrape.com/catalogue/full-moon-over-noahs-ark-an-odyssey-to-mount-ararat-and-beyond_811/index.html"
; try any of the other urls in the zip file I posted
qry = "Book"
mcode = $"let
    Source = Web.Page(Web.Contents("|url|")),
    Data0 = Source{0}[Data],
    ##"Changed Type" = Table.TransformColumnTypes(Data0,{{"Column1", type text}, {"Column2", type text}}),
    ##"Renamed Columns" = Table.RenameColumns(##"Changed Type",{{"Column1", "Item"}, {"Column2", "Value"}})
in
    ##"Renamed Columns"
$"

Display(2,"Creating Table","From ":url)
mcode = StrReplace(mcode,"|url|",url)
xl = CreateObject("Excel.Application")
xl.Visible = @true
xl.ScreenUpdating = @true
xl.UserControl = @true
xl.DisplayAlerts = @false
wk = xl.workbooks.add()
ws = wk.Worksheets.Item(1)
ws.Name = qry
xl.ActiveWorkBook.Queries.Add(::Name=qry,Formula=mcode,Description="Query")
cSource = "OLEDB;Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=%qry%;Extended Properties=''"
qt = ws.ListObjects.Add(::SourceType=0,Source=cSource,Destination=ws.Range("$A$1")).QueryTable
qt.CommandText = "Select * FROM [%qry%]"
qt.Refresh()
; finish up
ws=0
wk=0
xl=0

Message("Table Transformed","Save or Discard Excel File")
Exit

:WBERRORHANDLER
ws=0
wk=0
xl=0
geterror()
Message("Error Encountered",errmsg)
Exit

;===============================================================================================================

:udfs
#DefineSubRoutine geterror()
   wberroradditionalinfo = wberrorarray[6]
   lasterr = wberrorarray[0]
   handlerline = wberrorarray[1]
   textstring = wberrorarray[5]
   linenumber = wberrorarray[8]
   errmsg = "Error: ":lasterr:@LF:textstring:@LF:"Line (":linenumber:")":@LF:wberroradditionalinfo
   Return(errmsg)
#EndSubRoutine

Return
;===============================================================================================================


Stan - formerly stanl [ex-Pundit]

bottomleypotts

Gotcha. Have written 100s of bots to scrape, and I would never contemplate doing things this way, preferring to pull the page and because everything is done with a template search out the data I need.

This appeared to take 5-10 seconds to process one page. Is that common?

spl

Quote from: bottomleypotts on August 01, 2024, 06:50:38 AMGotcha. Have written 100s of bots to scrape, and I would never contemplate doing things this way, preferring to pull the page and because everything is done with a template search out the data I need.

This appeared to take 5-10 seconds to process one page. Is that common?

Not sure you actually looked at the main url. It is not a one-page 'tells-all'. 90% of it is drilling down to the page you want, then searching the data or individual attributes. Yeah, took about a minute, and not really satisfied with results. Put it in Excel although would prefer database table [which is a few more lines]. And, to confirm: I'm not new at this having parsed web: data, tables, API for years and assisting with Tech Database articles.

Was more or less looking into the test scraping url to come up with generic UDF's to handle multiple sites with the same product drilldowns... i.e. 'the book you want costs $$ but costs'=> here.

I appreciate your comments as the majority of the code I post doesn't. If you have a template for iterating a site such as I posted I would be very interested. And, good luck with Whisper.
Stan - formerly stanl [ex-Pundit]

bottomleypotts

I appreciate what everyone contributes to this forum. Especially code. It's a great opportunity to learn.

Here is how I would do things.

#DefineFunction BinarySearch(strThisString,beginSearch,endSearch)
bb=BinaryAlloc(StrLen(strThisString))
BinaryPokeStr(bb,0,strThisString)
foundString=``
s=BinaryTagInit(bb,beginSearch,endSearch)
s=BinaryTagFind(s)
If s!=`` Then   foundString=BinaryTagExtr(s,1)
BinaryFree(bb)
bb=0
Return foundString
#EndFunction

#DefineFunction UTF8toANSI(strUTF8)
intCP=ChrSetCodepage(65001); Translate using UTF-8
strUni=ChrStringToUnicode(strUTF8); Converts an ANSI string to a Unicode string
intCP=ChrSetCodepage(intCP); Back to original code page
Return ChrUnicodeToString(strUni); Converts an Unicode string to ANSI
#EndFunction

#DefineFunction getProduct(url) ; methods=POST,PATCH,DELETE
oHTTP=CreateObject(`WinHttp.WinHttpRequest.5.1`)
Em=ErrorMode(@OFF)
oHTTP.Open(`GET`,url,@FALSE)
oHTTP.Send()
oHTTP.WaitForResponse()
webResponse=oHTTP.ResponseText
ErrorMode(Em)
productData=MapCreate()
If webResponse!=``
    tableData=BinarySearch(webResponse,`<table class="table table-striped">`,`</table>`)
    If tableData!=``
        productData[`UPC`]=BinarySearch(tableData,`<th>UPC</th><td>`,`</td>`)
        productData[`productType`]=BinarySearch(tableData,`<th>Product Type</th><td>`,`</td>`)
        productData[`priceEx`]=UTF8toANSI(BinarySearch(tableData,`<th>Price (excl. tax)</th><td>`,`</td>`))
        productData[`priceInc`]=UTF8toANSI(BinarySearch(tableData,`<th>Price (incl. tax)</th><td>`,`</td>`))
        productData[`tax`]=UTF8toANSI(BinarySearch(tableData,`<th>Tax</th><td>`,`</td>`))

        productData[`availability`]=BinarySearch(BinarySearch(tableData,`<th>Availability</th>`,`</tr>`),`<td>`,`</td>`)
        productData[`noReviews`]=BinarySearch(BinarySearch(tableData,`<th>Number of reviews</th>`,`</tr>`),`<td>`,`</td>`)
    EndIf
EndIf
Return productData
#EndFunction

ourProduct=getProduct(`https://books.toscrape.com/catalogue/full-moon-over-noahs-ark-an-odyssey-to-mount-ararat-and-beyond_811/index.html`)

Exit

spl

Quote from: bottomleypotts on August 01, 2024, 07:04:34 PMI appreciate what everyone contributes to this forum. Especially code. It's a great opportunity to learn.

Here is how I would do things.
...........

Nice set of functions. And to avoid going to war in this thread.... I do have to respond that while it might get the specific data it misses the point. If you already know the url and the table <tag> it is a matter of method as to how to parse a text return. The point is
  • how does one figure out the url from that base url
  • then, how does one figure out the table tag

I easily scripted my StdOut [ps] function with examples a month or so ago. Yes, it does use PS [assumed as included with Windows OS Win7-Win11] but it is WB script that uses built-in DOM processing for HTML. The script basically takes a text string of PS code processes to STDOut and you can pretty much do anything with the results in WB code. The argument, in this case, is a two line of PS code
args = $"$URL = '|url|'
$html = (Invoke-WebRequest -Uri $URL).Content |  ConvertFrom-HtmlTable | ConvertTo-Csv -NoTypeInformation
$html
$"

and would replace |url| with a valid url, in this case

url =  'https://books.toscrape.com/catalogue/full-moon-over-noahs-ark-an-odyssey-to-mount-ararat-and-beyond_811/index.html';

and for the simplest treatment of the return I attached a jpeg. As for getting the url in the first place would involve pre-processing code shown on a listbox, as I mentioned earlier when I posted the excel file with valid urls.

So, back to the point: how 'semi-generic' can WB be made for webscraping. The original DOM we all used to play with circa 2005 [see post on Tech Database by Jay Alverson... he was a WB genius] is no longer part of browsing, unless you still use IE 9-11....

WB had a DOM extender back then.. it had flaws but was pretty solid for XPATH. The original author disappeared bu the talk was to re-write and extend the extender. I'm sure it can be done in WB.
Stan - formerly stanl [ex-Pundit]

bottomleypotts

No war. I'm always happy to learn faster/more efficient ways of obtaining data.

spl

Quote from: bottomleypotts on August 02, 2024, 01:21:44 PMNo war. I'm always happy to learn faster/more efficient ways of obtaining data.

Well, here you go. The script below does work, but may not work for you. I look at Powershell as a semi-extender, not as a rival to WB. One issue it PS may/will involve unnecessary overhead; the benefit is that it can be called from a WB GUI, as PS is a command-line language and making a GUI normally requires Windows Forms coding, and WB is soooo... much easier. That being said, the script will
  • isolate the url to retrieve Table information
  • run PS code through a CLR object to obtain the results

When I run the script, I already have the requisite PS module installed, so added a code line to check and if not installed, install it. The assumption here is that user running the code has Win10 or Win11 - and the default PS on the system includes a PowershellGet module which will handle the Installation. That being the detriment: Worse case you get a StdErr message with what went wrong and can be easy fixed.

Oh, and if it works for you... your might be interested in https://www.powershellgallery.com/packages?q=whisper
;Parsing HTML Table via CLR
;save as yerfolder\StdOut_HTMLTable.wbt
;Stan Littlefield 8/3/2024
;==========================================================
Gosub udfs
book = 'Full Moon over Noahs Ark'
url =  'https://books.toscrape.com/catalogue/full-moon-over-noahs-ark-an-odyssey-to-mount-ararat-and-beyond_811/index.html'
args = $"If (-not(Get-InstalledModule PSParseHTML -ErrorAction silentlycontinue)) { Install-Module -Name PowerHTML -Scope CurrentUser -AllowClobber -force }
$URL = '|url|'
$html = (Invoke-WebRequest -Uri $URL).Content |  ConvertFrom-HtmlTable | ConvertTo-Csv -NoTypeInformation
$html
$"
args = StrReplace(args,'|url|',url)
cmd="Powershell"
msg=book

BoxOpen("Running...",cmd:" ":args)
vals = Get_stdout()
vals = StrReplace(vals,'"','')
Message(msg,vals)
;FilePut("c:\temp\stdout.txt",vals)
Exit
;==========================================================
:udfs
#DefineSubroutine Get_stdout()
ObjectClrOption("useany","System")
objInfo = ObjectClrNew("System.Diagnostics.ProcessStartInfo")
Output=""
objInfo.FileName = cmd
objInfo.RedirectStandardError = ObjectType("BOOL",@TRUE)
objInfo.RedirectStandardOutput = ObjectType("BOOL",@TRUE)
objInfo.UseShellExecute = ObjectType("BOOL",@FALSE)
objInfo.CreateNoWindow = ObjectType("BOOL",@TRUE)
objInfo.Arguments = args
oProcess = ObjectClrNew("System.Diagnostics.Process")
oProcess.StartInfo = objInfo
oProcess.Start()
oProcess.WaitForExit()
STDOUT = oProcess.StandardOutput.ReadToEnd()
STDERR = oProcess.StandardError.ReadToEnd()
Output = Output:STDOUT:@CRLF
If STDERR<>""
   Output = Output:"STDERR:":STDERR:@CRLF
Endif
oProcess = 0
objInfo = 0
BoxShut()
Return (Output)
#EndSubroutine
Return
;==========================================================
Stan - formerly stanl [ex-Pundit]

spl

Arrgghh! Install-Module -Name PowerHTML should be written Install-Module -Name PSParseHTML
Stan - formerly stanl [ex-Pundit]

spl

Finally, to end this thread. Below is a stdout script to flesh out the links from the main https://books.toscrape.com page.
;Parsing HTML Table via CLR
;save as ..\StdOut_HTMLLinks.wbt
;Stan Littlefield 8/5/2024
;==========================================================
Gosub udfs
url = 'https://books.toscrape.com'
regex = 'catalogue/category/.*'
args = $"$url = '|url|'
$links = (Invoke-WebRequest -Uri $url).Links.Href  | Get-Unique
$regex =  '|regex|'
$matches = $links | Select-String $regex
foreach ($m in $matches)
{
  $n = $url + '/' + $m
  $n
}
$"
args = StrReplace(args,"|url|",url)
args = StrReplace(args,"|regex|",regex)
cmd="Powershell"
msg='Links'

BoxOpen("Running...",cmd:" ":args)
vals = Get_stdout()
vals = StrReplace(vals,'"','')
Message(msg,vals)
;optional - create output file with links, for further scraping
;FilePut("c:\temp\booklinks.txt",vals)
Exit
;==========================================================
:udfs
#DefineSubroutine Get_stdout()
ObjectClrOption("useany","System")
objInfo = ObjectClrNew("System.Diagnostics.ProcessStartInfo")
Output=""
objInfo.FileName = cmd
objInfo.RedirectStandardError = ObjectType("BOOL",@TRUE)
objInfo.RedirectStandardOutput = ObjectType("BOOL",@TRUE)
objInfo.UseShellExecute = ObjectType("BOOL",@FALSE)
objInfo.CreateNoWindow = ObjectType("BOOL",@TRUE)
objInfo.Arguments = args
oProcess = ObjectClrNew("System.Diagnostics.Process")
oProcess.StartInfo = objInfo
oProcess.Start()
oProcess.WaitForExit()
STDOUT = oProcess.StandardOutput.ReadToEnd()
STDERR = oProcess.StandardError.ReadToEnd()
Output = Output:STDOUT:@CRLF
If STDERR<>""
   Output = Output:"STDERR:":STDERR:@CRLF
Endif
oProcess = 0
objInfo = 0
BoxShut()
Return (Output)
#EndSubroutine
Return
;==========================================================
Stan - formerly stanl [ex-Pundit]

bottomleypotts


spl

Quote from: bottomleypotts on August 05, 2024, 06:36:19 AMAppreciate the post!

Yep. All work. Just understand the stdout() function is just that, a function. It takes an argument parameter that can, in the cases I posted, be a short PS snippet. Since PS cmdlets include DOM/XPATH, at this point, better to call from a WB script as opposed to write as pure WB.

Stan - formerly stanl [ex-Pundit]