Scraping data from a website

Started by oradba4u, September 11, 2020, 12:03:09 AM

Previous topic - Next topic

oradba4u

Hello:
I want to "scrape" the following table (see attachment) from this website:
https://www.baseball-reference.com/teams/CIN/1975-batting-orders.shtml


Possible? How?

Thanks


stanl

Probably not worth writing code to scrape, just cut/paste into Excel. Hyperlinks preserved [attached]

oradba4u

Not worth writing code? Uh, do you know how many of these lineups there are?
Every team since 1870's, every year new lineups. I'll be looping thru the years and leagues to "scrape" them all.
My apologies for not mentioning that originally, but I figured if I could do one, I could loop to get them all, so I gave one explicit example.
We'll probably have a vaccine for Covid-19 when I finish, if I start cutting & pasting right now :)

stanl

Quote from: oradba4u on September 12, 2020, 09:20:32 AM
Not worth writing code? Uh, do you know how many of these lineups there are?
Every team since 1870's, every year new lineups. I'll be looping thru the years and leagues to "scrape" them all.
My apologies for not mentioning that originally, but I figured if I could do one, I could loop to get them all, so I gave one explicit example.
We'll probably have a vaccine for Covid-19 when I finish, if I start cutting & pasting right now :)


Thought your initial post was more of a one-off. If you had looked at the url source the attachment you posted is obtainable as a DOM Table. More likely need to use getElementById("grid_table_408961"). That being said, WB could use Excel automation to obtain it. But as per web-scraping WB tech db has beaucoup examples using IE or Explorer; for non-IE browsers there is Selenium, and Jim recently introduced WebKitX.   


The code below although in Powershell can get you started on the elements to consider for your scraping app. If you post your best effort at the scrape, others on the board would help out, but I doubt anyone is going to volunteer to write a program for you. 




$url = "https://www.baseball-reference.com/teams/CIN/1975-batting-orders.shtml"
$r = Invoke-WebRequest $url
$tbl = $r.ParsedHtml.getElementById("grid_table_408961").getElementsByTagName("TR")
ForEach ($tr in $tbl) {
   ForEach ($th in $tr.getElementsByTagName("TH")) {   
     $tgame = "Game: " + $th.innerText + ","
   }
   $tbat=""
   ForEach ($td in $tr.getElementsByTagName("TD")) {
      $tbat = $tbat + $td.innerText + ","   
   }
   $join = $tgame + $tbat.Substring(0,$tbat.Length -1)
   $join
}



oradba4u

thanks.. you've given me a lot to digest, and I'm not looking for a free program (lunch)... How can I learn if someone hands me the code?

Will post my progress, as time permits... THANKS AGAIN!

stanl

That was sloppy code. I refined to just look for the Classname; added a header and output the table as a text file - | delimited with comma separated batting positions. I attached the output. I don't work with IE or MSHTML anymore but I'm sure it is easy to convert the PS to WB. If you are really interested in putting together an historical record, the base url appears pretty consistent, i.e.  https://www.baseball-reference.com/teams/CIN/2018-batting-orders.shtml or https://www.baseball-reference.com/teams/ATL/2018-batting-orders.shtml - would work with the code [with different output file].


$url = "https://www.baseball-reference.com/teams/CIN/1975-batting-orders.shtml"
$r = Invoke-WebRequest $url
$tbl = $r.ParsedHtml.getElementsByClassName("R")
$hdr = "Game_Date|Batting_Order"
$out = $hdr + "`r`n"
ForEach ($tr in $tbl) {
   ForEach ($th in $tr.getElementsByTagName("TH")) { 
     $tgame = "Game: " + $th.innerText + "|"
   }
   $tbat=""
   ForEach ($td in $tr.getElementsByTagName("TD")) {
     
      $tbat = $tbat + $td.innerText + ","   
   }
   $join = $tgame + $tbat.Substring(0,$tbat.Length -1) + "`r`n"
   $out = $out + $join
}
$out | Out-File "C:\temp\CIN_1975.txt" #change path/filename as needed



stanl

Here is a little extra WB code to generate URL's to pull up the batting lineups for MLB teams for a range of years. It uses WIL maps, which I'm getting more comfortable with and putting it together raised an anomaly - the abbreviation for Cincinnati Reds is CIN which is now a reserved word in Winbatch. So in the attached map lookup it is CINN, changed in the code.
Code (WINBATCH) Select


;Winbatch 2020A - create lookup URLS for MLB batting lineups
;uses WIL maps
;Stan Littlefield, September 13, 2020
;====================================================================================
BoxOpen("Building MLB URL Lookups","Please Wait")


mapfile = dirscript():"MLBTeams.csv"
If ! FileExist(mapfile) Then Terminate(@TRUE,"Cannot Run Script",mapfile:" is missing")


map = MapFileGetCSV(mapfile)
outfile =  dirscript():"BaseballRef.txt"
startyear = 2000  ;you can start with any year assuming data exists for team
endyear = 2020


baseurl = "https://www.baseball-reference.com/teams/TEAM/YEAR-batting-orders.shtml"


urls =""


ForEach key in map
   If key=="CINN" Then key="CIN"  ; CIN is reserved word in Winbatch
   For y=startyear to endyear
      url = StrReplace(baseurl,"TEAM",key)
      url = StrReplace(url,"YEAR",y)
      BoxText(url)
      urls := url:@CRLF
   Next
Next


BoxShut()
;Message("MLB Lookup URLS",urls)
FilePut(outfile,urls)
If FileExist(outfile) Then Message("Check Output File",outfile)


Exit

oradba4u

WOW! Thanks so much, Stan. I had been struggling with MAPS, and now I get what's going on! VERY HELPFUL!

stanl

Found another anomaly - change MIN to MINN in lookup and add this line


   If key=="MINN" Then key="MIN"  ; MIN is reserved word in Winbatch


parsing is fun

td

Of course, the irony is that ParsedHtml is just a wrapper for the old IHTMLDocument COM interface so you are still using IE's document COM objects under the hood.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

td

Here is a very rough, not for production hack at mixing FCL classes and COM Automation.  In a sense, it illustrates how PS is working under the hood.

Code (winbatch) Select
ObjectClrOption('useany', 'System')
objWebUtil = ObjectClrNew('System.Net.WebUtility')

objUri = ObjectClrNew('System.Uri', strUrl)
objSvcManager = ObjectClrNew('System.Net.ServicePointManager')
protocols = ObjectClrType("System.Net.SecurityProtocolType",3072|768) 
objSvcManager.SecurityProtocol = protocols
objSvcPoint = objSvcManager.FindServicePoint(objUri)

objWebRequest = ObjectClrNew('System.Net.WebRequest')
objWebRequest = objWebRequest.Create(objUri)
objWebRequest.Timeout = objWebRequest.Timeout * 6
objResponse = objWebRequest.GetResponse()
objResponseStream = objResponse.GetResponseStream
Encoding = ObjectClrNew( 'System.Text.Encoding' )
objStreamReader = ObjectClrNew("System.IO.StreamReader", objResponseStream, Encoding.UTF8)

;; Load the CLR stream into a COM Automation object.
objDom = ObjectCreate('htmlfile')
objDom.Write(objStreamReader.ReadToEnd())
objResponse.Close()

strOut  = ''
objTbl = objDom.getElementsByClassName("R")
ForEach objTr in objTbl
   strGame = ''
   ForEach  objTh in objTr.getElementsByTagName("TH")
      strGame = "Game: " : objTh.innerText :"|"
   next
   strBat  = ''
   ForEach objTd in objTr.getElementsByTagName("TD")
      strBat := objTd.innerText : ","   
   next
   strOut := strGame : strSub(strBat,1,StrLen(strBat)-1):@CRLF
next

FilePut("C:\temp\CIN_1975.txt", strOut)
exit
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

oradba4u

@Stan:

To learn exactly what's going on in your terrific code sample, I noticed something strange in the output file (BaseballRef.txt)
the second set of 20 URL's (2000-2020) are not filled-in with any team Abbreviation (after PHI - Philadelphia).
I checked the rest of the output file, and all 30 teams are in there (more about that in a moment), so I can't for the life of me understand why there are 21 "ghost" records! In total, there are 651 records instead of the expected 21 years x 30 teams = 630 records!

You added "CINN" to your modified MAP and code (in lieu of "CIN), and later posted another reserved word change, "MINN" (instead of "MIN) which I understand why - It's a reserved word. However, I got that change AFTER I ran the program, and F-Y-I,  "MIN" went through OK!

Anyway, I'm baffled as to why these "ghost" records are in the text file.

UPDATE!
I just ran the program again using 2001-2020 and got the same "ghost" records as the 2nd set. I'll investigate!


I FOUND THE PROBLEM!!!
A blank line at the end of the CSV file! Now, it works like a charm! THANKS!

kdmoyers

Quote from: td on September 15, 2020, 08:14:52 AM
Here is a very rough, not for production hack at mixing FCL classes and COM Automation.  In a sense, it illustrates how PS is working under the hood.
This was useful, thanks!
The mind is everything; What you think, you become.

stanl

Quote from: td on September 15, 2020, 08:14:52 AM
Here is a very rough, not for production hack at mixing FCL classes and COM Automation.  In a sense, it illustrates how PS is working under the hood.


I wasn't recommending PS as the solution and I know IE is under the hood. Any new server with Windows Core won't have IE installed and therefore the PS Invoke-WebRequest is useless. I responded to the opp while trying to work with PS Selenium module and since I was already in it, threw together the script. I was guessing someone would throw out some IE or mshtml, or you as you did with some CLR.

ChuckC

In PowerShell, on Windows Core, use the -UseBasicParsing parameter with the Invoke-WebRequest command.  That causes the command to not make any attempt to use IE-specific functionality.

Please note that this is specific to PowerShell v5.1

https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/invoke-webrequest?view=powershell-5.1

-UseBasicParsing
Indicates that the cmdlet uses the response object for HTML content without Document Object Model (DOM) parsing. This parameter is required when Internet Explorer is not installed on the computers, such as on a Server Core installation of a Windows Server operating system.

For PowerShell v6 & newer, the following applies:

https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/invoke-webrequest?view=powershell-6

This parameter has been deprecated. Beginning with PowerShell 6.0.0, all Web requests use basic parsing only. This parameter is included for backwards compatibility only and any use of it has no effect on the operation of the cmdlet.


td

I was just proud of myself for coming up with a way to illustrate using a mix of COM Automation CLR hosting to accomplish a task. Mashing different bits of functionality together is a WinBatch strength.   I didn't mean to imply anything or offend anyone.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

stanl

Quote from: ChuckC on September 15, 2020, 05:43:41 PM
In PowerShell, on Windows Core, use the -UseBasicParsing parameter with the Invoke-WebRequest command. 


My bad Chuck. I meant to write that parsedHTML would not work. If -UseBasicParsing was added to my original PS code you could then add $HTML = New-Object -Com "HTMLFile" similar to Tony's script.

ChuckC

Getting proficient with WinBatch can easily become the programming equivalent of mastering duct tape, twine, bailing wire, crazy glue, nails, rivets and decking screws when it comes to pulling different technologies together and making them work for a common purpose.

I would argue that a well written WinBatch script is a lot like a pinata made from duct tape... No matter how hard you beat it up, it's just not going to break...


Quote from: td on September 16, 2020, 12:42:28 AM
I was just proud of myself for coming up with a way to illustrate using a mix of COM Automation CLR hosting to accomplish a task. Mashing different bits of functionality together is a WinBatch strength.   I didn't mean to imply anything or offend anyone.

td

Quote
My bad Chuck. I meant to write that parsedHTML would not work. If -UseBasicParsing was added to my original PS code you could then add $HTML = New-Object -Com "HTMLFile" similar to Tony's script.

MSFT is constantly changing both the dotNet Found Class library and the dotNet Core library.  The last time I checked MSFT was still loading MSHTML.DLL into PowerShell for HTML parsing.  That may have already changed or will change in the future and MSFT would not need a full IE installation to have MSHTML.DLL still available on systems. I do know that MSFT is has a beta version of a replacement for the "WebBrowser" COM server that has both a dotNet (for UWP?) and COM interfaces. It is likely to also support the HTML parsing part represented by MSHTML.DLL. But for now, MSHTML.DLL is still present on Workstation installations of Windows 10 and likely will continue to be present for some time to come.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

stanl

Well, GetElementsByTagName ain't what it used to be.... meaning web-scraping is a lot more complicated >:(

td

Obviously, a lot has to do with the increased use of dynamic Web pages.  That is where MSFT's new browser object may (or may not) be a boon.  It all depends on what they end up adding to the implementation - Chakra would be a nice start.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

ChuckC

Given that screen/web scraping is almost always hindered by arbitrary layout changes and "iffy" representations of a parsed web page, has any thought been given as to whether the site mentioned at the start of the thread by the OP offers any kind of REST [or otherwise] API for directly querying for data sets instead of having to parse tables of data from the content of web pages?

stanl

Quote from: ChuckC on September 16, 2020, 10:38:18 AM
has any thought been given as to whether the site mentioned at the start of the thread by the OP offers any kind of REST [or otherwise] API for directly querying for data sets instead of having to parse tables of data from the content of web pages?


No REST for the weary. I think a lot of the data is captured with Google Analytics.

oradba4u

Quote from: ChuckC on September 16, 2020, 10:38:18 AM
Given that screen/web scraping is almost always hindered by arbitrary layout changes and "iffy" representations of a parsed web page, has any thought been given as to whether the site mentioned at the start of the thread by the OP offers any kind of REST [or otherwise] API for directly querying for data sets instead of having to parse tables of data from the content of web pages?

As an inexperienced (as in NO experience) web developer, etc. how would one go about finding out the answer to ChuckC's last question?

JTaylor

Ask the people at the place where you are getting the data.   Be prepared for the challenge in finding someone who actually knows what you are asking about though  :)

Jim

oradba4u

Yeah, and I'm sure they want $$$ (if I can even get to the people in the know) for access to their database... But, it's worth a try... I got nothin' to lose, and as ChuckC said, it's probably more reliable that way (database) than depending on a website scrape of the data.

stanl

we can put Powershell to sleep here with a caveat. I tried to mimic Tony's code in PS by using $r = Invoke-WebRequest $url  -UseBasicParsing - and get error


[mshtml.HTMLDocumentClass] does not contain a method named 'getElementsByClassName'


but this worked:


$url = "https://www.baseball-reference.com/teams/CIN/2018-batting-orders.shtml"
$r = Invoke-WebRequest $url  -UseBasicParsing
$html = New-Object -Com "htmlfile"
$html.IHTMLDocument2_write($r.Content)
$tbl = $html.documentElement.getElementsByClassName("R")

stanl

Quote from: oradba4u on September 16, 2020, 05:21:56 PM
Yeah, and I'm sure they want $$$


They do have a subscription service for individual queries.


In terms of WB and mixing technologies.

       
  • Take the url output as text file input and for each line
  • Send url to Tony's code and obtain output
  • Reformat the output as a true .csv [all comma delimited]
  • Use Jim's extender and insert into SQLite
and you have a database for batting orders from 2000 in one pure WB script.

oradba4u

I'll see what I can do with all of this info from the various sources, and thanks a lot guys... Honestly, it's been a great learning experience for me

stanl

This has been an interesting thread. My original and Tony's WB-specific output for the batting lineups was bar delimited due to the , separating the day of the week. To create a complete csv output with headers (and I expect some will come in with sloppy code cleanup hints :o ) the code below might be of help. And it definitely makes the output a candidate for Jim's SQLite extender.
Code (WINBATCH) Select


;reparsing batting lineup




url = "https://www.baseball-reference.com/teams/CIN/1975-batting-orders.shtml"
;sample output from Website parsing
cp = "Game: 1. Mon,4/7 vs LAD W (2-1)|Rose-LF,Morgan-2B,Bench-C,Perez-1B,Concepcion-SS,Geronimo-CF,Griffey-RF,Vukovich-3B,Gullett-P"
year = StrSub(ItemExtract(6,url,"/"),1,4)


;modified headers
hdr= "Game,Day,GameDate,Versus,p1,p2,p3,p4,p5,p6,p7,p8,p9"


;reparsing for csv: re-accomodate for comma within | delimiter
cp1 = ItemExtract(1,cp,"|")
cp2 = ItemExtract(2,cp,"|")
cp11 = Strsub(cp1,1,Strindex(cp1,".",0,@FWDSCAN)-1):","
cp12 = StrTrim(Strsub(cp1,Strindex(cp1,".",0,@FWDSCAN)+1,-1))
cpl21 = ItemExtract(1,cp12,","):","
cpl22 = ItemExtract(2,cp12,",")
cpl222 = ItemExtract(1,cp12,","):","
cp1221 = StrTrim(Strsub(cpl22,1,Strindex(cpl22,"vs",0,@FWDSCAN)-1)):"/":year:","
cp1222 = StrTrim(Strsub(cpl22,Strindex(cpl22,"vs",0,@FWDSCAN)-1,-1))
cp1 = cp11:cpl21:cp1221:cp1222:","
FilePut("C:\temp\test.txt", hdr:@CRLF:cp1:cp2)
Exit



stanl

I suggest you move that to a new thread.