Downloading a webpage

Started by hdsouza, December 19, 2019, 12:16:38 PM

Previous topic - Next topic

hdsouza

Hi,

I am trying to download this webpage https://www.nextseed.com/offerings as I want to monitor the offerings on this page.
When I save the page to a file I see garbled characters in the file.

Please Help. Thanks

Code (winbatch) Select

Var_url = "https://www.nextseed.com/offerings"
File_HostConnect = "c:\temp\File_HostConnect.txt"

URL_Temp = strreplace (Var_url, "https://", "")
URL_Temp = strreplace (URL_Temp, "http://", "")
Url_Main =  strtrim(Itemextract(1, URL_Temp, "/" ))
Url_Remain = strtrim(strreplace(URL_Temp, "%Url_Main%/", ""))

tophandle=iBegin(0,"","")
connecthandle=iHostConnect(tophandle, Url_Main, @HTTP, "", "")
datahandle=iHttpInit(connecthandle, "GET", Url_Remain, "",0)
If datahandle==0 then
   err=iGetLastError()
   URL_Error = err_datahandle
   iClose(tophandle)
   Page_Error = 1
   ;Timedelay(1)
EndIf

rslt=iHttpOpen(datahandle, "", 0, 0)
If rslt=="ERROR" || rslt!=200 then
   URL_Error = rslt
   iClose(datahandle)
   iClose(connecthandle)
   iClose(tophandle)
   Display(1, "Error..", "Problem opening webpage")
EndIf
iReadData(datahandle, File_HostConnect)
iClose(datahandle)
iClose(connecthandle)
iClose(tophandle)


td

I believe we have gone down this rabbit hole before.  The page is generated when a javascript script is executed inside the browser.   Unfortunately, WinInet is not a browser and therefore does not have a javascript engine to execute the code that creates the page.  You can't use the Wininet extender to download a page that doesn't exist yet.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

hdsouza

Maybe someone had the same problem with nextseed.com, although I just started with it.
So can I use something other than Wininet. Any options?
Thanks TD

JTaylor

Automate Internet Explorer.   Should be examples in the tech database.    Depending on what is happening on the page you may need to check for the existence of certain TAGS,IDs,etc. before grabbing the page content.   You may also be able to make the embedded browser object work but not certain if it is heavy on JS. 

Jim

hdsouza

hey Jim, the webpage is being downloaded as junk characters

JTaylor

It does this using IE?   Had only seen WinInet mentioned.

Jim

hdsouza

Yes I am using the script as in my first post. What options exist?

td

Thanks, Jim.  I should have mentioned the COM Automation approach in my first response in this topic.   Here is a link to an old but still informative article about different options available to perform Web scraping.

https://techsupt.winbatch.com/webcgi/webbatch.exe?techsupt/nftechsupt.web+Tutorials+Working~With~Web~Pages.txt
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

JTaylor

As mentioned, automating IE is another option.   Again, you have to find some way to test to know the page is complete.    Sorry if I was unclear.   

See if this gets you headed in a useful direction.    Sorry for the sloppy code...just quickly chopped out some code from another script...

Code (winbatch) Select


  XMLURL = "https://www.nextseed.com/offerings"
  ybrowser = ObjectOpen("InternetExplorer.Application")
  ybrowser.addressbar = @TRUE
  ybrowser.statusbar = @TRUE
  ybrowser.menubar = @FALSE
  ybrowser.toolbar = @TRUE
  ybrowser.visible = 1
  ybrowser.Height =  600
  ybrowser.Width  =  800

  ybrowser.navigate(XMLURL)
  TimeDelay(1.5)


message("HEY","Wait until page is loaded.")
txt = ybrowser.document.getElementsByTagName("body").Item(0).OuterHTML
clipput(txt)
message("HEY",txt)
ybrowser.quit
ybrowser = 0
Exit


stanl

Here is a potentially useful subroutine I used to make sure url is loaded.  I use oIE as my browser object
Code (WINBATCH) Select


#DefineSubroutine ieready(n,msg)
IntControl(73,1,0,0,0)
t = 0 
If msg=="" Then msg="Loading... Please Be patient... WebSite may be busy..."
While oIE.busy || oIE.readystate != 4
   t = t+1
   display(3,msg,"Attempt %t% on Page %page%")
   If WinExist("~Security")
      SendkeysTo("~Security","~")
      TimeDelay(.5)
Endif
   If t>n Then Return(0)
EndWhile


t=0
While oIE.Document.readystate != "complete"
   t = t+1
   display(3,"Document is...",oIE.Document.readystate)
If WinExist("~Security")
      SendkeysTo("~Security","~")
      TimeDelay(.5)
Endif


   If t>n Then Return(0)
EndWhile
Return(1)


:WBERRORHANDLER
Return(0)




#EndSubroutine

td

There might be a Tech Database article in this but in the interim a glued together in a very crude way version of the two scripts.

Code (winbatch) Select
#DefineSubroutine ieready(n,msg)
   IntControl(73,1,0,0,0)
   
   t = 0 
   If msg=="" Then msg="Loading... Please Be patient... WebSite may be busy..."
   While oIE.busy || oIE.readystate != 4
      t = t+1
      display(3,msg,"Attempt %t% on Page %page%")
      If WinExist("~Security")
         SendkeysTo("~Security","~")
         TimeDelay(.5)
      Endif
      If t>n Then Return(0)
   EndWhile
   
   
   t=0
   While oIE.Document.readystate != "complete"
      t = t+1
      display(3,"Document is...",oIE.Document.readystate)
      If WinExist("~Security")
         SendkeysTo("~Security","~")
         TimeDelay(.5)
      Endif
   
      If t>n Then Return(0)
   EndWhile
   Return(1)
   
   
   :WBERRORHANDLER
   Return(0)
   
#EndSubroutine


XMLURL = "https://www.nextseed.com/offerings"
oIE = ObjectOpen("InternetExplorer.Application")
oIE.visible = @FALSE

oIE.navigate(XMLURL)

if IeReady(10, '')
   txt = oIE.document.getElementsByTagName("body").Item(0).OuterHTML
   clipput(txt)
   message("Web Page Body",txt)
endif
oIE.quit
oIE = 0
Exit


Thanks for the contributions.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

JTaylor

Don't take this as any type of disagreement with what Stan or Tony posted, that approach is what I use as well and probably should have included, but there have been instances where I have had to create a loop and check for the existence of an ID or other Element to know it has loaded before proceeding.  I only mention this so you don't have to come back and ask again before trying such a thing , in case your script still seems to not get all the data.


Jim

td

It shouldn't be too difficult to add an additional ID parameter and check to Stan's IeReady subroutine.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

hdsouza

Thanks Jim. all. 
Yes the "outerhtml" did the trick. No garbled characters now.
I will just code around with launching IE and doing it that way.
I may have to launch each offering in IE to extract the details, instead of Wininet.

Thanks again

Code (winbatch) Select

FilePut((File_PageCheck), msie.document.GetElementsByTagName("HTML").item(0).outerHTML)


td

This is the week of the never-ending forum topics it would appear.  To correct a statement I made in my original post to the topic, it should be mentioned that the OP's site is compressed using gzip. That is why it appears as "garbage" to the OP instead of a collection of references to off-site javascript libraries.  The solution still remains the same.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

snowsnowsnow

Quote from: td on December 19, 2019, 02:23:12 PM
I believe we have gone down this rabbit hole before.  The page is generated when a javascript script is executed inside the browser.   Unfortunately, WinInet is not a browser and therefore does not have a javascript engine to execute the code that creates the page.  You can't use the Wininet extender to download a page that doesn't exist yet.

Do we have an ETA on when that will be fixed?

JTaylor

Don't hold your breath...I did notice that the the "(Almost) Psychic Support" header was removed (or I am blind) so guessing they lack the psychic resources now to produce information that doesn't exist if they can no longer claim to answer questions before they are asked.   Perhaps this will be resolved with the release of the Quantum Extender.   It is still uncertain what that will allow.   It either will or it won't or maybe both at the same time???

Jim

td

"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

stanl