viewpoint-particle

Author Topic: Extract Webpage  (Read 989 times)

hdsouza

  • Full Member
  • ***
  • Posts: 162
Extract Webpage
« on: April 11, 2019, 05:09:51 pm »
I have used this numerous times on other sites and I have already received the correct result so I am not sure what I am doing wrong.
I want to download this webpage: http://www.tipranks.com/stocks/pdm/price-target
But I get something vastly different in the File_Hostconnect.txt from what appears on the page.

Code: Winbatch
AddExtender("WWINT44I.DLL")
File_Hostconnect = "c:\temp\File_Hostconnect.txt"

Url_main = "www.tipranks.com"
Url_sub  = "stocks/pdm/price-target"
tophandle=iBegin(0,"","")
connecthandle=iHostConnect(tophandle, Url_main, @HTTP, "", "")
datahandle=iHttpInit(connecthandle, "GET", Url_sub, "",0)
rslt=iHttpOpen(datahandle, "", 0, 0)
iReadData(datahandle, File_Hostconnect)
iClose(datahandle)
iClose(connecthandle)
iClose(tophandle)
 

stanl

  • Pundit
  • *****
  • Posts: 1250
Re: Extract Webpage
« Reply #1 on: April 12, 2019, 03:05:10 am »
Your script returns the page source. It is primarily JS links. Probably requires skill in Ajax to get anything useful.

td

  • Tech Support
  • *****
  • Posts: 3494
    • WinBatch
Re: Extract Webpage
« Reply #2 on: April 12, 2019, 06:40:14 am »
A lot of references to the React javascript library. 
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

stanl

  • Pundit
  • *****
  • Posts: 1250
Re: Extract Webpage
« Reply #3 on: April 12, 2019, 08:53:13 am »
A lot of references to the React javascript library.


and some babel-polyfill ???

stanl

  • Pundit
  • *****
  • Posts: 1250
Re: Extract Webpage
« Reply #4 on: April 12, 2019, 09:16:20 am »
really interesting... try this
Code: Winbatch

IntControl(73,1,0,0,0)
url = 'http://www.tipranks.com/stocks/pdm/price-target'
ObjectClrOption("useany","System")
oWEB = ObjectClrNew('System.Net.WebClient')
content = oWEB.DownloadString(url)
oWEB=0
FilePut("c:\temp\content.txt",content)
Exit


:WBERRORHANDLER
oWEB=0
Pause("oops, error",wberrortextstring:@CRLF:wberroradditionalinfo)
Exit
 

hdsouza

  • Full Member
  • ***
  • Posts: 162
Re: Extract Webpage
« Reply #5 on: April 12, 2019, 10:44:55 am »
Thanks Stan. I ran your script with winbatch version 2018B . It returned "405: method not allowed"

I also tried using winbatch ,
-- opened IE
--  navigating to 'http://www.tipranks.com/stocks/pdm/price-target'
-- FilePut((File_PageCheck), msie.document.GetElementsByTagName("HTML").item(0).outerHTML)
 The contents of File_PageCheck correctly had "Moderate Buy". So I would have assumed that the ihostconnect  would have displayed the same contents too.


td

  • Tech Support
  • *****
  • Posts: 3494
    • WinBatch
Re: Extract Webpage
« Reply #6 on: April 12, 2019, 01:03:43 pm »
A lot of references to the React javascript library.


and some babel-polyfill ???

I suppose it works better than a Babel Fish stuck in your ear in this case...
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

td

  • Tech Support
  • *****
  • Posts: 3494
    • WinBatch
Re: Extract Webpage
« Reply #7 on: April 12, 2019, 02:37:51 pm »
Thanks Stan. I ran your script with winbatch version 2018B . It returned "405: method not allowed"

I also tried using winbatch ,
-- opened IE
--  navigating to 'http://www.tipranks.com/stocks/pdm/price-target'
-- FilePut((File_PageCheck), msie.document.GetElementsByTagName("HTML").item(0).outerHTML)
 The contents of File_PageCheck correctly had "Moderate Buy". So I would have assumed that the ihostconnect  would have displayed the same contents too.

Stan, correct me if I am wrong but I believe Stan's point was that the page generates its contents by running javascript in the client browser.  The MSIE COM object has somewhat limited browser scripting capabilities but the WinInet extender is not a browser at all.  This means that the WinInet extender cannot return the text as you see it in a Web browser because it does not have a javascript engine that will generate the output.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

stanl

  • Pundit
  • *****
  • Posts: 1250
Re: Extract Webpage
« Reply #8 on: April 14, 2019, 01:35:15 pm »
Bottom line: it is an interesting site. I had some time and tried a Plan-B by connecting from Excel but got the same 405 error that the .net script returned. Great reading - the 405. Thought about Selenium, but then read a thread about Selenium and React javascript. More interesting reading. My best guess is that for 'free' the site is look but don't touch and maybe they are a little more lenient with the paid stuff.

hdsouza

  • Full Member
  • ***
  • Posts: 162
Re: Extract Webpage
« Reply #9 on: April 14, 2019, 05:19:43 pm »
Thanks Stan and Td.
Appreciate the thought process and Ideas.

mhall

  • Jr. Member
  • **
  • Posts: 84
Re: Extract Webpage
« Reply #10 on: April 22, 2019, 01:53:14 pm »
This is not an automated method, but when I was in a position like yours, I (assuming FireFox as a browser):

1.) Opened the browser devTools (F12)
2.) From the Inspector Tab, select the HTML (root) element of the page.
3.) Right click and copy the OuterHTML of that element.

That gives you the generated content of the page, as you see it on screen.

That may not however be everything you need - various interactions of the page (making a selection, clicking a button) could cause content you want to be injected/removed.

But, it may be a start.

~Micheal