Extract Webpage

Started by hdsouza, April 11, 2019, 05:09:51 PM

Previous topic - Next topic

hdsouza

I have used this numerous times on other sites and I have already received the correct result so I am not sure what I am doing wrong.
I want to download this webpage: http://www.tipranks.com/stocks/pdm/price-target
But I get something vastly different in the File_Hostconnect.txt from what appears on the page.

Code (winbatch) Select

AddExtender("WWINT44I.DLL")
File_Hostconnect = "c:\temp\File_Hostconnect.txt"

Url_main = "www.tipranks.com"
Url_sub  = "stocks/pdm/price-target"
tophandle=iBegin(0,"","")
connecthandle=iHostConnect(tophandle, Url_main, @HTTP, "", "")
datahandle=iHttpInit(connecthandle, "GET", Url_sub, "",0)
rslt=iHttpOpen(datahandle, "", 0, 0)
iReadData(datahandle, File_Hostconnect)
iClose(datahandle)
iClose(connecthandle)
iClose(tophandle)


stanl

Your script returns the page source. It is primarily JS links. Probably requires skill in Ajax to get anything useful.

td

A lot of references to the React javascript library. 
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

stanl

Quote from: td on April 12, 2019, 06:40:14 AM
A lot of references to the React javascript library.


and some babel-polyfill ???

stanl

really interesting... try this
Code (WINBATCH) Select


IntControl(73,1,0,0,0)
url = 'http://www.tipranks.com/stocks/pdm/price-target'
ObjectClrOption("useany","System")
oWEB = ObjectClrNew('System.Net.WebClient')
content = oWEB.DownloadString(url)
oWEB=0
FilePut("c:\temp\content.txt",content)
Exit


:WBERRORHANDLER
oWEB=0
Pause("oops, error",wberrortextstring:@CRLF:wberroradditionalinfo)
Exit

hdsouza

Thanks Stan. I ran your script with winbatch version 2018B . It returned "405: method not allowed"

I also tried using winbatch ,
-- opened IE
--  navigating to 'http://www.tipranks.com/stocks/pdm/price-target'
-- FilePut((File_PageCheck), msie.document.GetElementsByTagName("HTML").item(0).outerHTML)
The contents of File_PageCheck correctly had "Moderate Buy". So I would have assumed that the ihostconnect  would have displayed the same contents too.


td

Quote from: stanl on April 12, 2019, 08:53:13 AM
Quote from: td on April 12, 2019, 06:40:14 AM
A lot of references to the React javascript library.


and some babel-polyfill ???

I suppose it works better than a Babel Fish stuck in your ear in this case...
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

td

Quote from: hdsouza on April 12, 2019, 10:44:55 AM
Thanks Stan. I ran your script with winbatch version 2018B . It returned "405: method not allowed"

I also tried using winbatch ,
-- opened IE
--  navigating to 'http://www.tipranks.com/stocks/pdm/price-target'
-- FilePut((File_PageCheck), msie.document.GetElementsByTagName("HTML").item(0).outerHTML)
The contents of File_PageCheck correctly had "Moderate Buy". So I would have assumed that the ihostconnect  would have displayed the same contents too.

Stan, correct me if I am wrong but I believe Stan's point was that the page generates its contents by running javascript in the client browser.  The MSIE COM object has somewhat limited browser scripting capabilities but the WinInet extender is not a browser at all.  This means that the WinInet extender cannot return the text as you see it in a Web browser because it does not have a javascript engine that will generate the output.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

stanl

Bottom line: it is an interesting site. I had some time and tried a Plan-B by connecting from Excel but got the same 405 error that the .net script returned. Great reading - the 405. Thought about Selenium, but then read a thread about Selenium and React javascript. More interesting reading. My best guess is that for 'free' the site is look but don't touch and maybe they are a little more lenient with the paid stuff.

hdsouza

Thanks Stan and Td.
Appreciate the thought process and Ideas.

mhall

This is not an automated method, but when I was in a position like yours, I (assuming FireFox as a browser):

1.) Opened the browser devTools (F12)
2.) From the Inspector Tab, select the HTML (root) element of the page.
3.) Right click and copy the OuterHTML of that element.

That gives you the generated content of the page, as you see it on screen.

That may not however be everything you need - various interactions of the page (making a selection, clicking a button) could cause content you want to be injected/removed.

But, it may be a start.

~Micheal