Reading text from a webpage

Started by Ron47, July 24, 2013, 07:00:12 AM

Previous topic - Next topic

Ron47

Hi,

I'd like to read text from webpages and I'm trying the example in your documentation:
Internet_Explorer_Q02
which includes:
Code (winbatch) Select
objBrowser = ObjectCreate("InternetExplorer.Application")
objBrowser.visible = @FALSE
objBrowser.navigate ("http://www.winbatch.com") ;to access an internet file




Your sample code works with some web pages but not others. For example.

http://www.hp.com/   returns no text.

http://www.sfchronicle.com/    and other sites return an error message:

Ole: Unknown name
Code (winbatch) Select
objBrowserBody=objBrowserDoc.Body

(I'm using IE 10)

Is there a better way to do this?

Thanks!
Ron

Deana

I tested that code sample on my Windows 7 System with IE 10. All three website returned a VT_NULL for the Text property of the Text Range object.

I managed to create some working code using the innerText property:

Code (winbatch) Select
;url = 'http://www.hp.com'
url = 'http://www.winbatch.com'
;url = 'http://www.sfchronicle.com'

objBrowser = ObjectCreate("InternetExplorer.Application")
objBrowser.visible = @FALSE
objBrowser.navigate (url) ;to access an internet file
While objBrowser.readystate <> 4
       TimeDelay(0.5)
EndWhile
objBrowserDoc = objBrowser.Document
objBrowserBody = objBrowserDoc.Body
Message("InnerText", objBrowserBody.InnerText )

;Close objects
;objBrowserPage = 0
objBrowserBody = 0
objBrowserDoc = 0
objBrowser.Quit
objBrowser = 0
Exit


It is currently unclear why the CreateTextRange object and Text property method no longer works. Here is a link to some documentation about the InnerText object. http://msdn.microsoft.com/en-us/library/ie/ms533899(v=vs.85).aspx

I will update the code sample in the help file.

Deana F.
Technical Support
Wilson WindowWare Inc.

stanl

Quote from: Deana on July 24, 2013, 09:05:56 AM

It is currently unclear why the CreateTextRange object and Text property method no longer works.

Just navigate to the HP site and hit F12. You will see why.

Deana

Quote from: stanl on July 24, 2013, 11:56:05 AM
Quote from: Deana on July 24, 2013, 09:05:56 AM

It is currently unclear why the CreateTextRange object and Text property method no longer works.

Just navigate to the HP site and hit F12. You will see why.

Stan,
Are you referring to the HTML Tab | Body Node |  Text - Empty Text Node?
Deana F.
Technical Support
Wilson WindowWare Inc.

Ron47

Thanks.

It's now working for the HP site but I still get the same Ole "Unknown name" error for
sfchronicle, yahoo and other sites.

Ron

Deana

Quote from: Ron47 on July 24, 2013, 12:35:53 PM
Thanks.

It's now working for the HP site but I still get the same Ole "Unknown name" error for
sfchronicle, yahoo and other sites.

Ron

I recommend using DebugTrace. Simply add DebugTrace(@on,"trace.txt") to the beginning of the script and inside any UDF, run it until the error or completion, then inspect the resulting trace file for clues as to the problem.

Feel free to post the trace file here ( removing any private info) if you need further assistance.
Deana F.
Technical Support
Wilson WindowWare Inc.

Ron47

The objBrowserDoc.Body returns a value of 0 with Yahoo and some other sites.

The Trace file is attached

Deana

I wonder if it is some type of timing issue. I am unable to reproduce an error using any of these sites. As I understand it the ReadyState property should be sufficient for waiting for the webpage to load completely. (Reference: http://stackoverflow.com/questions/10071048/internet-explorer-automation-busy-v-s-readystate-property)  Some people use the Busy property or even both:

Code (winbatch) Select
While objBrowser.busy || objBrowser.readystate <> 4
       TimeDelay(0.5)
EndWhile
Timedelay(1) ; give it another second


You could maybe add code to suppress that error and reattempt the call to the property. for example:

Code (winbatch) Select
;url = 'http://www.hp.com'
url = 'http://www.yahoo.com'
;url = 'http://www.sfchronicle.com'

objBrowser = ObjectCreate("InternetExplorer.Application")
objBrowser.visible = @FALSE
objBrowser.navigate (url) ;to access an internet file
While objBrowser.busy || objBrowser.readystate <> 4
       TimeDelay(0.5)
EndWhile
Timedelay(1)
objBrowserDoc = objBrowser.Document

For x = 1 to 5
ErrorMode(@off)
objBrowserBody = objBrowserDoc.Body
ErrorMode(@Cancel)
If objBrowserBody != 0 then break
Next
If objBrowserBody == 0
   Pause("Notice","Unable to obtain the Documents Body Object.")
   Exit
Endif
Message("InnerText", objBrowserBody.InnerText )

;Close objects
objBrowserBody = 0
objBrowserDoc = 0
objBrowser.Quit
objBrowser = 0
Exit

Deana F.
Technical Support
Wilson WindowWare Inc.

stanl

More and more, URL's are using jQuery, iFrames and hidden DIV' to deliver content. You might start by enumerating say any DIV's then looking at the InnerText. A quick way to do this with IE 9/10 is to use the F12 key and inspect either the HTML or Script sections.

Deana

Stan,
When I go to Yahoo.com and press F12 (to launch IE developer tools). On the HTML tab, I select the BODY element. In the right hand frame of the developer tool, I select the ATTRIBUTES heading, I then check the SHOW READ_ONLY PROPERTIES box. In the resulting attributes list I see InnerText listed and it does in fact contain text.

This user is actually having a problem obtaining a handle to the BODY object. I still suspect it is some sort of timing issue, where the object is not quite yet loaded.

Reference: http://javascript.info/tutorial/traversing-dom
Deana F.
Technical Support
Wilson WindowWare Inc.

nrr

I've found that I cannot rely on the ReadyState property and always include an error handler to catch the 'Unknown name' error.  Then the code that does the actual page access is done within a loop, up to a maximum count.

Also, in the code pasted above, I would add a slight time delay in the loop that checks for objBrowserBody != 0.

Nick


Ron47

I tried this on a number of other machines and it works on all except two, my Win7 machine and an old Vista machines.

On the two machines that it doesn't work on, I press F12 and HTML reports 'loading', it stays on loading and never ever loads.

On these computers, the HTML code comes right up with a few other sites that I tried but not yahoo and sfchronicle. 

I guess I'll use this code and on error go to a fallback subroutine that'll just click on the page and press ctrl+a to get the text.  Pretty sad, but everything else I've tried, fails. 

Deana

If IE's very own developer tool is choking then WinBatch will not have much hope....

I currently suspect some IE add-ons might be to blame. On the problem systems try disabling each add-on to see if you can track it down to a particular one.

THe following tech article may be of interest to reset the IE settings for these systems: http://support.microsoft.com/kb/923737
Deana F.
Technical Support
Wilson WindowWare Inc.

Ron47

Of course you were right. Disabled some add-ons and it works.

Thanks and thanks for improving the code with the objBrowserBody.InnerText addition.

Ron

stanl

I'm glad to see the issue resolved. As an afterthought I looked at some old code I had written to access a TOA website. Initially, my code had similar issues with the document.body and other elements.  This was due to the site using jQuery and hidden DIVs to populate menu elements. I ended up re-writing my ieready() UDF to first look for the readystate<> 4 but then after a pause look at the .busy property. Then as it turned out a TimeDelay() of 10 seconds was also required. Not a great Plan-B and eventually a programmer used Firefox and Ruby to optimize...