File search problem with .doc file

Started by mcvpjd3, December 16, 2015, 07:31:54 AM

Previous topic - Next topic

mcvpjd3

I've been given a task of searching through thousands of files for a few key words. So I've created a script that looped through the various folder, for the various doc types for the various words I'm looking for. I had the script dump the name of the file it had just searched to a temp file to I could monitor its progress. This all works great, except every now and then it was stopping (on the server the CPU was going to 100% and the results and temp file stopped updating). So I narrowed it down to some .DOC files. I would remove what I though was the offending .DOC file and it would carry on until it came across another troublesome .DOC file. So I decided to find out what was going on.

I found a particular .DOC file that caused the issue, and started stripping it down bit by bit until it no longer failed so that I could fine out what the issue was. This .DOC had a table with loads of pictures and the file was about 100MB to start with, but no matter what I removed from the table, it kept failing. Eventually got it down to just the one column in the one table, no pictures, only text and carriage returns to about 6 pages. This still fails. However, when I remove about a page of carriage returns, the search works fine. Undo the removal and it fails again. Anyone got any idea on what's going on and how I can get around this as I need to search these documents without having to remove tables.

I've enclosed a cut down version of the script (which still stops at the a=SearchNext(Hand) line) and the .DOC in question to see if you can replicate the issue. I'm using 2014A.

Thanks

thisdir=dirscript()
extender=strcat(thisdir,"wsrch34i.dll")
AddExtender(extender)
display(3,"Start","Process Start")
hand=srchinit("C:\myfolder","test-all.doc","searchme","",50)
While 1
    a=SrchNext(hand)
   If a=="" Then Break
EndWhile
SrchFree(hand)
display(3,"End","Process End")

td

Without attempting to duplicate the problem it can still be said that it is likely some kind of bug in the search dll used be the extender. Unfortunately, that is a 3rd party dll which WindowWare has no control over so there isn't much that can be done about it in the sort term. 
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

td

We were able to confirm that you have encountered some kind of bug in the 3rd party search DLL that provides the functionality for the Search extender.  You may wish to consider using regular expressions functionality provided in WinBatch by either the CLR host (dotNet) or COM Automation.  Another options would be to use WinBatch to drive a free third party tool like GAWK to perform your search.   

Perhaps someone else has tried and true suggestion for an alternative.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

td

Here's a simple workaround that has several limitations so it may not work for you but it might (or might not)  inspires something that will.

Code (winbatch) Select
; Will not work with Unicode files.
; and minimally tested...
AddExtender("wwfaf44i.dll")       

lFiles = ""
nSize = 0
hFaf = fafOpen(DirScript(), "*.doc", 16)
While 1
   strFound = fafFind(hFaf )
   If strFound == '' Then Break
   
   hBin = BinaryAlloc(FileSize(strFound)+1)
   BinaryRead(hBin, strFound)
   if BinaryIndexEx(hBin, 0, "Ensure", @Fwdscan, 0) > 1
      lFiles = ItemInsert(strFound, -1, lFiles, @lf)
   endif
   BinaryFree(hBin)
endwhile
fafClose(hFaf)

;; Test result.
Message("Found Files", lFiles)

"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade

mcvpjd3

Sorry for the late reply.

Thanks for the replies. I've used the fafFind option and that worked great.

Thanks