Watch for files issue

Started by MW4, January 29, 2019, 06:56:16 PM

Previous topic - Next topic

MW4

I have used the "watch for files then do something" for years without issue.

I have recently started to see an issue on a manually downloaded internet file.

It comes in like this:

40052985.pdf.crdownload
but the filename is actually 40052985.pdf

The first step is to copy the file to a working folder
The process errors out because of the .crdownload

Any ideas of what to do?
I have tried adding delays and I still have issues.

stanl


MW4

Yeah, I knew what they were, just trying to figure out how to deal with that with the watch program.

My first thought was to add some sort of delay, or to rename any file that comes in with crdownload to get rid of that, etc.

ChuckC

Arbitrarily choosing to rename the file would very likely introduce other problems.  The whole reason the file exists with that extension is that the browser is in the process of downloading it, and the content of the file isn't complete as long as that extension is present.  At the end of a successful download, the file will get renamed to remove that "download in progress" extension, at which point it is now eligible for your script to perform some processing on it.  If the download fails and gets interrupted, then the file will persist on disk with that extension until something comes along and deletes it.  If the browser doesn't clean these files up periodically on its own, perhaps even waiting as long as when the browser is restarted, then some other method must be used to A) determine that the incomplete download file is not currently open for access [e.g. truly is not currently having content written to it as part of an active download], and B) is "stale" or otherwise older than some threshold indicating that it's safe to delete the file.

In effect, to be safe, your script should ignore these files any time it encounters them.

MW4

Here is what I'm trying right before my "do something"...
Is this a good solution? or would it be better to search for .crdownload ?

Code (winbatch) Select
if StrSub (NewFile, strlen(newfile) - 11, 11) == ".crdownload" then
   newfile = StrSub (NewFile, 1, strlen(newfile) - 11)
endif

FileCopy(StrCat(FTdir,"\",NewFile),strcat(FTdir,"\Transporter\archive\Rep.pdf"),@false)


Or would modifying this below from -1 to 8000 be better?

Code (winbatch) Select
  ; -----------------------------------------------------------------------------
    ; Note:  In the following call, "-1" means "No Timeout". 
    ; If you want a timeout you can substitute a time in ms.  (1000 = 1 second.)   
    ; WaitStatus == 258 indicates a timeout.
    ;------------------------------------------------------------------------------
    WaitStatus = DLLCall(User32,long:"MsgWaitForMultipleObjects",long:NumberOfFolders,lpbinary:HandleBinary,long:0,long:-1,long:255)

MW4

I'm trying this, that way I'm not messing with the original newfile...comments?
Overkill?

Code (winbatch) Select

newfile2 = newfile

if StrSub (NewFile2, strlen(newfile2) -10, 11) == ".crdownload" then
   newfile3 = StrSub (NewFile2, 1, strlen(newfile2) - 11)
   timedelay(20)
else
   NewFile3 = newfile2
endif

FileCopy(StrCat(FTPdir,"\",NewFile3),strcat(FTPdir,"\Transporter\archive\Rep.pdf"),@false)



ChuckC

The point is that if the file has an extension of ".crdownload", it is either for write access by the browser and is in the process of being downloaded, which makes it incomplete and not eligible for being copied, yet, or it is not open for access and is incomplete due to the download having been interrupted.  In either case, it's incomplete and isn't eligible to be copied by your script.  That's why I recommended that your script simply ignore all files that have names ending in ".crdownload".  Any attempt on your part to rename or copy the file while its name ends with ".crdownload" is only going to result in creating additional problems for yourself.

If/when the browser completes the download, it will rename the file and remove the ".crdownload" extension.  From what I see in your examples, you'd see something along the lines of "MyFile.pdf.crdownload" being renamed to "MyFile.pdf".

MW4

I'm with you...

1. file comes in as xx.pdf.crdownload

2. Watch for files sets newfile as xx.pdf.crdownload

3. using switch, the process is started the "do something"

4. at that point newfile thinks it is xx.pdf.crdownload but is actually xx.pdf, so it errors out

I tried to get around that with the rename, but somtimes the download process takes too long .


How do I set it to ignore .crdownload files at the moment it drops into the folder?


stanl

Quote from: MW4 on January 30, 2019, 06:00:34 AM
Yeah, I knew what they were,


Then I would have assumed you knew the issues they presented. For general knowledge, how is the file being acquired? HTTP download via WB script? 3rd party app that is polling chrome for files? Just a little surprised that a PDF would stay in a wait state that long. For other than general reasons, I have looked into using the Selenium WebDriver within a WB script to download from chrome and would likely run into .crdownloaded.


MW4

it's a manually downloaded file and saved directly into the watched folder.

ChuckC

You nearly suggested the solution to your problem when you numbered the steps being taken and where things start to blow up on you.

This is a concurrency / race condition issue where your script is running in one process and the download is occurring in another process [Chrome].  The directory watcher receives a signal that something has changed in the directory [file create/delete], and it figures out what the name of the new file is that's being downloaded.  Provided that the download doesn't complete nearly instantaneously, the file name that gets returned as having been created is the "download in progress" name ending in ".crdownload".  However, your WinBatch script is every slightly slower, and it is possible for downloads of relatively small files that by the time your script gets around to trying to interact with the file on disk, the download has completed and the file name that the directory watcher gave you no longer exists.  In the case of a larger file being downloaded, your script starts to interact with the file that is still being downloaded.

The solution in this case is to always check the name that was given to you by the directory watcher.  If it doesn't end in ".crdownload", then you are free to do what you want with the file immediately as it has been completely downloaded and the browser is done with it.  If the file name you receive ends in ".crdownload", you need to go into a loop with a short delay testing the existence of the file.  As soon as you determine that the file does not exist, truncate the name to remove ".crdownload" from it and then take the result and use that as the name of the file that you are going to process.  I don't know if Chrome opens the temporary download file for exclusive access, but if it does, then the FileExist() function's alternate return value of 2 may be of interest.  Otherwise, as long as the ".crdownload" file exists, you need to test it by calling IntControl(39, ...) and specify no sharing [P1 = 0] and the open the ".crdownload" file for read-only access.  If you get a sharing violation error [that you need to trap & handle], then Chrome still has the file open and is continuing to download it.  If you get success, then for some reason the download was aborted and the file is incomplete and you need to delete it rather than process it.  If opening it for read returns a file not found error, then in between your test for existence and testing for whether Chrome has it open, the download completed and the file has already been renamed.



td

Wow, a nice detailed explanation.  I can only wish I had that much patience.

Not saying that the following is a better solution.  Only that it might be something worth considering.  The OP's script appears to be a partial cut&paste from the Tech Database article:

http://techsupt.winbatch.com/webcgi/webbatch.exe?techsupt/nftechsupt.web+WinBatch/Samples~from~Users+Watch~for~Files~and~Do~Something.txt

Based on that, one suggestion would be to explore changing the "FindFirstChangeNotification's" dwNotifyFilter parameter so only file renames or some other event that indicated a completed download would fire.  See the MSFT documentation:

https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-findfirstchangenotificationa

As another alternative, you could use the "ReadDirectoryChangesW" to detect the type of change event.  The MSFT documentation:

https://docs.microsoft.com/en-us/windows/desktop/api/winbase/nf-winbase-readdirectorychangesw

Note that "ReadDirectoryChangesW" appears to only have a Unicode version.
"No one who sees a peregrine falcon fly can ever forget the beauty and thrill of that flight."
  - Dr. Tom Cade