openFile gives "file is locked" error on Linux when creating a non-existing file
Harendra Kumar
harendra.kumar at gmail.com
Fri Nov 15 13:15:40 UTC 2024
Coming back to this issue after a break. I reviewed the code carefully
and I cannot find anything where we are doing something in the
application code that affects the RTS locking mechanism. Let me walk
through the steps of the test up to failure and what we are doing in
the code. The test output is like this:
Ensuring dir: /tmp/fsevent_common-3725879775a96dfe/watch-root
createFile: Creating file: /tmp/fsevent_common-3725879775a96dfe/watch-root/file1
FileSystem.Event:
/tmp/fsevent_common-3725879775a96dfe/watch-root/file1: openFile:
resource busy (file is locked)
These steps are done before we start the file watching tests. We first
create a "watch-root" directory using createDirectory and then use
openFile to create a file named "file1" in that directory which we
just created, this openFile operation fails with "file is locked"
error. These steps are executed serially. The directory path is
unique, this path is never used before in this test or in any other
tests. This is the first time we are creating this file.
Now let me explain a bit about the custom file opening/closing logic
in the program. In each test, first we create a watch fd using
inotify_init, and create a Handle from it:
watch_fd <- c_inotify_init
Note that the watch_fd does not belong to an actual file on the file
system, so locking is irrelevant for this fd. FWIW, we avoid the
lockFile code path in this case when creating the Handle from fd.
To watch a file or directory we add that file to the watch using something like:
watch_descriptor <- c_inotify_add_watch watch_fd path flags
Note that we never create a GHC Handle from watch_descriptor. To
remove a file from the watch we do something like this:
rm = c_inotify_rm_watch watch_fd watch_descriptor
This closes the watch_descriptor returned by inotify_add_watch.
In any of these operations we have not inserted anything in the GHC
RTS locking table. In fact we are not even creating a Handle in any
case other than the watch_fd which does not even represent a file on
the file system and we skip the lockFile code when creating this
Handle.
After adding the file to the watch, we open the file, perform
read/write operations on it using the regular GHC file operations, and
then close the file. I am not sure if closing the watch_descriptor
interferes in any way with GHC's file descriptor which is open for
normal reading/writing. I assume the watch_descriptor returned by
inotify_add_watch operation on the file and the regular fd in the GHC
Handle returned by the "openFile" operation on the file are different.
So our closing the watch_descriptor should not affect the removal of
the lock entry by RTS when the file Handle is closed.
Regarding, how to debug this - I have not been able to reproduce this
on a local Linux machine, even when using ext4 as temp directory's
file system. On the local system I can see that inodes are being
reused all the time on ext4, but still the problem is not reproduced.
I ran the tests for several days and did not get a single instance of
the issue. The only way forward seems to be to create a GHC build
which logs the lock table entry and removal, and then run the CI using
that GHC build. Even in CIs I can reproduce the problem only on lucky
days.
-harendra
On Thu, 10 Oct 2024 at 04:32, Viktor Dukhovni <ietf-dane at dukhovni.org> wrote:
>
> On Wed, Oct 09, 2024 at 12:15:32PM +0530, Harendra Kumar wrote:
> > We do use low level C APIs and GHC APIs to create a Handle in the
> > event watching module. But that is for the watch-root and not for the
> > file that is experiencing this problem. So here is how it works. We
> > have a top level directory which is watched for events using inotify.
> > We first create this directory, this directory is opened using
> > inotify_init which returns a C file descriptor. We then create a
> > Handle from this fd, this Handle is used for watching inotify events.
> > We are then creating a file inside this directory which is being
> > watched while we are reading events from the parent directory. The
> > resource-busy issue occurs when creating a file inside this directory.
> > So we are not creating the Handle for the file in question in a
> > non-standard manner, but the parent directory Handle is being created
> > in that manner. I do not know if that somehow affects anything. Or if
> > the fact that the directory is being watched using inotify makes any
> > difference?
> >
> > The code for creating the watch Handle is here:
> > https://github.com/composewell/streamly/blob/bbac52d9e09fa5ad760ab6ee5572c701e198d4ee/src/Streamly/Internal/FileSystem/Event/Linux.hs#L589
> > . Viktor, you may want to take a quick look at this to see if it can
> > make any difference to the issue at hand.
>
> I don't have the cycles to isolate the problem. I still suspect that
> your code is somehow directly closing file descriptors associated with a
> Handle. This then orphans the associated logical reader/writer lock,
> which then gets inherited by the next incarnation of the same (dev, ino)
> pair. However, if the filesystem underlying "/tmp" were actually "tmpfs",
> inode reuse would be quite unlikely, because tmpfs inodes are assigned
> from a strictly incrementing counter:
>
> $ for i in {1..10}; do touch /tmp/foobar; ls -i /tmp/foobar; rm
> /tmp/foobar; done
> 3830 /tmp/foobar
> 3831 /tmp/foobar
> 3832 /tmp/foobar
> 3833 /tmp/foobar
> 3834 /tmp/foobar
> 3835 /tmp/foobar
> 3836 /tmp/foobar
> 3837 /tmp/foobar
> 3838 /tmp/foobar
> 3839 /tmp/foobar
>
> but IIRC you mentioned that on Github "/tmp" is ext4, not "tmpfs"
> (perhaps RAM-backed storage is a more scarce resource), in which
> case indeed inode reuse is quite likely:
>
> $ for i in {1..10}; do touch /var/tmp/foobar; ls -i /var/tmp/foobar; rm
> /var/tmp/foobar; done
> 25854141 /var/tmp/foobar
> 25854142 /var/tmp/foobar
> 25854141 /var/tmp/foobar
> 25854142 /var/tmp/foobar
> 25854141 /var/tmp/foobar
> 25854142 /var/tmp/foobar
> 25854141 /var/tmp/foobar
> 25854142 /var/tmp/foobar
> 25854141 /var/tmp/foobar
> 25854142 /var/tmp/foobar
>
> But since normal open/close of Handles acquires the lock after open, and
> releases it before close, the evidence points to a bypass of the normal
> open file lifecycle.
>
> Your codebase contains a bunch of custom file management logic, which
> could be the source the of problem. To find the problem code path,
> you'd probably need to instrument the RTS lock/unlock code to log its
> activity: (mode, descriptor, dev, ino) tuples being added and removed.
> And strace execution to be able to identify descriptor open and close
> events. Ideally the problem will be reproducible even with strace.
>
> Good luck.
>
> --
> Viktor.
> _______________________________________________
> ghc-devs mailing list
> ghc-devs at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
More information about the ghc-devs
mailing list