[Haskell-beginners] filesystem verification utility

Tue Jan 11 20:36:22 CET 2011

Hi Stephen,

Thanks for you reply. I will give you my motivation behind this
application so that you have better context. We have been porting ZFS
to linux and have written a few programs to stress the filesystem and
generate IO load similar to what would be expected in a production
environment. The key objective is to find bugs in the ZFS code. I have
been interested in haskell for quite some time and used this as an
excuse to write something which might be useful. It is possible that
what I am doing would be much more efficient in C than in haskell.
However  if my objective is fulfilled to within a fair degree, the
effort will be fun as well as worth the time invested in it.

On the high level I expect the program to do the following
- ability to perform multi-threaded i/o both threadlevel and process
level  to increasing contention on locks and shared resources in the
filesystem data structures to expose race conditions.
- verify each write that has been written. i.e. the content of the
file is a function of a random seed which will allow us to detect
errors like misplaced writes lost writes etc.
- the contents should be self identifying for debugging purposes. If
mysterious data appears within a file the contents would make it
obvious where it should actually belong. I.e. file name, offset, size
and io sequence number.
- ability to generate metadata load

The existing code does only some of these but it could be expanded to
cover all aspects if the existing performance is promising.

Getting back to the problem at  hand. I had some luck in identifying
the cause of the "handle is closed". I suspected that it could be
because I was mixing calls from System.IO and System.POSIX.IO. After
making the them uniform atleast I don't get a "handle is closed" error
but it is hung. Not being familiar with the debugger I resorted to
more traditional means of putting putStrLn to find what was happening.
And it appeared that the program was getting hung just before starting
the random IO. At this point I was distracted with some other work
while the apparently hung program was executing. When I came back to
the xterm there was the debug putStrLn I had added. This told me that
it wasn't hung but just taking a lot of time. From this evidence it
was instantly clear what was happening. To simplify the verification
process I was checking that there weren't any overlaping I/O. It seems
that a large number of them were overlapping and hence an inordinate
amount of time was being spent generating the list. I could bet that
is where my memory was going as well. I have removed the constraints
on the random list generator and now program is able to saturate the
disk bandwidth without any trouble. Thanks for you help I'll welcome
suggestions to improve the code.

regards
-- 
mitra

On Tue, Jan 11, 2011 at 3:38 PM, Stephen Tetley
<stephen.tetley at gmail.com> wrote:
> Hi Anand
>
> Firstly apologies - my advice from yesterday was trivial advice,
> changing to a better representation of Strings and avoiding costly
> operations (++) is valuable and should improve the performance of the
> program, but it might be a small overall improvement and it doesn't
> get to the heart of things.
>
> Really you need to do two things - one is consider what you are doing
> and evaluate whether it is appropriate for a performance sensitive
> app, the other is to profile and find the bits that are too slow.
>
> I rarely use Control.Concurrent so I can't offer any real experience
> but I'm concerned that it is adding overhead for no benefit. Looking
> at the code and what the comments say it does - I don't think your
> situation benefits from concurrency. A thread in your program could do
> all is work in one go, its not that you need to be servicing many
> clients (cf. a web server that needs to service many clients without
> individual long waits so it makes sense to schedule them) or that you
> are waiting on other processes making resources available. So for your
> program, any changes to execution caused by scheduling / de-scheduling
> threads (probably) just add to the total time.
>
> If you have a multi-core machine you could potentially benefit from
> parallelism - splitting the work amongst available cores. But in GHC
> forkIO spawns "green threads" which run in the same OS thread so you
> won't be getting any automatic parallelism from the code even if you
> have multi-core.
>
> However don't take my word for this - I could easily be wrong. If you
> want performance you really do need to see what the profiler tells
> you.
>
> Best wishes
>
> Stephen
>
> _______________________________________________
> Beginners mailing list
> Beginners at haskell.org
> http://www.haskell.org/mailman/listinfo/beginners
>