[Haskell-cafe] MD5?

Andrew Coppin andrewcoppin at btinternet.com
Sat Nov 17 09:45:29 EST 2007

Neil Mitchell wrote:
> Hi
>> The MD5SUM.EXE file I have chokes if you ask it to hash a file in
>> another directory. It will hash from stdin, or from a file in the
>> current directory, but point-blank refuses to hash anything else.
> Try http://www.cs.york.ac.uk/fp/yhc/dependencies/UnxUtils.zip - that
> has an MD5SUM program in it that seems to work fine on things in
> different directories. It also has many other great utilities in it.

Negative. It gives strange output if the pathname contains any 
backslashes. (Each backslash appears twice, and an additional backslash 
appears just before the hash value. Very odd...)

I spent a while playing with Google, and found many, many 
implementations of MD5. Every single one of them did *something* strange 
under certain conditions. Most frustrating! Well anyway, I eventually 
settled on a program MD5DEEP.EXE, which seems to work just about well 
enough to be useful.

> I'm trying to imagine what mistake the authors of your version of
> MD5SUM must have made to screw up files in different directories, but
> it eludes me...

It seems typically Unix tools are compiled for Windows with the aid of a 
Unix emulator. These often do all sorts of strange path munging to make 
Windows look like Unix. That's probably the source of the problem...

BTW, while I'm here... I sat down and wrote my own MD5 implementation. 
It's now 95% working. (The padding algorithm goes wrong for certain 
message lengths.) I doubt it'll ever be fast, but I wanted to see how 
hard it would be to implement. The hard part, ridiculously enough, 
wasn't MD5 itself. It's all the datatype conversions. Nowhere in the 
Haskell libraries can I find any of these functions:

  pack8into16 :: [Word8] -> Word16
  pack8into32 :: [Word8] -> Word32
  unpack16into8 :: Word16 -> [Word8]
  unpack32into8 :: Word32 -> [Word8]
  pack8into16s :: [Word8] -> [Word16]
  pack8into32s :: [Word8] -> [Word32]

I had to write all these myself, by hand, and then check that I got 
everything the right way round and so forth. (And every now and then I 
find an edge case where these functions go wrong.) Of course, on top of 
that, MD5 uses something really stupid called "little endian integers". 
In other words, to interpret the data, you have to read it partially 
backwards, partially forwards. Really awkward to get right!

But, after a few hours last night and a few more this morning, I was 
able to get the main program to work properly. If I can just straighten 
out the message padding code, I'll be all set... Then I can see about 
measuring just how slow it is. :-}

Most amusing moment: Trying to run the GHC debugger, and then realising 
that you have to actually install the new version of GHC first...

More information about the Haskell-Cafe mailing list