Sure, we will provide you with more details of our installation but
let me first say that, if recollection serves, we did not pull that
number out of a hat. I believe that there is a formula in one of
the lustre tuning manuals for calculating the recommended timeout
value. I'll have to take a moment to go back and find it. Anyway,
if you use that formula for our cluster, the recommended timeout
value, I think, comes out to be *much* larger than 1000.
Later this morning, we will go back and find that formula and share
with the list how we came up w/ our timeout. Perhaps you can show
us where we are going wrong.
One more comment.... We just brought up our second large lustre file
system. It is 80+ TB served by 24 OSTs on two (pretty beefy)
OSSs. We just achieved over 2GB/sec of sustained (large block,
sequential) I/O from an aggregate of 20 clients. Our design target
was 1.0 GB/sec/OSS and we hit that pretty comfortably. That said,
when we first mounted the new (126.96.36.199) file system across all 400
nodes in our cluster, we immediately started getting "transport
endpoint failures" and evictions. We looked rather intensively for
network/fabric problems (we have both o2ib and tcp nids) and could
find none. All of our MPI apps are/were running just fine. The
only way we could get rid of the evictions and transport endpoint
failures was by increasing the timeout. Also, we knew to do this
based on our experience with our first lustre file system (1.6.3 +
patches) where we had to do the same thing.
Like I said, a little bit later, Craig or I will post more details
about our implementation. If we are doing something wrong with
regard to this timeout business, I would love to know what it is.
UF HPC Center
Post by Brian J. Murrell
Post by Aaron S. Knister
I think I tried that before and it didn't help, but I will try it
again. Thanks for the suggestion.
Just so you guys know, 1000 seconds for the obd_timeout is very, very
large! As you could probably guess, we have some very, very big Lustre
installations and to the best of my knowledge none of them are using
anywhere near that. AFAIK (and perhaps a Sun engineer with closer
experience to some of these very large clusters might correct me) the
largest value that the largest clusters are using is in the
neighbourhood of 300s. There has to be some other problem at play here
that you need 1000s.
Can you both please report your lustre and kernel versions? I know you
said "latest" Aaron, but some version numbers might be more solid to go
Lustre-discuss mailing list