Discussion:
Cannot send after transport endpoint shutdown (-108)
(too old to reply)
Aaron S. Knister
2008-03-04 20:31:19 UTC
Permalink
This morning I've had both my infiniband and tcp lustre clients hiccup. They are evicted from the server presumably as a result of their high load and consequent timeouts. My question is- why don't the clients re-connect. The infiniband and tcp clients both give the following message when I type "df" - Cannot send after transport endpoint shutdown (-108). I've been battling with this on and off now for a few months. I've upgraded my infiniband switch firmware, all the clients and servers are running the latest version of lustre and the lustre patched kernel. Any ideas?

-Aaron
Charles Taylor
2008-03-04 20:41:04 UTC
Permalink
We've seen this before as well. Our experience is that the
obd_timeout is far too small for large clusters (ours is 400+
nodes) and the only way we avoid these errors is by setting it to
1000 which seems high to us but appears to work and puts an end to
the transport endpoint shutdowns.

On the MDS....

lctl conf_param srn.sys.timeout=1000

You may have to do this on the OSS's as well unless you restart the
OSS's but I could be wrong on that. You should check it everywhere
with...

cat /proc/sys/lustre/timeout
Post by Aaron S. Knister
This morning I've had both my infiniband and tcp lustre clients
hiccup. They are evicted from the server presumably as a result of
their high load and consequent timeouts. My question is- why don't
the clients re-connect. The infiniband and tcp clients both give
the following message when I type "df" - Cannot send after
transport endpoint shutdown (-108). I've been battling with this on
and off now for a few months. I've upgraded my infiniband switch
firmware, all the clients and servers are running the latest
version of lustre and the lustre patched kernel. Any ideas?
-Aaron
_______________________________________________
Lustre-discuss mailing list
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Aaron S. Knister
2008-03-04 20:55:13 UTC
Permalink
I think I tried that before and it didn't help, but I will try it again. Thanks for the suggestion.

-Aaron

----- Original Message -----
From: "Charles Taylor" <taylor-***@public.gmane.org>
To: "Aaron S. Knister" <aaron-***@public.gmane.org>
Cc: "lustre-discuss" <lustre-discuss-KYPl3Ael/***@public.gmane.org>, "Thomas Wakefield" <twake-0U0XG9xM+***@public.gmane.org>
Sent: Tuesday, March 4, 2008 3:41:04 PM GMT -05:00 US/Canada Eastern
Subject: Re: [Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

We've seen this before as well. Our experience is that the
obd_timeout is far too small for large clusters (ours is 400+
nodes) and the only way we avoid these errors is by setting it to
1000 which seems high to us but appears to work and puts an end to
the transport endpoint shutdowns.

On the MDS....

lctl conf_param srn.sys.timeout=1000

You may have to do this on the OSS's as well unless you restart the
OSS's but I could be wrong on that. You should check it everywhere
with...

cat /proc/sys/lustre/timeout
Post by Aaron S. Knister
This morning I've had both my infiniband and tcp lustre clients
hiccup. They are evicted from the server presumably as a result of
their high load and consequent timeouts. My question is- why don't
the clients re-connect. The infiniband and tcp clients both give
the following message when I type "df" - Cannot send after
transport endpoint shutdown (-108). I've been battling with this on
and off now for a few months. I've upgraded my infiniband switch
firmware, all the clients and servers are running the latest
version of lustre and the lustre patched kernel. Any ideas?
-Aaron
_______________________________________________
Lustre-discuss mailing list
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Aaron Knister
2008-03-04 22:42:07 UTC
Permalink
I made this change and clients are still being evicted. This is very
frustrating. It happens over tcp and infiniband. My timeout is 1000.
Anybody know why don't the clients reconnect?
Post by Aaron S. Knister
I think I tried that before and it didn't help, but I will try it
again. Thanks for the suggestion.
-Aaron
----- Original Message -----
Sent: Tuesday, March 4, 2008 3:41:04 PM GMT -05:00 US/Canada Eastern
Subject: Re: [Lustre-discuss] Cannot send after transport endpoint shutdown (-108)
We've seen this before as well. Our experience is that the
obd_timeout is far too small for large clusters (ours is 400+
nodes) and the only way we avoid these errors is by setting it to
1000 which seems high to us but appears to work and puts an end to
the transport endpoint shutdowns.
On the MDS....
lctl conf_param srn.sys.timeout=1000
You may have to do this on the OSS's as well unless you restart the
OSS's but I could be wrong on that. You should check it everywhere
with...
cat /proc/sys/lustre/timeout
Post by Aaron S. Knister
This morning I've had both my infiniband and tcp lustre clients
hiccup. They are evicted from the server presumably as a result of
their high load and consequent timeouts. My question is- why don't
the clients re-connect. The infiniband and tcp clients both give
the following message when I type "df" - Cannot send after
transport endpoint shutdown (-108). I've been battling with this on
and off now for a few months. I've upgraded my infiniband switch
firmware, all the clients and servers are running the latest
version of lustre and the lustre patched kernel. Any ideas?
-Aaron
_______________________________________________
Lustre-discuss mailing list
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies

(301) 595-7000
aaron-***@public.gmane.org
Craig Prescott
2008-03-05 00:37:05 UTC
Permalink
Hi Aaron;

As Charlie mentioned, we have 400 clients and a timeout
value of 1000 is "enough" for us. How many clients do you
have? If it is more than 400, or the ratio of your o2ib/tcp
clients is not like ours (80/20), you may need a bigger value.

Also, we have observed that occassionally we set the timeout
on out MGS/MDS machine via:

lctl conf_param <fsname>.sys.timeout=1000

but it does not "take" everywhere. That is, you should check
your OSSes and clients to observe that the correct timeout
is reflected in /proc/sys/lustre/timeout. If it isn't, just echo
the correct number in there. If you already checked this, maybe
try a bigger value?

Hope that helps,
Craig Prescott
Post by Aaron Knister
I made this change and clients are still being evicted. This is very
frustrating. It happens over tcp and infiniband. My timeout is 1000.
Anybody know why don't the clients reconnect?
Post by Aaron S. Knister
I think I tried that before and it didn't help, but I will try it
again. Thanks for the suggestion.
-Aaron
----- Original Message -----
Sent: Tuesday, March 4, 2008 3:41:04 PM GMT -05:00 US/Canada Eastern
Subject: Re: [Lustre-discuss] Cannot send after transport endpoint shutdown (-108)
We've seen this before as well. Our experience is that the
obd_timeout is far too small for large clusters (ours is 400+
nodes) and the only way we avoid these errors is by setting it to
1000 which seems high to us but appears to work and puts an end to
the transport endpoint shutdowns.
On the MDS....
lctl conf_param srn.sys.timeout=1000
You may have to do this on the OSS's as well unless you restart the
OSS's but I could be wrong on that. You should check it everywhere
with...
cat /proc/sys/lustre/timeout
Post by Aaron S. Knister
This morning I've had both my infiniband and tcp lustre clients
hiccup. They are evicted from the server presumably as a result of
their high load and consequent timeouts. My question is- why don't
the clients re-connect. The infiniband and tcp clients both give
the following message when I type "df" - Cannot send after
transport endpoint shutdown (-108). I've been battling with this on
and off now for a few months. I've upgraded my infiniband switch
firmware, all the clients and servers are running the latest
version of lustre and the lustre patched kernel. Any ideas?
-Aaron
_______________________________________________
Lustre-discuss mailing list
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies
(301) 595-7000
------------------------------------------------------------------------
_______________________________________________
Lustre-discuss mailing list
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Brian J. Murrell
2008-03-04 21:04:32 UTC
Permalink
Post by Aaron S. Knister
I think I tried that before and it didn't help, but I will try it
again. Thanks for the suggestion.
Just so you guys know, 1000 seconds for the obd_timeout is very, very
large! As you could probably guess, we have some very, very big Lustre
installations and to the best of my knowledge none of them are using
anywhere near that. AFAIK (and perhaps a Sun engineer with closer
experience to some of these very large clusters might correct me) the
largest value that the largest clusters are using is in the
neighbourhood of 300s. There has to be some other problem at play here
that you need 1000s.

Can you both please report your lustre and kernel versions? I know you
said "latest" Aaron, but some version numbers might be more solid to go
on.

b.
Charles Taylor
2008-03-05 11:56:46 UTC
Permalink
Sure, we will provide you with more details of our installation but
let me first say that, if recollection serves, we did not pull that
number out of a hat. I believe that there is a formula in one of
the lustre tuning manuals for calculating the recommended timeout
value. I'll have to take a moment to go back and find it. Anyway,
if you use that formula for our cluster, the recommended timeout
value, I think, comes out to be *much* larger than 1000.

Later this morning, we will go back and find that formula and share
with the list how we came up w/ our timeout. Perhaps you can show
us where we are going wrong.

One more comment.... We just brought up our second large lustre file
system. It is 80+ TB served by 24 OSTs on two (pretty beefy)
OSSs. We just achieved over 2GB/sec of sustained (large block,
sequential) I/O from an aggregate of 20 clients. Our design target
was 1.0 GB/sec/OSS and we hit that pretty comfortably. That said,
when we first mounted the new (1.6.4.2) file system across all 400
nodes in our cluster, we immediately started getting "transport
endpoint failures" and evictions. We looked rather intensively for
network/fabric problems (we have both o2ib and tcp nids) and could
find none. All of our MPI apps are/were running just fine. The
only way we could get rid of the evictions and transport endpoint
failures was by increasing the timeout. Also, we knew to do this
based on our experience with our first lustre file system (1.6.3 +
patches) where we had to do the same thing.

Like I said, a little bit later, Craig or I will post more details
about our implementation. If we are doing something wrong with
regard to this timeout business, I would love to know what it is.

Thanks,

Charlie Taylor
UF HPC Center
Post by Brian J. Murrell
Post by Aaron S. Knister
I think I tried that before and it didn't help, but I will try it
again. Thanks for the suggestion.
Just so you guys know, 1000 seconds for the obd_timeout is very, very
large! As you could probably guess, we have some very, very big Lustre
installations and to the best of my knowledge none of them are using
anywhere near that. AFAIK (and perhaps a Sun engineer with closer
experience to some of these very large clusters might correct me) the
largest value that the largest clusters are using is in the
neighbourhood of 300s. There has to be some other problem at play here
that you need 1000s.
Can you both please report your lustre and kernel versions? I know you
said "latest" Aaron, but some version numbers might be more solid to go
on.
b.
_______________________________________________
Lustre-discuss mailing list
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Frank Leers
2008-03-05 16:03:14 UTC
Permalink
Post by Brian J. Murrell
Post by Aaron S. Knister
I think I tried that before and it didn't help, but I will try it
again. Thanks for the suggestion.
Just so you guys know, 1000 seconds for the obd_timeout is very, very
large! As you could probably guess, we have some very, very big Lustre
installations and to the best of my knowledge none of them are using
anywhere near that. AFAIK (and perhaps a Sun engineer with closer
experience to some of these very large clusters might correct me) the
largest value that the largest clusters are using is in the
neighbourhood of 300s. There has to be some other problem at play here
that you need 1000s.
I can confirm that at a recent large installation with several thousand
clients, the default of 100 is in effect.
Post by Brian J. Murrell
Can you both please report your lustre and kernel versions? I know you
said "latest" Aaron, but some version numbers might be more solid to go
on.
b.
_______________________________________________
Lustre-discuss mailing list
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Aaron Knister
2008-03-05 16:08:33 UTC
Permalink
That's very strange. What interconnect is that site using?

My versions are -

Lustre - 1.6.4.2
Kernel (servers) - 2.6.18-8.1.14.el5_lustre.1.6.4.2smp
Kernel (clients) - 2.6.18-53.1.13.el5
Post by Frank Leers
Post by Brian J. Murrell
Post by Aaron S. Knister
I think I tried that before and it didn't help, but I will try it
again. Thanks for the suggestion.
Just so you guys know, 1000 seconds for the obd_timeout is very, very
large! As you could probably guess, we have some very, very big Lustre
installations and to the best of my knowledge none of them are using
anywhere near that. AFAIK (and perhaps a Sun engineer with closer
experience to some of these very large clusters might correct me) the
largest value that the largest clusters are using is in the
neighbourhood of 300s. There has to be some other problem at play here
that you need 1000s.
I can confirm that at a recent large installation with several
thousand
clients, the default of 100 is in effect.
Post by Brian J. Murrell
Can you both please report your lustre and kernel versions? I know you
said "latest" Aaron, but some version numbers might be more solid to go
on.
b.
_______________________________________________
Lustre-discuss mailing list
http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies

(301) 595-7000
aaron-***@public.gmane.org
Frank Leers
2008-03-05 16:33:53 UTC
Permalink
Post by Aaron Knister
That's very strange. What interconnect is that site using?
Not really strange, but -

SDR IB/OFED

lustre 1.6.4.2
2.6.18.8 clients
2.6.9-55.0.9 servers
Post by Aaron Knister
My versions are -
Lustre - 1.6.4.2
Kernel (servers) - 2.6.18-8.1.14.el5_lustre.1.6.4.2smp
Kernel (clients) - 2.6.18-53.1.13.el5
Post by Frank Leers
Post by Brian J. Murrell
Post by Aaron S. Knister
I think I tried that before and it didn't help, but I will try it
again. Thanks for the suggestion.
Just so you guys know, 1000 seconds for the obd_timeout is very, very
large! As you could probably guess, we have some very, very big Lustre
installations and to the best of my knowledge none of them are using
anywhere near that. AFAIK (and perhaps a Sun engineer with closer
experience to some of these very large clusters might correct me) the
largest value that the largest clusters are using is in the
neighbourhood of 300s. There has to be some other problem at play here
that you need 1000s.
I can confirm that at a recent large installation with several thousand
clients, the default of 100 is in effect.
Post by Brian J. Murrell
Can you both please report your lustre and kernel versions? I know you
said "latest" Aaron, but some version numbers might be more solid to go
on.
b.
_______________________________________________
Lustre-discuss mailing list
http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies
(301) 595-7000
Aaron Knister
2008-03-05 18:37:33 UTC
Permalink
Could you tell me what version of OFED was being used? Was it the
version that ships with the kernel?

-Aaron
Post by Frank Leers
Post by Aaron Knister
That's very strange. What interconnect is that site using?
Not really strange, but -
SDR IB/OFED
lustre 1.6.4.2
2.6.18.8 clients
2.6.9-55.0.9 servers
Post by Aaron Knister
My versions are -
Lustre - 1.6.4.2
Kernel (servers) - 2.6.18-8.1.14.el5_lustre.1.6.4.2smp
Kernel (clients) - 2.6.18-53.1.13.el5
Post by Frank Leers
Post by Brian J. Murrell
Post by Aaron S. Knister
I think I tried that before and it didn't help, but I will try it
again. Thanks for the suggestion.
Just so you guys know, 1000 seconds for the obd_timeout is very, very
large! As you could probably guess, we have some very, very big Lustre
installations and to the best of my knowledge none of them are using
anywhere near that. AFAIK (and perhaps a Sun engineer with closer
experience to some of these very large clusters might correct me) the
largest value that the largest clusters are using is in the
neighbourhood of 300s. There has to be some other problem at play here
that you need 1000s.
I can confirm that at a recent large installation with several thousand
clients, the default of 100 is in effect.
Post by Brian J. Murrell
Can you both please report your lustre and kernel versions? I know you
said "latest" Aaron, but some version numbers might be more solid to go
on.
b.
_______________________________________________
Lustre-discuss mailing list
http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies
(301) 595-7000
Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies

(301) 595-7000
aaron-***@public.gmane.org
Frank Leers
2008-03-05 19:03:11 UTC
Permalink
Post by Aaron Knister
Could you tell me what version of OFED was being used? Was it the
version that ships with the kernel?
OFED version is 1.2.5.4
Post by Aaron Knister
-Aaron
Post by Frank Leers
Post by Aaron Knister
That's very strange. What interconnect is that site using?
Not really strange, but -
SDR IB/OFED
lustre 1.6.4.2
2.6.18.8 clients
2.6.9-55.0.9 servers
Post by Aaron Knister
My versions are -
Lustre - 1.6.4.2
Kernel (servers) - 2.6.18-8.1.14.el5_lustre.1.6.4.2smp
Kernel (clients) - 2.6.18-53.1.13.el5
Post by Frank Leers
Post by Brian J. Murrell
Post by Aaron S. Knister
I think I tried that before and it didn't help, but I will try it
again. Thanks for the suggestion.
Just so you guys know, 1000 seconds for the obd_timeout is very, very
large! As you could probably guess, we have some very, very big Lustre
installations and to the best of my knowledge none of them are using
anywhere near that. AFAIK (and perhaps a Sun engineer with closer
experience to some of these very large clusters might correct me) the
largest value that the largest clusters are using is in the
neighbourhood of 300s. There has to be some other problem at play here
that you need 1000s.
I can confirm that at a recent large installation with several thousand
clients, the default of 100 is in effect.
Post by Brian J. Murrell
Can you both please report your lustre and kernel versions? I know you
said "latest" Aaron, but some version numbers might be more solid to go
on.
b.
_______________________________________________
Lustre-discuss mailing list
http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies
(301) 595-7000
Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies
(301) 595-7000
Aaron Knister
2008-03-05 23:00:40 UTC
Permalink
Are the clients SuSE, redhat or another distro? I can't get OFED
1.2.5.4 to build with rhel5 but im working on that.
Post by Frank Leers
Post by Aaron Knister
Could you tell me what version of OFED was being used? Was it the
version that ships with the kernel?
OFED version is 1.2.5.4
Post by Aaron Knister
-Aaron
Post by Frank Leers
Post by Aaron Knister
That's very strange. What interconnect is that site using?
Not really strange, but -
SDR IB/OFED
lustre 1.6.4.2
2.6.18.8 clients
2.6.9-55.0.9 servers
Post by Aaron Knister
My versions are -
Lustre - 1.6.4.2
Kernel (servers) - 2.6.18-8.1.14.el5_lustre.1.6.4.2smp
Kernel (clients) - 2.6.18-53.1.13.el5
Post by Frank Leers
Post by Brian J. Murrell
Post by Aaron S. Knister
I think I tried that before and it didn't help, but I will try it
again. Thanks for the suggestion.
Just so you guys know, 1000 seconds for the obd_timeout is very, very
large! As you could probably guess, we have some very, very big Lustre
installations and to the best of my knowledge none of them are using
anywhere near that. AFAIK (and perhaps a Sun engineer with closer
experience to some of these very large clusters might correct me) the
largest value that the largest clusters are using is in the
neighbourhood of 300s. There has to be some other problem at
play
here
that you need 1000s.
I can confirm that at a recent large installation with several thousand
clients, the default of 100 is in effect.
Post by Brian J. Murrell
Can you both please report your lustre and kernel versions? I
know
you
said "latest" Aaron, but some version numbers might be more solid to go
on.
b.
_______________________________________________
Lustre-discuss mailing list
http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies
(301) 595-7000
Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies
(301) 595-7000
Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies

(301) 595-7000
aaron-***@public.gmane.org

Charles Taylor
2008-03-05 16:34:28 UTC
Permalink
Well, go figure. We are running...

Lustre: 1.6.4.2 on clients and servers
Kernel: 2.6.18-8.1.14.el5Lustre (clients and servers)
Platform: X86_64 (opteron 275s, mostly)
Interconnect: IB, Ethernet
IB Stack: OFED 1.2

We already posted our procedure for patching the kernel, building
OFED, and building lustre so I don't think I'll go into that
again. Like I said, we just brought a new file system online.
Everything looked fine at first with just a few clients mounted.
Once we mounted all 408 (or so), we started gettting all kinds of
"transport endpoint failures" and the MGSs and OSTs were evicting
clients left and right. We looked for network problems and could
not find any of any substance. Once we increased the obd/lustre/
system timeout setting as previously discussed, the errors
vanished. This was consistent with our experience with 1.6.3 as
well. That file system has been online since early December.
Both file systems appear to be working well.

I'm not sure what to make of it. Perhaps we are just masking
another problem. Perhaps there are some other, related values
that need to be tuned. We've done the best we could but I'm sure
there is still much about Lustre we don't know. We'll try to get
someone out to the next class but until then, we're on our own, so to
speak.

Charlie Taylor
UF HPC Center
Post by Frank Leers
Post by Brian J. Murrell
Just so you guys know, 1000 seconds for the obd_timeout is very, very
large! As you could probably guess, we have some very, very big Lustre
installations and to the best of my knowledge none of them are using
anywhere near that. AFAIK (and perhaps a Sun engineer with closer
experience to some of these very large clusters might correct me) the
largest value that the largest clusters are using is in the
neighbourhood of 300s. There has to be some other problem at play here
that you need 1000s.
I can confirm that at a recent large installation with several
thousand
clients, the default of 100 is in effect.
Aaron Knister
2008-03-05 18:09:53 UTC
Permalink
Are you running DDR or SDR IB? Also what hardware are you using for
your storage?
Post by Charles Taylor
Well, go figure. We are running...
Lustre: 1.6.4.2 on clients and servers
Kernel: 2.6.18-8.1.14.el5Lustre (clients and servers)
Platform: X86_64 (opteron 275s, mostly)
Interconnect: IB, Ethernet
IB Stack: OFED 1.2
We already posted our procedure for patching the kernel, building
OFED, and building lustre so I don't think I'll go into that
again. Like I said, we just brought a new file system online.
Everything looked fine at first with just a few clients mounted.
Once we mounted all 408 (or so), we started gettting all kinds of
"transport endpoint failures" and the MGSs and OSTs were evicting
clients left and right. We looked for network problems and could
not find any of any substance. Once we increased the obd/lustre/
system timeout setting as previously discussed, the errors
vanished. This was consistent with our experience with 1.6.3 as
well. That file system has been online since early December.
Both file systems appear to be working well.
I'm not sure what to make of it. Perhaps we are just masking
another problem. Perhaps there are some other, related values
that need to be tuned. We've done the best we could but I'm sure
there is still much about Lustre we don't know. We'll try to get
someone out to the next class but until then, we're on our own, so to
speak.
Charlie Taylor
UF HPC Center
Post by Frank Leers
Post by Brian J. Murrell
Just so you guys know, 1000 seconds for the obd_timeout is very, very
large! As you could probably guess, we have some very, very big Lustre
installations and to the best of my knowledge none of them are using
anywhere near that. AFAIK (and perhaps a Sun engineer with closer
experience to some of these very large clusters might correct me) the
largest value that the largest clusters are using is in the
neighbourhood of 300s. There has to be some other problem at play here
that you need 1000s.
I can confirm that at a recent large installation with several thousand
clients, the default of 100 is in effect.
_______________________________________________
Lustre-discuss mailing list
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies

(301) 595-7000
aaron-***@public.gmane.org
Charles Taylor
2008-03-05 18:30:44 UTC
Permalink
SDR on the IB side. Our storage is RAID Inc. Falcon 3s, host
attached via 4Gb qlogic FC HCAs.

http://www.raidinc.com/falcon_III.php

Regards,

Charlie
Post by Aaron Knister
Are you running DDR or SDR IB? Also what hardware are you using for
your storage?
Post by Charles Taylor
Well, go figure. We are running...
Lustre: 1.6.4.2 on clients and servers
Kernel: 2.6.18-8.1.14.el5Lustre (clients and servers)
Platform: X86_64 (opteron 275s, mostly)
Interconnect: IB, Ethernet
IB Stack: OFED 1.2
We already posted our procedure for patching the kernel, building
OFED, and building lustre so I don't think I'll go into that
again. Like I said, we just brought a new file system online.
Everything looked fine at first with just a few clients mounted.
Once we mounted all 408 (or so), we started gettting all kinds of
"transport endpoint failures" and the MGSs and OSTs were evicting
clients left and right. We looked for network problems and could
not find any of any substance. Once we increased the obd/lustre/
system timeout setting as previously discussed, the errors
vanished. This was consistent with our experience with 1.6.3 as
well. That file system has been online since early December.
Both file systems appear to be working well.
I'm not sure what to make of it. Perhaps we are just masking
another problem. Perhaps there are some other, related values
that need to be tuned. We've done the best we could but I'm sure
there is still much about Lustre we don't know. We'll try to get
someone out to the next class but until then, we're on our own, so to
speak.
Charlie Taylor
UF HPC Center
Post by Frank Leers
Post by Brian J. Murrell
Just so you guys know, 1000 seconds for the obd_timeout is very, very
large! As you could probably guess, we have some very, very big Lustre
installations and to the best of my knowledge none of them are using
anywhere near that. AFAIK (and perhaps a Sun engineer with closer
experience to some of these very large clusters might correct me) the
largest value that the largest clusters are using is in the
neighbourhood of 300s. There has to be some other problem at play here
that you need 1000s.
I can confirm that at a recent large installation with several thousand
clients, the default of 100 is in effect.
_______________________________________________
Lustre-discuss mailing list
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies
(301) 595-7000
Aaron Knister
2008-03-05 18:39:20 UTC
Permalink
I wonder if the issue is related to the kernels being run on the
servers. Both Mr. Taylor and my setups are running the 2.6.18 kernel
on the server, however the set up mentioned with a timeout of 100 was
using the 2.6.9 kernel on the servers.

-Aaron
Post by Charles Taylor
SDR on the IB side. Our storage is RAID Inc. Falcon 3s, host
attached via 4Gb qlogic FC HCAs.
http://www.raidinc.com/falcon_III.php
Regards,
Charlie
Post by Aaron Knister
Are you running DDR or SDR IB? Also what hardware are you using for
your storage?
Post by Charles Taylor
Well, go figure. We are running...
Lustre: 1.6.4.2 on clients and servers
Kernel: 2.6.18-8.1.14.el5Lustre (clients and servers)
Platform: X86_64 (opteron 275s, mostly)
Interconnect: IB, Ethernet
IB Stack: OFED 1.2
We already posted our procedure for patching the kernel, building
OFED, and building lustre so I don't think I'll go into that
again. Like I said, we just brought a new file system online.
Everything looked fine at first with just a few clients mounted.
Once we mounted all 408 (or so), we started gettting all kinds of
"transport endpoint failures" and the MGSs and OSTs were evicting
clients left and right. We looked for network problems and could
not find any of any substance. Once we increased the obd/lustre/
system timeout setting as previously discussed, the errors
vanished. This was consistent with our experience with 1.6.3 as
well. That file system has been online since early December.
Both file systems appear to be working well.
I'm not sure what to make of it. Perhaps we are just masking
another problem. Perhaps there are some other, related values
that need to be tuned. We've done the best we could but I'm sure
there is still much about Lustre we don't know. We'll try to get
someone out to the next class but until then, we're on our own, so to
speak.
Charlie Taylor
UF HPC Center
Post by Frank Leers
Post by Brian J. Murrell
Just so you guys know, 1000 seconds for the obd_timeout is very, very
large! As you could probably guess, we have some very, very big Lustre
installations and to the best of my knowledge none of them are using
anywhere near that. AFAIK (and perhaps a Sun engineer with closer
experience to some of these very large clusters might correct me) the
largest value that the largest clusters are using is in the
neighbourhood of 300s. There has to be some other problem at play here
that you need 1000s.
I can confirm that at a recent large installation with several thousand
clients, the default of 100 is in effect.
_______________________________________________
Lustre-discuss mailing list
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies
(301) 595-7000
Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies

(301) 595-7000
aaron-***@public.gmane.org
Loading...