Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: DRBD: Users

drbd + 10gig network

 

 

DRBD users RSS feed   Index | Next | Previous | View Threaded


mike at dev-zero

Oct 14, 2009, 10:21 PM

Post #1 of 16 (817 views)
Permalink
drbd + 10gig network

first off, hello everybody. i'm somewhat new to drbd and definitely new
to the mailing list.

i am try to set up a cheap alternative to a iscsi san using some
somewhat commodity hardware and drbd. i happen to have some 10 gigabit
network interfaces around so i thought it would be a great interconnect
for the drbd replication and probably as the interconnect to the rest of
the network.

things were going well in my small proof of concept but when i made the
jump to the 10 gigabit network interfaces, i started running into
troubles with drbd not being able to complete a synchronization. it will
get anywhere between 5 and 15 percent done (on a 2TB volume) and the
stall. the only thing i have been able to do to get things going again
is to take down the network interface, stop drbd, bring back up the
interface, start drbd, and wait for it to stall again. i have to take
down the network interface because drbd wont respond until then.

in dmesg on the node with the UpToDate disk, i see errors like this in
the kernel log.

[191401.876167] drbd0: Began resync as SyncSource (will sync 1809012776
KB [452253194 bits set]).
[191409.068152] drbd0: [drbd0_worker/24334] sock_sendmsg time expired,
ko = 4294967295
[191416.533556] drbd0: [drbd0_worker/24334] sock_sendmsg time expired,
ko = 4294967294
[191423.531804] drbd0: [drbd0_worker/24334] sock_sendmsg time expired,
ko = 4294967293
[191429.888326] drbd0: [drbd0_worker/24334] sock_sendmsg time expired,
ko = 4294967292
[191437.658299] drbd0: [drbd0_worker/24334] sock_sendmsg time expired,
ko = 4294967291

in my trouble shooting, i tried changing the replication to use the
gigabit network interfaces already in the system and the synchronization
completed. i also tried a newer kernel and a new version of drbd.

i am doing this on debian lenny using the 2.6.26 kernel and drbd 8.0.14
that are with the distro. the system is a single opteron 2346 on a
supermicro h8dme-2 with a intel 10 gigabit nic. the underlying device is
a software raid10 with linux md. i did try a 2.6.30 kernel and drbd 8.3
but it didn't help.

has anyone seen anything like this or have any recommendations?

thanks in advance

mike

_______________________________________________
drbd-user mailing list
drbd-user[at]lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


muffaleta at gmail

Oct 15, 2009, 1:52 AM

Post #2 of 16 (791 views)
Permalink
Re: drbd + 10gig network [In reply to]

I'm a little confused. DRBD and iSCSI are totally different things.
What exactly are you trying to do, architecturally?

On Wed, Oct 14, 2009 at 10:21 PM, Mike Lovell <mike[at]dev-zero.net> wrote:
> first off, hello everybody. i'm somewhat new to drbd and definitely new to
> the mailing list.
>
> i am try to set up a cheap alternative to a iscsi san using some somewhat
> commodity hardware and drbd. i happen to have some 10 gigabit network
> interfaces around so i thought it would be a great interconnect for the drbd
> replication and probably as the interconnect to the rest of the network.
>
> things were going well in my small proof of concept but when i made the jump
> to the 10 gigabit network interfaces, i started running into troubles with
> drbd not being able to complete a synchronization. it will get anywhere
> between 5 and 15 percent done (on a 2TB volume) and the stall. the only
> thing i have been able to do to get things going again is to take down the
> network interface, stop drbd, bring back up the interface, start drbd, and
> wait for it to stall again. i have to take down the network interface
> because drbd wont respond until then.
>
> in dmesg on the node with the UpToDate disk, i see errors like this in the
> kernel log.
>
> [191401.876167] drbd0: Began resync as SyncSource (will sync 1809012776 KB
> [452253194 bits set]).
> [191409.068152] drbd0: [drbd0_worker/24334] sock_sendmsg time expired, ko =
> 4294967295
> [191416.533556] drbd0: [drbd0_worker/24334] sock_sendmsg time expired, ko =
> 4294967294
> [191423.531804] drbd0: [drbd0_worker/24334] sock_sendmsg time expired, ko =
> 4294967293
> [191429.888326] drbd0: [drbd0_worker/24334] sock_sendmsg time expired, ko =
> 4294967292
> [191437.658299] drbd0: [drbd0_worker/24334] sock_sendmsg time expired, ko =
> 4294967291
>
> in my trouble shooting, i tried changing the replication to use the gigabit
> network interfaces already in the system and the synchronization completed.
> i also tried a newer kernel and a new version of drbd.
>
> i am doing this on debian lenny using the 2.6.26 kernel and drbd 8.0.14 that
> are with the distro. the system is a single opteron 2346 on a supermicro
> h8dme-2 with a intel 10 gigabit nic. the underlying device is a software
> raid10 with linux md. i did try a 2.6.30 kernel and drbd 8.3 but it didn't
> help.
>
> has anyone seen anything like this or have any recommendations?
>
> thanks in advance
>
> mike
>
> _______________________________________________
> drbd-user mailing list
> drbd-user[at]lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>



--
Chris Chen <muffaleta[at]gmail.com>
"The fact that yours is better than anyone else's
is not a guarantee that it's any good."
-- Seen on a wall
_______________________________________________
drbd-user mailing list
drbd-user[at]lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


mike at dev-zero

Oct 15, 2009, 9:26 AM

Post #3 of 16 (789 views)
Permalink
Re: drbd + 10gig network [In reply to]

Christopher Chen wrote:
> I'm a little confused. DRBD and iSCSI are totally different things.
> What exactly are you trying to do, architecturally?
>
> On Wed, Oct 14, 2009 at 10:21 PM, Mike Lovell <mike[at]dev-zero.net> wrote:
>
>> first off, hello everybody. i'm somewhat new to drbd and definitely new to
>> the mailing list.
>>
>> i am try to set up a cheap alternative to a iscsi san using some somewhat
>> commodity hardware and drbd. i happen to have some 10 gigabit network
>> interfaces around so i thought it would be a great interconnect for the drbd
>> replication and probably as the interconnect to the rest of the network.
>>
they are different things. i am planning on setting up two servers with
a lot of storage space on then with drbd sync'ing the data between the
two. then run one or more iscsi targets on top of the drbd volume that
multiple iscsi initiators will connect to. i'm also going to use either
heartbeat or pacemaker to set up automatic failover of the drbd roles
and the iscsi targets. this should give me a somewhat cheap alternative
to buying san.

there is a brief overview. the current problem though is that drbd is
having trouble completing a sync over the 10 gig interfaces.

mike
_______________________________________________
drbd-user mailing list
drbd-user[at]lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


mike at dev-zero

Oct 15, 2009, 9:36 AM

Post #4 of 16 (788 views)
Permalink
Re: drbd + 10gig network [In reply to]

Johan Verrept wrote:
> On Wed, 2009-10-14 at 23:21 -0600, Mike Lovell wrote:
>
>> first off, hello everybody. i'm somewhat new to drbd and definitely new
>> to the mailing list.
>>
>> i am try to set up a cheap alternative to a iscsi san using some
>> somewhat commodity hardware and drbd. i happen to have some 10 gigabit
>> network interfaces around so i thought it would be a great interconnect
>> for the drbd replication and probably as the interconnect to the rest of
>> the network.
>>
>> things were going well in my small proof of concept but when i made the
>> jump to the 10 gigabit network interfaces, i started running into
>> troubles with drbd not being able to complete a synchronization. it will
>> get anywhere between 5 and 15 percent done (on a 2TB volume) and the
>> stall. the only thing i have been able to do to get things going again
>> is to take down the network interface, stop drbd, bring back up the
>> interface, start drbd, and wait for it to stall again. i have to take
>> down the network interface because drbd wont respond until then.
>>
>> in dmesg on the node with the UpToDate disk, i see errors like this in
>> the kernel log.
>>
>> [191401.876167] drbd0: Began resync as SyncSource (will sync 1809012776
>> KB [452253194 bits set]).
>> [191409.068152] drbd0: [drbd0_worker/24334] sock_sendmsg time expired,
>> ko = 4294967295
>> [191416.533556] drbd0: [drbd0_worker/24334] sock_sendmsg time expired,
>> ko = 4294967294
>> [191423.531804] drbd0: [drbd0_worker/24334] sock_sendmsg time expired,
>> ko = 4294967293
>> [191429.888326] drbd0: [drbd0_worker/24334] sock_sendmsg time expired,
>> ko = 4294967292
>> [191437.658299] drbd0: [drbd0_worker/24334] sock_sendmsg time expired,
>> ko = 4294967291
>>
>> in my trouble shooting, i tried changing the replication to use the
>> gigabit network interfaces already in the system and the synchronization
>> completed. i also tried a newer kernel and a new version of drbd.
>>
>> i am doing this on debian lenny using the 2.6.26 kernel and drbd 8.0.14
>> that are with the distro. the system is a single opteron 2346 on a
>> supermicro h8dme-2 with a intel 10 gigabit nic. the underlying device is
>> a software raid10 with linux md. i did try a 2.6.30 kernel and drbd 8.3
>> but it didn't help.
>>
>> has anyone seen anything like this or have any recommendations?
>>
>
> <disclaimer> I am not an expert at drbd </disclaimer>
>
> I have seen similar things (stalling drbd) mentioned on the mailing
> list. Mostly the reaction is a finger pointing first to your network
> interface/drivers. Perhaps you should look into that first? From your
> symptoms, I would strongly suspect the problem is there (especially
> since it works fine once you switch interfaces). Perhaps run a few iperf
> test to see if it runs smoothly?
>
> J.
>
>
i realized right after i sent my request that i hadn't done any load or
integrity testing on the 10 gigabit interfaces since i moved them around
and reinstalled the OS. i had previously used these nics for stuff other
than drbd and so i assumed that things were still operating properly. i
am going to start some testing on the interfaces and see if i see any
problems but considering my previous experience with these cards, i'm
doubting that is the problem. no harm in checking though. i'll let the
list know the results of my test.

has anyone else on the list been able to do drbd over 10 gigabit links
before and been successful with it? if so, what was your hardware and
software set up to do it?

thx.


mike at dev-zero

Oct 16, 2009, 1:21 AM

Post #5 of 16 (777 views)
Permalink
Re: drbd + 10gig network [In reply to]

Mike Lovell wrote:
> Johan Verrept wrote:
>> On Wed, 2009-10-14 at 23:21 -0600, Mike Lovell wrote:
>>
>>> first off, hello everybody. i'm somewhat new to drbd and definitely new
>>> to the mailing list.
>>>
>>> i am try to set up a cheap alternative to a iscsi san using some
>>> somewhat commodity hardware and drbd. i happen to have some 10 gigabit
>>> network interfaces around so i thought it would be a great interconnect
>>> for the drbd replication and probably as the interconnect to the rest of
>>> the network.
>>>
>>> things were going well in my small proof of concept but when i made the
>>> jump to the 10 gigabit network interfaces, i started running into
>>> troubles with drbd not being able to complete a synchronization. it will
>>> get anywhere between 5 and 15 percent done (on a 2TB volume) and the
>>> stall. the only thing i have been able to do to get things going again
>>> is to take down the network interface, stop drbd, bring back up the
>>> interface, start drbd, and wait for it to stall again. i have to take
>>> down the network interface because drbd wont respond until then.
>>>
>>> in dmesg on the node with the UpToDate disk, i see errors like this in
>>> the kernel log.
>>>
>>> [191401.876167] drbd0: Began resync as SyncSource (will sync 1809012776
>>> KB [452253194 bits set]).
>>> [191409.068152] drbd0: [drbd0_worker/24334] sock_sendmsg time expired,
>>> ko = 4294967295
>>> [191416.533556] drbd0: [drbd0_worker/24334] sock_sendmsg time expired,
>>> ko = 4294967294
>>> [191423.531804] drbd0: [drbd0_worker/24334] sock_sendmsg time expired,
>>> ko = 4294967293
>>> [191429.888326] drbd0: [drbd0_worker/24334] sock_sendmsg time expired,
>>> ko = 4294967292
>>> [191437.658299] drbd0: [drbd0_worker/24334] sock_sendmsg time expired,
>>> ko = 4294967291
>>>
>>> in my trouble shooting, i tried changing the replication to use the
>>> gigabit network interfaces already in the system and the synchronization
>>> completed. i also tried a newer kernel and a new version of drbd.
>>>
>>> i am doing this on debian lenny using the 2.6.26 kernel and drbd 8.0.14
>>> that are with the distro. the system is a single opteron 2346 on a
>>> supermicro h8dme-2 with a intel 10 gigabit nic. the underlying device is
>>> a software raid10 with linux md. i did try a 2.6.30 kernel and drbd 8.3
>>> but it didn't help.
>>>
>>> has anyone seen anything like this or have any recommendations?
>>>
>>
>> <disclaimer> I am not an expert at drbd </disclaimer>
>>
>> I have seen similar things (stalling drbd) mentioned on the mailing
>> list. Mostly the reaction is a finger pointing first to your network
>> interface/drivers. Perhaps you should look into that first? From your
>> symptoms, I would strongly suspect the problem is there (especially
>> since it works fine once you switch interfaces). Perhaps run a few iperf
>> test to see if it runs smoothly?
>>
>> J.
>>
>>
> i realized right after i sent my request that i hadn't done any load
> or integrity testing on the 10 gigabit interfaces since i moved them
> around and reinstalled the OS. i had previously used these nics for
> stuff other than drbd and so i assumed that things were still
> operating properly. i am going to start some testing on the interfaces
> and see if i see any problems but considering my previous experience
> with these cards, i'm doubting that is the problem. no harm in
> checking though. i'll let the list know the results of my test.
>
> has anyone else on the list been able to do drbd over 10 gigabit links
> before and been successful with it? if so, what was your hardware and
> software set up to do it?

i did some performance and load testing on the 10 gig interfaces today.
using a variety of methods, i moved > 10 TiB of data across the link
without dropped packets or connection interrupt. i things like `cat
/dev/zero | nc` on one box to `nc > /dev/null` on the other and iperf
and NPtcp between the nodes. no kernel errors, no connection drops, no
dropped packets listed in ifconfig for the devices. i even just tried
building the latest drivers for the nic from intel and the problem remains.

any other thoughts?

mike
_______________________________________________
drbd-user mailing list
drbd-user[at]lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


lars.ellenberg at linbit

Oct 16, 2009, 4:14 AM

Post #6 of 16 (776 views)
Permalink
Re: drbd + 10gig network [In reply to]

On Fri, Oct 16, 2009 at 02:21:40AM -0600, Mike Lovell wrote:
> Mike Lovell wrote:
>> Johan Verrept wrote:
>>> On Wed, 2009-10-14 at 23:21 -0600, Mike Lovell wrote:
>>>
>>>> first off, hello everybody. i'm somewhat new to drbd and definitely
>>>> new to the mailing list.
>>>>
>>>> i am try to set up a cheap alternative to a iscsi san using some
>>>> somewhat commodity hardware and drbd. i happen to have some 10
>>>> gigabit network interfaces around so i thought it would be a great
>>>> interconnect for the drbd replication and probably as the
>>>> interconnect to the rest of the network.
>>>>
>>>> things were going well in my small proof of concept but when i made
>>>> the jump to the 10 gigabit network interfaces, i started running
>>>> into troubles with drbd not being able to complete a
>>>> synchronization. it will get anywhere between 5 and 15 percent done
>>>> (on a 2TB volume) and the stall. the only thing i have been able to
>>>> do to get things going again is to take down the network interface,
>>>> stop drbd, bring back up the interface, start drbd, and wait for it
>>>> to stall again. i have to take down the network interface because
>>>> drbd wont respond until then.
>>>>
>>>> in dmesg on the node with the UpToDate disk, i see errors like this
>>>> in the kernel log.
>>>>
>>>> [191401.876167] drbd0: Began resync as SyncSource (will sync
>>>> 1809012776 KB [452253194 bits set]).
>>>> [191409.068152] drbd0: [drbd0_worker/24334] sock_sendmsg time
>>>> expired, ko = 4294967295
>>>> [191416.533556] drbd0: [drbd0_worker/24334] sock_sendmsg time
>>>> expired, ko = 4294967294
>>>> [191423.531804] drbd0: [drbd0_worker/24334] sock_sendmsg time
>>>> expired, ko = 4294967293
>>>> [191429.888326] drbd0: [drbd0_worker/24334] sock_sendmsg time
>>>> expired, ko = 4294967292
>>>> [191437.658299] drbd0: [drbd0_worker/24334] sock_sendmsg time
>>>> expired, ko = 4294967291
>>>>
>>>> in my trouble shooting, i tried changing the replication to use the
>>>> gigabit network interfaces already in the system and the
>>>> synchronization completed. i also tried a newer kernel and a new
>>>> version of drbd.
>>>>
>>>> i am doing this on debian lenny using the 2.6.26 kernel and drbd
>>>> 8.0.14 that are with the distro. the system is a single opteron
>>>> 2346 on a supermicro h8dme-2 with a intel 10 gigabit nic. the
>>>> underlying device is a software raid10 with linux md. i did try a
>>>> 2.6.30 kernel and drbd 8.3 but it didn't help.
>>>>
>>>> has anyone seen anything like this or have any recommendations?
>>>>
>>>
>>> <disclaimer> I am not an expert at drbd </disclaimer>
>>>
>>> I have seen similar things (stalling drbd) mentioned on the mailing
>>> list. Mostly the reaction is a finger pointing first to your network
>>> interface/drivers. Perhaps you should look into that first? From your
>>> symptoms, I would strongly suspect the problem is there (especially
>>> since it works fine once you switch interfaces). Perhaps run a few iperf
>>> test to see if it runs smoothly?
>>>
>>> J.
>>>
>>>
>> i realized right after i sent my request that i hadn't done any load
>> or integrity testing on the 10 gigabit interfaces since i moved them
>> around and reinstalled the OS. i had previously used these nics for
>> stuff other than drbd and so i assumed that things were still
>> operating properly. i am going to start some testing on the interfaces
>> and see if i see any problems but considering my previous experience
>> with these cards, i'm doubting that is the problem. no harm in
>> checking though. i'll let the list know the results of my test.
>>
>> has anyone else on the list been able to do drbd over 10 gigabit links
>> before and been successful with it? if so, what was your hardware and
>> software set up to do it?
>
> i did some performance and load testing on the 10 gig interfaces today.
> using a variety of methods, i moved > 10 TiB of data across the link
> without dropped packets or connection interrupt. i things like `cat
> /dev/zero | nc` on one box to `nc > /dev/null` on the other and iperf
> and NPtcp between the nodes. no kernel errors, no connection drops, no
> dropped packets listed in ifconfig for the devices. i even just tried
> building the latest drivers for the nic from intel and the problem
> remains.
>
> any other thoughts?

try DRBD 8.3.4.
It handles some settings more gracefully.

On <= 8.3.2, try decreasing sync-rate, and increase "max-buffers".

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user[at]lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


mike at dev-zero

Oct 23, 2009, 2:40 PM

Post #7 of 16 (660 views)
Permalink
Re: drbd + 10gig network [In reply to]

Lars Ellenberg wrote:
> On Fri, Oct 16, 2009 at 02:21:40AM -0600, Mike Lovell wrote:
>
>> Mike Lovell wrote:
>>
>>> Johan Verrept wrote:
>>>
>>>> On Wed, 2009-10-14 at 23:21 -0600, Mike Lovell wrote:
>>>>
>>>>
>>>>> first off, hello everybody. i'm somewhat new to drbd and definitely
>>>>> new to the mailing list.
>>>>>
>>>>> i am try to set up a cheap alternative to a iscsi san using some
>>>>> somewhat commodity hardware and drbd. i happen to have some 10
>>>>> gigabit network interfaces around so i thought it would be a great
>>>>> interconnect for the drbd replication and probably as the
>>>>> interconnect to the rest of the network.
>>>>>
>>>>> things were going well in my small proof of concept but when i made
>>>>> the jump to the 10 gigabit network interfaces, i started running
>>>>> into troubles with drbd not being able to complete a
>>>>> synchronization. it will get anywhere between 5 and 15 percent done
>>>>> (on a 2TB volume) and the stall. the only thing i have been able to
>>>>> do to get things going again is to take down the network interface,
>>>>> stop drbd, bring back up the interface, start drbd, and wait for it
>>>>> to stall again. i have to take down the network interface because
>>>>> drbd wont respond until then.
>>>>>
>>>>> in dmesg on the node with the UpToDate disk, i see errors like this
>>>>> in the kernel log.
>>>>>
>>>>> [191401.876167] drbd0: Began resync as SyncSource (will sync
>>>>> 1809012776 KB [452253194 bits set]).
>>>>> [191409.068152] drbd0: [drbd0_worker/24334] sock_sendmsg time
>>>>> expired, ko = 4294967295
>>>>> [191416.533556] drbd0: [drbd0_worker/24334] sock_sendmsg time
>>>>> expired, ko = 4294967294
>>>>> [191423.531804] drbd0: [drbd0_worker/24334] sock_sendmsg time
>>>>> expired, ko = 4294967293
>>>>> [191429.888326] drbd0: [drbd0_worker/24334] sock_sendmsg time
>>>>> expired, ko = 4294967292
>>>>> [191437.658299] drbd0: [drbd0_worker/24334] sock_sendmsg time
>>>>> expired, ko = 4294967291
>>>>>
>>>>> in my trouble shooting, i tried changing the replication to use the
>>>>> gigabit network interfaces already in the system and the
>>>>> synchronization completed. i also tried a newer kernel and a new
>>>>> version of drbd.
>>>>>
>>>>> i am doing this on debian lenny using the 2.6.26 kernel and drbd
>>>>> 8.0.14 that are with the distro. the system is a single opteron
>>>>> 2346 on a supermicro h8dme-2 with a intel 10 gigabit nic. the
>>>>> underlying device is a software raid10 with linux md. i did try a
>>>>> 2.6.30 kernel and drbd 8.3 but it didn't help.
>>>>>
>>>>> has anyone seen anything like this or have any recommendations?
>>>>>
>>>>>
>>>> <disclaimer> I am not an expert at drbd </disclaimer>
>>>>
>>>> I have seen similar things (stalling drbd) mentioned on the mailing
>>>> list. Mostly the reaction is a finger pointing first to your network
>>>> interface/drivers. Perhaps you should look into that first? From your
>>>> symptoms, I would strongly suspect the problem is there (especially
>>>> since it works fine once you switch interfaces). Perhaps run a few iperf
>>>> test to see if it runs smoothly?
>>>>
>>>> J.
>>>>
>>>>
>>>>
>>> i realized right after i sent my request that i hadn't done any load
>>> or integrity testing on the 10 gigabit interfaces since i moved them
>>> around and reinstalled the OS. i had previously used these nics for
>>> stuff other than drbd and so i assumed that things were still
>>> operating properly. i am going to start some testing on the interfaces
>>> and see if i see any problems but considering my previous experience
>>> with these cards, i'm doubting that is the problem. no harm in
>>> checking though. i'll let the list know the results of my test.
>>>
>>> has anyone else on the list been able to do drbd over 10 gigabit links
>>> before and been successful with it? if so, what was your hardware and
>>> software set up to do it?
>>>
>> i did some performance and load testing on the 10 gig interfaces today.
>> using a variety of methods, i moved > 10 TiB of data across the link
>> without dropped packets or connection interrupt. i things like `cat
>> /dev/zero | nc` on one box to `nc > /dev/null` on the other and iperf
>> and NPtcp between the nodes. no kernel errors, no connection drops, no
>> dropped packets listed in ifconfig for the devices. i even just tried
>> building the latest drivers for the nic from intel and the problem
>> remains.
>>
>> any other thoughts?
>>
>
> try DRBD 8.3.4.
> It handles some settings more gracefully.
>
> On <= 8.3.2, try decreasing sync-rate, and increase "max-buffers".
>
>
i spent some more time on this problem and still haven't been able to
resolve it yet. i tried changing from the opteron platform that i was
originally using to a xeon (nehalem) platform which has the IOAT and DCA
optimzations but using the same nics. that didn't fix the problem but
did greatly improved the performance when it was sync'ing but also
exaggerated the problem. when the sync hangs, the drbd module is almost
completely unresponsive. i tried doing a pause-sync and then resume-sync
thinking that it would nudge the module into working but the commands
timeout on talking to the module. i can still cat /proc/drbd but that is
about it until i take down the network interface and drbd detects the
network change. if i then bring back up the interface, drbd detects it
can talk again but then only syncs a couple of megabytes before stalling
again. i have tried every way i can think of to check the integrity of
the network link between the hosts and everything says they are fine
except for during a ping flood there will be a few out of a couple
hundred thousand packets that get dropped. but tcp should be able to
handle that amount of loss without coughing.

but, since i don't have any other 10gig equipment to test with, i can't
say for sure that it is not the driver or network cards. i was able to
convince my boss to let me buy two new 10gig nics so that i can test on
a different stack. does anyone on the list have any preferences on
network cards or chipsets for 10gig ethernet cards? i have been using
ones with an intel 82598 chipset. i am eye'ing ones from myricom and
chelsio. does anyone have any experience with these or any other
recommendations?

thanks

mike


igor at 3gnt

Oct 29, 2009, 11:40 AM

Post #8 of 16 (574 views)
Permalink
Re: drbd + 10gig network [In reply to]

Hi,

Are you still using software raid?

Have you tryed out using hardware raid?

The only time i had tryed out drbd with mdraid, well let's say, i never
will try that again, from now on, we only use areca raid cards. Anyway,
software raid it's slow, if you want to do this thing, buy raid
controllers, the best ones are areca, 3ware and lsi it's bad, don't know
all the others.

Cheers,

On 10/23/2009 10:40 PM, Mike Lovell wrote:
> exaggerated the problem. when the sync hangs, the drbd module is
> almost completely unresponsive. i tried doing a pause-sync and then
> resume-sync thinking that it would nudge the module into working but
> the commands timeout on talking to the module. i can still cat
> /proc/drbd but that is about it until i take down the network
> interface and drbd detects the network change. if i then bring back up
> the interface, drbd detects it can talk again but then only syncs a
> couple of megabytes before stalling again. i have tried every way i
> can think of to check the integrity of the network link between the
> hosts and everything says they are fine except for during a ping flood
> there will be a few out of a couple h

--
Igor Neves<igor.neves[at]3gnt.net>
3GNTW - Tecnologias de Informação, Lda

SIP: igor[at]3gnt.net JID: igor[at]3gnt.net
ICQ: 249075444 MSN: igor[at]3gnt.net
TLM: 00351914503611 PSTN: 00351252377120


_______________________________________________
drbd-user mailing list
drbd-user[at]lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


mike at dev-zero

Oct 29, 2009, 11:42 AM

Post #9 of 16 (574 views)
Permalink
Re: drbd + 10gig network [In reply to]

Igor Neves wrote:
> Hi,
>
> Are you still using software raid?
>
> Have you tryed out using hardware raid?
>
> The only time i had tryed out drbd with mdraid, well let's say, i
> never will try that again, from now on, we only use areca raid cards.
> Anyway, software raid it's slow, if you want to do this thing, buy
> raid controllers, the best ones are areca, 3ware and lsi it's bad,
> don't know all the others.
>
> Cheers,
>
> On 10/23/2009 10:40 PM, Mike Lovell wrote:
>> exaggerated the problem. when the sync hangs, the drbd module is
>> almost completely unresponsive. i tried doing a pause-sync and then
>> resume-sync thinking that it would nudge the module into working but
>> the commands timeout on talking to the module. i can still cat
>> /proc/drbd but that is about it until i take down the network
>> interface and drbd detects the network change. if i then bring back
>> up the interface, drbd detects it can talk again but then only syncs
>> a couple of megabytes before stalling again. i have tried every way i
>> can think of to check the integrity of the network link between the
>> hosts and everything says they are fine except for during a ping
>> flood there will be a few out of a couple h
>
yes, i am using software raid. i guess i didn't try changing that during
my hardware changes. i'll see if i can scrounge up some raid controllers
to test this with.

mike
_______________________________________________
drbd-user mailing list
drbd-user[at]lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Todd.Denniston at tsb

Oct 30, 2009, 12:34 PM

Post #10 of 16 (546 views)
Permalink
Re: drbd + 10gig network [In reply to]

Mike Lovell wrote, On 12/23/-28158 02:59 PM:
>> On 10/23/2009 10:40 PM, Mike Lovell wrote:
>>> exaggerated the problem. when the sync hangs, the drbd module is
>>> almost completely unresponsive. i tried doing a pause-sync and then
>>> resume-sync thinking that it would nudge the module into working but
>>> the commands timeout on talking to the module. i can still cat
>>> /proc/drbd but that is about it until i take down the network
>>> interface and drbd detects the network change. if i then bring back
>>> up the interface, drbd detects it can talk again but then only syncs
>>> a couple of megabytes before stalling again. i have tried every way i
>>> can think of to check the integrity of the network link between the
>>> hosts and everything says they are fine except for during a ping
>>> flood there will be a few out of a couple h
>>
> yes, i am using software raid. i guess i didn't try changing that during
> my hardware changes. i'll see if i can scrounge up some raid controllers
> to test this with.
>
> mike
>
>

One other thing I would suggest looking at, which I don't think anyone else mentioned.

Network equipment in between the mirrors. (switches, cables, crossovers)

What is your MTU?

I experienced a problem _very_ similar with the set I maintained when a 10/100baseT hub/switch was
placed between the units instead of the crossover cable I had been using. The base problem was that
the hub only supported MTUs smaller than 1500, and I was using 6000. If you are using Gig equipment
I would expect a considerably larger maximum MTU, but there _may_ be a limit somewhere there.

hope this helps.
--
Todd Denniston
Crane Division, Naval Surface Warfare Center (NSWC Crane)
Harnessing the Power of Technology for the Warfighter
_______________________________________________
drbd-user mailing list
drbd-user[at]lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


mike at dev-zero

Oct 30, 2009, 12:52 PM

Post #11 of 16 (548 views)
Permalink
Re: drbd + 10gig network [In reply to]

Todd Denniston wrote:
> Mike Lovell wrote, On 12/23/-28158 02:59 PM:
>>> On 10/23/2009 10:40 PM, Mike Lovell wrote:
>>>> exaggerated the problem. when the sync hangs, the drbd module is
>>>> almost completely unresponsive. i tried doing a pause-sync and then
>>>> resume-sync thinking that it would nudge the module into working
>>>> but the commands timeout on talking to the module. i can still cat
>>>> /proc/drbd but that is about it until i take down the network
>>>> interface and drbd detects the network change. if i then bring back
>>>> up the interface, drbd detects it can talk again but then only
>>>> syncs a couple of megabytes before stalling again. i have tried
>>>> every way i can think of to check the integrity of the network link
>>>> between the hosts and everything says they are fine except for
>>>> during a ping flood there will be a few out of a couple h
>>>
>> yes, i am using software raid. i guess i didn't try changing that
>> during my hardware changes. i'll see if i can scrounge up some raid
>> controllers to test this with.
>>
>> mike
>>
>>
>
> One other thing I would suggest looking at, which I don't think anyone
> else mentioned.
>
> Network equipment in between the mirrors. (switches, cables, crossovers)
>
> What is your MTU?
>
> I experienced a problem _very_ similar with the set I maintained when
> a 10/100baseT hub/switch was placed between the units instead of the
> crossover cable I had been using. The base problem was that the hub
> only supported MTUs smaller than 1500, and I was using 6000. If you
> are using Gig equipment I would expect a considerably larger maximum
> MTU, but there _may_ be a limit somewhere there.
>
> hope this helps.
i am essentially using a crossover cable between the boxes and a mtu of
9000. i did try with the standard 1500 and 9000 frame sizes and had
failures with both.

mike

_______________________________________________
drbd-user mailing list
drbd-user[at]lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


mike at dev-zero

Oct 30, 2009, 8:59 PM

Post #12 of 16 (542 views)
Permalink
Re: drbd + 10gig network [In reply to]

Mike Lovell wrote:
> Igor Neves wrote:
>> Hi,
>>
>> Are you still using software raid?
>>
>> Have you tryed out using hardware raid?
>>
>> The only time i had tryed out drbd with mdraid, well let's say, i
>> never will try that again, from now on, we only use areca raid cards.
>> Anyway, software raid it's slow, if you want to do this thing, buy
>> raid controllers, the best ones are areca, 3ware and lsi it's bad,
>> don't know all the others.
>>
>> Cheers,
>>
>> On 10/23/2009 10:40 PM, Mike Lovell wrote:
>>> exaggerated the problem. when the sync hangs, the drbd module is
>>> almost completely unresponsive. i tried doing a pause-sync and then
>>> resume-sync thinking that it would nudge the module into working but
>>> the commands timeout on talking to the module. i can still cat
>>> /proc/drbd but that is about it until i take down the network
>>> interface and drbd detects the network change. if i then bring back
>>> up the interface, drbd detects it can talk again but then only syncs
>>> a couple of megabytes before stalling again. i have tried every way
>>> i can think of to check the integrity of the network link between
>>> the hosts and everything says they are fine except for during a ping
>>> flood there will be a few out of a couple h
>>
> yes, i am using software raid. i guess i didn't try changing that
> during my hardware changes. i'll see if i can scrounge up some raid
> controllers to test this with.

i got a hold of a few 3ware controllers and used these for the disk
array instead of the software raid. unfortunately, it still broke.

mike
_______________________________________________
drbd-user mailing list
drbd-user[at]lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


igor at 3gnt

Nov 2, 2009, 2:20 AM

Post #13 of 16 (486 views)
Permalink
Re: drbd + 10gig network [In reply to]

On 10/30/2009 07:52 PM, Mike Lovell wrote:
> Todd Denniston wrote:
>> Mike Lovell wrote, On 12/23/-28158 02:59 PM:
>>>> On 10/23/2009 10:40 PM, Mike Lovell wrote:
>>>>> exaggerated the problem. when the sync hangs, the drbd module is
>>>>> almost completely unresponsive. i tried doing a pause-sync and
>>>>> then resume-sync thinking that it would nudge the module into
>>>>> working but the commands timeout on talking to the module. i can
>>>>> still cat /proc/drbd but that is about it until i take down the
>>>>> network interface and drbd detects the network change. if i then
>>>>> bring back up the interface, drbd detects it can talk again but
>>>>> then only syncs a couple of megabytes before stalling again. i
>>>>> have tried every way i can think of to check the integrity of the
>>>>> network link between the hosts and everything says they are fine
>>>>> except for during a ping flood there will be a few out of a couple h
>>>>
>>> yes, i am using software raid. i guess i didn't try changing that
>>> during my hardware changes. i'll see if i can scrounge up some raid
>>> controllers to test this with.
>>>
>>> mike
>>>
>>>
>>
>> One other thing I would suggest looking at, which I don't think
>> anyone else mentioned.
>>
>> Network equipment in between the mirrors. (switches, cables, crossovers)
>>
>> What is your MTU?
>>
>> I experienced a problem _very_ similar with the set I maintained when
>> a 10/100baseT hub/switch was placed between the units instead of the
>> crossover cable I had been using. The base problem was that the hub
>> only supported MTUs smaller than 1500, and I was using 6000. If you
>> are using Gig equipment I would expect a considerably larger maximum
>> MTU, but there _may_ be a limit somewhere there.
>>
>> hope this helps.
> i am essentially using a crossover cable between the boxes and a mtu
> of 9000. i did try with the standard 1500 and 9000 frame sizes and had
> failures with both.
>
> mike


In my opinion, you should put everything on the default standard's. MTU
should only be changed when you need "the extra mile" and AFTER you have
your system stable.

Don't change default's before your have that stability. You don't get so
much difference in speed changing MTU from 1500 to 9000.

This is my rule when i'm having troubles or when I'm playing with
something new, and this is for all options, don't apply only for MTU.

I will be playing exacly with this setup one of this days, I will give
my feedback later.

Have you consider using EL5 (or centos 5)?

Good luck,

--
Igor Neves<igor.neves[at]3gnt.net>
3GNTW - Tecnologias de Informação, Lda

SIP: igor[at]3gnt.net JID: igor[at]3gnt.net
ICQ: 249075444 MSN: igor[at]3gnt.net
TLM: 00351914503611 PSTN: 00351252377120


_______________________________________________
drbd-user mailing list
drbd-user[at]lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


mike at dev-zero

Nov 20, 2009, 9:01 AM

Post #14 of 16 (92 views)
Permalink
Re: drbd + 10gig network [In reply to]

Mike Lovell wrote:
> Mike Lovell wrote:
>> Igor Neves wrote:
>>> Hi,
>>>
>>> Are you still using software raid?
>>>
>>> Have you tryed out using hardware raid?
>>>
>>> The only time i had tryed out drbd with mdraid, well let's say, i
>>> never will try that again, from now on, we only use areca raid
>>> cards. Anyway, software raid it's slow, if you want to do this
>>> thing, buy raid controllers, the best ones are areca, 3ware and lsi
>>> it's bad, don't know all the others.
>>>
>>> Cheers,
>>>
>>> On 10/23/2009 10:40 PM, Mike Lovell wrote:
>>>> exaggerated the problem. when the sync hangs, the drbd module is
>>>> almost completely unresponsive. i tried doing a pause-sync and then
>>>> resume-sync thinking that it would nudge the module into working
>>>> but the commands timeout on talking to the module. i can still cat
>>>> /proc/drbd but that is about it until i take down the network
>>>> interface and drbd detects the network change. if i then bring back
>>>> up the interface, drbd detects it can talk again but then only
>>>> syncs a couple of megabytes before stalling again. i have tried
>>>> every way i can think of to check the integrity of the network link
>>>> between the hosts and everything says they are fine except for
>>>> during a ping flood there will be a few out of a couple h
>>>
>> yes, i am using software raid. i guess i didn't try changing that
>> during my hardware changes. i'll see if i can scrounge up some raid
>> controllers to test this with.
>
> i got a hold of a few 3ware controllers and used these for the disk
> array instead of the software raid. unfortunately, it still broke.

after much more experimentation and buying new hardware, i got a things
working. i did eventually try doing this on the hardware i had using
opensuse 11.2 and centos 5.4. same problems. luckily, i was able to
convince my boss to let me buy some new 10 gig nics cause those are the
only thing that didn't change. i ordered some cards from chelsio (pcie,
dual port cx4) and just got them on wednesday. put them in, configure
them, and start syncing. never hiccuped. so, i guess for anyone else
wanting to try this, avoid using 10gig pcie intel nics.

mike
_______________________________________________
drbd-user mailing list
drbd-user[at]lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


igor at 3gnt

Nov 20, 2009, 9:35 AM

Post #15 of 16 (92 views)
Permalink
Re: drbd + 10gig network [In reply to]

On 11/20/2009 05:01 PM, Mike Lovell wrote:
> Mike Lovell wrote:
>> Mike Lovell wrote:
>>> Igor Neves wrote:
>>>> Hi,
>>>>
>>>> Are you still using software raid?
>>>>
>>>> Have you tryed out using hardware raid?
>>>>
>>>> The only time i had tryed out drbd with mdraid, well let's say, i
>>>> never will try that again, from now on, we only use areca raid
>>>> cards. Anyway, software raid it's slow, if you want to do this
>>>> thing, buy raid controllers, the best ones are areca, 3ware and lsi
>>>> it's bad, don't know all the others.
>>>>
>>>> Cheers,
>>>>
>>>> On 10/23/2009 10:40 PM, Mike Lovell wrote:
>>>>> exaggerated the problem. when the sync hangs, the drbd module is
>>>>> almost completely unresponsive. i tried doing a pause-sync and
>>>>> then resume-sync thinking that it would nudge the module into
>>>>> working but the commands timeout on talking to the module. i can
>>>>> still cat /proc/drbd but that is about it until i take down the
>>>>> network interface and drbd detects the network change. if i then
>>>>> bring back up the interface, drbd detects it can talk again but
>>>>> then only syncs a couple of megabytes before stalling again. i
>>>>> have tried every way i can think of to check the integrity of the
>>>>> network link between the hosts and everything says they are fine
>>>>> except for during a ping flood there will be a few out of a couple h
>>>>
>>> yes, i am using software raid. i guess i didn't try changing that
>>> during my hardware changes. i'll see if i can scrounge up some raid
>>> controllers to test this with.
>>
>> i got a hold of a few 3ware controllers and used these for the disk
>> array instead of the software raid. unfortunately, it still broke.
>
> after much more experimentation and buying new hardware, i got a
> things working. i did eventually try doing this on the hardware i had
> using opensuse 11.2 and centos 5.4. same problems. luckily, i was able
> to convince my boss to let me buy some new 10 gig nics cause those are
> the only thing that didn't change. i ordered some cards from chelsio
> (pcie, dual port cx4) and just got them on wednesday. put them in,
> configure them, and start syncing. never hiccuped. so, i guess for
> anyone else wanting to try this, avoid using 10gig pcie intel nics.
>
> mike

Hi,

Can you please give me the exact Part-Number of the Cards that work from
chelsio?

And by the way the exact Model/Part-Number from intel, the ones that
don't worked?

Thanks.

--
Igor Neves<igor.neves[at]3gnt.net>
3GNTW - Tecnologias de Informação, Lda

SIP: igor[at]3gnt.net JID: igor[at]3gnt.net
ICQ: 249075444 MSN: igor[at]3gnt.net
TLM: 00351914503611 PSTN: 00351252377120


_______________________________________________
drbd-user mailing list
drbd-user[at]lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


mike at dev-zero

Nov 20, 2009, 10:19 AM

Post #16 of 16 (92 views)
Permalink
Re: drbd + 10gig network [In reply to]

Igor Neves wrote:
>
>
> On 11/20/2009 05:01 PM, Mike Lovell wrote:
>> Mike Lovell wrote:
>>> Mike Lovell wrote:
>>>> Igor Neves wrote:
>>>>> Hi,
>>>>>
>>>>> Are you still using software raid?
>>>>>
>>>>> Have you tryed out using hardware raid?
>>>>>
>>>>> The only time i had tryed out drbd with mdraid, well let's say, i
>>>>> never will try that again, from now on, we only use areca raid
>>>>> cards. Anyway, software raid it's slow, if you want to do this
>>>>> thing, buy raid controllers, the best ones are areca, 3ware and
>>>>> lsi it's bad, don't know all the others.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> On 10/23/2009 10:40 PM, Mike Lovell wrote:
>>>>>> exaggerated the problem. when the sync hangs, the drbd module is
>>>>>> almost completely unresponsive. i tried doing a pause-sync and
>>>>>> then resume-sync thinking that it would nudge the module into
>>>>>> working but the commands timeout on talking to the module. i can
>>>>>> still cat /proc/drbd but that is about it until i take down the
>>>>>> network interface and drbd detects the network change. if i then
>>>>>> bring back up the interface, drbd detects it can talk again but
>>>>>> then only syncs a couple of megabytes before stalling again. i
>>>>>> have tried every way i can think of to check the integrity of the
>>>>>> network link between the hosts and everything says they are fine
>>>>>> except for during a ping flood there will be a few out of a couple h
>>>>>
>>>> yes, i am using software raid. i guess i didn't try changing that
>>>> during my hardware changes. i'll see if i can scrounge up some raid
>>>> controllers to test this with.
>>>
>>> i got a hold of a few 3ware controllers and used these for the disk
>>> array instead of the software raid. unfortunately, it still broke.
>>
>> after much more experimentation and buying new hardware, i got a
>> things working. i did eventually try doing this on the hardware i had
>> using opensuse 11.2 and centos 5.4. same problems. luckily, i was
>> able to convince my boss to let me buy some new 10 gig nics cause
>> those are the only thing that didn't change. i ordered some cards
>> from chelsio (pcie, dual port cx4) and just got them on wednesday.
>> put them in, configure them, and start syncing. never hiccuped. so, i
>> guess for anyone else wanting to try this, avoid using 10gig pcie
>> intel nics.
>>
>> mike
>
> Hi,
>
> Can you please give me the exact Part-Number of the Cards that work
> from chelsio?
>
> And by the way the exact Model/Part-Number from intel, the ones that
> don't worked?

my intel cards are the supermicro aoc-stg-i2. i can't find the card on
supermicro site any more but it is very similar to the aoc-utg-i2 which
is still listed. both just use the 82598EB chip. only difference is the
form factor.

my chelsio cards are the N320E-CXA.

mike
_______________________________________________
drbd-user mailing list
drbd-user[at]lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user

DRBD users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.