bm_witness at yahoo
Apr 3, 2006, 12:29 PM
Post #1 of 3
I recently rebuilt a SunBlade 2000 system that was
Sun Gem (RIO GEM r01) errors...
running Solaris 8 to Gentoo 2006.0. The system sports
a Sun RIO GEM NIC, and worked quite well for the first
few days, however, we didn't hit it hard during that
time period either. The systems primary task is to be
our source repository, and so needs to be network
The system was initially setup on 3/9/2006, and ran
fine until 3/15/2006 when we started getting the below
Mar 15 15:39:25 tsdfft1 NETDEV WATCHDOG: eth0:
transmit timed out
Mar 15 15:39:25 tsdfft1 eth0: transmit timed out,
Mar 15 15:39:25 tsdfft1 eth0:
Mar 15 15:39:25 tsdfft1 eth0:
Mar 15 15:39:25 tsdfft1 eth0: Link is up at 100 Mbps,
Mar 15 15:39:25 tsdfft1 eth0: Pause is disabled
Mar 15 16:11:58 tsdfft1 eth0: TX MAC xmit underrun.
We're presently using the 2.6.16 kernel (vanilla) with
sungem driver version 0.98. We have also seen this
issue with the 220.127.116.11 kernel (vanilla) and the
2.4.32_r2 kernel (provided by Gentoo 2006.0).
The first one is spuratic, but happens from time to
time. (Same error message everytime, save date &
time.) The second one is the most reproducible as all
I have to do is try to pull down source from the
repository (hosted on Apache2 via WebDAV), and after
about 6 MiB of data transfer, the link will die until
an ifconfig down/up is done, when it will go for a
while longer and then require a system reboot.
In researching the issue, I discovered that there is
one of several issues at play - the card is going bad,
or there is a driver problem. I found a link to an
xmit underrun issue for Solaris, but was unable to
access it due to it being locked under
sunsolve.sun.com. So I have no guarantee that going
back to Solaris will solve the issue either.
I have had a hard time finding an xmit underrun issue
under Linux, most searches result in references to
where the message is generated from and not from users
trying to find solutions to the problem.
I did, however, notice that there was a similar
problem with overflows on the RX portion of the chip,
which was solved through resetting the chip's RX unit
My first attempt at a fix was to modify the driver at
the point of issue to schedule a reset, based on code
elsewhere in the driver. (See sungem-fix1.patch.txt)
At first this patch did not seem to work, however, I
have been running the kernel with it for about a week
now, and at least SSH and Apache seem to keep running.
So I do think it at least helped to improve the
situation, but it does not solve the problem on the
Subversion side (Apache/WebDAV) which still dies after
issues (just tested to make sure).
I then tried building a solution based on the
gem_rxmac_reset() and the various init functions, and
produced gem_txmac_reset(). However, my first use
locked up the kernel. It might be just that I tried to
gain a lock when I shouldn't have (I did try to get
the lock and tx_lock for the driver). However, I am
not sure that I did it correctly.
I would very much appreciate it if someone who is more
familiar with the sungem driver would look at the
patches and verify that (a) it is the correct thing to
do, and (b) I did it correctly.
I am aware that the network the system is running on
is suppose to be full duplex, 100 Mbps. However, I
have noticed that the card/driver seems to think it is
half-duplex. Could this simply be a duplexing issue? I
have no control of the switch it is plugged into (so
far as settings go), but have not been able to find a
way to get ifconfig to force it to full-duplex. (We've
typically built the driver into the kernel.)
If there is any information that I missed which would
be helpful, please let me know and I will be glad to
pass on what I can.
Patches and additional error log information on eth0
are available at the following URL:
Summary of system information:
System: Sun Microsystem's SunBlade 2000
Purchased: roughly 11/03.
NIC: Sun RIO GEM 10/100, built-in on SunBlade 2000
Linux Distro: Gentoo 2006.0
Kernel Versions: 2.6.16, 18.104.22.168, Gentoo's 2.4.32_r2
NETDEV WATCHDOG: eth0: transmit timed out
eth0: transmit timed out,resetting
eth0: Link is up at 100 Mbps,half-duplex.
eth0: Pause is disabled
eth0: TX MAC xmit underrun.
Any advice, help, etc. would be greatly appreciated.
Benjamen R. Meyer
P.S. I also posted to the netdev list at
vger.kernel.org, but I have not heard anything.
gentoo-sparc [at] gentoo mailing list