systems hanging since patches applied

Discussion:

(too old to reply)

Jeff Wieland

2011-01-13 13:34:35 UTC

We patched our SPARC Solaris 10 servers to be current as of
1/02/2011. Since then, starting a few days later, we've had three (or
maybe four) uniprocessor machines hang. The only way to get them back
is send a break to the console, which triggers a panic:

Jan 10 08:54:01 mysystem unix: [ID 836849 kern.notice]
Jan 10 08:54:01 mysystem ^Mpanic[cpu0]/thread=2a10004fca0:
Jan 10 08:54:01 mysystem unix: [ID 754337 kern.notice] lock_set_spl:
600108283ae lock held and only one CPU
Jan 10 08:54:01 mysystem unix: [ID 100000 kern.notice]
Jan 10 08:54:01 mysystem genunix: [ID 723222 kern.notice]
000002a100077770 unix:lock_set_spl_spin+214 (600108283ae, c,
600108283a8, f, 0, 1840000)
Jan 10 08:54:01 mysystem genunix: [ID 179002 kern.notice] %l0-3:
00000000010a6000 000000000104fb4c 00000000f0000000 00000000fffe0000
Jan 10 08:54:01 mysystem %l4-7: 00000000018402c8 00000000f0061644
00000000fffde3b8 00000000fffdd7a0
Jan 10 08:54:01 mysystem genunix: [ID 723222 kern.notice]
000002a100077820 su:asy_polled_enter+c (60010b8b980, 18, 0, 3, 3, 1)
Jan 10 08:54:01 mysystem genunix: [ID 179002 kern.notice] %l0-3:
0000000000052b68 ffffffffffffffff 00000000f0051a5c 00000000ffffffff
Jan 10 08:54:01 mysystem %l4-7: 0000000000000000 ffffffffffffffff
0000000000000016 000000000180c000
Jan 10 08:54:01 mysystem genunix: [ID 723222 kern.notice]
000002a1000778d0 unix:vx_handler+80 (fffd4d00, 181f620, 183d800, 1,
181f710, f00a6ba5)
Jan 10 08:54:01 mysystem genunix: [ID 179002 kern.notice] %l0-3:
000000000181f710 0000000000000000 0000000000000001 0000000000000001
Jan 10 08:54:01 mysystem %l4-7: 0000000001810c00 00000000f0000000
0000000001000000 000000000104e124
Jan 10 08:54:02 mysystem genunix: [ID 723222 kern.notice]
000002a100077980 unix:callback_handler+20 (fffd4d00, fffdc290, 0, 0,
0, 0)
Jan 10 08:54:02 mysystem genunix: [ID 179002 kern.notice] %l0-3:
0000000000000016 000002a100077231 00000600115a6000 0000000000000001
Jan 10 08:54:02 mysystem %l4-7: 00000600113a4880 0000000000000000
0000000000000000 0000000000003000
Jan 10 08:54:02 mysystem unix: [ID 100000 kern.notice]
Jan 10 08:54:02 mysystem genunix: [ID 672855 kern.notice]
syncing file systems...
Jan 10 08:54:02 mysystem genunix: [ID 733762 kern.notice] 2
Jan 10 08:54:03 mysystem genunix: [ID 904073 kern.notice] done

The other two panics have looked essentially the same. These have all
been on V100's, two are the11/06 release, and the third is 10/09
release. We also may have experienced this on a Sun Blade 1500, but
it would not respond to a Stop-A, so we ended up having to remove and
reapply the power, so we never got any useful error messages. We have
not seen this problem with multiprocessor V210's and V240 with the
same patch set.

Has anyone else been seeing this?

Richard B. Gilbert

2011-01-13 15:31:13 UTC

Permalink

Post by Jeff Wieland
We patched our SPARC Solaris 10 servers to be current as of
1/02/2011. Since then, starting a few days later, we've had three (or
maybe four) uniprocessor machines hang. The only way to get them back
Jan 10 08:54:01 mysystem unix: [ID 836849 kern.notice]
600108283ae lock held and only one CPU
Jan 10 08:54:01 mysystem unix: [ID 100000 kern.notice]
Jan 10 08:54:01 mysystem genunix: [ID 723222 kern.notice]
000002a100077770 unix:lock_set_spl_spin+214 (600108283ae, c,
600108283a8, f, 0, 1840000)
00000000010a6000 000000000104fb4c 00000000f0000000 00000000fffe0000
Jan 10 08:54:01 mysystem %l4-7: 00000000018402c8 00000000f0061644
00000000fffde3b8 00000000fffdd7a0
Jan 10 08:54:01 mysystem genunix: [ID 723222 kern.notice]
000002a100077820 su:asy_polled_enter+c (60010b8b980, 18, 0, 3, 3, 1)
0000000000052b68 ffffffffffffffff 00000000f0051a5c 00000000ffffffff
Jan 10 08:54:01 mysystem %l4-7: 0000000000000000 ffffffffffffffff
0000000000000016 000000000180c000
Jan 10 08:54:01 mysystem genunix: [ID 723222 kern.notice]
000002a1000778d0 unix:vx_handler+80 (fffd4d00, 181f620, 183d800, 1,
181f710, f00a6ba5)
000000000181f710 0000000000000000 0000000000000001 0000000000000001
Jan 10 08:54:01 mysystem %l4-7: 0000000001810c00 00000000f0000000
0000000001000000 000000000104e124
Jan 10 08:54:02 mysystem genunix: [ID 723222 kern.notice]
000002a100077980 unix:callback_handler+20 (fffd4d00, fffdc290, 0, 0,
0, 0)
0000000000000016 000002a100077231 00000600115a6000 0000000000000001
Jan 10 08:54:02 mysystem %l4-7: 00000600113a4880 0000000000000000
0000000000000000 0000000000003000
Jan 10 08:54:02 mysystem unix: [ID 100000 kern.notice]
Jan 10 08:54:02 mysystem genunix: [ID 672855 kern.notice]
syncing file systems...
Jan 10 08:54:02 mysystem genunix: [ID 733762 kern.notice] 2
Jan 10 08:54:03 mysystem genunix: [ID 904073 kern.notice] done
The other two panics have looked essentially the same. These have all
been on V100's, two are the11/06 release, and the third is 10/09
release. We also may have experienced this on a Sun Blade 1500, but
it would not respond to a Stop-A, so we ended up having to remove and
reapply the power, so we never got any useful error messages. We have
not seen this problem with multiprocessor V210's and V240 with the
same patch set.
Has anyone else been seeing this?

The moral of this story is: Test it before you trust it with "production"!

Jeff Wieland

2011-01-13 16:06:06 UTC

Permalink

Post by Richard B. Gilbert

The moral of this story is: Test it before you trust it with "production"!

We did. The test machines continue to function normally.

Ian Collins

2011-01-13 22:11:27 UTC

Permalink

Post by Jeff Wieland

Post by Richard B. Gilbert

The moral of this story is: Test it before you trust it with "production"!

We did. The test machines continue to function normally.

The other moral of this story is: Always create a new BE before applying
a patch cluster!

--
Ian Collins

Richard B. Gilbert

2011-01-13 23:04:33 UTC

Permalink

Post by Jeff Wieland

Post by Richard B. Gilbert

The moral of this story is: Test it before you trust it with "production"!

We did. The test machines continue to function normally.

How do the test systems differ from the production system? Are you
using the same O/S release and patches in both test and production?

This is not rocket science. The test systems clearly differ from
production in a way that causes different behavior!

The answer is there! Start digging!!!!

Continue reading on narkive:

Search results for 'systems hanging since patches applied' (Questions and Answers)

replies

Who knows what happened on the 5 season of Gilmore girls?

started 2006-07-24 20:08:17 UTC

television

replies

Can someone give me the 411 on Gilmore Girls?

started 2006-07-25 06:59:24 UTC

television

replies

Mysterious crash when shutting down windows xp sp3?