runaway processes

Date: Tue, 22 Sep 1998 11:57:52 -0300 (ADT)
From: David Potter <potter@csuite.ns.ca>
To: Gord Fisch <gfisch@gpfn.sk.ca>
cc: csuite-install@chebucto.ns.ca, techteam@gpfn.sk.ca, CCN TECHTEAM <ccn-tech@chebucto.ns.ca>
Precedence: bulk
Return-Path: <csuite-install-mml-owner@chebucto.ns.ca>

next message in archive
no next message in thread
previous message in archive
previous message in thread
Index of Subjects

Index of Subjects

Hi Gord... We've just gone through this while getting CSuite running on
Solaris... it happens that we have a technical meeting this afternoon
during which I hope to review the last ten days and figure out what
happened and a hind-sight strategy.

I'm generally of the opinion that we escaped by:

1) getting a 'reliable' getty that actually handled the modems the way it
should.

2) getting tq/timerd working to limit session length... It was calling
csuite:skill that was not working on Solaris...

3) perhaps more importantly ( but this is the order things went...)
getting idled to actually kill idle sessions.

-----

At the worst, we discovered users were disconnecting when the load
average was high and thus contributing to the problem... I threatened
grievious action in the motd (the board considered calling in a 'pr'
wiz to temper my callous presentation... ;-)

We often had the load average over 90 and at one point we had it over
235...! 

Several times we had users connect to a modem that still had a session
going and getting dumped into another users session (often Pine... scary).

_This_ prompted us to take drastic action to try and get control of the
machine again...

1) disabled all modem sessions, telnet didn't seem to be a problem because
the ports were being rotated

... after a good nights sleep...

2)  conjured up a script to allow/disallow specific modems/modem sets

... about this time we discovered that the Solaris stmon (gettyish) was
leaving sessions hanging around and we switched to mgetty (after trying
agetty earlier)

3) for a while I was:

i) 'allowing' a bank of modems on a short (30 min) session.... 
ii) letting the bank fill up. 
iii) disallowing that bank, 
iv) letting all those sessions die off...
v) checking for lingering sessions and
vi) repeating the process.

When mgetty proved to be functioning reasonable well, we allowed all
modems and started working on tq and idled.

We're still been tuning the system several days later....

I'm not sure if our experience will help but if nothing else you can
take heart in the fact that we survived...;-)

david potter


On Mon, 21 Sep 1998, Gord Fisch wrote:

> Dear Csuite team:
> 
> We have csuite 1.0 installed on a pc under redhat linux 4.
> 
> We have just switched over from a solaris sparcstation running csuite alpha
> 5. On the old machine we had runaway processes, usually from someone no
> longer logged on, that would cause machine loads to skyrocket. The csuite
> cronbin job "pkiller" did not work. I rewrote it as a perl script and had
> it kill off any lynx|pine|telnet|tin process taking over 50% cpu time and
> logged on over an hour. It took it's input from 'top' and 'who'. This
> controlled the problem.
> 
> All tests on the new pc seemed fine. However, when we moved all the
> accounts over, the same problem reappeared, with the addition of sendmail
> not accepting connections under a large machine load. (we are using our own
> sendmail, not the one distributed with csuite).
> 
> I have tried adjusting the csuite version of "pkiller" but it is not
> solving the problem. Despite a huge machine load, 'top' does not seem to
> report the processes as taking up a large amount of cpu. The 'timerd' and
> 'idled' seem to work if one is logged on but not with these processes.
> 
> Any inkling as to what is causing the problem or how to address it?
> 
> Excerpts from our techteam talk:
> 
> >I killed aa655, attached to a terminal, idle for 10 hours, machine load
> >was near 50, at about 0900 (Regina time). aa655 did NOT show up as a big
> >CPU user with 'top'. Load immediately dropped like a stone after these
> >processes were killed.
> >
> >On Sun, 20 Sep 1998, Neale Partington wrote:
> >
> >> I noticed that aa655 is on twice with the who command, on once
> >> with the w command, and idle 2 hrs.
> >>
> >> Load was over 18, so couldn't send mail either.
> >>
> >
> >Robert H Greenfield, DSc <rhg@gpfn.sk.ca>
> >http://www.gpfn.sk.ca/~rhg/, voice +1.215.885.7950
> >FN20kc 40d 06' 15" N, 75d 07' 50" W, 73 m elev
> >++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> Great Plains Free Net Webmaster
> Gord Fisch
> 
> Gord  |Program Officer: SK Cultural Exchange Society |www.gpfn.sk.ca/sces
> Fisch |Webmaster: Great Plains Free Net              |www.gpfn.sk.ca
> 
> 

next message in archive
no next message in thread
previous message in archive
previous message in thread
Index of Subjects