Article 137667 of comp.os.vms:
Newsgroups: comp.os.vms
Path: nntpd.lkg.dec.com!depot.mro.dec.com!news.jrd.dec.com!tbjnws.tbj.dec.com!news.dec-j!spin-hsd0-tky!spinnews!wnoc-tyo-news!news.join.ad.jp!news.imnet.ad.jp!usenet.seri.re.kr!news.kreonet.re.kr!overload.lbl.gov!lll-winken.llnl.gov!uwm.edu!vixen.cso.uiuc.edu!newsfeed.internetmci.com!in1.uu.net!fdn.fr!jussieu.fr!univ-lyon1.fr!in2p3.fr!swidir.switch.ch!CERN.ch!vxaluw.cern.ch!lishka
From: lishka@vxaluw.cern.ch ()
Subject: Re: cpu hog
X-Nntp-Posting-Host: axuw15.cern.ch
Message-ID: <DL2r5z.CCw@news.cern.ch>
Sender: news@news.cern.ch (USENET News System)
Reply-To: lishka@vxaluw.cern.ch ()
Organization: University of Wisconsin @ CERN
X-Newsreader: mxrn 6.18-32
References: <820003509.460000.PYUN@CYCLOP.KCC.HAWAII.EDU> <4cg2n1$hmd@rs18.hrz.th-darmstadt.de> <1996Jan9.091158.7190@acs.eku.edu> <4d0gia$e94@blue.usps.gov>
Date: Fri, 12 Jan 1996 15:32:22 GMT
Lines: 50


[My apologies if somebody has mentioned this already.  I jumped into this
discussion late, and haven't seen this solution in the current articles.]

We had problems in our cluster with people running CPU-intensive programs
interactively (they should be using batch queues).  Our problem had more
to do with short CPU-intensive jobs causing the system to hiccup --
everybody else would be locked out for ten seconds while somebody's custom
fit routine ran at interactive priority.  Very often the offending
processes were too fast to catch.  This was possibly due to two problems:
(1) people running short hungry jobs; (2) a previous system manager modified
the default interactive priority to 5 to accomodate batch queues.

I used what seems like a common approach: write your own scheduler add-on.
I picked up DYNPRI (originally by Harry Flowers, rewritten into C by Matt
Madison) and tried it out.  It turned out to have a bug handling when
outswapped processes, plus I didn't want the same priority adjustment scheme,
so I rolled my own version (which I called CPU_VIGILANTE).

The basic idea is for the program to periodically wake up, get information
on the appropriate processes, determine if a process is taking up "too much
time", and then lower its priority if need be.  Also, the procedure should
raise the priority back for processes that have stopped using significant
CPU.  We used a two-track tactic: normal interactive priorities are at 4,
while CPU hogs are knocked down to priority 3.  (We also have batch queues
running at priorities 3 through 0.)  We also use a "penalty": a CPU intensive
job that has been knocked down to priority 3 is kept at 3 for a penalty
period even >after< it stops hogging the CPU.  This fixes problems with
people running, pausing, running again, pausing, etc.

Running CPU_VIGILANTE and changing the default interactive priority back to
4 solved our problems.  Before, I would get many complaints about "poor
system performance"; after, the sailing is much more smooth, with no
complaints.  We do have the occasional bumps due to heavy paging or some
process hitting the system disk too hard, but these are rare.  The system
feels much much better now.

CPU_VIGILANTE is not really packaged for outside use, but I would be willing
to send it to anyone who asks.  The code is gross (many nested if-thens in
a mistaken search for efficiency), but it is >heavily< commented (so others
could figure out what I was doing).  It has been tested on our VAX 9000 and
a DEC 3000m400.

I would also recommend DYNPRI if it suits your needs better.  However, watch
out for odd behaviour with outswapped processes -- my copy did not handle
these correctly.
					Chris Lishka
					Computer Systems Manager
					Wisconsin-Aleph Group
					PPE Division, CERN