Article 137667 of comp.os.vms: Newsgroups: comp.os.vms Path: nntpd.lkg.dec.com!depot.mro.dec.com!news.jrd.dec.com!tbjnws.tbj.dec.com!news.dec-j!spin-hsd0-tky!spinnews!wnoc-tyo-news!news.join.ad.jp!news.imnet.ad.jp!usenet.seri.re.kr!news.kreonet.re.kr!overload.lbl.gov!lll-winken.llnl.gov!uwm.edu!vixen.cso.uiuc.edu!newsfeed.internetmci.com!in1.uu.net!fdn.fr!jussieu.fr!univ-lyon1.fr!in2p3.fr!swidir.switch.ch!CERN.ch!vxaluw.cern.ch!lishka From: lishka@vxaluw.cern.ch () Subject: Re: cpu hog X-Nntp-Posting-Host: axuw15.cern.ch Message-ID: Sender: news@news.cern.ch (USENET News System) Reply-To: lishka@vxaluw.cern.ch () Organization: University of Wisconsin @ CERN X-Newsreader: mxrn 6.18-32 References: <820003509.460000.PYUN@CYCLOP.KCC.HAWAII.EDU> <4cg2n1$hmd@rs18.hrz.th-darmstadt.de> <1996Jan9.091158.7190@acs.eku.edu> <4d0gia$e94@blue.usps.gov> Date: Fri, 12 Jan 1996 15:32:22 GMT Lines: 50 [My apologies if somebody has mentioned this already. I jumped into this discussion late, and haven't seen this solution in the current articles.] We had problems in our cluster with people running CPU-intensive programs interactively (they should be using batch queues). Our problem had more to do with short CPU-intensive jobs causing the system to hiccup -- everybody else would be locked out for ten seconds while somebody's custom fit routine ran at interactive priority. Very often the offending processes were too fast to catch. This was possibly due to two problems: (1) people running short hungry jobs; (2) a previous system manager modified the default interactive priority to 5 to accomodate batch queues. I used what seems like a common approach: write your own scheduler add-on. I picked up DYNPRI (originally by Harry Flowers, rewritten into C by Matt Madison) and tried it out. It turned out to have a bug handling when outswapped processes, plus I didn't want the same priority adjustment scheme, so I rolled my own version (which I called CPU_VIGILANTE). The basic idea is for the program to periodically wake up, get information on the appropriate processes, determine if a process is taking up "too much time", and then lower its priority if need be. Also, the procedure should raise the priority back for processes that have stopped using significant CPU. We used a two-track tactic: normal interactive priorities are at 4, while CPU hogs are knocked down to priority 3. (We also have batch queues running at priorities 3 through 0.) We also use a "penalty": a CPU intensive job that has been knocked down to priority 3 is kept at 3 for a penalty period even >after< it stops hogging the CPU. This fixes problems with people running, pausing, running again, pausing, etc. Running CPU_VIGILANTE and changing the default interactive priority back to 4 solved our problems. Before, I would get many complaints about "poor system performance"; after, the sailing is much more smooth, with no complaints. We do have the occasional bumps due to heavy paging or some process hitting the system disk too hard, but these are rare. The system feels much much better now. CPU_VIGILANTE is not really packaged for outside use, but I would be willing to send it to anyone who asks. The code is gross (many nested if-thens in a mistaken search for efficiency), but it is >heavily< commented (so others could figure out what I was doing). It has been tested on our VAX 9000 and a DEC 3000m400. I would also recommend DYNPRI if it suits your needs better. However, watch out for odd behaviour with outswapped processes -- my copy did not handle these correctly. Chris Lishka Computer Systems Manager Wisconsin-Aleph Group PPE Division, CERN