Zmailer: troubleshooting

Date: Sun, 31 Jan 1999 12:45:08 -0400 (AST)
From: "David L. Potter" <potter@chebucto.ns.ca>
To: tech-log@chebucto.ns.ca
cc: csuite-tech@chebucto.ns.ca
Precedence: bulk
Return-Path: <csuite-tech-mml-owner@chebucto.ns.ca>

next message in archive
next message in thread
previous message in archive
Index of Subjects



This is an account of my 'newbie' attempt to solve a problem I had
personally never encountered before... anyone who can point me to a
quicker solution is encouraged/welcome to... ;-)

---

Early Friday morning the halifax:zmailer crashed completely apparently
after encountering a message it didn't like...

First Indication
-------------------
My first indication of a problem was that only a few of the overnight
status reports were delivered to my mailbox.

Investigation
-------------------
The first thing I checked was the debug output from
CS_ROOT/cronbin/csuite-cron (see.. CS_LOG/debug.csuite_cron)
which calls the routines that produce the error reports. 

There were no errors in the debug output so I checked the processes
running on the system and discovered that there were no (zmailer) 
transport agents running and that the (zmailer) scheduler which starts
these 'delivery agents' was not running.


First crack at a fix... restart zmailer scheduler
-----------------------------------------------------
The first thing I tried was killing and re-starting zmailer.

<zmailer binary path>/zmailer kill
   ...kill anything running, I usually have to kill a few smtp* processes
that aren't killed by the above...

<zmailer binary path>/zmailer bootclean
   ...to clean up, and then,

<zmailer binary path>/zmailer start
 
scheduler would start and hand around for a few minutes and then die...

...repeated this several times... with the same results.


More investigation...
---------------------------
Found a core file in the postoffice/transport directory. Had a look at the
contents of the corefile using

`strings core | less`

<output..>

CORE
scheduler
/var/csuite/csuite1.0/zmailer/bin/scheduler -f
/var/csuite/csuite1.0/zmailer/sc
CORE
SUNW,Ultra-1
CORE
CORE
CORE
iam.aven-owner@l
5r@liam.a
CORE
aven
CORE
scheduler
/var/csuite/csuite1.0/zmailer/bin/scheduler -f
/var/csuite/csuite1.0/zmailer/sc
CORE
iam.aven-owner@l
5r@liam.a
CORE
SUNW,Ultra-1
CORE
CORE
CORE
iam.aven-owner@l
5r@liam.a
<snip...>
</output>

This certainly looked consistent with the scheduler failure... ;-)


Now starts the ugly work...
-------------------------------------
Using a printed copy of the zmailer manual and consulting the illustration
of Zmailer directories..

http://www.zmailer.org/zmanual/TUTimg3.gif

I started with the postoffice/queue and postoffice/transport
directories...

<!--Both these directories contain 'loose' files along with a series
of sub-directories named 'A', 'B', 'C', 'D', etc... with matching
dirctory structure in both queue/ and transport/

after stopping all zmailer processes, I grepped for portions of:

iam.aven-owner@l
5r@liam.a

(which were found in the core file) ...this certainly looked like a
partial address....

I had no luck finding "iam.aven" in the queue/ or transport/
directories... or in the current or rotated mail logs
CS_LOG/mail , CS_LOG/mail/OLD ... ;-( 

I then proceeded to move all the 'loose' files into temporary quarters one
directory level above...

I then restarted zmailer with the same failure... 

----

I tried another approach... maybe the scheduler binary was corrupt... I
killed the parts of zmailer that were running, (router, smtpserver),
and...

----

I moved the postoffice directory tree to a temporary locate and created a
new postoffice tree, I correctly guessed that zmaler would create and 'A',
'B', 'C', directories it needed. I just created the upper level
directories.

----

I restarted zmailer and everything worked fine... so the problem is in the
old postoffice tree somewhere... I killed zmailer again and moved the old
tree back in place. Back to looking for the problem message...

-----

I finally found a match on a substring in transport/G with a corresponding
file in queue/G

I could see nothing wrong with either file but, by now with 5,000 +
messages backed up, I shot the wad and removed them.

----

I restarted zmailer... with about 2,800 messages to process. It worked!

----

I waited until the transport queue was down to about 1,600, killed
zmailer, moved the 'loose' files back into the queue/ and transport/
directories... and re-started zmailer with 5,500 messages in the
queue.

Back down to 1,600, killed zmailer and merged the mail that had
accumulated in the new postoffice during the short period that I'd had it
running.

=====

4 hours to get the mail back-log moving properly. In my mind this
particular problem was complicated by the fact that scheduler was dumping
core with providing and hint in the logs as to what it was hanging on...

comments..?


David Potter

next message in archive
next message in thread
previous message in archive
Index of Subjects