next message in archive
next message in thread
previous message in archive
Index of Subjects
This is an account of my 'newbie' attempt to solve a problem I had personally never encountered before... anyone who can point me to a quicker solution is encouraged/welcome to... ;-) --- Early Friday morning the halifax:zmailer crashed completely apparently after encountering a message it didn't like... First Indication ------------------- My first indication of a problem was that only a few of the overnight status reports were delivered to my mailbox. Investigation ------------------- The first thing I checked was the debug output from CS_ROOT/cronbin/csuite-cron (see.. CS_LOG/debug.csuite_cron) which calls the routines that produce the error reports. There were no errors in the debug output so I checked the processes running on the system and discovered that there were no (zmailer) transport agents running and that the (zmailer) scheduler which starts these 'delivery agents' was not running. First crack at a fix... restart zmailer scheduler ----------------------------------------------------- The first thing I tried was killing and re-starting zmailer. <zmailer binary path>/zmailer kill ...kill anything running, I usually have to kill a few smtp* processes that aren't killed by the above... <zmailer binary path>/zmailer bootclean ...to clean up, and then, <zmailer binary path>/zmailer start scheduler would start and hand around for a few minutes and then die... ...repeated this several times... with the same results. More investigation... --------------------------- Found a core file in the postoffice/transport directory. Had a look at the contents of the corefile using `strings core | less` <output..> CORE scheduler /var/csuite/csuite1.0/zmailer/bin/scheduler -f /var/csuite/csuite1.0/zmailer/sc CORE SUNW,Ultra-1 CORE CORE CORE iam.aven-owner@l 5r@liam.a CORE aven CORE scheduler /var/csuite/csuite1.0/zmailer/bin/scheduler -f /var/csuite/csuite1.0/zmailer/sc CORE iam.aven-owner@l 5r@liam.a CORE SUNW,Ultra-1 CORE CORE CORE iam.aven-owner@l 5r@liam.a <snip...> </output> This certainly looked consistent with the scheduler failure... ;-) Now starts the ugly work... ------------------------------------- Using a printed copy of the zmailer manual and consulting the illustration of Zmailer directories.. http://www.zmailer.org/zmanual/TUTimg3.gif I started with the postoffice/queue and postoffice/transport directories... <!--Both these directories contain 'loose' files along with a series of sub-directories named 'A', 'B', 'C', 'D', etc... with matching dirctory structure in both queue/ and transport/ after stopping all zmailer processes, I grepped for portions of: iam.aven-owner@l 5r@liam.a (which were found in the core file) ...this certainly looked like a partial address.... I had no luck finding "iam.aven" in the queue/ or transport/ directories... or in the current or rotated mail logs CS_LOG/mail , CS_LOG/mail/OLD ... ;-( I then proceeded to move all the 'loose' files into temporary quarters one directory level above... I then restarted zmailer with the same failure... ---- I tried another approach... maybe the scheduler binary was corrupt... I killed the parts of zmailer that were running, (router, smtpserver), and... ---- I moved the postoffice directory tree to a temporary locate and created a new postoffice tree, I correctly guessed that zmaler would create and 'A', 'B', 'C', directories it needed. I just created the upper level directories. ---- I restarted zmailer and everything worked fine... so the problem is in the old postoffice tree somewhere... I killed zmailer again and moved the old tree back in place. Back to looking for the problem message... ----- I finally found a match on a substring in transport/G with a corresponding file in queue/G I could see nothing wrong with either file but, by now with 5,000 + messages backed up, I shot the wad and removed them. ---- I restarted zmailer... with about 2,800 messages to process. It worked! ---- I waited until the transport queue was down to about 1,600, killed zmailer, moved the 'loose' files back into the queue/ and transport/ directories... and re-started zmailer with 5,500 messages in the queue. Back down to 1,600, killed zmailer and merged the mail that had accumulated in the new postoffice during the short period that I'd had it running. ===== 4 hours to get the mail back-log moving properly. In my mind this particular problem was complicated by the fact that scheduler was dumping core with providing and hint in the logs as to what it was hanging on... comments..? David Potter
next message in archive
next message in thread
previous message in archive
Index of Subjects