Nov 8th 2003

aLL wOrk anD no PLAY maKEs JEff a dULl Boy

As soon as I woke up at 9:30am I dialed into work. I checked the health of integration test and to my horror I saw that we had tons of core files. The 5GB log file system on each of the six nodes was at 100% due to all of the core files being created.

It was readily apparent that we had a major problem. I paged Carole and John with an update. I also paged Rodney. Rodney and Sameer were in the office this morning. I explained the situation to them and started sending over core files for them to look at. I also sent the stack trace which wasn’t very helpful.

During this time I was in constant contact with Rodney and Sameer. Eventually they thought they found the ticket number of one of the tickets causing the core dumps. They requested that ODE send them the transaction for that ticket number.

Unfortunately ODE could not find that ticket in their error queue. I think Rodney took this to mean that there were _NO_ tickets at all in their error queue. He then came to the conclusion that the tickets were eventually making it through the system when retired enough times. This would mean that the problem wasn’t always reproducible - meaning this could be an intransient memory issue.

Even though they didn’t know what was causing the problem, Rodney and Sameer requested that I rebuild one of the components to see if re-linking would fix the problem. I agreed but also suggested that I temporarily enable logging so we can capture the message causing one of the next core dumps. They agreed to this and I set out to do both tasks.

Around this time Mehdi from ODE sent us an email with the 976k logfile from the error queue. It looks like Sameer and Rodney misidentified the ticket number causing the problem. I suggested that they take a look at one of the tickets in the file Mehdi sent.

About an hour later I got a page from Rodney. They finally found the root cause of the problem causing the core dumps. The code fixes were checked in and they requested that I build the two components as soon as possible.

I labeled, merged, and kicked off the build of the code. It didn’t look like I was going to have it ready by 3pm.

Once 3pm rolled around I called into the bridge line for our scheduled 3pm status conference call. I gave everyone an update and then listened to the managers go over the schedule for the rest of the day. The game plan is to wait for me to activate the current build in system test. They are going to do a ‘comprehensive’ test in system test so we don’t have the snafu like we had last night. We agreed to have another bridge call at 7pm tonight.

While I was working today I watched ‘The Shining‘ and then after that I watched ‘Rich Girls‘ on MTV.

Around 4pm I had the code deployed and activated in the system test environment. While I waited, I paged Chris A. and Tim B. from mid-tier engineering to have the binaries deployed to the six integration servers. It wasn’t until around 6:30pm that they validated the results in system test. I paged John to see about activating in integration. Have gave me the go-ahead.

It’s a tedious, time-consuming process to activate code in integration. I have to do it one node at a time. I start with the last node and work my way up. By the time I joined the bridge call at 7pm I had only completed the first node. I continued to work while on the conference call.

Finally around 8pm I finished activating and they turned the ticket flow back on. I monitored the system for the next hour and didn’t see any more problems. There were no core dumps and everything seemed smooth.

I paged Carole and John with an update. Before logging off for the night, I entered my time for this week into PlanView. This was by far my longest week of my career. I logged 73.5 hours.

0 Responses to “aLL wOrk anD no PLAY maKEs JEff a dULl Boy”


  1. No Comments

Leave a Reply