kernel panic

Shhhh… do you hear that? That’s the shit winds

Thursday night was otherwise uneventful. Assisting James in the course of migrating a VM from the now non-supported environment to the new environment went smooth. Initial testing proved fine, our internal customers experienced all of about two minutes of downtime, and we were anticipiating this morning to be uneventful. Well, anyone that’s been in IT for a while knows that those easy moves always have this white noise lingering in the background – something just doesn’t settle right in the stomach until there’s absolute proof everything is a go. At around 0700 hours, before I even had the opportunity to put feet on the floor, I was roused with the siren.

Shit was broken. Our internal customers were panicking. We were losing on money for every minute this problem was ongoing. The problem was, we didn’t know what the problem was. The code that was running on the new instance was not the code that was running on the old instance. In the course of migration, we were concerned some NAT rule became conflicted and was routing to a non-production server currently in testing. The problem is, that server in question has far updated code. There’s no way it could be serving that code, the UI is clearly differnt. We ended up bringing in 15 people into this triage call just to figure out the full scope. One thing we knew right off was this wasn’t your typical problem.

Back track

Monday we brought on a Linux contractor to assist me with some heavy lifting for upgrading a handful of long outdated Linux servers. This guy knows what he’s doing – it’s a breath of fresh air being one of two Linux guys in a Microsoft shop. Anyways, he’s been wrapping up some work on a server that’s now ready to go into production. During triage I asked him if he’s seen anything like this – some kind of weird hard drive caching issue. Low and behold, he has. He mentions some weird behavior that can happen on VMWare with Linux. The caveat is it’s super difficult to even confirm this was happening. A graceful reboot ends up writing the cache to disk, so any trace of the behavior is essentially removed.

Given that detail, my thoughts were why don’t we run a hash on both VMs and compare? If the image on the new platform is a direct replica of the old VM, the hashes should be the same. This is where another issue crops up – the new platform requires some additional steps to configure the networking. A single bit difference between the VMs results in different hashes. So we hit another wall. At this point, we know the situation is this: we have code that’s weeks old – revisions that were pushed to the old server last week are gone. Changes made over the past three weeks; gone. We’re looking at a snapshot from weeks ago, but the replication process started on Sunday. There’s no reason we should be in the situation we are, but we’re here.

We were looking at two options: I wipe the web root and load a tarball from the git repo from my local system. This will bring us a few minutes of downtime, but the potential for introducing bugs is big. This could result in prolonging downtime and moving us further from a positive resolution. Option two is, boot up the old VM and redirect the network traffic back to it. We’re still looking at a few minutes of downtime, but if it goes smooth, we’ll resolve the issue and can return to investigating what exactly happened during replication to cause this behavior on the new VM. We chose option two.

At this point, I’m still confounded as to what exactly occurred. We’re in the process of sandboxing the new VM so I can dig into it and confirm the code in the web root is in fact weeks old. That confirmation won’t bring many answers, though. The question remains why is that code weeks old? If the replication started on Sunday, and no changes were made during the replication, what influenced this to occur? We can’t do a full post mortem to ensure this doesn’t happen again in the future (this is the second failed VM replication to this new system) if we don’t know what happened originally.

Lessons learned

I’m quickly realizing there are a lot of nuances when dealing with legacy code bases and legacy systems. It’s one thing to run a FreeBSD 6 box that’s been chugging along for a decade and never move it. It’s another to take an Ubuntu 14 box and move it from a VMWare host on a bare metal box to a cloud based VM host. In addition, this VM hosts code that was not built for resilience, and never maintained. I’m realizing that I can’t come in and refactor the base laayer of this system without having detrimental impacts to uptime. We’re moving to a better situation, and there are growing pains associated with that. The positive outcome of this is we have a team that is growing closer together and working cohesively to address these situations when they come up. It’s times like this that redefine what it means to be full stack.