History of DRBD at ITEG & Clazzes.org
For many years (since CentOS 5) we at ITEG (also the main force behind Clazzes.org) have used DRBD to mirror the partition of each virtual machine (then OpenVZ, now mostly LXC).
Kernel 4.6 hickups with DRBD 8.4.6 as possible cause
After shredding some data last year, it now (Debian jessie, Kernel 4.6 from jessie-backports, DRBD module 8.4.6) it seems to make troubles again.
Several times a single LXC guest became unreachable, the host's "load" started to rise slowly but continously up to ~1000 (!) and more, fortunately with the host still allowing to login via ssh but unfortunately with only one way out: Reset button. Ouch. WTF.
With the exception of one case under Kernel 4.4 (DRBD module 8.4.5) the last non-trivial syslog entries always included DRBD:
Dismissed solution ideas: DRBD9? Commercial support?
Our first idea was to try DRBD9, maybe with commercial support.
So we filled out Linbit's contact form, and got called back quickly.
Conclusion 1: Don't migrate from DRBD8 to DRBD9 unless you need >2 nodes
DRBD9 is for multi-node operation.
For 2-node operation DRBD8 is fine, recommended, and will be getting support for at least years (or so they say; this is 2016-08-29).
Conclusion 2: Commercial support prices not for us
We mirror within several pairs "pizza boxes" from the shelf, not between 2 top-of-the-line rack-high elephants.
This doesn't fit with their pricing that's based on per-node (and per-year).
So long, staying with pure Open Source approach then.
Trying out DRBD9 for curiousity's sake
One of our server pairs is due to be decommisioned soon and hosts nothing of relevance, under Debian jessie.
So we used Linbit's semi-official Ubuntu PPA Linbit's semi-official Ubuntu PPA to upgrade to DRBD9 the Open Source way.
We managed to do it, somehow. But I don't recommend it for production systems. There are too many hickups and not too much to gain unless you want to migrate to multi-node setups in which case I strongly recommend using new nodes anyway.
Current solution attempt: Packaging Upstream DRBD 8.x Kernel Module for Debian
Without checking mailing list archives it seems unlikely that an eventual severe problem with DRBD 8.4.6 (almost 16 months old now) still is not fixed in the current 8.x module eversion.
DRBD's contact area with the rest of the kernel or userspace (libc6) always has been quite thin (Unix rules, Linux without SystemDisabler still is a unix).
Building the module via DKMS (Thanks Dell!?) is no rocket science and widely documented, i.e. in Proxmox PVE's Wiki page on how to Build DRBD kernel module (Proxmox VE is a nice LXC+KVM+HA virtualization distro, using Debian OS, Ubuntu LTS kernels, and ther own unified adminstration tools & UI).
And, after all, DRBD sources are still Open Source.
So, let's build ;-)
Up-to-date DRBD 8 packages in Clazzes.org' Debian repository
We are using the packages below on 4 nodes so far (as of 2016-08-30), with 6 more nodes going to use them after the next reboot.
Remark: On some nodes
dkms triggered the installation of a
linux-image-3.2.0-4-rt-amd64 which can be removed afterwards.
Clazzes.org's Deb server deb.clazzes.org contains a repository "
jessie-drbdpkg-8" providing 2 DRBD packages:
This packages contains the most-recently available sources of the DRBD8 module, along with with DKMS integration for Debian jessie (probably usable for jessie-based derivates too).
On installation of drbd8-dkms 8.4.9-1 (or later installation of a new of
linux-header-*-amd64 alongside the matching
linux-image-*-amd64 package) the up-to-date DRBD module is automatically built and installed.
We 'cheated' a bit here: drbd-utils_8.9.7-1 is the package from Debian's unstable/experimental repositories, re-integrated in our DRBD8-repository.
This way any Debian jessie installation has access to up-to-date DRBD8 packages without the need to care about compiling manually or adding
Debian unstable to
DKMS PROBLEMs, Solutions, hints
1 of 5 hosts rebooted so far, going from 4.4 to 4.6 through the reboot, failed to load the DRBD module, claiming
Possible root cause
This could be due to ABI changes in the kernel, from 4.4 to 4.6. Our other reboots so far have been without changing the kernel version, 4.6 before and after.
However 2 more of our 8 nodes are running 4.5, and their
/var/lib/dkms/drbd/8.4.8-1clazzes1/4.6.0-0.bpo.1-amd64/x86_64/module/drbd.ko has a different size from the those nodes with 4.6. Although,
/lib/modules/4.6.0-0.bpo.1-amd64/kernel/drivers/block/drbd/drbd.ko has the same size and MD5 sum on machines running 4.5 or 4.6.
I solved it with ...
Suspected faster solution
It would propably have been sufficient to just ...
Proposed check command
It might be a good idea to perform this every now and then: