James Coates

Computer Science student. Sydney, Australia.


← Back to blog

Hardening the Home Lab: Custom Homebridge Logic and Automated TrueNAS Recovery

· Technology

Hardening the Home Lab: Custom Homebridge Logic and Automated TrueNAS Recovery

A home lab earns its name when it stops being a toy. The moment a household relies on it to cool a bedroom or stream a film, every shortcut you took during the build becomes a future incident waiting to happen. This post documents two recent pieces of work that pushed my lab a little further out of "hobbyist tinkering" and into something closer to the systems I work on day to day at Lanrex: a custom Homebridge patch that fixed a broken Tuya air conditioner integration, and an automated TrueNAS recovery pipeline that mirrors my Plex server and Raspberry Pi to ZFS every week.

Neither of these is glamorous. Both of them are the kind of unglamorous, documented, repeatable work that separates a system that runs from a system you can actually trust.

The Problem: A Slider That Refused to Stay Put

The trigger for the Homebridge work was mundane. A Tuya-based split system air conditioner, exposed to HomeKit through the Homebridge Tuya platform plugin, had a fan speed slider that simply would not behave. Drag it to 50 percent, let go, and it would snap back to 0. Drag it to 100, and it might briefly register before resetting. Cooling worked. Fan mode worked in isolation. The slider, the one control most people actually touch, was unusable.

This is the kind of bug that looks trivial until you open the plugin source. The root cause sat at the intersection of two completely incompatible assumptions about what a "fan speed" actually is.

Apple's Scale vs Tuya's Scale

Apple's HomeKit specification defines RotationSpeed as a floating point value from 0 to 100, intended to be presented to the user as a percentage. Tuya's underlying wind speed datapoint is an integer enumeration with three valid steps, typically 1 for low, 2 for medium and 3 for high, with 0 representing off.

The plugin's original code declared the characteristic with a minValue of 0 and a maxValue of 3, then handed those values straight back to HomeKit. From HomeKit's perspective, the slider was now operating on a 0 to 3 scale, but the iOS UI still rendered it as a percentage bar from 0 to 100. Drag it to what visually looks like 50 percent, and HomeKit sends 50 to a characteristic that thinks its valid range is 0 to 3. The plugin clamps that down, the device responds with its current state of 0, and HomeKit obediently snaps the slider back. The user sees a slider that "resets." The protocol sees two systems talking past each other.

Rewriting the Characteristic

The fix had two parts. The first was to redefine the characteristic so HomeKit and the plugin agreed on the scale. Inside the plugin's accessory file, the RotationSpeed characteristic was reconfigured with setProps so that HomeKit could keep its native 0 to 100 range:

service
  .getCharacteristic(Characteristic.RotationSpeed)
  .setProps({
    minValue: 0,
    maxValue: 100,
    minStep: 1,
  });

The second part was a translation layer between the percentage HomeKit sends and the integer Tuya expects. A simple if/else mapping handles the three-step nature of the device cleanly, with the boundaries chosen so the slider feels natural to drag:

function percentToTuyaWind(percent) {
  if (percent <= 0) return 0;
  if (percent <= 33) return 1;
  if (percent <= 66) return 2;
  return 3;
}

That alone made the slider behave. What it did not do was stop the plugin from hammering the Tuya cloud API every time a finger moved across the screen.

Debouncing the Signal Spam

HomeKit emits a set event for almost every intermediate value as a slider is dragged. On a touchscreen, that can be dozens of events in under a second. Each one fires a network request to Tuya, and Tuya rate limits aggressively. Without debouncing, dragging the slider from 0 to 100 would briefly report success, then fail outright as the upstream API started rejecting requests, and the device would end up in whatever state corresponded to the last accepted call rather than the one the user actually wanted.

An 800 ms debounce wrapped around the set handler resolved the entire class of failure. The plugin now waits for the user to settle on a value before issuing a single API call:

let debounceTimer = null;

function setSpeedDebounced(value) {
  if (debounceTimer) clearTimeout(debounceTimer);
  debounceTimer = setTimeout(() => {
    const tuyaValue = percentToTuyaWind(value);
    device.set({ dps: WIND_DP, set: tuyaValue });
  }, 800);
}

The final piece was removing a conflicting "fan mode" override that the plugin was applying whenever cooling was active. The override was forcing the wind speed datapoint back to a default whenever the unit changed mode, which neatly undid every manual change the user made. Stripping that handler allowed cooling and fan speed to operate as the genuinely independent characteristics HomeKit assumes them to be.

Persistence: Making a node_modules Edit Survive an npm install

Editing files inside node_modules is, in almost every other context, a sin. The directory is treated as ephemeral. Any developer who has watched an npm install quietly wipe out a hand-edited dependency file knows why. Write protecting the file is worse, because it does not preserve the change, it just breaks the next update.

The professional answer is patch-package. The workflow is small and self-contained:

  1. Edit the file inside node_modules until the behaviour is correct.
  2. Run npx patch-package homebridge-tuya-platform to generate a diff, stored in a patches/ directory at the repository root.
  3. Add a postinstall hook to package.json so that every install run replays the patch automatically.
  4. Commit npm-shrinkwrap.json so the dependency tree is locked in lockstep with the patch.
{
  "scripts": {
    "postinstall": "patch-package"
  }
}

The diff itself is small, human readable, and reviewable. When the upstream plugin eventually fixes the scale mismatch properly, the patch will fail to apply cleanly and force a conscious decision rather than silently masking the upstream fix. That is exactly the behaviour you want from a workaround. It is loud when it stops being necessary.

Disaster Recovery: rsync, ZFS, and the 100 Megabit Bottleneck

The second piece of work was less clever and more important. Two machines in the lab carry state that genuinely matters. The Homebridge instance runs on a Raspberry Pi 3B (doorpi), which also drives the GPIO relay for the front door lock. The Plex server runs on an Ubuntu host with roughly 12 GB of OS state on its system SSD and 8 TB of media spread across two external drives mounted at /mnt/Plex1 and /mnt/Plex2.

Until this month, neither machine had a real recovery story. If the Pi's SD card died, the door lock and every Homebridge accessory would go down for as long as it took to rebuild from memory. If the Plex SSD failed, the OS, the database and the watch history all went with it, even though the media itself was safe.

Hardware Constraints That Shape the Strategy

The Pi 3B has a single shared bus for USB and Ethernet, capped at 100 Mbps in practice. Any backup strategy that assumes gigabit throughput is wrong from the start. A full block-level image of even a modest 32 GB SD card over that link would take long enough to be operationally awkward, and would copy huge volumes of empty space along with the data that actually matters.

File-level rsync wins on every axis here. It transfers only what has changed since the last run, it preserves permissions and ownership when invoked correctly, and it can be told to ignore the parts of the filesystem that should never be backed up in the first place.

The Exclusion List

A naive rsync -aAX / target/ will happily try to copy /proc, /sys and /dev, none of which are real files on disk. They are kernel-managed virtual filesystems that exist only at runtime, and copying them is at best wasteful and at worst actively harmful when the backup is restored. The exclusion list is non-negotiable:

rsync -aAXH --delete \
  --exclude={"/dev/*","/proc/*","/sys/*","/tmp/*","/run/*","/mnt/*","/media/*","/lost+found"} \
  / backup@truenas.lan:/mnt/tank/backups/plex/

The /mnt/* exclusion is the one that matters most on the Plex host. Without it, the backup would dutifully attempt to mirror 8 TB of media into a TrueNAS dataset sized for OS state, and the job would never complete. Excluding /mnt/ cleanly isolates the 12 GB of OS, configuration and Plex database from the bulk media that lives on the externally mounted drives. The media is already redundant by virtue of the original source library; the irreplaceable data is the metadata, the watch history and the configuration that took hours to tune.

Passwordless ED25519 and Why root Matters

Both backups run unattended on a schedule, which means the SSH transport has to be passwordless. The right answer is a dedicated ED25519 key pair with no passphrase, restricted on the TrueNAS side to a specific account that can only write into the backup dataset. ED25519 is preferred over RSA here for the usual reasons: smaller keys, faster handshakes, modern curve, no ambiguous parameter choices.

Both cron jobs run as root. This is not laziness. A full system mirror needs to read files that only root can read, including /etc/shadow, the Homebridge service's persistence directory, and the parts of /var that are mode 700 for the daemons that own them. Running rsync as a regular user would silently skip those files and produce a backup that looked complete and was not.

The schedule is staggered to keep the TrueNAS link uncontended:

45 5 * * 0 /usr/local/sbin/backup-plex.sh   >> /var/log/backup-plex.log   2>&1
45 6 * * 0 /usr/local/sbin/backup-doorpi.sh >> /var/log/backup-doorpi.log 2>&1

Plex runs at 05:45 every Sunday, the Pi at 06:45. The hour gap is enough to guarantee the Plex job has cleared its peak network usage before the Pi starts pushing its own delta over the same 100 Mbps link.

On the TrueNAS side, the destination is a dedicated dataset with ZFS snapshots configured to retain weekly checkpoints. The combination is the important part. rsync gives a current mirror; ZFS snapshots give a history. A file deleted accidentally on Wednesday and only noticed the following Monday is still recoverable from the previous snapshot, not silently overwritten by the next rsync run.

The Enterprise Perspective

None of this is novel. Everything in this post is the same set of techniques I see at Lanrex every week, applied at a smaller scale: standardised deployment, scripted recovery, documented exclusion lists, dedicated service accounts, logged scheduled jobs. The only real difference between a home lab and a managed enterprise environment is the number of zeros in the asset list.

What the home lab makes obvious, in a way that no theoretical framework ever quite manages, is that documentation is a first-class part of the process, not something you write afterwards. The patch only survives because the postinstall hook is committed and the diff is in version control. The backup only works because the exclusion list is in a script, not in someone's head. When a rebuild eventually happens, the recovery time is the time it takes to read the runbook, not the time it takes to remember.

There is a useful reality check in all of this. The clever part of the Homebridge fix took an afternoon. Making it survive an npm install took ten minutes and is the only reason it still works today. The clever part of the backup is nothing at all; it is just rsync. The valuable part is the schedule, the exclusion list, the key management and the ZFS retention policy underneath it.

Managing a home network at this level is, in the most direct way, a way of seeing enterprise IT from the inside without the politics. The constraints are real, the failure modes are real, and the consequences land on someone you actually live with. That is the part that turns hobbyist tinkering into something worth calling a system.

Technology Stack

Software

Tooling

Hardware

The lab is now in a state where the two machines that genuinely matter can be rebuilt from a single TrueNAS dataset, and the air conditioner does what its slider says it should. Neither of those was true a month ago. Both of them needed to be.


← Back to blog