Personal Infrastructure Part 5: Core Supporting Services

Published on September 25, 2024 | Last updated on September 25, 2024

#blog #development #infrastructure #ansible

Abstract image of system core.

In this post, I will describe the design of the core supporting services that will support the rest of the Personal Cloud.

Introduction

The core supporting services provide the foundation and support for building more advanced services, including Kubernetes clusters, load balancers, and more. They exist before any "user-facing" services and make deploying and managing the main services easier. Eventually, I will be running a reliable, highly available set of personal services, including file hosting, calendar, AI services, email, messaging, web hosting, and more.

I will build out this system using commercial off-the-shelf components, somewhat in the "prosumer" category, as well as rent virtual servers from various cloud providers.

Goals

The goal for this step is to build a core layer of services that will support building out the entire personal cloud. This foundational layer needs to be robust against a variety of failures and provide valuable services to be used later.

Specifically, I'd like to:

Be able to quickly create and destroy the second layer of infrastructure.
Securely manage secrets.
Establish secure communication between all nodes, from bare metal physical servers to ephemeral containers and virtual machines.
Observe what is happening in all services, including core support services, in the form of logs, tracing, error reporting, and more.
Leave existing "legacy" infrastructure untouched until it is smoothly migrated.

Constraints, Requirements, and Considerations

I'd like this setup to be reliable and relatively easy to use once up.
I'd like to use hardware I already have.
My home internet connection is not very good.
My friend will let me put hardware in his home and use his connection.
I will move.
Power outages occur.
Internet outages occur.
Theft or total hardware failure may occur.
Disk failures may occur.
Brain outages will occur.
I have a bunch of "legacy" services running, primarily as docker compose workloads manually started on aslan and green-lion. I'd like these not to be disturbed until they can gracefully be migrated to the new system.

For this proof of concept, I'd like to use the hardware I have. If I can use a mix of different machines with different hardware and OSs and still build a solid foundation, it helps prove this can be done on commodity hardware.

I'd like to utilize normal home internet connections for the majority of the connectivity. Ideally, home internet connections would be better. Perhaps building systems like this will create more demand for better home internet. One can dream.

I have some friends who will let me install "servers" in their homes. However, out of respect for their privacy and to minimize the chance they become hacked, no home IP addresses are ever used for public ingress into the network.

I live a semi-nomadic life. I've moved just about every year or two since graduating from college in 2013. Therefore, my personal cloud should move with me with minimal disruption.

Taken together, and chewed primarily by my right brain, these considerations spawn some derived requirements:

I should be able to add or remove any single node without any ill effects.
The core of the network should be almost bulletproof, and it should remain possible to fix things in many scenarios. I should be able to SSH into any ground-level server no matter what else I do.
I should be able to create a functionally identical cluster from backups within hours, if not minutes, assuming "application" data is instantly available.
There should be backups of all critical data. This includes all configuration for all servers, including keys, topology, Terraform or Pulumi state, as well as any "application" data running on the cluster(s).

If I shake these requirements around and stir them a few times, a rough design begins to emerge.

Requirement Implementation Table

Requirement	Mitigation
Handle power outage	Servers auto-boot
Handle internet outage	Multiple ingress nodes, multiple backend servers
Never lose connectivity	Have all servers join Wireguard P2P mesh, and rarely mess with "ground level" servers
Quickly create and destroy clusters	Automate deploying clusters. Have core supporting infrastructure to make this easy
Easily recover from serious failures	Create entire cluster automatically from description in code. Backup all the things. Test that the backups work, and intentionally break stuff
Easy adding and removing of nodes	Make sure that no node is critical. Make services runnable on any, or at least multiple nodes.
I will move	There should be no requirement on the home internet connection. Don't use static IPs. Connect everything with Wireguard/Headscale

Taking all these factors into consideration, here's what I've arrived at:

Physical Server Setup

Note: I consider all of these "physical" even though some of them are VPS. They are the lowest level computers my personal cloud will have knowledge of.

I will set up some physical servers at my home. They will be connected to a standard, boring Ethernet switch behind a pfSense router that serves as a Wireguard server. Anything connected to this server, or remotely via Wireguard, is IP routable via a private subnet. Right now, I have aslan and green-lion connected, and bagheera connected and powered off.

I will rent three virtual machines: gateway, the primary ingress node, a small VPS hosted geographically near me in Los Angeles; snow-leopard, a very disk-heavy dedicated server also located geographically close to me; and palantir (name undecided yet, I may pick another cat), a secondary ingress node from a different vendor in a different region.

Public traffic will enter my network through one of the two ingress nodes, which will have public IP addresses. A load balancer will route the traffic to the correct backend node. As the architecture advances, much traffic will flow through an API gateway, which will authenticate the agent making the request and pass on headers so that many backend services need not worry about authentication and authorization.

Probable Software Stack

Setup Headscale to create a management network for all "physical" servers
Use Packer to create standardized VM images
Use Terraform or Pulumi IaC to start VMs on physical servers.
Set up Nomad, with clients on all VMs and physical servers, and the Nomad server on one or more VMs.
Deploy Vault for secrets management
Set up monitoring tools (e.g., Prometheus, Grafana)
Implement logging solution (e.g., ELK Stack)

Stack diagram showing physical, logical and containerized systems

Ongoing Thought Process

I am spending a lot of time trying to figure out the best bootstrap dance here. In essence, I am asking myself the question: What is the most repeatable way to wrangle a pile of miscellaneous compute devices located all over the world into a consistent platform that can be built upon with confidence?

VMs as Building Blocks

If the same VM image is running on two different hosts, it will behave similarly enough to be considered a stable building block. Whether two different instances running can be treated identically is a different question, ideally one that can be asked programmatically at runtime. For example, each will have different CPUs, RAM, GPU, and potentially other differences.

As long as these differences can be introspected at runtime, I will consider a running VM a stable building block. Some services may require stable building blocks that are not available, but that is a different problem.

Known Unknowns

These are questions or areas that I am aware I have not explored fully.

Network-aware scheduling of workloads. I'm not sure how to handle where workloads are scheduled based on the performance of different network segments. For example, I'd like my Plex server to be running in the same building that I live in. This may be manual, or maybe there are tools to automate this scheduling and placement.
Should I use Terraform or Pulumi IaC? How much of the stack should be managed by each of these tools? Should there be multiple instances of them to make a nested platform? For example, should I deploy a Vault instance to a VM and use that to store secrets used to deploy another Vault in a high-availability setup?

Next Steps

Try out Terraform and Pulumi IaC and choose one for now.
Build out the core infrastructure as described here.
Work on the secondary core services. These will include a service mesh, API gateway, unified authentication server, and possibly a Kubernetes cluster for fun.