The architecture behind a one-person tech startup

Last modified on April 08, 2021

April 7, 2021 · 23 minute learn

Here's a prolonged-build put up breaking down the setup I exploit to bustle a SaaS. From load balancing to cron job monitoring to subscription and funds. There's fairly a couple of floor to cover, so buckle up!

As grandiose as a result of the title of this text might nicely perchance sound, I should peaceable elaborate we’re speaking a couple of low-stress, one-person firm that I bustle from my flat right here in Germany. It be totally self-funded, and I take dangle of to steal points leisurely. It be probably not what most folks think about after I impart "tech startup".

I might not be prepared to achieve this with out the sizable amount of initiate-source software and managed merchandise and firms at my disposal. I dangle respect I’m standing on the shoulders of giants, who did all of the merciless work sooner than me, and I’m very grateful for that.

For context, I bustle a one-man SaaS, and this is a extra detailed mannequin of my put up on the tech stack I exploit. Please set aside in concepts your dangle circumstances sooner than following my recommendation, your dangle context points in phrases of technical choices, there's no longer any holy grail.

I exploit Kubernetes on AWS, however don’t plunge into the entice of pondering you might wish to this. I noticed these devices over fairly a lot of years mentored by a really affected person group. I'm productive on story of here's what I do know worthy, and I'll sort out delivery stuff as an substitute. Your mileage may even merely range.

By the map, I drew inspiration for the format of this put up from Wenbin Fang’s weblog put up. I absolutely cherished studying his article, and it's miles helpful to take a take a look at it out too!

With that talked about, let's soar worthy into the tour.

A rooster’s perceive peek

My infrastructure handles fairly a lot of tasks right away, however as an example points I’ll use Panelbear, my most newest SaaS, as an exact-world occasion of this setup in scuttle.

Panelbear's performance monitoring feature
Browser Timings chart in Panelbear, the occasion conducting I'm going to make use of for this tour.

From a technical stage of peek, this SaaS processes a sizable amount of requests per second from wherever inside the area, and shops the information in an atmosphere beneficiant format for exact time querying.

Business-wise it's peaceable in its infancy (I launched six months in the past), however it absolutely has grown fairly like a flash for my dangle expectations, specifically as I inside the origin constructed it for myself as a Django app the utilization of SQLite on a single minute VPS. For my wants on the time, it labored worthy dazzling and I'll have probably pushed that model fairly a ways.

Nonetheless, I grew more and more extra annoyed having to reimplement a lot of the tooling I was so conscious of: zero downtime deploys, autoscaling, well being exams, computerized DNS / TLS / ingress concepts, and so forth. Kubernetes cross me, I was frail to dealing with higher stage abstractions, whereas holding administration and suppleness.

Rapidly ahead six months, a couple of iterations, and even if my up to date setup is peaceable a Django monolith, I'm now the utilization of Postgres as a result of the app DB, ClickHouse for analytics data, and Redis for caching. I additionally use Celery for scheduled obligations, and a personalised event queue for buffering writes. I bustle every this type of points on a managed Kubernetes cluster (EKS).

SaaS AWS architecture diagram
A excessive-stage overview of the construction.

It might perchance even merely sound subtle, however it absolutely's virtually an standard-college monolithic construction engaged on Kubernetes. Substitute Django with Rails or Laravel and you already know what I'm speaking about. The attention-grabbing part is how all of the items is glued collectively and automated: autoscaling, ingress, TLS certificates, failover, logging, monitoring, and so forth.

It be price noting I exploit this setup legitimate by fairly a lot of tasks, which helps keep my expenses down and open experiments absolutely with out wretchedness (write a Dockerfile and git push). And since I fetch requested this loads: reverse to what you perchance may even merely be pondering, I absolutely spend little or no time managing the infrastructure, typically 0-2 hours monthly whole. Most of my time is spent creating substances, doing buyer give a increase to, and rising the commerce.

That talked about, these are the devices I’ve been the utilization of for fairly a lot of years now and I’m pretty conscious of them. I set aside in concepts my setup simple for what it’s in a place to, however it absolutely took a long time of producing fires at my day job to fetch right here. So I acquired’t impart it’s all sunshine and roses.

I fabricate not know who talked about it first, however what I inform my mates is: "Kubernetes makes the simple stuff complicated, however it absolutely additionally makes the complicated stuff extra environment friendly".

Computerized DNS, SSL, and Load Balancing

Now that you already know I absolutely have a managed Kubernetes cluster on AWS and I bustle diverse tasks in it, let's construct the primary cease of the tour: simple concepts to fetch web site web site guests into the cluster.

My cluster is in a personal community, so that you just acquired’t be prepared to achieve it legitimate now from the ultimate public cyber net. There’s a couple of items in between that administration fetch admission to and cargo stability web site web site guests to the cluster.

In reality, I absolutely have Cloudflare proxying all web site web site guests to an NLB (AWS L4 Network Load Balancer). This Load Balancer is the bridge between the ultimate public cyber net and my private community. As quickly as a result of it receives a set aside a question to of, it forwards it to 1 of many Kubernetes cluster nodes. These nodes are in private subnets unfold legitimate by fairly a lot of availability zones in AWS. It be all managed by the map, however extra on that later.

SaaS ingress diagram
Traffic will get cached on the edge, or forwarded to the AWS residing the set aside aside I function.

"But how does Kubernetes know which service to ahead the set aside a question to of to?" - That’s the set aside aside ingress-nginx is obtainable in. Briefly: it's an NGINX cluster managed by Kubernetes, and it's the entrypoint for all web site web site guests inside the cluster.

NGINX applies charge-limiting and completely totally different web site web site guests shaping concepts I elaborate sooner than sending the set aside a question to of to the corresponding app container. In Panelbear’s case, the app container is Django being served by Uvicorn.

It be not exceptional completely totally different from a ragged nginx/gunicorn/Django in a VPS come, with added horizontal scaling benefits and an automated CDN setup. It’s additionally a “setup as quickly as and omit” roughly factor, largely a couple of recordsdata between Terraform/Kubernetes, and it’s shared by all deployed tasks.

After I deploy a model uncommon conducting, it’s in reality 20 strains of ingress configuration and that’s it:

apiVersion:  networking.k8s.io/v1beta1
mannequin:  Ingress
metadata: 
 namespace:  occasion
 title:  occasion-api
annotations: 
 kubernetes.io/ingress.class:  "nginx"
 nginx.ingress.kubernetes.io/limit-rpm:  "5000"
 cert-manager.io/cluster-issuer:  "letsencrypt-prod"
 exterior-dns.alpha.kubernetes.io/cloudflare-proxied:  "legitimate"
spec: 
tls: 
- hosts: 
   - api.occasion.com
 secretName:  occasion-api-tls
concepts: 
- host:  api.occasion.com
 http: 
   paths: 
     - path:  /
       backend: 
         serviceName:  occasion-api
         servicePort:  http

Those annotations bid that I want a DNS file, with web site web site guests proxied by Cloudflare, a TLS certificates by way of letsencrypt, and that it's going to peaceable charge-limit the requests per minute by IP sooner than forwarding the set aside a question to of to my app.

Kubernetes takes care of creating these infra adjustments to mediate the required negate. It’s a piece of verbose, however it absolutely works well in be acutely aware.

Computerized rollouts and rollbacks

GitOps CI pipeline
The chain of actions that occur after I push a model uncommon commit.

At any time after I push to grasp one among my tasks, it kicks off a CI pipeline on GitHub Actions. This pipeline runs some codebase exams, stop-to-stop exams (the utilization of Docker keep to setup a whole ambiance), and as quickly as these exams scurry it builds a model uncommon Docker picture that may get pushed to ECR (the Docker registry in AWS).

To this level as a result of the utility repo is worried, a model uncommon mannequin of the app has been examined and is able to be deployed as a Docker picture:

panelbear/panelbear-webserver:6a54bb3

"So what happens subsequent? There’s a model uncommon Docker picture, however no deploy?" - My Kubernetes cluster has a factor often known as flux. It robotically retains in sync what's at the moment working inside the cluster and the newest picture for my apps.

Fluxcd release commit
Flux robotically retains monitor of newest releases in my infrastructure monorepo.

Flux robotically triggers an incremental rollout when there’s a model uncommon Docker picture accessible, and retains file of those actions in an "Infrastructure Monorepo".

I want mannequin managed infrastructure, in order that every time I construct a model uncommon commit on this repo, between Terraform and Kubernetes, they're going to assemble the essential adjustments on AWS, Cloudflare and completely totally different merchandise and firms to synchronize the negate of my repo with what's deployed.

It’s all model-controlled with a linear historic earlier of each deployment made. This means a lot much less stuff for me to keep up in concepts over time, since I fabricate not have any magic settings configured by way of clicky-clicky on some imprecise UI.

Dangle this monorepo as deployable documentation, however extra on that later.

Let it break

About a years in the past I frail the Actor model of concurrency for diverse firm tasks, and fell in respect with a lot of the knowledge round its ecosystem. One factor finish lead to a single different and shortly I was studying books about Erlang, and its philosophy round letting points break.

I'll even merely be stretching the idea too exceptional, however in Kubernetes I take dangle of to think about liveliness probes and computerized restarts as a methodology to pause a equivalent keep.

From the Kubernetes documentation:
“The kubelet makes use of liveness probes to take dangle of when to restart a container. Shall we embrace, liveness probes might nicely perchance fetch a deadlock, the set aside aside an utility is working, however unable to assemble development. Restarting a container in one among these negate might nicely lend a hand to assemble the utility extra accessible regardless of bugs.”

In be acutely aware this has labored honest about for me. Containers and nodes are supposed to come back encourage and scurry, and Kubernetes will gracefully shift the web site web site guests to wholesome pods whereas therapeutic the unhealthy ones (extra respect killing). Brutal, however environment friendly.

Horizontal autoscaling

My app containers auto-scale basically basically based mostly on CPU/Memory utilization. Kubernetes will try to pack as many workloads per node as that you just perchance can think about to completely expend it.

In case there’s too many Pods per node inside the cluster, it is going to robotically spawn extra servers to amplify the cluster skill and ease the burden. Similarly, it is going to scale down when there’s not exceptional occurring.

Here’s what a Horizontal Pod Autoscaler might nicely perchance watch respect:

apiVersion:  autoscaling/v1
mannequin:  HorizontalPodAutoscaler
metadata: 
 title:  panelbear-api
 namespace:  panelbear
spec: 
 scaleTargetRef: 
   apiVersion:  apps/v1
   mannequin:  Deployment
   title:  panelbear-api
 minReplicas:  2
 maxReplicas:  8
 targetCPUUtilizationPercentage:  50

In this occasion, it is going to robotically regulate the gathering of panelbear-api pods basically basically based mostly on the CPU utilization, starting at 2 replicas however capping at 8.

Static sources cached by CDN

When defining the ingress concepts for my app, the annotation cloudflare-proxied: "legitimate" is what tells the Kubernetes that I want to make use of Cloudflare for DNS, and to proxy all requests by way of it’s CDN and DDoS security too.

From then on, it’s pretty simple to assemble use of it. I worthy area equivalent outdated HTTP cache headers in my features to specify which requests will even be cached, and for a vogue extended.

# Cache this response for 5 minutes
response["Cache-Control"] = "public, max-age=300"

Cloudflare will use these response headers to manipulate the caching habits on the edge servers. It absolutely works amazingly well for one among these simple setup.

I exploit Whitenoise to lend a hand static recordsdata legitimate now from my app container. That map I steer positive of desperate to add static recordsdata to Nginx/Cloudfront/S3 on every deployment. It has labored absolutely well thus far, and most requests will fetch cached by the CDN as a result of it is going to get stuffed. It be performant, and retains points simple.

I additionally use NextJS for a couple of static web sites, such as a result of the touchdown net web page of Panelbear. I might nicely perchance lend a hand it by way of Cloudfront/S3 and even Netlify or Vercel, however it absolutely was simple to worthy bustle it as a container in my cluster and let Cloudflare cache the static sources as they're being requested. There’s zero added imprint for me to achieve this, and I'll re-use all tooling for deployment, logging and monitoring.

Application data caching

Moreover static file caching, there's additionally utility data caching (eg. outcomes of heavy calculations, Django objects, charge-limiting counters, and plenty others...).

On one hand I leverage an in-memory Least Honest honest presently Used (LRU) cache to keep up usually accessed objects in reminiscence, and I’d steal pleasure in zero community calls (pure Python, no Redis partaking).

Nonetheless, most endpoints worthy use the in-cluster Redis for caching. It be peaceable like a flash and the cached data will even be shared by all Django circumstances, even after re-deploys, whereas an in-memory cache would fetch wiped.

Here's an exact-world occasion:

My Pricing Plans are basically basically based mostly on analytics occasions monthly. For this some construct of metering is essential to take dangle of what number of occasions had been consumed inside the up to date billing interval and implement limits. Nonetheless, I fabricate not interrupt the service legitimate now when a buyer crosses the restrict. As an substitute a "Capability depleted" e-mail is robotically despatched, and a grace interval is given to the patron sooner than the API begins rejecting uncommon data.

Here's supposed to present prospects satisfactory time to come back to a risk if an toughen is right for them, whereas guaranteeing no data is misplaced. Shall we embrace legitimate by a web site web site guests spike in case their advise materials goes viral or in the event that they're worthy taking part in the weekend and not checking their emails. If the patron decides to protect inside the up to date determining and not toughen, there is no penalty and points will return to customary as quickly as utilization is encourage inside their determining limits.

So for this characteristic I absolutely have a attribute that applies the info above, which require fairly a lot of calls to the DB and ClickHouse, however fetch cached 15 minutes to steer positive of recomputing this on each set aside a question to of. It be worthy satisfactory and easy. Value noting: the cache will get invalidated on determining adjustments, in each different case it will probably nicely perchance steal 15 minutes for an toughen to steal keep.

@cache(ttl=60 * 15)
def has_enough_capacity(set aside aside:  Feature) -> bool: 
 """
 Returns Correct if a Feature has satisfactory skill to simply accept incoming occasions,
 or False if it already went over the determining limits, and the grace interval is over.
 """

Per-endpoint charge-liming

While I implement world cost limits on the nginx-ingress on Kubernetes, I typically want extra particular limits on a per endpoint/come basis.

For that I exploit the ravishing Django Ratelimit library to with out wretchedness expose the boundaries per Django peek. It be configured to make use of Redis as a backend for maintaining monitor of the purchasers making the requests to every endpoint (it shops a hash basically basically based mostly on the patron key, and not the IP).

Shall we embrace:

class MySensitiveActionView(RatelimitMixin, LoginRequiredMixin): 
 ratelimit_key = "user_or_ip"
 ratelimit_rate = "5/m"
 ratelimit_method = "POST"
 ratelimit_block = Correct

 def fetch(): 
   ...

 def put up(): 
   ...

In the occasion above, if the patron makes an try to POST to this explicit endpoint larger than 5 circumstances per minute, the subsequent name will fetch rejected with a HTTP 429 Too Many Requests area code.

Rate limited HTTP error
The beneficiant error message you're going to fetch when being charge-restricted.

App administration

Django supplies me an admin panel for all my objects without cost. It’s built-in, and It’s pretty at hand for inspecting data for buyer give a increase to work on the scurry.

Django admin panel
Django's built-in admin panel is very worthwhile for doing buyer give a increase to on the scurry.

I added actions to lend a hand me handle points from the UI. Things respect blocking fetch admission to to suspicious accounts, sending out announcement emails, and approving paunchy story deletion requests (first a comfy delete, and inside 72 hours a paunchy break).

Security-wise: best staff customers are able to fetch admission to the panel (me), and I’m planning so as to add 2FA for additional security on all accounts.

Additionally at any time when a consumer logs in, I ship an computerized security e-mail with essential substances regarding the uncommon session to the story’s e-mail. Moral now I ship it on each uncommon login, however I might swap it inside the waste to skip recognized devices. It’s not a really “MVP characteristic”, however I care about security and it was not subtle so as to add. A minimal of I’d be warned if anybody logged in to my story.

Pointless to say, there's map extra to hardening an utility than this, however that's out of the scope of this put up.

Panelbear security email notification
Instance security train e-mail you perchance can obtain when logging in.

Running scheduled jobs

But another attention-grabbing use case is that I bustle fairly a couple of completely totally different scheduled jobs as part of my SaaS. These are points respect producing every day evaluations for my prospects, calculating utilization stats each 15 minutes, sending staff emails (I fetch a every day e-mail with the biggest metrics) and whatnot.

My setup is de facto pretty simple, I worthy have a couple of Celery staff and a Celery beat scheduler working inside the cluster. They're configured to make use of Redis as a result of the responsibility queue. It took me a day to area it up as quickly as, and happily I haven’t had any points thus far.

I must fetch notified by way of SMS/Slack/Email when a scheduled task is not any longer working as anticipated. Shall we embrace when the weekly evaluations task is caught or tremendously delayed. For that I exploit Healthchecks.io, however checkout Cronitor and CronHub too, I've been listening to sizable points about them as well.

Healthchecks.io cron job monitoring dashboard
The cron job monitoring dashboard from Healthchecks.io

To abstract their API, I wrote a small Python snippet to automate the video present creation and area pinging:

def some_hourly_job(): 
 # Assignment logic
 ...

 # Ping monitoring service as quickly as task completes
 TaskMonitor(
   title="send_quota_depleted_email",
   expected_schedule=timedelta(hours=1),
   grace_period=timedelta(hours=2),
 ).ping()

App configuration

All my features are configured by way of ambiance variables, normal school however transportable and well supported. Shall we embrace, in my Django settings.py I’d setup a variable with a default price:

INVITE_ONLY = env.str("INVITE_ONLY", default=False)

And use it wherever in my code respect this:

from django.conf import settings

# If invite-easiest, then disable story creation endpoints
if settings.INVITE_ONLY: 
 ...

I'll override the ambiance variable in my Kubernetes configmap:

apiVersion:  v1
mannequin:  ConfigMap
metadata: 
 namespace:  panelbear
 title:  panelbear-webserver-config
data: 
 INVITE_ONLY:  "Correct"
 DEFAULT_FROM_EMAIL:  "The Panelbear Crew "
 SESSION_COOKIE_SECURE:  "Correct"
 SECURE_HSTS_PRELOAD:  "Correct"
 SECURE_SSL_REDIRECT:  "Correct"

Holding secrets and techniques and concepts

The map secrets and techniques and concepts are dealt with is beautiful attention-grabbing: I must additionally commit them to my infrastructure repo, alongside completely totally different config recordsdata, however secrets and techniques and concepts must be encrypted.

For that I exploit kubeseal in Kubernetes. This factor makes use of uneven crypto to encrypt my secrets and techniques and concepts, and best a cluster approved to fetch admission to the decryption keys can decrypt them.

Shall we embrace here's what you perchance can glean in my infrastructure repo:

apiVersion:  bitnami.com/v1alpha1
mannequin:  SealedSecret
metadata: 
 title:  panelbear-secrets and techniques and concepts
 namespace:  panelbear
spec: 
 encryptedData: 
   DATABASE_CONN_URL:  AgBy3i4OJSWK+PiTySYZZA9rO43cGDEq...
   SESSION_COOKIE_SECRET:  oi7ySY1ZA9rO43cGDEq+ygByri4OJBlK...
   ...

The cluster will robotically decrypt the secrets and techniques and concepts and scurry them to the corresponding container as an atmosphere variable:

DATABASE_CONN_URL='postgres://consumer:scurry@my-rds-db: 5432/db'
SESSION_COOKIE_SECRET='this-is-supposed-to-be-very-secret'

To guard the secrets and techniques and concepts inside the cluster, I exploit AWS-managed encryption keys by way of KMS, which might nicely perchance be circled on a accepted basis. Here's a single atmosphere when creating the Kubernetes cluster, and it is vitally managed.

Operationally what this implies is that I write the secrets and techniques and concepts as ambiance variables in a Kubernetes manifest, I then bustle a inform to encrypt them sooner than committing, and push my adjustments.

The secrets and techniques and concepts are deployed inside a couple of seconds, and the cluster will steal care of robotically decrypting them sooner than working my containers.

Relational data: Postgres

For experiments I bustle a vanilla Postgres container inside the cluster, and a Kubernetes cronjob that does every day backups to S3. This helps keep my expenses down, and it’s pretty simple for worthy starting out.

Nonetheless, as a conducting grows, respect Panelbear, I scurry the database out of the cluster into RDS, and let AWS steal care of encrypted backups, security updates and all completely totally different stuff that’s no enjoyable to litter up.

For added security, the databases managed by AWS are peaceable deployed inside my private community, in order that they’re unreachable by way of the ultimate public cyber net.

Columnar data: ClickHouse

I depend on ClickHouse for ambiance beneficiant storage and (comfy) precise-time queries over the analytics data in Panelbear. It’s a superb columnar database, extraordinarily like a flash and inside the event you construction your data well you perchance can pause extreme compression ratios (a lot much less storage expenses=higher margins).

I at the moment self-host a ClickHouse occasion inside my Kubernetes cluster. I exploit a StatefulSet with encrypted amount keys managed by AWS. I absolutely have a Kubernetes CronJob that periodically backups up all data in an atmosphere beneficiant columnar format to S3. In case of disaster restoration, I absolutely have a couple of scripts to manually backup and restore the information from S3.

ClickHouse has been rock-solid thus far, and it’s an excellent a part of software. It’s essentially the most simple instrument I wasn’t already conscious of after I began my SaaS, however because of their docs I was able to fetch up and dealing pretty like a flash.

I mediate there’s fairly a couple of low inserting fruit in case I needed to squeeze out much more effectivity (eg. optimizing the sphere kinds for higher compression, pre-computing materialized tables and tuning the occasion mannequin), however it absolutely’s worthy satisfactory for now.

DNS-essentially basically based mostly service discovery

Moreover Django, I additionally bustle containers for Redis, ClickHouse, NextJS, amongst greater than a couple of points. These containers ought to concentrate on over with every completely totally different in a single map, and that in a single map is by way of the built-in service discovery in Kubernetes.

It’s pretty simple: I elaborate a Provider useful useful resource for the container and Kubernetes robotically manages DNS data inside the cluster to route web site web site guests to the corresponding service.

Shall we embrace, given a Redis service uncovered inside the cluster:

apiVersion:  v1
mannequin:  Provider
metadata: 
 title:  redis
 namespace:  weekend-conducting
labels: 
 app:  redis
spec: 
 mannequin:  ClusterIP
 ports: 
   - port:  6379
 selector: 
   app:  redis

I'll fetch admission to this Redis occasion wherever from my cluster by way of the subsequent URL:

redis://redis.weekend-conducting.svc.cluster: 6379

Undercover agent the service title and the conducting namespace is part of the URL. That makes it absolutely simple for your entire cluster merchandise and firms to concentrate on over with every completely totally different, regardless of the set aside aside inside the cluster they bustle.

Shall we embrace, right here’s how I’d configure Django by way of ambiance variables to make use of my in-cluster Redis:

apiVersion:  v1
mannequin:  ConfigMap
metadata: 
 title:  panelbear-config
 namespace:  panelbear
data: 
 CACHE_URL:  "redis://redis.panelbear.svc.cluster: 6379/0"
 ENV:  "manufacturing"
 ...

Kubernetes will robotically keep the DNS data in-sync with wholesome pods, similtaneously containers fetch moved legitimate by nodes legitimate by autoscaling. The map this works on the encourage of the scenes is beautiful attention-grabbing, however out of the scope of this put up. Here’s a respectable clarification inside the event you glean it attention-grabbing.

Version-controlled infrastructure

I want model-controlled, reproducible infrastructure that I'll produce and break with a couple of simple directions.

To pause this, I exploit Docker, Terraform and Kubernetes manifests in a monorepo that incorporates all-issues infrastructure, even legitimate by fairly a lot of tasks. And for each utility/conducting I exploit a separate git repo, however this code is not any longer acutely aware of the ambiance it is going to bustle on.

Whenever you’re conscious of The Twelve-Element App this separation may even merely ring a bell or two. In reality, my utility has no data of the correct infrastructure it is going to bustle on, and is configured by way of ambiance variables.

By describing my infrastructure in a git repo, I don’t ought to keep monitor of each little useful useful resource and configuration atmosphere in some imprecise UI. This permits me to revive my full stack with a single inform in case of disaster restoration.

Here’s an occasion folder construction of what you perchance can glean on the infra monorepo:

# Cloud sources
terraform/
aws/
 rds.tf
 ecr.tf
 eks.tf
 lambda.tf
 s3.tf
 roles.tf
 vpc.tf
cloudflare/
 tasks.tf

# Kubernetes manifests
manifests/
 cluster/
   ingress-nginx/
   exterior-dns/
   certmanager/
   monitoring/

 apps/
   panelbear/
     webserver.yaml
     celery-scheduler.yaml
     celery-workers.yaml
     secrets and techniques and concepts.encrypted.yaml
     ingress.yaml
     redis.yaml
     clickhouse.yaml
   one other-saas/
   my-weekend-conducting/
   some-ghost-blog/

# Python scripts for disaster restoration, and CI
obligations/
 ...

# In case of a fire, some lend a hand for future me
README.md
DISASTER.md
TROUBLESHOOTING.md

But another lend a hand of this setup is that each particular person the transferring items are described in a single residing. I'll configure and handle reusable substances respect centralized logging, utility monitoring, and encrypted secrets and techniques and concepts to title a couple of.

Terraform for cloud sources

I exploit Terraform to govern fairly a lot of the underlying cloud sources. This helps me doc, and keep monitor of the sources and configuration that makes up my infrastructure. In case of disaster restoration, I'll stride up and rollback sources with a single inform.

Shall we embrace, this is one among my Terraform recordsdata for creating a personal S3 bucket for encrypted backups which expire after 30 days:

useful useful resource "aws_s3_bucket" "panelbear_app" {
 bucket = "panelbear-app"
 acl    = "private"

 tags = {
   Title        = "panelbear-app"
   Ambiance = "manufacturing"
 }

 lifecycle_rule {
   id      = "backups"
   enabled = legitimate
   prefix  = "backups/"

   expiration {
     days = 30
   }
 }

 server_side_encryption_configuration {
   rule {
     apply_server_side_encryption_by_default {
       sse_algorithm     = "AES256"
     }
   }
 }
}

Kubernetes manifests for app deployments

Similarly, all my Kubernetes manifests are described in YAML recordsdata inside the infrastructure monorepo. I absolutely have wreck up them into two directories: cluster and apps.

All over the cluster listing I bid all cluster-broad merchandise and firms and configuration, points respect the nginx-ingress, encrypted secrets and techniques and concepts, prometheus scrapers, and so forth. In reality the reusable bits.

On completely totally different hand, the apps listing incorporates one namespace per conducting, describing what's wished to deploy it (ingress concepts, deployments, secrets and techniques and concepts, volumes, and so forth).

One in all of the chilly points about Kubernetes, is that you just perchance can customise virtually all of the items about your stack. So as an illustration, if I needed to make use of encrypted SSD volumes that may even be resized, I might nicely perchance elaborate a model uncommon “StorageClass'' inside the cluster. Kubernetes and on this case AWS will coordinate and construct the magic occur for me. Shall we embrace:

apiVersion:  storage.k8s.io/v1
mannequin:  StorageClass
metadata: 
 title:  encrypted-ssd
 provisioner:  kubernetes.io/aws-ebs
parameters: 
 mannequin:  gp2
 encrypted:  "legitimate"
 reclaimPolicy:  Defend
 allowVolumeExpansion:  legitimate
 volumeBindingMode:  WaitForFirstConsumer

I'll now scurry ahead and be a part of this map of persistent storage for any of my deployments, and Kubernetes will handle the requested sources for me:

# Somewhere inside the ClickHouse StatefulSet configuration
...
storageClassName:  encrypted-ssd
sources: 
 requests: 
   storage:  250Gi
...

Subscriptions and Payments

I exploit Stripe Checkout to set all of the work in dealing with funds, creating checkout screens, dealing with 3D steady necessities from bank cards, and even the patron billing portal.

I attain not have fetch admission to to the cost information itself, which is a agreeable reduction and permits me to accommodate my product as an substitute of extremely aloof topics respect bank card dealing with and fraud prevention.

Panelbear's Customer Billing Portal
An occasion Customer Billing Portal in Panelbear.

All I absolutely ought to attain is produce a model uncommon buyer session and redirect the patron to 1 of Stripe's hosted pages. I then hear for webhooks about whether or not or not the patron upgraded/downgraded/cancelled and replace my database accordingly.

Pointless to say there's a couple of essential substances respect validating that the webhook absolutely obtained right here from Stripe (you might wish to validate the set aside a question to of signature with a secret), however Stripe's documentation covers all of the substances absolutely well.

I best have a couple of plans, so it's pretty simple for me to govern them in my codebase. I in reality have one factor respect:

# Opinion constants
FREE = Opinion(
 code='free',
 display_name='Free Opinion',
 substances={'abc', 'xyz'},
 monthly_usage_limit=5e3,
 max_alerts=1,
 stripe_price_id='...',
)

BASIC = Opinion(
 code='customary',
 display_name='Overall Opinion',
 substances={'abc', 'xyz'},
 monthly_usage_limit=50e3,
 max_alerts=5,
 stripe_price_id='...',
)


PREMIUM = Opinion(
 code='prime cost',
 display_name='Top class Opinion',
 substances={'abc', 'xyz', 'special-feature'},
 monthly_usage_limit=250e3,
 max_alerts=25,
 stripe_price_id='...',
)

# Helpers for straightforward fetch admission to
ALL_PLANS = [FREE, BASIC, PREMIUM]
PLANS_BY_CODE = {p.code:  p for p in ALL_PLANS}

I'll then use it in any API endpoint, cron job and admin task to look out out which limits/substances be acutely aware for a given buyer. The scorching determining for a given buyer is a column often known as plan_code on a BillingProfile model. I separate the consumer from the billing information since I'm planning so as to add organizations/groups at some stage, and that map I'll with out wretchedness migrate the BillingProfile to the story proprietor / admin consumer.

Pointless to say this model might nicely perchance not scale in case you are providing hundreds of explicit particular person merchandise in an e-commerce retailer, however it absolutely works honest about for me since a SaaS typically best has a couple of plans.

Logging

I don’t ought to instrument my code with any logging agent or one factor respect that. I merely log to stdout and Kubernetes robotically collects, and rotates logs for me. I might nicely perchance additionally robotically ship these logs to 1 factor respect Elasticsearch/Kibana the utilization of FluentBit, however I don’t attain that but to keep up points simple.

To scurry trying for the logs I exploit stern, a minute CLI instrument for Kubernetes that makes it sizable simple to tail utility logs legitimate by fairly a lot of pods.
Shall we embrace, stern -n ingress-nginx would tail the fetch admission to logs for my nginx pods even legitimate by fairly a lot of nodes.

Monitoring and alerting

In the origin I frail a self-hosted Prometheus / Grafana to robotically video present my cluster and utility metrics. Nonetheless, I didn’t really feel contented self-webhosting my monitoring stack, on story of if one factor went irascible inside the cluster, my alerting plan would scurry down with it too (not sizable).

If there’s one factor that should by no methodology scurry down is your monitoring plan, in each different case you’re in reality flying with out devices. That’s why I swapped my monitoring / alerting plan with a hosted service (Unique Relic).

All my merchandise and firms have a Prometheus integration that robotically data and forwards the metrics to a acceptable backend, equal to Datadog, Unique Relic, Grafana Cloud or a self-hosted Prometheus occasion (what I frail to achieve). To migrate to Unique Relic, all I needed to attain was to make use of their Prometheus Docker picture, and shutdown the self-hosted monitoring stack.

Panelbear New Relic Dashboard
Instance Unique Relic dashboard with a abstract of the biggest stats.

Panelbear New Relic Uptime Monitoring
I additionally video present uptime legitimate by the sector the utilization of Unique Relic's probes.

The migration from a self-hosted Grafana/Loki/Prometheus stack to Unique Relic diminished my operational ground. Extra importantly, I might peaceable fetch alerted regardless of the indeniable reality that my AWS residing is down.

You would perchance nicely perchance perchance additionally merely be questioning how I uncover metrics from my Django app. I leverage the ravishing django-prometheus library, and merely register a model uncommon counter/gauge in my utility:

from prometheus_client import Counter

EVENTS_WRITTEN = Counter(
 "events_total",
 "Total assortment of occasions written to the eventstore"
)

# We can increment the counter to file the gathering of occasions
# being written to the eventstore (ClickHouse)
EVENTS_WRITTEN.incr(rely)

This might perchance perchance uncover this and completely totally different metrics inside the /metrics endpoint of my server (best reachable inside my cluster). Prometheus will robotically plight this endpoint each minute and ahead the metrics to Unique Relic.

Prometheus metrics
The metric robotically shows up in Unique Relic as a result of Prometheus integration.

Error monitoring

All individuals thinks they don’t have errors of their utility, besides they provoke error monitoring. It’s too simple for an exception to fetch misplaced in logs, or worse you’re acutely aware of it however unable to breed the hazard because of lack of context.

I exploit Sentry to mixture and difficulty me about errors legitimate by my features. Instrumenting my Django apps is very simple:

SENTRY_DSN = env.str("SENTRY_DSN", default=None)

# Init Sentry if configured
if SENTRY_DSN: 
 sentry_sdk.init(
   dsn=SENTRY_DSN,
   integrations=[DjangoIntegration(), RedisIntegration(), CeleryIntegration()],
   # Pause not ship consumer PII data to Sentry
   # Watch additionally inbound concepts for particular patterns
   send_default_pii=False,
   # Handiest sample a small amount of effectivity traces
   traces_sample_rate=env.drift("SENTRY_TRACES_SAMPLE_RATE", default=0.008),
 )

It’s been very useful on story of it robotically collects a bunch of contextual information about what happened when the exception occurred:

Panelbear Sentry error tracking
Sentry aggregates and notifies me in case of exceptions.

I exploit a Slack #indicators channel to centralize all my indicators: downtime, cron job failures, security indicators, effectivity regressions, utility exceptions, and whatnot. It be sizable on story of I'll typically correlate points when fairly a lot of merchandise and firms ping me legitimate by the equivalent time, on seemingly unrelated problems.

Panelbear Slack alerts channel
Instance Slack alert because of a CDN endpoint being down in Sydney, Australia.

Profiling and completely totally different sweets

After I ought to deep dive, I additionally use devices respect cProfile and snakeviz to raised perceive allocations, assortment of calls and completely totally different stats about my app’s effectivity. Sounds care for however they’re pretty simple to make use of devices, and have helped me establish diverse points inside the earlier that made my dashboards leisurely from seemingly unrelated code.

Panelbear New Relic Uptime Monitoring
cProfile and snakeviz are sizable devices to profile your Python code inside the neighborhood.

I additionally use the Django debug toolbar on my native machine to with out wretchedness search for the queries that a peek triggers, preview outgoing emails legitimate by sample, and a nice deal of substitute sweets.

Panelbear New Relic Uptime Monitoring
Django's Debug Toolbar is sizable for inspecting stuff in native dev, and previewing transactional emails.

That is all folks

I'm hoping you liked this put up when you will have gotten obtained made it this a ways. It ended up being loads longer than I inside the origin supposed as there was fairly a couple of floor to cover.

Whenever you are not already conscious of those devices set aside in concepts the utilization of a managed platform first, as an illustration Render or DigitalOcean's App Platform (not affiliated, worthy heard sizable points about each). They are going to allow you to focal stage to your product, and peaceable construct a lot of the benefits I concentrate on about right here.

"Pause you utilize Kubernetes for all of the items?" - No, completely totally different tasks, completely totally different wants. Shall we embrace this weblog is hosted on Vercel.

Interestingly, I spent extra time penning this put up than absolutely organising all of the items I described. At larger than 6k phrases, and fairly a lot of weeks of on-and-off work, it's pretty positive that I'm a leisurely creator.

That talked about, I attain intend to place in writing extra be acutely aware up posts on particular pointers and tri

@panelbear.com>
Read More

Similar Products:

Recent Content