Site Reliability Engineering (SRE) for Self-Contained SaaS Solutions

Sharesquare.co engineering blog by R. Vincelli

9 min readDec 5, 2024

In today’s fast-paced digital world, ensuring software is both reliable and scalable is more crucial than ever. Enter Site Reliability Engineering (SRE) — a groundbreaking approach that blends the art of software engineering with the science of IT operations. The goal? To build and maintain systems that are not only robust but also capable of handling the ever-increasing demands of users.

The origins of SRE trace back to 2003 when Ben Traynor, a visionary at Google, coined the term. He envisioned a role that would bridge the gap between development and operations, focusing on keeping systems running smoothly while continuously improving them. Fast forward to today, and SRE practices have become a cornerstone for organizations of all sizes, from startups to tech giants. These practices help ensure that software solutions are not only stable and performant but also ready to scale as user demands grow.

But why is SRE such a big deal? Think of it as having a dedicated team that ensures your software is always at its best — like having a personal trainer for your tech stack. Whether it’s about preventing outages, optimizing performance, or handling unexpected traffic spikes, SRE professionals are the unsung heroes working behind the scenes to keep everything running like a well-oiled machine.

So, if you’ve ever wondered what keeps your favorite apps and services so reliable, SRE might just be the magic ingredient behind the curtain. Let’s dive deeper into how this discipline works and why it’s such a game-changer for self-contained SaaS solutions.

The key to SRE is to assume failure will happen and to focus relentlessly on resilience and recovery, not just preventing failures. Image credits: Umer Waqas.

1. Continuous Integration/Continuous Deployment (CI/CD)

CI/CD is like the secret sauce in our Site Reliability Engineering toolkit. It’s all about making sure we can update our software frequently, reliably, and without throwing a wrench into the works. By automating the process of building, testing, and deploying code, CI/CD pipelines help us deliver new features and fixes smoothly, while reducing the risk of human error.

CI/CD in SRE is the backbone of agility and reliability, enabling rapid innovation while ensuring stability through automated testing, deployment, and monitoring. Image credits: Sharesquare.

At Sharesquare, our CI/CD pipeline is designed to spring into action with every code commit. Here’s how it works:

Automated Triggers: As soon as code is committed, our pipeline kicks off automatically.
Build and Test: The pipeline builds the application, runs a suite of unit and integration tests, and ensures everything is in tip-top shape.
Packaging and Deployment: Only code that passes all tests makes it to production, keeping our service stable and reliable.

Deployments are seamlessly managed through GitHub Actions, which handle the process of rolling out our application to both staging and production environments via Azure App Service.

name: Your Project Name
on:
  push:
    branches:
      - main
  workflow_dispatch:
    inputs:
      target_branch:
        type: string
        description: Provide branch or commit hash
        required: false
        default: main

Key Point: Even in a smaller SaaS environment, automated CI/CD processes are essential. Manual deployments introduce risks and inefficiencies that can be easily avoided by leveraging automation tools and practices.

2. Capacity Planning and Autoscaling

Capacity Planning and Autoscaling are like the dynamic duo of SRE, working together to keep our systems scalable and performing at their best.

2.1 Capacity Planning is all about predicting our resource needs based on anticipated traffic and usage patterns. It helps us provision resources efficiently, so we’re ready for whatever comes our way.

2.2 Autoscaling takes this a step further by automatically adjusting our system resources — like CPU, memory, and network capacity — in real-time. This means we can handle sudden spikes in traffic without over-provisioning and wasting resources, keeping our applications responsive and cost-efficient.

Autoscaling ensures your application adapts seamlessly to demand, scaling up when needed and saving costs when traffic slows. Image credits Thanakorn Lappattaranan.

At Sharesquare, we’re currently not using autoscaling because our traffic levels are manageable, and we’re mindful of the associated costs. However, Azure App Service offers horizontal autoscaling (scale-out), where new instances are automatically created if traffic surpasses a certain threshold. You can set minimum and maximum instance limits to manage this. Vertical autoscaling (scale-up), which involves upgrading to more powerful hardware, is not automated and must be done manually. Typically, scale-out is more efficient and cost-effective, which aligns with our approach and most computing scenarios.

Tips: For those diving into autoscaling, Azure App Service Autoscale is great for horizontal scaling, while Azure Monitor helps track traffic and set scaling triggers to ensure smooth performance under varying loads.

Key Point: By combining effective capacity planning with autoscaling capabilities, you can maintain top-notch performance during peak times without wasting resources when things are quieter. This balance helps keep our services running efficiently and cost-effectively.

3. Observability

Observability involves measuring the internal state of a system through the data it generates, including logs, metrics, and traces. Regardless of the scale of your application, observability is essential for detecting and addressing issues proactively, ensuring they are resolved before affecting users.

Without observability, you’re flying blind: insights are the compass to reliability. Image credits: Microsoft Azure.

At Sharesquare, our current approach to monitoring is more passive, without an alerting system in place. Ideally, we’d like both infrastructure-level issues (like problems with Azure resources) and application-level issues (such as errors in Laravel logs) to trigger notifications via Azure Alerts. For example, if there’s a spike in 500 errors from our web container, both the infrastructure and application teams would be notified. Conversely, an ERROR log from the web app would only alert the application team. Moving forward, implementing alerts for application failures is a key step to help us troubleshoot and resolve issues more proactively.

Tips: Tools like Azure Monitor, Log Analytics, and Application Insights can be used to collect and analyze data, set up alerts for critical metrics, and visualize logs to detect anomalies. In addition to Azure-native tools, solutions like Sentry and New Relic can also be integrated to enhance error tracking and performance monitoring, offering more specialized features for detailed observability.

Key Point: Observability is a continuous process that evolves with your application. As your SaaS solution grows, enhancing your observability practices will ensure that you can maintain high levels of reliability and performance.

4. Application Backups

Backing up your application data is like having a safety net for your digital tightrope walk. It’s a crucial part of Site Reliability Engineering (SRE) that ensures your data is safe and recoverable in case things go sideways. Robust backup solutions should cover all bases, from your codebase and databases to configurations and storage.

Application backups are your insurance policy against data loss, ensuring recovery and resilience in the face of unforeseen events. Image credits: Icon ade.

At Sharesquare, we use the native Backup feature in Azure App Service, which provides automated, periodic backups that capture the entire instance image. This is great for quick restorations — assuming everything goes according to plan. However, we’ve had our share of hiccups where restoring to a new deployment slot or in-place has failed, often without giving us a clear error message. And while these backups are great for preserving the code (which is safely stored on GitHub), they don’t cover the database.

For database backups, we use a semi-automated process. We generate a MySQL dump, encrypt it, and store it in a file share attached to the production instance. This dump is then downloaded and kept in a secure, cold storage location. Documents are backed up in the same fashion. Although Azure MySQL offers automatic backups, restoring them can be a bit of a hassle — requiring us to create a new database instance rather than restoring in place.

Tips: To streamline backups and recovery, services like Azure Backup and Site Recovery offer point-in-time restoration and disaster recovery capabilities. They help you quickly get your application back to a specific state if needed, turning potential data disasters into mere blips on the radar.

Key Point: Ensuring the safety and recoverability of application data is a fundamental part of SRE. By leveraging these backup and recovery tools, you can make sure your application is resilient against data loss and ready to bounce back from unexpected events. After all, it’s better to have a backup plan than to be caught without one!

From the trenches: with the managed MySQL service we use in Azure, when one wants to restore a past backup from an instance in some region, a new instance needs to be created in that very same region; if there’s no availability left in your target region for MySQL service, you are stuck and you won’t be able to restore your database backup. Not so well done, Azure: soft limits like this shouldn’t get in the way of critical activities such as restoring a production backup!

5. Failovers and Redundancy

Eliminating single points of failure and ensuring redundancy is crucial for building a reliable SaaS solution. Implementing geographic failover and redundancy strategies can help maintain service availability even during outages.

Failovers and redundancy are the safety nets of resilient systems, ensuring that when one part fails, the whole continues to operate seamlessly without disruption. Image credits: Microsoft.

At Sharesquare, we don’t have a dedicated failover strategy just yet. However, our Azure App Service runs on multiple instances, which provides a basic level of resilience. If one instance decides to take a nap, the other two can still keep serving requests. This setup is great for zero-downtime deployments but isn’t quite the failover magic we’d ideally want.

For data redundancy, we use geo-redundant storage accounts and MySQL database services. This means our data is replicated across different locations and regions. However, we haven’t fully embraced this yet due to cost and our current needs. In an ideal world, we’d have geographic failover and redundancy to keep our services humming even if one region decides to take a holiday.

Tips: Azure offers a smorgasbord of tools to boost availability and redundancy. Azure Traffic Manager helps route traffic across multiple regions or endpoints to keep things running smoothly. Azure Front Door provides global load balancing and failover, making sure your app performs well no matter where users are. For greater resilience within a region, Azure Availability Zones spread resources across separate physical zones, while Azure Availability Sets offer redundancy for virtual machines within a data center, reducing the risk of a single point of failure.

Key Point: Leveraging Azure’s traffic management, load balancing, and availability features can significantly enhance our system’s resilience. By putting these tools to work, we can build a more robust system that keeps delivering, even when things don’t go according to plan.

6. Security

When it comes to SaaS solutions, security isn’t just an afterthought — it’s a top priority. A robust security strategy is essential to protect your data, control access, and stay compliant with industry standards and regulations. Think of it as putting a high-tech security system around your digital fortress.

Strong IAM roles are the foundation of security, ensuring that only the right people have access to the right resources at the right time. Image credits: witsanu singkaew.

At Sharesquare, we take security seriously. Here’s how we keep our systems safe:

Identity and Access Management (IAM): Our IAM policy is like a strict doorman for our resources. It ensures that only authorized users can access sensitive areas, reducing the risk of unauthorized access or data breaches.
Secret Management: We rely on Azure Key Vault to securely manage our secrets, such as API keys and passwords. Each environment slot — development 1, 2, … N, staging, and production green, blue etc — has its own vault to enhance isolation and security.
Vulnerability Monitoring: We use Snyk to keep an eye on library-level vulnerabilities. This helps us ensure that our dependencies are secure and up-to-date, so we’re not left exposed to known issues.

Tips: Azure Security Center offers unified security management and advanced threat protection across cloud resources, continuously assessing the environment, identifying vulnerabilities, and recommending best practices. Azure Key Vault securely handles secrets, encryption keys, and certificates, with strict access controls in place to safeguard sensitive information.

Key Point: Regular security assessments, vigilant patch management, and adherence to best practices are critical for maintaining a secure and resilient environment. By staying proactive and leveraging these tools, we ensure that our SaaS solutions remain safe and trustworthy.

Conclusion

Building and maintaining a reliable SaaS solution involves more than just writing code — it requires a holistic approach that includes Site Reliability Engineering (SRE) practices. At Sharesquare, we integrate key SRE components like Continuous Integration/Continuous Deployment (CI/CD), Capacity Planning and Autoscaling, Observability, Application Backups, Failovers and Redundancy, and Security to ensure our services are robust and resilient.

By automating our CI/CD pipelines, planning for capacity needs, monitoring our systems proactively, managing backups effectively, and ensuring failover and redundancy, we aim to deliver high-quality, dependable software. Our commitment to security through comprehensive IAM policies and secret management further fortifies our solutions against threats.

Through continuous improvement and leveraging the right tools, we strive to provide a seamless experience for our users, maintaining reliability and performance even in the face of challenges. In the ever-evolving digital landscape, staying proactive and adaptable is key to success.

Blog by Riccardo Vincelli and Umer Waqas brought to you by the engineering team at Sharesquare.