2 min read

Monitoring

Introduction

This document outlines the monitoring strategy for the Poq Platform. Our approach integrates various tools and services to ensure comprehensive monitoring of backend services, application performance, and API functionality.

Monitoring Goals

Performance Optimization: Continuously improve the performance and responsiveness of the app and backend services.
User Experience: Ensure a high-quality, consistent user experience.
Data-Driven Decisions: Use gathered data to inform development and operational decisions.

Reporting and Communication

Alerts: Configured to notify relevant teams via Slack for immediate action.

Continuous Monitoring

We track below categories of issues and monitor in real-time:

Availability Alerts: These are raised with P1 priority. We try to solve them immediately.
HTTP 5xx Alerts: These are raised when an API's 5xx rate exceeds a machine-learning based, dynamic threshold. We immediately investigate these when they are not resolved within minutes.
HTTP 4xx Alerts: These alerts usually point to issues in external systems. We monitor and resolve these on the same basis as 5xx alerts.
Abnormal Crash Rates: These alerts are raised when more than 1% of active sessions of an app have a crash. We look at these immediately if the number of crashes are signficant. e.g. we would not prioritse an alert that's impacting 2 users, but we would immediately look into it if the same crash is impacting more than 5 users.

Monthly Reporting

As part of their KPIs, each Squad Lead internally reports the performance of the apps for their squad's clients. This report includes:

API Error and App Crash Rates

We aim to keep a minimum of 99.5% success rate of API responses and crash-free users. We report on this rate for each app monthly and internally raise bugs as we identify them. If an API's success rate, or an app's crash-free rate drops below 99.5%, we prioritise the fixes.

Negative App Reviews

We read every negative review to understand areas of improvement for each app. If negative reviews point to common defects on the apps, we try to identify the root cause and fix it. Where we identify patterns for non-technial issues, we report these to the client and work with them to try and remedy.

Monitoring Tools and Services

Azure Monitor for Backend Services

Purpose: To track the performance, health, and availability of backend services.
Features: Utilizes Azure Monitor to collect, analyze, and act on telemetry data from Azure.
Implementation: Set up alerts for anomalies, track performance metrics, and analyze logs for troubleshooting.

Crashlytics for App Crash Rate Monitoring

Purpose: To monitor and manage application crashes effectively.
Features: Integrates with the development workflow to provide real-time crash reporting, issue tracking, and detailed crash analytics.
Implementation: Configured to capture and report unhandled exceptions and crashes, helping in identifying and resolving issues promptly.

Firebase Performance for App Responsiveness

Purpose: To monitor the app's performance and responsiveness.
Features: Firebase Performance offers insights into app speed and user experience.
Implementation: Monitors key metrics like startup time, network performance, and interaction latency to ensure a smooth user experience.

Postman for Production API Monitoring

Purpose: To ensure continuous functionality of Production APIs.
Features: Utilizes Postman for automated, scheduled testing of API endpoints.
Implementation: Regularly scheduled tests run against production APIs to validate API response data. Alerts are configured for any failures or performance degradation.