Managed Prometheus
Assistance-operated metrics, alerting, dashboards, and reliability signals for production systems
Managed Prometheus is for teams that need dependable metrics and alerting but do not want Prometheus itself to become another production platform nobody owns. Assistance operates the observability stack while your team owns service meaning, response decisions, and product reliability priorities.
Best-fit use cases#
| Use case | Why Managed Prometheus fits |
|---|---|
| Infrastructure monitoring | Server, container, Kubernetes, network, storage, and platform metrics |
| Application health | Request rate, error rate, latency, saturation, queue depth, and custom metrics |
| Alerting cleanup | Replace noisy pages with actionable alerts tied to ownership and runbooks |
| SLO visibility | Build service-level indicators, error budget views, and reliability review dashboards |
| Managed service visibility | Monitor databases, Redis, Kafka, OpenSearch, registries, and platform dependencies |
What Assistance operates#
| Area | Included managed service responsibility |
|---|---|
| Provisioning | Prometheus deployment, scrape topology, storage sizing, network placement, and secure defaults |
| Collection | Scrape configuration, service discovery patterns, exporter onboarding guidance, and target health monitoring |
| Alerting | Alertmanager setup, routing, severity labels, silences, inhibition rules, and integration with paging/chat tools |
| Dashboards | Grafana data source integration, base dashboards, and service health views where scoped |
| Retention | Local retention and long-term storage options such as Thanos/Cortex/Mimir-style patterns when required |
| Maintenance | Version lifecycle guidance, patching, configuration changes, maintenance windows, and rollback planning |
| Support | Platform incident response and escalation for covered observability services |
Metrics are shared evidence, not automatic reliability
Assistance operates Prometheus and alerting infrastructure. Your team owns service intent, SLO decisions, business impact definitions, and whether an alert requires product or application remediation. We help turn signals into an operating model, but service ownership must be explicit.
Ownership boundary#
| Responsibility | Assistance owns | Customer owns |
|---|---|---|
| Prometheus runtime | Deployment, scraping platform, retention, upgrades, monitoring, and platform incidents | Instrumenting application code and exposing meaningful metrics |
| Alert routing | Alertmanager configuration, integrations, routing mechanics, and noise-reduction implementation | Service owners, severity policy, escalation decisions, and response behavior |
| Dashboards | Platform dashboards and agreed service views | Business meaning, product KPIs, and interpretation of application-specific metrics |
| SLOs | Technical implementation of SLIs/SLO dashboards where scoped | Choosing user-facing objectives and accepting error-budget trade-offs |
| Access | Roles, data source permissions, credential rotation support | User approval, identity source, and internal access reviews |
Deployment options#
| Option | When to use it |
|---|---|
| Assistance physical servers | Development platform monitoring, staging observability, and internal services |
| Customer cloud account | Production observability inside existing cloud/network/compliance boundary |
| Hybrid observability | Central managed Prometheus with remote write or federation across environments |
| SRE engagement | Combine Managed Prometheus with service ownership, incident response, SLO, and runbook work |
Reliability and support model#
| Topic | Managed Prometheus approach |
|---|---|
| Availability | Scoped by topology, retention design, and support plan; HA pairs or long-term storage used where required |
| Data retention | Retention and downsampling defined by operational and compliance needs |
| Alert delivery | Integrations configured for agreed channels; escalation ownership must be defined by customer/team |
| Platform monitoring | Prometheus monitors itself: scrape failures, query pressure, storage, rule evaluation, and Alertmanager health |
| Response | Critical response targets scoped in the support agreement; 24/7 coverage available for covered production observability platforms |
Onboarding#
1. Observability assessment#
We review current metrics, dashboards, alert history, incident pain points, service ownership, environments, retention needs, and existing tools.
2. Platform design#
Assistance defines scrape architecture, retention, long-term storage, dashboards, alert routing, integrations, access model, and support tier.
3. Signal implementation#
We configure targets, exporters, rules, dashboards, Alertmanager routes, and runbook links. Where needed, we help teams define service-level indicators.
4. Operate and refine#
After go-live, we monitor platform health, tune noisy alerts, review capacity, and keep dashboards aligned with service ownership and incident response.
Supported capabilities#
- Prometheus servers, HA patterns, and federation/remote-write designs
- Alertmanager routing, silencing, inhibition, and notification integrations
- Grafana dashboards and data source configuration
- Exporter onboarding for Linux, Kubernetes, PostgreSQL, MySQL, Redis, MongoDB, Kafka, Nginx, HAProxy, and common infrastructure
- Long-term metric storage patterns where required
- SLO dashboard implementation when paired with reliability work
Not included by default#
- Instrumenting every application endpoint
- Defining business KPIs without product owner input
- Providing blanket on-call response for services outside the support plan
- Guaranteeing alert actionability when service ownership is undefined
- Replacing all existing observability tools unless migration is scoped
Related products#
- SRE as a Service — Turn metrics into SLOs, runbooks, and incident response practice
- Managed OpenSearch — Logs, search, and indexed operational data
- Managed Kafka — Metrics and alerting for streaming platforms
- Managed PostgreSQL — Database monitoring and operational dashboards
Getting started#
Request an observability assessment. We will review current metrics, alerts, service ownership, and retention needs before proposing a managed Prometheus model.
Request observability assessment →Frequently asked questions#
Can you work with our existing Grafana? Yes. We can integrate with existing Grafana or operate Grafana as part of the managed observability platform when scoped.
Do you write application metrics? We advise on instrumentation and can implement it as separate project work. By default, application teams own code-level metrics.
Can this reduce alert noise? Yes, if service ownership and severity criteria are defined. We tune alerts to actionable conditions and connect them to dashboards and runbooks.
Do you provide on-call response for alerts? Only for services explicitly covered by the support agreement. We can route alerts to your team, Assistance, or a shared model depending on scope.
What retention is available? Retention is designed per plan and may include local storage plus long-term storage. We choose based on query needs, compliance, cost, and SLO review requirements.