Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock

@mattbostock Platform Operations

Talk outline ● What’s Prometheus? ● Why use Prometheus? ● Our monitoring architecture ● Approaches for reducing alert fatigue

Key takeaways ● Prometheus is the new gold standard for monitoring ● Monitor within the same failure domain ● Good monitoring doesn’t happen for free ● Monitoring is about interaction with humans

What’s Prometheus? ● Monitoring system, using metrics ● Started in 2012, incubated by SoundCloud ● Governance: hosted project of CNCF ● Has a time-series database at its core ● Strong community: 162+ contributors on main repo

Why use Prometheus? ● Simple to operate and deploy ● Dynamic configuration using service discovery ● Succinct query language, PromQL ● Multi-dimensional metrics are very powerful http_requests_total{status=”502”, job=”service_a”, instance=”98ea3”}

Scope

Prometheus for monitoring ● Alerting on critical production issues ● Incident response ● Post-mortem analysis ● Metrics, but not long-term storage

Cloudflare’s anycast edge network

5M HTTP requests/second

10% Internet requests every day

116 Data centers globally

1.2M DNS requests/second

6M+ websites, apps & APIs in 150 countries

Cloudflare’s Prometheus deployment

72k Samples ingested per second max per server

188 Prometheus servers currently in Production

4.8M Time-series max per server

4 Top-level Prometheus servers

250GB Max size of data on disk

Edge architecture ● Routing via anycast ● Points of Presence (PoPs) configured identically ● Each PoP is independent

Primary services in each PoP ● HTTP ● DNS ● Replicated key-value store ● Attack mitigation

Core data centers ●

Enterprise log share (HTTP access logs for Enterprise customers)



Customer analytics



Logging: auditd, HTTP errors, DNS errors, syslog



Application and operational metrics



Internal and customer-facing APIs

Services in core data centers ●

PaaS: Marathon, Mesos, Chronos, Docker, Sentry



Object storage: Ceph



Data streams: Kafka, Flink, Spark



Analytics: ClickHouse (OLAP), CitusDB (sharded PostgreSQL)



Hadoop: HDFS, HBase, OpenTSDB



Logging: Elasticsearch, Kibana



Config management: Salt



Misc: MySQL

Prometheus queries

node_md_disks_active / node_md_disks * 100

count(count(node_uname_info) by (release))

rate(node_disk_read_time_ms[2m]) / rate(node_disk_reads_completed[2m])

Metrics for alerting

sum(rate(http_requests_total{job="foo", code=~"5.."}[2m])) / sum(rate(http_requests_total{job="foo"}[2m])) * 100 > 0

count( abs( (hbase_namenode_FSNamesystemState_CapacityUsed / hbase_namenode_FSNamesystemState_CapacityTotal) - ON() GROUP_RIGHT() (hadoop_datanode_fs_DfsUsed / hadoop_datanode_fs_Capacity) ) * 100 > 10 )

Prometheus architecture

Before, we used Nagios ● Tuned for high volume of checks ● Hundreds of thousands of checks ● One machine in one central location ● Alerting backend for custom metrics pipeline

Specification

Comments

Inside each PoP

Server

Prometheus

Server

Server

Metrics are exposed by exporters ● An exporter is a process ● Speaks HTTP, answers on /metrics ● Large exporter ecosystem ● You can write your own

Inside each PoP

Server

Prometheus

Server

Server

Inside each PoP: High availability

Server

Prometheus

Server

Prometheus Server

Federation CORE

San Jose

Prometheus

Frankfurt

Santiago

Federation CORE

San Jose

Prometheus

Frankfurt

Santiago

Federation: High availability CORE

San Jose

Prometheus

Frankfurt

Prometheus

Santiago

Federation: High availability CORE US

San Jose

Prometheus

CORE EU Frankfurt

Prometheus

Santiago

Retention and sample frequency ● 15 days’ retention ● Metrics scraped every 60 seconds ○ Federation: every 30 seconds ● No downsampling

Exporters we use Purpose

Name

System (CPU, memory, TCP, RAID, etc)

Node exporter

Network probes (HTTP, TCP, ICMP ping)

Blackbox exporter

Log matches (hung tasks, controller errors)

mtail

Container/namespace metrics (resources)

cadvisor

Deploying exporters ● One exporter per service instance ● Separate concerns ● Deploy in same failure domain

Alerting

Alerting CORE

San Jose

Alertmanager

Frankfurt

Santiago

Alerting: High availability (soon) CORE US

San Jose

Alertmanager

Frankfurt

CORE EU

Alertmanager Santiago

Writing alerting rules ● Test the query on past data

Writing alerting rules ● Test the query on past data ● Descriptive name with adjective or adverb

RAID_Array

RAID_Health_Degraded

Writing alerting rules ● Test the query on past data ● Descriptive name with adjective/adverb ● Must have an alert reference

Writing alerting rules ● Test the query on past data ● Descriptive name with adjective/adverb ● Must have an alert reference ● Must be actionable

Writing alerting rules ● Test the query on past data ● Descriptive name with adjective/adverb ● Must have an alert reference ● Must be actionable ● Keep it simple

Example alerting rule ALERT RAID_Health_Degraded IF node_md_disks - node_md_disks_active > 0 LABELS { notify="jira-sre" } ANNOTATIONS { summary = `{{ $value }} disks in {{ $labels.device }} on {{ $labels.instance }} are faulty`, Dashboard = `https://grafana.internal/disk-health?var-instance={{ $labels.instance }}`, link = "https://wiki.internal/ALERT+Raid+Health", }

Dashboards ● Use drill-down dashboards for additional detail ● Template dashboards using labels ● Dashboards as living documentation

Monitoring == interaction with humans ● Understand the user journey ● Not everything is critical ● Pick the appropriate channels ● Alert on symptoms over causes ● Alert on services over machines

Monitoring your monitoring

Monitoring Prometheus ● Mesh: each Prometheus monitors other Prometheus servers in same datacenter

● Top-down: top-level Prometheus servers monitor datacenter-level Prometheus servers

Testing escalations: PagerDuty drill ALERT SRE_Escalation_Drill IF (hour() % 8 == 1 and minute() >= 35) or (hour() % 8 == 2 and minute() < 20) LABELS { notify="escalate-sre" } ANNOTATIONS { dashboard="https://cloudflare.pagerduty.com/", link="https://wiki.internal/display/OPS/ALERT+Escalation+Drill", summary="This is a drill to test that alerts are being correctly escalated. Please ack the PagerDuty notification." }

Alerting

Routing tree

amtool matt➜~» go get -u github.com/prometheus/alertmanager/cmd/amtool matt➜~» amtool silence add \ --expire 4h \ --comment https://jira.internal/TICKET-1234 \ alertname=HDFS_Capacity_Almost_Exhausted

Lessons learnt

Pain points ● Alertmanager instability in earlier versions ● Phantom spikes in percentiles using histograms ● Self-inhibiting alerts ● Storage: tuning, cardinality explosions ● See forthcoming PromCon videos for more details

Standardise on metric labels early ● Especially probes: source versus target ● Identifying environments ● Identifying clusters ● Identifying deployments of same app in different roles

Be open and engage other teams early ● Document your proposal, solicit reviews ● Talks, documentation and training really important ● Highlight the benefits for them ● Instrumentation takes time, engage teams early ● Get backing from tech leadership and management

Keep iterating ● Learn from your alert data ● Tune thresholds and queries; iterate on your metrics ● Err on side of alerting less; not more ● Empower staff on-call to improve their experience ● Always keep the user journey in mind

Next steps

Still more to do ● Removed 87% of our Nagios alerts ● More instrumentation and services still to add ● Now have a lot of the tooling in place ● Continue reducing alert noise

Prometheus 2.0 ● Lower disk I/O and memory requirements ● Better handling of metrics churn

Integration with long term storage ● Ship metrics from Prometheus (remote write) ● One query language: PromQL

More improvements ● Federate one set of metrics per datacenter ● Highly-available Alertmanager ● Visual similarity search ● Alert menus; loading alerting rules dynamically ● Alert routing using service owner metadata

Further information blog.cloudflare.com github.com/cloudflare

Try Prometheus 2.0: prometheus.io/blog Questions? @mattbostock

Thanks! blog.cloudflare.com github.com/cloudflare

Try Prometheus 2.0: prometheus.io/blog Questions? @mattbostock

201708 SRECon EMEA 2017- Monitoring Cloudflare's planet-scale ...

201708 SRECon EMEA 2017- Monitoring Cloudflare's planet-scale edge network with Prometheus.pdf. 201708 SRECon EMEA 2017- Monitoring Cloudflare's ...

4MB Sizes 1 Downloads 239 Views

Recommend Documents

201708 SRECon EMEA 2017- Monitoring Cloudflare's planet-scale ...
201708 SRECon EMEA 2017- Monitoring Cloudflare's planet-scale edge network with Prometheus.pdf. 201708 SRECon EMEA 2017- Monitoring Cloudflare's ...

201708 SRECon EMEA 2017- Monitoring Cloudflare's planet-scale ...
Page 1 of 76. Page 2 of 76. @mattbostock. Platform Operations. Page 2 of 76 .... 2017- Monitoring Cloudflare's planet-scale edge network with Prometheus.pdf.

Solution Selling in EMEA
Dec 6, 2009 - To understand the role of Pc City's paid and owned media (website) in driving sales. • To optimise the marketing and media mix to generate greater revenue from ..... 20. Drivers of Branded search queries. In addition to driving web tr

Solution Selling in EMEA
Possibility to differentiate advertising stimuli from market to market in order to isolate the specific impact of your media campaign on in-store sales. O2S. Geo Lab.

Google Business Intern EMEA 2016/17
We offer a range of internships across EMEA and durations and start dates vary according to project and location. Our internships expose students to the ...

Herpetofauna Monitoring Flyer 2017.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Herpetofauna ...

BOLD Immersion EMEA 2018 FAQs Services
Mar 19, 2018 - In addition, Google will reimburse qualifying successful applicants reasonable meals and social event expenses related to the BOLD Immersion program, subject to a cap as notified to applicants by Google. For the avoidance of doubt succ

BOLD Immersion EMEA 2018 FAQs - Services
to get a rare glimpse into the business side of the technology industry, a chance to grow your peer network, and exposure and insight ... with the Criteria by an internal panel of Google team members and the best 65 applications (as assessed by the .

SAPTECHED 2015 EMEA - Building Fiori Apps.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. SAPTECHED ...

201707 PromCon 2017- Monitoring Cloudflare's planet-scale edge ...
201707 PromCon 2017- Monitoring Cloudflare's planet-scale edge network with Prometheus.pdf. 201707 PromCon 2017- Monitoring Cloudflare's planet-scale ...

SERVER MONITORING SYSTEM
Department of Computer Science and Information Technology ... SYSTEM” in partial fulfillment of the requirements for the degree of ... the period of last three and half years. .... CHAPTER 2: REQUIREMENT AND FEASIBILITY ANALYSIS .

Ivan Zoratti Systems Engineering Manager, EMEA
New Load Testing Utility. MyISAM Memory Option. New Process/SQL Diagnostics. 5.1. 5. SQL Mode. Triggers. Views. Precision Math. Distributed Transactions ...... Customizable rules-based monitoring and alerts. • Identifies problems before they occur.

SAPTECHED 2015 EMEA - Building Fiori Apps.pdf
My relation. with SAP. Founded in 2001. Certified Consultants. Located in. The Netherlands. UI5. GateWay/OData. Fiori. SAP Trainer. for. My Profile. My profile. Robert Eijpe. User Experience Architect. Fiori Consultant. SAP Consultant. since 1996. SA

Lead_DC_Env_Exposure_Detection-Monitoring-Investigation-of ...
... of the apps below to open or edit this item. Lead_DC_Env_Exposure_Detection-Monitoring-Investig ... l-and-Chronic-Diseases-regulations(6CCR1009-7).pdf.

Open Vehicle Monitoring System - GitHub
Aug 14, 2013 - 10. CONFIGURE THE GPRS DATA CONNECTION (NEEDED FOR ...... Using the OVMS smartphone App (Android or Apple iOS), set Feature ...

Weather Monitoring Model
obtained from Internet in the form raw data which specific format data. Specific format data refers to ..... Rogers, R.R. 1983. A short course in Cloud Physics.

monitoring-technology-design.pdf
Page 3 of 4. Page 3 of 4. monitoring-technology-design.pdf. monitoring-technology-design.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying ...

Wildlife Monitoring flyer.pdf
Page 1 of 1. Wildlife Monitoring flyer.pdf. Wildlife Monitoring flyer.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying Wildlife Monitoring flyer.pdf.

Methods in tropical reefs monitoring
May 31, 2018 - Passport (valid at least 6 month after arrival) ... Note: in order to properly organise transport from/to Bangka, every participant must arrive to ...

Stretch Progress Monitoring Chart - sppsreading
technology, coaching, and instructional sustainability. She has coordinated ... specialist in adolescent literacy and secondary school change. Michael Hock is ...

monitoring day reminders.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. monitoring day ...

Scalable Offline Monitoring
entries from a two year period, requiring 0.4 TB of storage. The monitoring takes ... MFOTL's satisfaction relation |= is defined as expected for (i) a time ... we use terms like free variable and atomic formula, and abbreviations such as ...... Conf

Monitoring with Zabbix agent - EPDF.TIPS
server cache, and two minutes until agent would refresh its own item list. That's better ...... You can look at man snmpcmd for other supported output formatting ...