Devops
Topics
- Overview of Devops
- Overview of SDLC model
- Waterfall
- Agile
- Vm management
- Virtualbox
- Vagrant
- Google Cloud
- Linux Basics Overview
- Linux File Structure
- Linux file management
- Linux Network Management
- Linux Process Management
- Linux User Management
- Linux Service Management
- Web Server
- Http with ssl
- Reverse Proxy
- Letus encrypt for ssl certificate
- SSL related commands
- ApplicationServer
- Tomcat
- Securing tomcat with ssl
- DNS Overview and testing utilities
- nslookup
- dig
- Tools Covered
- Code Management
- Git
- CICD Tools
- Jenkins
- Gitlab
- Github
- Infrastracture as a code
- Terraform
- Configuration Management
- Ansible
- Containerization
- Docker
- Kubernetes
- Static Code Testing
- Sonarqube
- Monitoring & Logging
- Prometheus
- Grafana
- Grafana Loki
- Code Management
Devops Overview
DevOps is a set of practices, tools, and a cultural philosophy that automate and integrate the processes between software development and IT teams. It emphasizes team empowerment, cross-team communication and collaboration, and technology automation.
DevOps combines development and operations to increase the efficiency, speed, and security of software development and delivery compared to traditional processes. A more nimble software development lifecycle results in a competitive advantage for businesses and their customers.
The DevOps lifecycle and how DevOps works
The DevOps lifecyle stretches from the beginning of software development through to delivery, maintenance, and security. The stages of the DevOps lifecycle are:
Plan
Organize the work that needs to be done, prioritize it, and track its completion.
Create
Write, design, develop and securely manage code and project data with your team.
Verify
Ensure that your code works correctly and adheres to your quality standards — ideally with automated testing.
Package
Package your applications and dependencies, manage containers, and build artifacts.
Secure
Check for vulnerabilities through static and dynamic tests, fuzz testing, and dependency scanning.
Release
Deploy the software to end users.
Configure
Manage and configure the infrastructure required to support your applications.
Monitor
Track performance metrics and errors to help reduce the severity and frequency of incidents.
Govern
Manage security vulnerabilities, policies, and compliance across your organization.
What is CI/CD?
Monitoring
- Prometheus
- Node_Exporter
- Wmi Exporter
- BlackBox Exporter
- AlertManager
Prometheus
Prometheus is a free and open-source toolkit used for time series event monitoring and alerting. It was originally developed by SoundCloud in 2012. Prometheus was then adopted by several companies and active developers around the world. In 2016, it was elevated to the Cloud Native Computing Foundation as the second hosted project after Kubernetes.
Features:
- Time-series metrics collection which happens via a pull model over HTTP
- It uses PromQL, which is a flexible query language to leverage this dimensionality
- It doesn’t rely on distributed storage; single server nodes are autonomous
- It uses a multi-dimensional data model where time series data is identified by metric name and key/value pairs
- Targets are discovered via service discovery or static configuration
- It supports multiple modes of graphing and dashboarding.
The Prometheus ecosystem encompasses several components. They include:
- Prometheus server – scrapes and stores time series data
- Client libraries – used to instrument application code
- Push gateway – to support short-lived jobs
- Special-purpose exporters – for services such as HAProxy, StatsD, Graphite e.t.c
- Alert manager – for alert handling
Prerequisites for Prometheus Installation
- Virtualbox
- Prometheus
- Linux Server 1 CPU and 2Gb of Memory
- Create 2 linux vms
- Update the linux Os
dnf update -y
Add Prometheus Repositories
- Enable Epel Repo
sudo dnf -y install epel-release vim
- Add prometheus Repo
curl -s https://packagecloud.io/install/repositories/prometheus-rpm/release/script.rpm.sh | sudo bash
- Install Prometheus
sudo dnf install prometheus -y
-
Once installed, Prometheus stores its data at /var/lib/prometheus and config files at /etc/prometheus/.
-
The default configuration file is at /etc/prometheus/prometheus.yml
-
Start and enable service using below commands:
sudo systemctl start prometheus
sudo systemctl enable prometheus
- check the status of prometheus service
systemctl status prometheus
- Allow the port through the firewall if firewall is in use
sudo firewall-cmd --add-port=9090/tcp --permanent
sudo firewall-cmd --reload
you can access Prometheus using the URL http://IP_Address:9090
- Check all targets
- Configuration files and other details
Install and Configure Node Exporter
- Add prometheus Repo
curl -s https://packagecloud.io/install/repositories/prometheus-rpm/release/script.rpm.sh | sudo bash
- Install node_exporter
sudo dnf -y install node_exporter
- Start and enable the service after installation:
sudo systemctl enable --now node_exporter
- Check node_exporter service
systemctl status node_exporter
- Check the port used
sudo ss -aplnt | grep node
- Add to firewall if firewall is running already
sudo firewall-cmd --add-port=9100/tcp --permanent
sudo firewall-cmd --reload
- Now on Prometheus server make the entry of this newly added node
sudo vim /etc/prometheus/prometheus.yml
- job_name: 'node_exporter_metrics'
scrape_interval: 5s
static_configs:
- targets: ['SERVER-IP:9100']
- Restart Prometheus service
sudo systemctl restart prometheus
Install and Configure Blackbox exporter
The Blackbox exporter enables probing of endpoints over HTTP, HTTPS, DNS, TCP, and ICMP. You can install from YUM repository or manually.
- Install Blackbox exporter from Prometheus repository
sudo dnf -y install blackbox_exporter
The settings of Blackbox exporter are in the file /etc/prometheus/blackbox.yml. You can modify or just start and enable the service:
- Start and enable node-exporter service
sudo systemctl enable --now blackbox_exporter
- Enable firewall if running
sudo firewall-cmd --add-port=9115/tcp --permanent
sudo firewall-cmd --reload
- How to check prometheus configuration file
promtool check config /etc/prometheus/prometheus.yml
- Make an entry in Prometheus.yaml file
- job_name: 'Blackbox_ssh'
metrics_path: /probe
params:
module: [ssh_banner]
static_configs:
- targets:
- 172.16.16.120:26
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 172.16.16.120:9115
- Restart prometheus service
systemct restart prometheus
Configuring Alert Manager
Alertmanager is the application that handles the alerts sent by the applications and notifies the user via E-mail, Slack, or other tools. Alert rules defined in Prometheus are taken into consideration when scraping metrics. If any of the alert conditions are hit depending on the rules, Prometheus pushes them to the AlertManager.
The Alertmanager can handle grouping, deduplication, and routing of alerts to the correct receiver. It manages alerts through its pipelines, which are:
- Silencing: mutes alerts for a given period
- Grouping: groups alerts of similar nature into a single notification to avoid sending multiple notifications.
- Inhibition: suppresses specific alerts if other alerts are already fired.
- Install Alert Manager
dnf install alertmanager -y
- Alert Manager config file
/etc/alertmanager/alertmanager.yml
/etc/prometheus/alertmanager.yml
-
repeat_interval tells the AlertManager to wait for the set time before sending another notification. The default value is 1 hour, but you can adjust it as desired.
-
receiver: ’email’ sets the default receiver to be used. For this tutorial, we have set the default receiver as email.
-
receivers: lists the available receivers with their configurations. for example web.hook and email as above
-
Modify AlertManager config
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'Test@gmail.com'
from: 'test@gmail.com'
smarthost: smtp.gmail.com:587
auth_username: 'test@gmail.com'
auth_password: 'pzckclkrhpfnscts'
send_resolved: true
- Check alert manager configuration
amtool check-config /etc/alertmanager/alertmanager.yml
amtool check-config /etc/prometheus/alertmanager.yml
- Modify the pormetheus configuration settings
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "alert_rules.yml"
# - "second_rules.yml"
- Create a file
vi /etc/prometheus/alert_rules.yml
groups:
- name: alert_rules
rules:
- alert: InstanceDown
expr: up == 1
for: 1m
Promql (WIP)
Promqul: Basics
Prometheus Metric and Data Types
Prometheus has four metric types:
- Counters
- Gauges
- Histograms
- Summaries
https://logz.io/blog/promql-examples-introduction/
- Matcher and Selector: Filters are used with {“job”=“prometheus”} {“job”=“prometheus”,“node”=“node1”}
Equality Mactcher = Negative Equality Matcher {“job”!=“node2”} Regualar Expression: {=~} prometheus_http_request_total{handler=~"/api.*"} Negative Regular Expression Matcher {!~}
Range Vector: [1d] [1m]
Binary Operator: Arithmatic
Aggregator:
- Find the number of pods per namespace
sum by (namespace) (kube_pod_info)
- Find CPU overcommit
sum(kube_pod_container_resource_limits{resource="cpu"}) - sum(kube_node_status_capacity{resource="cpu"})
- Memory overcommit
sum(kube_pod_container_resource_limits{resource="memory"}) - sum(kube_node_status_capacity{resource="memory"})
- Find unhealthy Kubernetes pods
min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[15m:1m]) > 0
- Find Kubernetes pods CrashLooping
increase(kube_pod_container_status_restarts_total[15m]) > 3
- Find the number of containers without CPU limits in each namespace
count by (namespace)(sum by (namespace,pod,container)(kube_pod_container_info{container!=""}) unless sum by (namespace,pod,container)(kube_pod_container_resource_limits{resource="cpu"}))
- Find PersistentVolumeClaim in the pending state
kube_persistentvolumeclaim_status_phase{phase="Pending"}
- Find unstable nodes
sum(changes(kube_node_status_condition{status="true",condition="Ready"}[15m])) by (node) > 2
- Find idle CPU cores
sum((rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[30m]) - on (namespace,pod,container) group_left avg by (namespace,pod,container)(kube_pod_container_resource_requests{resource="cpu"})) * -1 >0)
- Find idle memory
sum((container_memory_usage_bytes{container!="POD",container!=""} - on (namespace,pod,container) avg by (namespace,pod,container)(kube_pod_container_resource_requests{resource="memory"})) * -1 >0 ) / (1024*1024*1024)
- Find node status
sum(kube_node_status_condition{condition="Ready",status="true"})
sum(kube_node_status_condition{condition="NotReady",status="true"})
sum(kube_node_spec_unschedulable) by (node)