Original URL: https://trevorsmale.github.io/techblog/post/pacu12/
Baselining & Benchmarking
The purpose of a baseline is not to find fault, load, or to take corrective action. A baseline simply determines what is. You must know what is so that you can test against that when you make a change to be able to objectively say there was or wasnât an improvement. You must know where you are at to be able to properly plan where you are going. A poor baseline assessment, because of inflated numbers or inaccurate testing, does a disservice to the rest of your project. You must accurately draw the first line and understand your systemâs performance.
Discussion Post 1:
Your manager has come to you with another emergency. He has a meeting next week to discuss capacity planning and usage of the system with IT upper management. He doesnât want to lose his budget, but he has to prove that the system utilization warrants spending more.
- What information can you show your manager from your systems?
You could present your manager with a progressive trend graph showing time on the x-axis and several fields on the y-axis that represent changes from a baseline, assuming the necessary data has been collected. With this information, it would be possible to predict when various system resources will reach their maximum capacity.
-
What type of data would prove system utilization? (Remember the big 4: compute, memory, disk, networking)
CPU load, process execution time, throughput. Disk Operations (IOPS). Networking Requests and Bandwidth. RAM utilization, memory paging/swapping rates.
-
What would your report look like to your manager?
Capacity Planning Report
Current and projected system utilization. By examining trends over time, we can predict when critical resources will reach their limits if no additional capacity is provisioned.
Key Areas of Focus
- Compute Usage
- Memory Load
- Disk Resources
- Networking Metrics
Historical Data and Trends
Below is a sample progressive trend graph over the last 6 months. The x-axis represents time (in weeks), while the y-axis shows percentage utilization relative to an established baseline.
Example Metrics (relative to baseline):
-
CPU Utilization (% of baseline)
-
Memory Load (% of baseline)
-
Disk IOPS (% of baseline)
-
Network Throughput (% of baseline)
Time (Weeks): 1 2 3 4 5 6 ⌠20 21 22 CPU Util(%): 50 52 55 57 60 62 ⌠80 82 85 Memory(%): 45 47 50 50 52 55 ⌠70 73 75 Disk IOPS(%): 30 32 35 36 38 40 ⌠60 63 68 Network(%): 40 42 45 47 49 51 ⌠75 78 80
As time progresses, each of the key metrics is trending upward, indicating increasing load and approaching capacity thresholds.
Projections
We can estimate the âtime to ceilingâ for critical resources. For instance, if CPU load is rising at an average rate of 2â3% per month, and we know that at 90% utilization the system will experience performance degradation.
Projected Time to CPU Ceiling: 3â5 months
Projected Time to Memory Ceiling: 6â8 months
Projected Time to Disk IOPS Ceiling: 8â10 months
Projected Time to Network Bandwidth Ceiling: 4â6 months
Recommendations
- Compute: Consider adding more CPU cores or upgrading processors before reaching the predicted 90% utilization mark.
- Memory: Upgrade RAM or optimize applications to reduce memory footprint.
- Disk: Enhance disk subsystems or switch to faster storage (e.g., SSDs) to handle projected IOPS.
- Networking: Increase network capacity (e.g., from 1Gb to 10Gb links) or optimize network traffic.
Conclusion
Investment in scaling resources now will prevent future performance bottlenecks, ensuring the system can continue to meet business demands effectively.
Discussion Post 2:
You are in a capacity planning meeting with a few of the architects. They have decided to add 2 more agents to your Linux Sytems, Bacula Agent and an Avamar Agent . They expect these agents to run their work starting at 0400 every morning.
- What do these agents do? (May have to look them up)
Bacula is an open-source suite of tools designed to automate backup tasks. Itâs widely regarded for its flexibility and reliability. Dell Avamar, on the other hand, is a commercial backup automation solution. Both tools handle incremental backups using custom daemons that monitor changes over time, offering greater sophistication than simple scheduling systems like Cron. Additionally, they can manage backups across diverse, heterogeneous storage environments.
- Do you think there is a good reason not to use these agents at this timeframe?
This approach is about balancing workload. If all processes start at a fixed time, they can consume valuable resources simultaneously. The best schedule depends on the environment. For example, if the environment experiences downtimeâsuch as a traditional office settingâstarting backups at 4 a.m. might be fine. However, if services run around the clock, itâs better to stagger the tasks so they use only a fraction of the available resources at any given time. This approach also reduces the impact of failures, since not all systems are involved at once.
- Is there anything else you might want to point out to these architects about these agents they are installing?
There are several factors architects should consider. However, in the context of this discussion, performance overhead is particularly relevant. They need to ensure that the chosen backup solutions wonât overburden the systemâs resources and that thereâs enough âbreathing roomâ to maintain smooth operations.
Discussion Post 3: âTODOâ
Your team has recently tested at proof of concept of a new storage system. The vendor has published the blazing fast speeds that are capable of being run through this storage system. You have a set of systems connected to both the old storage system and the new storage system.
- Write up a test procedure of how you may test these two systems.
I did a bit of research regarding tooling for such a task and found FIO âFlexible Input / Outputâ, a program written for the purpose of testing systems with various scenarios. Rather than using BASH, I can run more comprehensive testing with more data to analyze using FIO.
Baseline Test
fio --filename=/dev/new_storage_lun --direct=1 --rw=read --bs=128k --size=10G --numjobs=1 --iodepth=32 --runtime=300 --time_based --name=new_storage_seq_read
fio --filename=/dev/old_storage_lun --direct=1 --rw=read --bs=128k --size=10G --numjobs=1 --iodepth=32 --runtime=300 --time_based --name=old_storage_seq_read
Running Mixed Workload Tests
fio --filename=/dev/new_storage_lun --direct=1 --rw=randrw --rwmixread=70 --bs=4k --size=10G --numjobs=4 --iodepth=16 --runtime=300 --time_based --name=new_storage_mixed
Increased Concurrency and Scale
-
Increase
numjobsandiodepthin subsequent runs to measure how performance changes:fio âfilename=/dev/new_storage_lun âdirect=1 ârw=read âbs=128k âsize=10G ânumjobs=8 âiodepth=64 âruntime=300 âtime_based âname=new_storage_high_concurrency
-
Run the same tests on the old storage system and record all metrics.
Stress/Soak Tests
-
12-hour continuous I/O test on both storage systems.
fio âfilename=/dev/new_storage_lun âdirect=1 ârw=randwrite âbs=4k âsize=100G ânumjobs=1 âiodepth=32 âruntime=43200 âtime_based âname=new_storage_soak
AWK Line Parsing
I would then pipe the output of these commands to AWK to seperate out specific datapoints to append to files for full analysis.
- How are you assuring these test are objective?
By gathering multiple datasets with varying run parameters, I can reduce statistical noise and better isolate data of interest by comparing these datasets against one another.
- What is meant by the term Ceteris Paribus, in this context?
in the context of system benchmarking means that when measuring the performance of one specific aspect of the system, all other variables and conditions are kept constant. This approach ensures that any observed changes in performance can be attributed directly to the variable under test, rather than being influenced by unrelated fluctuations in the environment or system load.
Definitions & Terminology
- Benchmark
- High watermark
- Scope
- Methodology
- Testing
- Control
- Experiment
- Analytics
- Descriptive
- Diagnostic
- Predictive
- Prescriptive
Digging Deeper (optional)
- Analyzing data may open up a new field of interest to you. Go through some of the free lessons on Kaggle, here: https://www.kaggle.com/learn
a. What did you learn?
b. How will you apply these lessons to data and monitoring you have already collected as a system administrator?
- Find a blog or article that discusses the 4 types of data analytics.
a. What did you learn about past operations? b. What did you learn about predictive operations?
- Download Spyder IDE (Open source)
a. Find a blog post or otherwise try to evaluate some data. b. Perform some Linear regression. My block of code (but this requires some additional libraries to be added. I can help with that if you need it.)
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
size = [[5.0], [5.5], [5.9], [6.3], [6.9], [7.5]] price =[[165], [200], [223], [250], [278], [315]] plt.title(âPizza Price plotted against the sizeâ)
plt.xlabel(âPizza Size in inchesâ)
plt.ylabel(âPizza Price in centsâ)
plt.plot(size, price, âk.â)
plt.axis([5.0, 9.0, 99, 355])
plt.grid(True)
model = LinearRegression()
model.fit(X = size, y = price)
#plot the regression line
plt.plot(size, model.predict(size), color=ârâ)
Reflection Questions
-
What questions do you still have about this week?
-
How can you apply this now in your current role in IT? If youâre not in IT, how can you look to put something like this into your resume or portfolio?
Digging Deeper
1. Read the rest of the chapter https://sre.google/workbook/monitoring/ and note anything else of interest when it comes to monitoring and dashboarding.
2. Look up the âProLUG Prometheus Certified Associate Prep 2024â in Resources -> Presentations in our ProLUG Discord. Study that for a deep dive into Prometheus.
3. Complete the project section of âMonitoring Deep Dive Project Guideâ from the prolug-projects section of the Discord. We have a Youtube video on that project as well. https://www.youtube.com/watch?v=54VgGHr99Qg
Labs
https://killercoda.com/het-tanis/course/Linux-Labs/102-monitoring-linux-logs
https://killercoda.com/het-tanis/course/Linux-Labs/103-monitoring-linux-telemetry
https://killercoda.com/het-tanis/course/Linux-Labs/104-monitoring-linux-Influx-Grafana
- While completing each lab think about the following:
a. How does it tie into the diagram below?
b. What could you improve, or what would you change based on your previous administration experience.
Install Grafana on the Rocky Linux system by adding the Grafana repo manually. Red = Inputs Blue = Outputs
- Create a new repository configuration sudo vim /etc/yum.repos.d/grafana.repo
Paste:
[grafana]
name=grafana
baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
sslverify=1
- Verify using the DNF
sudo dnf repolist
sudo dnf clean - verifies whether files are working
Should see:
repo id repo name appstream Rocky Linux 8 - AppStream baseos Rocky Linux 8 - BaseOS extras Rocky Linux 8 - Extras grafana grafana/spl
- Check the grafana package on the official repository
sudo dnf info grafana
Should see something similar đ
Importing GPG key 0x24098CB6: Userid : âGrafana " Fingerprint: 4E40 DDF6 D76E 284A 4A67 80E4 8C8C 34C5 2409 8CB6 From : https://packages.grafana.com/gpg.key Is this ok [y/N]: y
Should see đ
Name : grafana Version : 8.2.5 Release : 1 rchitecture : x86_64 Size : 64 M Source : grafana-8.2.5-1.src.rpm Repository : grafana Summary : Grafana URL : https://grafana.com License : âApache 2.0â Description : Grafana
- Install Grafana sudo dnf install grafana -y
âł Takes a whileâŚ
- Restart SystemD unit sudo systemctl enable ânow grafana-server
Verify sudo systemctl status grafana-server
5.5. Firewall (Security) Firewall is Managed by files present in /etc/firedwalld
cd /etc/firewalld/ ls -l
cp /usr/lib/firewalld/services/ssh.xml /etc/firewalld/services/example.xml
sudo firewall-cmd âadd-service=grafana âpermanent
sudo firewall-cmd âadd-port=3000/tcp âpermanent sudo firewall-cmd âreload
-
Create config file sudo vim /etc/grafana/grafana.ini
-
Change the default value of:
The option âhttp_addrâ to âlocalhostâ, the âhttp_portâ to â3000â, and the âdomainâ option to your domain name as below. For this example, the domain name is âgrafana.example.ioâ.
For non-standard port, be sure to uncomment ; [server] đ http_port = 4000 đ
The public facing domain name used to access grafana from a browser domain = grafana.example.io
7.1 Turn off the nasty default report of analytics đş [analytics] reporting_enabled = false
7.2. Restart the grafana service to apply a new configuration.
sudo systemctl restart grafana-server
Reverse Proxy Setup
- Install NGINX
sudo dnf install nginx -y
- Create a new server block for grafana
/etc/nginx/conf.d/grafana.conf
Required to proxy Grafana Live WebSocket connections
map $http_upgrade $connection_upgrade { default upgrade; ââ close; } server { listen 80; server_name grafana.example.io; rewrite ^ https://$server_name$request_uri? permanent; } server { listen 443 ssl http2; server_name grafana.example.io; root /usr/share/nginx/html; index index.html index.htm; ssl_certificate /etc/letsencrypt/live/grafana.example.io/fullchain.pem; ssl_certificate_key /etc/letsencrypt/live/grafana.example.io/privkey.pem; access_log /var/log/nginx/grafana-access.log; error_log /var/log/nginx/grafana-error.log; location / { proxy_pass http://localhost:3000/; }
Proxy Grafana Live WebSocket connections location /api/live { rewrite ^/(.*) /$1 break; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection $connection_upgrade; proxy_set_header Host $http_host; proxy_pass http://localhost:3000/; } }
- Next, verify the Nginx configuration
sudo nginx -t
Should see đ
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok nginx: configuration file /etc/nginx/nginx.conf test is successful đ
- Start and enable the Nginx service sudo systemctl enable ânow nginx sudo systemctl status nginx
Install Prometheus (Saturday)
Rocky Prometheus Install
- Add New User and Directory âprometheusâ
create a new configuration directory and data directory for the Prometheus installation.
sudo adduser -M -r -s /sbin/nologin prometheus
- create a new configuration directory â/etc/prometheusâ and the data directory â/var/lib/prometheusâ
(Only needed for running as service)
sudo mkdir /etc/prometheus sudo mkdir /var/lib/prometheus
Note: All Prometheus configuration at the â/etc/prometheusâ directory, and all Prometheus data will automatically be saved to the directory â/var/lib/prometheusâ. Installing Prometheus on Rocky Linux
Install Prometheus monitoring system manually from the tarball or tar.gz file.
- Change the working directory to â/usr/srcâ and download the Prometheus binary
cd /usr/src wget https://github.com/prometheus/prometheus/releases/download/v3.0.1/prometheus-3.0.1.linux-amd64.tar.gz
Extract
tar -xzf ***.tar.gz
cd into folder
Run bin:
./bin đ
If bin works, proceed
- Copy all Prometheus configurations to the directory â/etc/prometheusâ and the binary file âprometheusâ to the â/usr/local/binâ directory.
- Move prometheus configuration âprometheus.ymlâ to the directory â/etc/prometheus.
sudo mv $PROM_SRC/prometheus.yml /etc/prometheus/
- Move the binary file âprometheusâ and âpromtoolâ to the directory â/usr/local/bin/â.
sudo mv $PROM_SRC/prometheus /usr/local/bin/ sudo mv $PROM_SRC/promtool /usr/local/bin/
- Move Prometheus console templates and libraries to the â/etc/prometheusâ directory.
sudo mv -r $PROM_SRC/consoles /etc/prometheus sudo mv -r $PROM_SRC/console_libraries /etc/prometheus
- Edit Prometheus configuration â/etc/prometheus/prometheus.ymlâ
vim /etc/prometheus/prometheus.yml
On the âscrape_configsâ option, you may need to add monitoring jobs
The default configuration comes with the default monitoring job name âprometheusâ and the target server âlocalhostâ through the âstatic_configsâ option.
Change the target from âlocalhost:9090â to the server IP address â192.168.1.10:9090â as below.
Note:
Scrape configuration containing exactly one endpoint to scrape:
Here itâs Prometheus itself.
scrape_configs:
The job name is added as a label job= to any timeseries scraped from this config.
job_name: âprometheusâ
metrics_path defaults to â/metricsâ scheme defaults to âhttpâ.
static_configs: targets: [â192.168.1.10:9090â]
- Change the configuration and data directories to the user âpromethuesâ.
sudo chown prometheus:prometheus /etc/prometheus sudo chown prometheus:prometheus /var/lib/prometheus
Basic prometheus installation finished, Hopefully đ .
Configure Prometheus
- Create a new systemd service sudo vim /etc/systemd/system/prometheus.service
Copy and paste the following configuration.
[Unit] Description=Prometheus Wants=network-online.target After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus
âconfig.file /etc/prometheus/prometheus.yml
âstorage.tsdb.path /var/lib/prometheus/
âweb.console.templates=/etc/prometheus/consoles
âweb.console.libraries=/etc/prometheus/console_libraries
[Install] WantedBy=multi-user.target
- Reload the systemd manager to apply a new config.
sudo systemctl daemon-reload
- Start and enable the Prometheus service
sudo systemctl enable ânow prometheus sudo systemctl status prometheus
Prometheus monitoring tool is now accessible on the TCP port â9090.
- Visit IP address with port â9090â
http://192.168.1.10:9090/
And you will see the prometheus dashboard query below.
Prometheus query dashboard
Reflection Questions
-
What questions do you still have about this week?
-
How can you apply this now in your current role in IT? If youâre not in IT, how can you look to put something like this into your resume or portfolio?
ProLUG Links âď¸
Discord: https://discord.com/invite/m6VPPD9usw Youtube: https://www.youtube.com/@het_tanis8213 Twitch: https://www.twitch.tv/het_tanis ProLUG Book: https://leanpub.com/theprolugbigbookoflabs KillerCoda: https://killercoda.com/het-tanis