Original URL: https://trevorsmale.github.io/techblog/post/psc6/
Intro đ
Monitoring and parsing logs is essential to operational intelligence. Computers typically produce immense amounts of dataâfar more than a human can interpret in real time. To extract meaning from this data, we must intelligently filter event logs into clear, comprehensible, and actionable items.
Achieving this is easier said than done. This unit offers general advice on the art of making complex information comprehensible. 1
Worksheet
Discussion Post 1
Review chapter 15 of the SRE book:
2
There are 14 references at the end of the chapter. Follow them for more information. One of them by Julia Evans 3should be reviewed for question âcâ.
Question
- What are some concepts that are new to you?
Answer
-
Core dumps, Memory dumps, or Stack traces.
I have heard the terms before and understand the concepts to a basic degree. I decided to do a bit of further reading to understand each of the dumps and traces so here is a gist.
- A core dump is a snapshot of the processes state at the time of downing.
- A Memory dump is a snapshot of the Random Access Memory (RAM) at the time of downing.
- A Stack trace is the process of tracing function calls through the stack from end (error) to beginning (Call). The way I personally conceptualize this is through comparison to root cause analysis, something I am familiar with.
-
Host intrusion detection systems (HIDS) or Host Agents
A few ideas from the book2 âModern (sometimes referred to as ânext-genâ) host agents use innovative techniques aimed at detecting increasingly sophisticated threats. Some agents blend system and user behavior modeling, machine learning, and threat intelligence to identify previously unknown attacks.â
âHost agents always impact performance, and are often a source of friction between end users and IT teams. Generally speaking, the more data an agent can gather, the greater its performance impact may be because of deeper platform integration and more on-host processing.â
Question
- There are 5 conclusions drawn, do you agree with them? Would you add or remove anything from the list?
Answer
- To begin with, here are the conclusions drawn:
- âDebugging is an essential activity whereby systematic techniquesânot guessworkâachieve results.â
- âSecurity investigations are different from debugging. They involve different people, tactics, and risks.â
- âCentralized logging is useful for debugging purposes, critical for investigations, and often useful for business analysis.â
- âIterate by looking at some recent investigations and asking yourself what information would have helped you debug an issue or investigate a concern.â
- âDesign for safety. You need logs. Debuggers need access to systems and stored data. However, as the amount of data you store increases, both logs and debugging endpoints can become targets for adversaries.â
Firstly, I would like to preface this answer with a disclaimer. I lack the competency to critisize and/or disect OâRellyâs book. With that out of the way. I am going to target the first point.
My only criticism here is that the point is very broad in scope as compared to the more granular and topics specific to this book/chapter.
Question
- In Julia Evanâs debugging blog, which shows that debugging is just another form of troubleshooting, what useful things do you learn about the relationship between these topics?
Answer
-
Both debugging and troubleshooting involve:
-
Proceduralization: If a clear procedure doesnât exist, begin documenting and formalizing the process into a repeatable method.
-
Humility: Acknowledge that you might be the cause of the problem. This is especially important in development.
-
Methodical Experimentation: Form a hypothesis, then devise a controlled method to test itâuse unit tests in development, or targeted scripts and commands when debugging.
-
One Step at a Time: Tackle problems incrementallyââeat the elephant one bite at a time.â
-
Strong Foundations: Write debuggable code and build robust systems. A good foundation makes issues easier to isolate.
-
More Is Better: Verbose error messages provide more cluesâenable detailed output when possible.
Question
- Are there any techniques you already do that this helps solidify for you?
Answer
Yes, I try to create excellent documentation with respect for my future self or others I may need to share it with. This involves numbered procedural steps with inputs and outputs, if that is the nature of the work. Otherwise, I write in a general manner that is legible to others.
Discussion Post 2
Read Monitoring Distributed Systems 4
Question
- What interesting or new things do you learn in this reading? What may you want to know more about?
Answer
- Interesting Concept:
One of the general themes I gathered from this article is low cognitive overhead. Itâs a concept Iâm very familiar with from accessibility-focused design. Too much information overwhelms our ability to observe, absorb, and decide effectively.
For example, public signage must be simple, legible, and self-descriptive through clear graphic compositionâguiding the eye where to look first and in which direction to proceed. This closely parallels the need for simplicity in monitoring and alerting systems. When such systems become overly complex, they can lead to misinterpretation, miscommunication, and fatigue due to information overload.
Information must be derived and presented in a way that is easily consumable, where errors are unmistakableâwithout exhausting the viewer.
New concepts
- White box monitoring systems vs. Black box monitoring systems.
- Conducting ad hoc retrospective analysis (ie. Debugging)
- (4 Golden signals) Latency, Traffic, Saturation, Errors
- This one in particular relates strongly to the
USEacronym I had recently picked up from Het,Utilization, Saturation, Errors
- This one in particular relates strongly to the
Question
- What are the â4 golden signalsâ?
Answer
- Latency
- Traffic
- Saturation
- Errors
Question
- After reading these, why is immutability so important to logging?
Answer
- Tamper Resistance: Immutable logs cannot be altered or deleted without detection, which helps prevent covering up malicious activity or mistakes.
- Auditability: Logs serve as historical records. If they can be changed, audits and investigations lose their value.
- Debugging Integrity: Developers and operators rely on logs to trace errors. Mutable logs can introduce false positives or hide root causes.
- Regulatory Compliance: Standards like HIPAA, PCI-DSS, and GDPR often require tamper-evident or immutable log storage.
- Forensic Value: In incident response, immutable logs serve as trustworthy evidence for timelines and breach analysis.
Question
- What do you think the other required items are for logging to be effective?
Answer
In order to be effective, log must be:
- Trustworthy: Logs should be immutable.
- Time-stamped: Every entry needs a synced timestamp.
- Clear levels: Use
INFO,ERROR,DEBUG, etc., to show importance. - Structured: Format logs so machines and humans can read them.
- Context-rich: Include request IDs, user info, IPsâanything that helps trace the story.
- Centralized: Gather logs in one place for easy searching and alerting.
- Searchable: You should be able to find issues fast with good queries.
- Safe: Control who can see logsâsome contain sensitive info.
- Durable: Logs shouldnât disappear in a crashâuse backups and redundancy.
- Noise-controlled: Avoid floodingârotate logs and cap log rates.
Definitions
ApplicationLogs from software applications showing events or errors.HostLogs from the operating system, like kernel or authentication events.NetworkLogs capturing traffic, connections, and protocol activity.DBLogs from databases showing queries, errors, and access.ImmutableLogs that cannot be changed once written.RFC 3164 BSD SyslogOlder syslog format with simple priority and message structure.RFC 5424 IETF SyslogModern syslog format with structured data and better timestamps.Systemd JournalBinary log format used by systemd with metadata support.Log rotationArchiving or deleting old logs to manage disk space.RsyslogAdvanced syslog daemon for filtering, formatting, and remote logging.Log aggregationCollecting logs from multiple sources for analysis.ELKStack using Elasticsearch, Logstash, and Kibana to manage logs.SplunkTool for searching and analyzing machine-generated logs.GraylogOpen-source platform for central log collection and analysis.LokiLog system that indexes by labels, optimized for Grafana.SIEMTools that collect and analyze security data for threat detection.
Lab đ§Ş
RSYSLOG
Reliable System and Kernel Logging System
Basic Steps:
- Ensure Rsyslog is installed and running on both the control-plane and target node.
- Configure sending of logs over UDP Port.
- Editing Rsyslog config to split out logs.
Question
Why do we split out the logs in this lab? Why donât we just aggregate them to one place?
Answer
- We are aggregating the logs.
- So that we can tell where the logs are coming from.
- Each node is getting its own directory in `/var/log.
Question
- What do we split them out by?
Answer
- We split them b y hostname.
Question
- How does that template configuration work?
Answer
- It will log to a specific directory named after the target hostname in
/var/log.
Question
- Are we securing this communication in any way, or do we still need to configure that?
Answer
- No we are not securing this communication, yes it needs further configuring.
Lab
- Complete the lab here: https://killercoda.com/het-tanis/course/Linux-Labs/102-monitoring-linux-logs
Question
-
Does the lab work correctly, and do you understand the data flow?
-
Yes
- Promtail (collects)
- Loki (stores)
- Grafana (visualizes)
loki-write.py
Question
- Can you see it in your Grafana?
Answer
- Yes Scott is too awesome!
Question
- Can you modify the file loki-write.py to say something related to your name?
Answer
msg = âOn server {host} detected error - Treasure Wuz Hereâ.format(host=host)
Question
- Can you modify that to see the actual entires?
Answer
- Yes
Lab
Complete the killercoda lab found here: https://killercoda.com/het-tanis/course/Linux-Labs/108-kafka-to-loki-logging
Question
- Did you get it all to work?
Answer
- Yes
Question
- Does the flow make sense in the context of this diagram?
Answer
- Yes
- Uses
kcatto write out to Kafka - Using promtail to receive the messages from kafka
- Promtail pushes to loki
- Displayed by grafana
Question
Can you find any configurations or blogs that describe why you might want to use this architecture or how it has been used in the industry?
Answer
- Kafka interconnects log
Producersand logIngesters. Kafka can ingest logs from all types of sources and analyze in real time. - Apache Kafka, according to Apache, is the most popular open-source stream-processing software.5
ProLUG Links âď¸
Discord: https://discord.com/invite/m6VPPD9usw Youtube: https://www.youtube.com/@het_tanis8213 Twitch: https://www.twitch.tv/het_tanis ProLUG PSC Repo: https://github.com/ProfessionalLinuxUsersGroup/psc ProLUG PSC Book: https://professionallinuxusersgroup.github.io/psc/ ProLUG Book of Labs: https://leanpub.com/theprolugbigbookoflabs KillerCoda: https://killercoda.com/het-tanis