Series: “Vezha story week by week” • Issue dated 04/20/2026
In #098, we added real access.log, RPS and error rate parsing to Vezha for early detection of suspicious activity. After this step, the feedback from the operations teams was very specific: “The signal is there, now give us a way to gather an evidentiary picture of the incident as quickly as not to do it by hand.”
This request became the focus of issue #099. We have added to web-stats 7-day forensic PDF report on error IP. Its task is not to create another “beautiful” schedule, but to give the team a ready-made document for action: localization, escalation, recovery.
Context: Where the process broke down prior to this release
In most production circuits, the picture is typical. The anomaly is quickly visible: the share of 4xx/5xx is growing, the traffic profile is changing, repeated requests to sensitive endpoints are appearing. But between “we see the problem” and “made a decision” there is a manual gap.
Usually this interval looks like this:
- the engineer on duty filters logs by time window;
- individually selects IP addresses with the highest percentage of errors;
- summarizes intermediate conclusions in a table or ticket;
- explains to colleagues why this activity deserves a response right now.
The weakest point here is obvious: the team spends the resource not on the action, but on the preparation of the material for the action. During peak hours, this is the most expensive waste of time.
What appeared in #099
A new feature in web-stats generates a 7-day forensic PDF report in one controlled run. The user works in one circuit: does not switch between several tools, does not copy data manually, does not collect evidence base “from scratch”.
As a result, the team receives:
- structured section by error IP for the selected period;
- dynamics of activity over time to detect repetitive waves;
- ready document for SOC/Security, SRE and Incident Manager.
The key idea is simple: to give people not “raw fragments”, but material with which they can immediately make decisions.
How the process is built under the hood
We did not make a monolithic “long” query, which is difficult to control and even more difficult to restore in case of failures. Instead, they implemented a staged pipeline with visible progress in the UI.
- Running the task: operator initiates forensic collection from web-stats.
- Scan window: the agent sequentially passes the time interval and prepares the aggregated data.
- Phased transmission: data is sent in parts, without a “heavy” one-time download.
- Final assembly: the server compiles the result, forms the final package and renders the PDF.
- Export: the finished file is given to the same working circuit where the team conducts the incident.
This approach provides two advantages at the same time: predictability for the operator and stable load for the system.
What has changed for different team roles
We evaluated the release not only for the technical implementation, but also for how it is used by different roles in a single incident.
For NOC/on duty: the need to manually explain why a burst of errors is not “random noise” disappears. There is a ready-made document with a clear structure.
For SRE: it is easier to agree on changes to the perimeter, rate-limit or isolation scenario when the evidence base is already collected and unified.
For Security/SOC: the own analysis process starts faster, because the data comes in a suitable format, and not “in the form of raw excerpts”.
For the shift manager: decisions are made faster because everyone is looking at the same facts rather than different versions of handwritten notes.
Anonymized practical case
On one circuit, the team saw a modest but steady increase in 4xx share during the evening hours. The overall RPS was not outside of expected limits, so without further analysis the situation could appear to be “temporary noise”.
After running the 7-day forensic report, a repeating pattern became visible: the activity was concentrated on a small set of endpoints and had a waveform with almost identical intervals. This made it possible to quickly coordinate the steps between the teams:
- strengthen perimeter protection rules for specific access scenarios;
- specify limits for individual request templates;
- record incident analysis in a single document without manual duplication.
Most importantly in this case, the team went from discussing symptoms to taking action within one shift.
Borders of security and commercial model
Forensic PDF by error IP is available at Total– plans This is a conscious decision, because it is in this segment that there is the highest demand for in-depth incident response, remote operations and a stable response schedule.
At the same time, we do not confuse the goals: basic monitoring and daily metrics remain more widely available, and the “heavy” incident layer is activated where the business really needs it every day.
What this means for Vezha’s strategy
This release is a good representation of neemle’s approach to product development. Vezha remains a small, fast and secure real-time platform built on Rust with centralized management of monitoring points. Each update should benefit production without increasing operational complexity.
The 7-day forensic report is just about that:
- less manual routine at a critical moment;
- faster context transfer between roles;
- clearer decisions based on facts rather than assumptions.
Also important, Vezha has built-in proxying to Prometheus, so new capabilities are naturally integrated into existing surveillance loops without the need to “break” the team’s processes.
Operational effect in process figures
We deliberately measured not only technical metrics, but also the team’s behavioral metrics during incidents. Since the advent of forensic PDF, the time to first agreed solution has decreased, the number of manual intermediate artifacts has decreased, and the synchronization between changes has become more predictable.
In other words, the product doesn’t just “see the problem,” but helps the team consistently bring the incident to a conclusion.
How to use the report in the first 30 minutes of an incident
To get the most out of a release, we recommend a simple work ritual that the team can run immediately after an anomaly trigger. It does not require separate tools and fits well into the standard operating process.
- Record the event. Define the time window when the deviation started and create incident context in your tracker.
- Run forensic PDF in web-stats. This provides a single factology for all participants in the incident.
- Separate the noise from the risk. See which IPs and call patterns are driving the key share of errors.
- Make a tactical decision. Agree on the first set of actions: filtering, rate-limit, perimeter changes, SOC escalation.
- Capture post-actions. After the environment is stabilized, add structural improvements to the backlog to reduce case repetition.
When this cycle is practiced, the team improvises less in the “hot” mode and moves more quickly to a controlled response scenario.
Common mistakes that #099 helps to avoid
During our interviews with technical teams, we kept seeing several recurring anti-pattern scenarios. The new report doesn’t automatically “cure” them, but it does reduce the chance of error due to the data structure.
- Mistake #1: Only reacting to RPS. High traffic in itself is not always a problem. The key signal is often in the imbalance between volume and error rate.
- Mistake No. 2: fragmentary analysis of one time segment. Without a 7-day window, it’s easy to miss recurring waves.
- Mistake #3: Paying lip service without documented facts. This creates chaos when transferring an incident between shifts.
- Mistake #4: Escalating too late. If the evidence base is slow to build, the team misses the window for cheap and quick intervention.
That’s why we focused on the PDF format: it disciplines the process, standardizes communication and reduces the “broken phone effect” between roles.
Where this opportunity gives the greatest effect
In our experience, contours where the incident cycle already exists, but “slips” at the stage of data preparation, get the most benefit. Most often it is:
- e-commerce and high-load services with wave-like traffic during the day;
- SaaS-platforms, where some endpoints have increased sensitivity to automated requests;
- infrastructures with distributed teams working across multiple shifts or geographies.
In such environments, not only the fact of detection, but the speed of transition to a coordinated response becomes decisive. #099 just closes that gap.
A practical checklist before turning on your circuit
In order for the launch to go smoothly, we advise you to go through a short readiness check before the demonstration:
- agree on who in the team is the owner of the incident solution in evening/night shifts;
- determine the standard escalation channel (SOC, SRE lead, on-call manager);
- fix the “threshold of action” for launching the forensic process to avoid unnecessary fluctuations;
- agree on which post-incident actions must be included in the backlog after stabilization.
It takes a little time, but dramatically improves the quality of the first month of using the feature.
Summary of issue #099
There are no loud promises this week. There is a practical step that saves time every day and reduces chaos in critical situations. These are the steps that form a mature product: when each release makes it easier for the team to work in a real environment.
If you want to see how this script works in your circuit, leave a demo request at vezha.io. Let’s show how Vezha integrates into current infrastructure with zero CAPEX and fair OPEX for scaling.
Хочете перевірити Vezha на вашій інфраструктурі? Перейдіть на vezha.io та надішліть запит на демо.