Vezha Diary #099: 7-day forensic PDF error IP report for faster incident analysis

Reading Time: 6 minutesReading Time: 6 minutes

Series: “Vezha story week by week” • Issue dated 04/20/2026

In #098 we added real access.log, RPS and error rate parsing to Vezha for early detection of suspicious activity. After this step, the feedback from the operations teams was very specific: “The signal is there, now give us a way to gather an evidentiary picture of the incident as quickly as not to do it by hand.”

This request became the focus of issue #099. We have added to web-stats 7-day forensic PDF report on error IP. Its task is not to create another “beautiful” schedule, but to give the team a ready-made document for action: localization, escalation, recovery.

Context: Where the process broke down before this release

In most production circuits, the picture is typical. The anomaly is quickly visible: the share of 4xx/5xx is growing, the traffic profile is changing, repeated requests to sensitive endpoints are appearing. But between “we see the problem” and “made a decision” there is a manual gap.

Usually this interval looks like this:

the engineer on duty filters logs by time window;
selects IP addresses with the highest percentage of errors individually;
summarizes intermediate conclusions in a table or ticket;
explains to colleagues why this particular activity deserves a reaction right now.

The weakest point here is obvious: the team spends the resource not on the action, but on the preparation of the material for the action. During peak hours, this is the most expensive waste of time.

What appeared in #099

A new feature in web-stats generates a 7-day forensic report in PDF in one controlled run. The user works in one circuit: does not switch between several tools, does not copy data manually, does not collect evidence base “from scratch”.

As a result, the team receives:

structured section by error IP for the selected period;
time dynamics of activity to detect repetitive waves;
document ready for SOC/Security, SRE and Incident Manager.

The key idea is simple: to give people not “raw fragments”, but material with which they can immediately make decisions.

How the process is built under the hood

We did not make a monolithic “long” query, which is difficult to control and even more difficult to restore in case of failures. Instead, they implemented a staged pipeline with visible progress in the UI.

Starting the task: operator initiates forensic collection from web-stats.
Scan window: the agent sequentially passes the time interval and prepares the aggregated data.
Step-by-step transfer: data is sent in parts, without a “heavy” one-time download.
Final assembly: server compiles the result, forms the final package and renders the PDF.
Export: the finished file is given to the same working circuit where the team conducts the incident.

This approach provides two advantages at the same time: predictability for the operator and stable load for the system.

What has changed for different team roles

We evaluated a release not only because of the technical implementation, but because of how it is used by different roles in a single incident.

For NOC/on duty: eliminates the need to manually explain why a burst of errors is not “random noise”. There is a ready-made document with a clear structure.

For SRE: it is easier to agree on changes to the perimeter, rate-limit or isolation scenario when the evidence base is already collected and unified.

For security/SOC: starts its own analysis process faster, because the data comes in a suitable format, and not “in the form of raw excerpts”.

For the shift manager: Decisions are made faster because everyone is looking at the same facts, rather than different versions of handwritten notes.

Anonymized practical case

On one circuit, the team saw a moderate but steady increase in the share of 4xx in the evening hours. The overall RPS was not outside of expected limits, so without further analysis the situation could appear to be “temporary noise”.

After running the 7-day forensic report, a repeating pattern became visible: the activity was concentrated on a small set of endpoints and had a waveform with almost identical intervals. This made it possible to quickly coordinate the steps between the teams:

strengthen the rules of protection on the perimeter for specific scenarios of appeals;
specify limits for individual request templates;
record incident analysis in a single document without manual duplication.

The most important thing in this case: the team went from discussing symptoms to taking action within one shift.

Boundaries of security and commercial model

Forensic PDF by error IP available at Total-plan. This is a conscious decision, because it is in this segment that there is the highest demand for in-depth incident response, remote operations and a stable response schedule.

At the same time, we do not mix the goals: basic monitoring and daily metrics remain more widely available, and the “heavy” incident layer is activated where the business really needs it every day.

What does this mean for Vezha’s strategy

This release is a good representation of neemle’s approach to product development. Vezha remains a small, fast and secure real-time platform built on Rust with centralized management of monitoring points. Each update should benefit production without increasing operational complexity.

7-day forensic report is just about that:

less manual routine at a critical moment;
faster transfer of context between roles;
clearer decisions based on facts, not assumptions.

Also important, Vezha has built-in proxying to Prometheus, so new capabilities are naturally integrated into existing surveillance loops without the need to “break” team processes.

Operational effect in process numbers

We consciously measured not only technical metrics, but also team behavioral metrics during incidents. Since the advent of forensic PDF, the time to first agreed solution has decreased, the number of manual intermediate artifacts has decreased, and the synchronization between changes has become more predictable.

In other words, the product does not just “see the problem”, but helps the team to steadily bring the incident to a conclusion.

How to use the report in the first 30 minutes of the incident

To get the most out of the release, we recommend a simple work ritual that the team can run immediately after an anomaly trigger. It does not require separate tools and fits well into the standard operating process.

Record the event. Define the time window when the deviation started and create incident context in your tracker.
Run forensic PDF in web-stats. This provides a single factology for all participants in the incident.
Separate the noise from the risk. See which IPs and call patterns generate a key share of errors.
Make a tactical decision. Agree on the first set of actions: filtering, rate-limit, perimeter changes, SOC escalation.
Capture post-actions. After stabilizing the environment, add structural improvements to the backlog to reduce case repetition.

When this cycle is worked out, the team improvises less in the “hot” mode and moves more quickly to a controlled response scenario.

Typical mistakes that #099 helps to avoid

During interviews with technical teams, we consistently saw several recurring anti-pattern scenarios. The new report doesn’t automatically “cure” them, but it does reduce the chance of error due to the data structure.

Mistake #1: React only to RPS. High traffic in itself is not always a problem. The key signal is often in the imbalance between volume and error rate.
Error No. 2: fragmentary analysis for one time segment. Without a 7-day window, repeating waves are easy to miss.
Mistake No. 3: Verbal decisions without documented facts. This creates chaos when transferring an incident between shifts.
Mistake #4: Escalating too late. If the evidence base is slow to build, the team misses a window for cheap and quick intervention.

That’s why we emphasized the PDF format: it disciplines the process, standardizes communication and reduces the “broken phone effect” between roles.

Where this opportunity gives the greatest effect

In our experience, contours where the incident cycle already exists, but “slips” at the stage of data preparation, get the most benefit. Most often it is:

e-commerce and high-load services with wave-like traffic during the day;
SaaS-platforms, where some endpoints have increased sensitivity to automated requests;
infrastructures with distributed teams working across multiple shifts or geographies.

In such environments, not only the fact of detection, but the speed of transition to a coordinated reaction becomes decisive. #099 just closes that gap.

A practical checklist before turning on your circuit

In order for the launch to go smoothly, we advise you to go through a short readiness check before the demonstration:

agree on who in the team is the owner of incident resolution in evening/night shifts;
determine the standard escalation channel (SOC, SRE lead, on-call manager);
fix the “action threshold” for launching the forensic process to avoid unnecessary fluctuations;
to agree on which post-incident actions must be included in the backlog after stabilization.

This takes little time, but dramatically improves the quality of the first month of using the feature.

Summary of issue #099

There are no loud promises this week. There is a practical step that saves time every day and reduces chaos in critical situations. These are the steps that form a mature product: when each release makes it easier for the team to work in a real environment.

If you want to see how this script works in your circuit, leave a demo request at vezha.io. Let’s show how Vezha integrates into current infrastructure with zero CAPEX and fair OPEX for scaling.

Хочете перевірити Vezha на вашій інфраструктурі? Перейдіть на vezha.io та надішліть запит на демо.