Legacy Server Performance Analysis 2

Problem

During real exams with 90 concurrent clients, teachers observed that student screenshot previews in the Proctor view updated only every ~3 minutes instead of the configured 5-10 second interval. This made effective proctoring impossible.

Agreement

We agreed to:

Build a stress testing tool (openbox-dos) to reproduce the issue in a controlled environment and identify the problem.

Approach

Phase 1: Adding Observability

We added profiling metrics to both the server and frontend to track:

Screenshot request latency (end-to-end)
Image decode/encode times
Queue sizes and active request counts
WebSocket ping latency
Image download times

Phase 2: Building the Stress Test Tool

We built openbox-dos, a clone of the openbox rust app with removed ui, support for multiple clients and screenshot mocking. Frames are pregenerated and then loaded into memory to make the network bandwidth the bottleneck so we aren’t limited by I/O or computation.

Phase 3: Load Testing and Observation

We ran the stress test with 150 clients on 2 machines against the server infrastructure and observed the metrics dashboard.

Observations

Server Metrics (150 connected clients)

Metric	Value	Significance
Queue Size	135	90% of clients always waiting
Active Requests	15	Hard limit on concurrent requests
Connected Clients	150	All clients connected successfully
Uploads/sec	7.2	Actual throughput achieved
Screenshot Request (p50)	1.46s	End-to-end request latency
Image Decode (p50)	92.3ms	Server-side decode time
Image Encode (p50)	79.7ms	Server-side encode time
Image Save Total (p50)	1.13s	Full save pipeline (see breakdown below)
Image Download (p50)	327.2ms	Proctor downloading screenshots for web view

Image Save Total Breakdown

The “Image Save Total” metric (1.13s) measures the entire saveFrameOfSession() method in ImageService.java, which includes:

Database lookup: Find participation by session ID
Permission check: Verify upload is allowed via ScreenshotRequestManager
State validation: Check if exam is ongoing
Database persist: Save image metadata to database
Directory creation: Create screenshot folder if needed
Image Decode (92ms): Decode uploaded image bytes
Beta frame processing (if applicable):
- DB query for latest alpha frame
- File read from disk (21ms)
- Graphics merge operation (3.8ms)
Image Encode (80ms): Encode and write PNG to disk

The sub-metrics (decode: 92ms, encode: 80ms, file read: 21ms, merge: 3.8ms) total ~197ms, but the full pipeline takes 1.13s. The hidden cost is database operations under high concurrency - multiple reactive DB queries that become slow when the connection pool is saturated.

Analysis

Screenshot request queue was constantly backed up - With 150 clients and only 15 concurrent request slots, 135 clients were always waiting in queue.
Screenshot request times were abnormally high - End-to-end screenshot requests took 1.46s (p50), with p99 reaching 2.96s.
Image downloads for the Proctor view were also slow - Even with just 3 web views open in the Proctor, image download times increased significantly (327ms p50, up to 1.27s max).
The concurrent request limit was the bottleneck - The server was configured to allow only 15 concurrent screenshot requests at a time.

Configuration Analysis

We examined the server configuration in application.properties:

# max concurrent screenshot requests
screenshots.max-concurrent-requests=15
# maximum amount of milliseconds to wait before invalidating a screenshot request
screenshots.upload-timeout=3000

Git history revealed these values were set conservatively during initial implementation (starting at 10, later bumped to 15) without load testing against production workloads.

Attempt: Increasing Concurrency Limits

We attempted to increase the limits to 50 concurrent requests:

screenshots.max-concurrent-requests=50
websocket.ping.max-concurrent-requests=50

Result: Client connections became extremely unstable. Clients received error responses from the server and were disconnected:

[Client 67] Upload error: hyper::Error(IncompleteMessage)
[Client 67] Upload error: Connection reset by peer
[Client 30] Disconnected by server
[Client 70] Disconnected by server
[Client 44] Disconnected by server
...

The server could not handle 50 concurrent image processing operations, leading to resource exhaustion and connection resets.

Conclusion

The 3-minute screenshot delay is caused by a queue backup due to the combination of:

Low concurrent request limit (15) - Only 15 screenshot requests can be processed simultaneously
High per-request processing time (~1.5s) - Each screenshot requires decode, merge, encode, and disk write operations
Math that doesn’t work:
- Maximum throughput: 15 slots × (1 request / 1.5s) = 10 requests/second
- Required throughput for 90 clients at 5s interval: 90 / 5 = 18 requests/second
- The server cannot keep up, causing queue growth and increasing delays

Note: Our test results where applied to the actual test environment.

The queue continuously grows faster than it drains, causing screenshot update times to degrade from seconds to minutes over the course of an exam.

Increasing the concurrency limit is not a viable solution - the server becomes unstable when processing more than ~15-20 concurrent image operations.

Solution

The throttling limits are a symptom protector, not the root cause. The real solutions require architectural changes:

Rewrite the Architecture to make the server a pure I/O relay and encode/decode on the client.

Last updated on March 16, 2026 • J.H.F.

Legacy Server Performance Analysis