Building AI-Enabled Video Workflows for Near Real-Time Surveillance
Today, the camera lens is a data collector. A video stream is a rich source of real-time intelligence. For architects in campus security, transportation, and public safety, the goal is no longer just visibility, it’s actionability. AI-enabled workflows for video surveillance are bridging that gap.
Systems need to do more than just record a parking lot. They need to alert teams when a vehicle is loitering. Traffic cameras need to show congestion and automatically feed data into light-timing optimization software.
However, a major barrier exists: Aging Infrastructure. Most organizations are sitting on millions of dollars of existing hardware that lack native intelligence, including analog cameras, older IP models, and entrenched NVRs.
This post outlines the architectural patterns for injecting modern Computer Vision (CV) and AI into existing video workflows. We will explore how a flexible, programmable media infrastructure layer acts as the bridge between aging hardware and cutting-edge intelligence. This enables modernization without the massive cost of a “rip-and-replace” overhaul.
How to Integrate Intelligence into Video Surveillance Systems
The two fundamental components of an intelligent surveillance system are the capture device(s) and the intelligence layer. A media server sits in the middle, normalizing the video feed so it can be processed by any model, anywhere. Latency and bandwidth dictate where the intelligence engine should live. A flexible media engine can deploy the processing logic where it makes the most sense for the use case:
- Edge (On-Prem/Gateway)
Best for bandwidth-constrained environments (e.g., drone feeds, remote traffic intersections). The media server processes the RTSP feed locally. Only the metadata (alerts) or low-res proxy streams are sent to the cloud, saving massive bandwidth costs. - Sidecar (Containerized)
Best for heavy lifting. High-quality feeds are ingested by the media engine, which passes frames to a local Docker container running the vision model. The results are injected back into the stream immediately. - Cloud/Hybrid
Best for non-urgent forensics. Use the local media server to buffer and reliably transport streams to the cloud for deep learning analysis, like facial recognition.
Integrating Custom AI Functionality with “Bring Your Own Model” (BYOM) & MCP
One of the biggest risks in modern surveillance is buying a “smart camera” ecosystem that only supports the vendor’s proprietary analytics. Avoid vendor lock-in at the intelligence layer with a programmable media infrastructure that supports a Bring Your Own Model (BYOM) approach. Train custom models using standard frameworks like TensorFlow or YOLO and utilize Model Context Protocol (MCP) to integrate AI tools.
The media infrastructure delivers video frames, regardless of model. The most powerful aspect of this architecture is its ability to modernize aging infrastructure. Many campuses and cities rely on older RTSP cameras that function perfectly well as optical devices but lack onboard processing. By routing these legacy feeds through a modern media engine like Wowza Streaming Engine, they can effectively upgrade the camera.
- The Workflow: The media engine ingests the legacy RTSP feed, transcodes it into a modern format (like WebRTC or SRT), and passes it to your AI model.
- The Result: You get advanced object detection and real-time alerts from a 10-year-old camera.
- The Benefit: You avoid the capital expenditure of replacing hardware. You connect the dots between old glass and new code.
One of the strongest arguments for a flexible media infrastructure is Capital Expenditure (CapEx) Avoidance. Smart City planners can still leverage analog-to-IP encoders by placing a programmable media server like Wowza Streaming Engine in the middle. This turns a legacy video system into an intelligent endpoint. The server handles the “translation” (ingesting legacy RTSP, outputting modern WebRTC) and the “intelligence” (running the stream through a Dockerized AI model). A robust media engine normalizes feeds into a standard H.264/H.265 mezzanine format. With this, modern AI developers don’t need to write code to support a discontinued Sony camera from 2012.
Sample Implementation Scenarios
Campus & Facility Security
For monitoring campuses and facilities remotely, to ensure only approved people have access, standard HLS latency is unacceptable. Send a low-latency WebRTC stream (sub-500ms) to the security desk’s dashboard for immediate viewing, while simultaneously recording a high-bitrate SRT stream for evidence.
- The Workflow:
- Ingest RTSP feeds from existing IP cameras and pull into a local Wowza instance.
- Use a ServerListener module to intercept raw packets. Decode frames to a shared memory buffer where a local lightweight model performs inference.
- If a specific object class is detected with >85% confidence, immediately trigger a REST API call to the access control system and alerting system.
Traffic Monitoring & Transportation
Managing 1,000+ RTSP inputs where cameras can go offline or “hang” (sending keep-alives but no video data) is a non-starter for optimizing transit flow. Implementing an Ingest Monitor polls incomingStream.getDataRate(). If the bitrate drops below a threshold (e.g., 50kbps) for >10 seconds, it automatically issues a resetStream command to force the camera to re-negotiate the RTSP handshake.
- The Workflow:
- Use Wowza’s REST API to dynamically create .stream files based on sensor triggers instead of applying static XML files.
- Integrate a “Near-Miss” analysis model that flags non-collision anomaly patterns.
- Feed real-time insights to traffic signal systems to ease congestion at crowded intersections.
Download our Solution Brief and learn how government agencies, including the Mississippi Department of Transportation (MDOT), are leveraging flexible video streaming infrastructure to intelligently monitor surveillance feeds.
Telehealth & Hybrid Care
HIPAA compliance requires that no PII (Personally Identifiable Information) touches the public cloud inference engine. In telemedicine or first responder situations, monitoring live feeds while preserving privacy and data security is of paramount importance. Using Session-Based Token Authentication for playback generates a unique, short-lived token for the doctor’s session that expires immediately after the consult ends. This ensures the appropriate medical professionals can view their patients’ feeds at a moment’s notice.
- The Workflow:
- Run a “Redaction Container” on the local Wowza server. A simple face-detection model can blur faces before the stream is transcoded.
- The redacted stream is sent to the cloud for custom vision analysis, such as for fall detection or emergency response.
- The original, un-redacted stream is encrypted (AES-128) and stored strictly on local write-once storage for patient records and liability.
Election Monitoring & Chain of Custody
For elections, surveillance teams aim to improve public trust in election proceedings while also protecting staff at voting locations. Ensuring real-time response and validating that the video has not been altered or interrupted during a network blip is critical.
- The Workflow:
- Configure an Active/Passive origin group. If the primary camera stream fails, the secondary takes over.
- Use an overlay (transcoder overlay image) that dynamically updates with a hash of the current video segment.
- Employ custom vision models that identify and alert to bad actors in real-time.
Download our Election Monitoring Solution Brief for more technical tips and implementation considerations.
Troubleshooting Common Issues
Fixing Timestamp Misalignment and A Drifting Bounding Box
- The Issue: Your AI inference takes 200ms. By the time the metadata is injected, the video frame has already passed. The box appears to “trail” the object.
- The Fix:
- Don’t rely on system clock time (NTP) alone.
- Do extract the PTS (Presentation Timestamp) from the video packet header in your Java module.When the model returns a result, calculate the currentPTS + inferenceTime.
- Inject the metadata with this future timestamp so the player renders it at the exact moment the frame aligns.
Addressing RTSP Packet Loss & Artifacting
- The Issue: AI models are incredibly sensitive to compression artifacts. An I-frame caused by packet loss can look like a person to a computer vision model, causing false positives.
- The Fix:
- Force TCP Interleave for RTSP ingest (forceInterleaved=true in MediaCaster settings) if the network is jittery. This adds slight latency but ensures packet order.
- Increase the RTP Jitter Buffer in VHost.xml to 500ms-1000ms to smooth out packet arrival before the transcoder (and the AI model) sees the frames.
Overcoming Metadata Bloat
- The Issue: Sending full JSON objects for every detected object, every frame (30fps), will bloat the stream manifest and causes player stalls on mobile networks.
- The Fix:
- Decimate: Only inject metadata when the state changes (e.g., “Object Entered”, “Object Left”) or at a lower frequency (e.g., 5fps).
- Optimize: Use binary metadata formats or compact KLV instead of verbose JSON strings if the player supports it.
Best practices involve embedding inference data directly into the streaming manifest using ID3 tags (for HLS) or KLV metadata. This ensures that the “bounding box” travels inside the video stream, staying frame-accurate regardless of buffering or network jitter.
Sample Metadata Payload (JSON injected into ID3):
JSON
{
"timestamp": "1733774822",
"event_type": "object_detection",
"label": "unattended_bag",
"confidence": 0.92,
"coordinates": { "x": 200, "y": 450, "w": 50, "h": 60 }
}How A Flexible Media Infrastructure Solution Empowers Active Intelligence
The difference between a passive recording and an intelligent security workflow isn’t just the AI model. It’s about the infrastructure that delivers the video to that model.
Wowza Streaming Engine empowers this transformation by providing the reliable, protocol-agnostic, and programmable foundation required for critical operations. Whether you are deploying on-premises for maximum security, at the edge for speed, or in the cloud for scale, Wowza connects your existing vision hardware to your future vision goals.
Don’t rebuild your infrastructure. Make it smarter. Get in touch today.