• International Journal of Technology (IJTech)
  • Vol 13, No 5 (2022)

Video Chunk Processor: Low-Latency Parallel Processing of 3 x 3-pixel Image Kernels

Video Chunk Processor: Low-Latency Parallel Processing of 3 x 3-pixel Image Kernels

Title: Video Chunk Processor: Low-Latency Parallel Processing of 3 x 3-pixel Image Kernels
Daniel Cheok Kiang Kho, Mohammad Faizal Ahmad Fauzi, Sin Liang Lim

Corresponding email:

Cite this article as:
Kho, D.C.K., Fauzi, M.F.A., Lim, S.L., 2022. Video Chunk Processor: Low-Latency Parallel Processing of 3 x 3-pixel Image Kernels. International Journal of Technology. Volume 13(5), pp. 1045-1054

Daniel Cheok Kiang Kho Faculty of Engineering, Multimedia University, Persiaran Multimedia, 63100 Cyberjaya, Selangor, Malaysia
Mohammad Faizal Ahmad Fauzi Faculty of Engineering, Multimedia University, Persiaran Multimedia, 63100 Cyberjaya, Selangor, Malaysia
Sin Liang Lim Faculty of Engineering, Multimedia University, Persiaran Multimedia, 63100 Cyberjaya, Selangor, Malaysia
Email to Corresponding Author

Video Chunk Processor: Low-Latency Parallel Processing of 3 x 3-pixel Image Kernels

Video framebuffers are usually used in video processing systems to store an entire frame of video data required for processing. These framebuffers make extensive use of random access memory (RAM) technologies and interfaces that use them. Recent trends in the high-speed video discuss the use of higher speed memory interfaces such as DDR4 (double-datarate 4) and HBM (high-bandwidth memory) interfaces. To meet the demand for higher image resolutions and frame rates, larger and faster framebuffer memories are required. While it is not feasible for software to read and process parts of an image quickly and efficiently enough due to the high speed of the incoming video, a hardware-based video processing solution poses no such limitation. Existing discussions involve the use of framebuffers even in hardware-based implementations, which greatly reduces the speed and efficiency of such implementations. This paper introduces hardware techniques to read and process kernels without the need to store the entire image frame. This reduces the memory requirements significantly without losing the quality of the processed images.

Framebuffer; Image processing; Kernel processing; Parallel processing; Video processing


Video processing solutions (Park et al., 2020) typically require incoming video streams to be captured and stored first, and then to be processed later. This technique of buffering incoming frames is still the mainstream technique used to process videos today. However, as the consumer demands for clearer and faster videos continue to rise, initiating processing of videos after a time- lapse of several frames reduces efficiency and increases latency between the incoming video and the consumer of the video at the endpoint of the video pipeline. This paper introduces techniques that enable real-time processing of videos to reduce such latency while still maintaining the quality of the video being received, enabling potential improvements to existing hardware-based solutions that require the use of external RAM memory for frame buffering.

Traditionally, random-access memory (RAM) (Kechiche, 2018; Vervoort, 2007; Wang, 2019; Cerezuela-Mora et al., 2015; Gonzalez, 2002) has been extensively used to store entire image frame data before actual processing is performed in image and video processing. As the resolution of images increase, so do the need for more RAM. To be able to process video at high speed, access to the memory now becomes a bottleneck. Newer DDR and HBM RAM interfaces have been designed to cater for a surge in need for faster RAM access from the host (eg. graphics processing unit (GPU)). This paper proposes a technique to perform image processing without requiring the use of external DDR or HBM RAM memory. A much smaller amount of on-chip RAM is required to buffer the incoming video stream. Figure 1 shows a video processing system that incorporates the proposed image processing technique. This paper is an extension of work originally presented in the 2020 edition of the IEEE Region 10 Conference (TENCON) (Kho et al., 2018).


Literature Review

     Chen (Chen, 2008) has shown that it was feasible to implement Full-HD (1920×1080p)-resolution video processing on an application-specific integrated circuit (ASIC) at 55 fps. However, it is not known what type of edge detection or video sharpening algorithms were supported, or if any exist at all. Our research has shown that video sharpening, edge detection, and potentially many other video processing algorithms, can be implemented feasibly on a low-cost and much slower FPGA at a speed of at least 60 fps.
       He (He, 2019) showed that it was feasible to implement high-speed HDMI video transfer and display using an FPGA device. Similar to our research, they have also used the HDMI as the video input interface to receive and perform video capture. However, their solution focused more on video capture, transmission, and display, with median filtering being the only algorithmic block. We have also used something similar to the image storage module, in what we refer to as the Row Buffers shown in Figure 1. However, our research focused more on improving algorithm efficiency and quality (Kho et al., 2018; Kho et al., 2019; Kho et al., 2020), as well as efficient video data encoding techniques. Our Row Buffers re-encode the VGA or HDMI input video stream into a native 15×3-pixel chunk structure which is then provided to the Chunk Processor. Our Chunk Processor efficiently repacks a chunk into an array of fifteen 3×3 kernels, which are then passed directly into our algorithm blocks for further video processing. As shown in Figure 7, there are 15 such kernel processors within the Chunk Processor that processes each kernel in parallel. Our algorithmic blocks can receive these kernels directly and commence the processing without further delay.

       Kolonko (Kolonko, 2019) used a Raspberry Pi single-board computer to repack the standard VGA or HDMI input video source to a format suitable for their FPGA design to manipulate. Such a system was intended to test the algorithmic blocks within the FPGA, and it was not meant to be a production-ready solution. Our chunk processor has shown to quickly process incoming VGA or HDMI video and pass the information quickly to the algorithmic blocks, hence, this solution can be used as a stand-alone production-ready solution.
        Kechiche (Kechiche, 2018) shows a typical implementation that uses an ARM processor connected to an AXI bus, and making use of a 1-Gigabit DDR3 external RAM device. Sousa (Sousa, 2020) uses the million of connection updates per second (MCUPS) metric to characterize the throughput of their neural-network based application, which is different from the video processing throughput. Park (Park, 2020) discussed a 29×29-pixel Retinex algorithm at length and reported that the latency in a 1080p video processing environment at 60 fps is unnoticeable. The latency proposed in this paper is also unnoticeable and has been used successfully in gradient-based calculations. Our proposed techniques could be scaled to the same kernel size as Park’s, enabling the scaling of such applications to higher resolutions and framerates.

     Wahab (Wahab et al., 2015) shows an automatic face-image acquisition system designed using a microcontroller that can be used for facial recognition applications. Pasnur (Pasnur et al., 2016) described methods to retrieve a partial image by specifying a region of interest (ROI). Our proposed techniques aim to help improve the speed of recognition and retrieval processes. 


Hardware Architecture

3.1Devising a new kernel processing scheme

       Image processing algorithms usually implements with kernels, which are small fixed-size pieces of an image. These kernels contain pixels that form a square, with the width and height of the square usually as an odd number. Even-numbered kernel dimensions exist but are less common. Kernel sizes may range from 3-by-3, 5-by-5, 7-by-7, 9-by-9 pixels to larger dimensions. In this paper, we show how we have carefully devised a new image processing scheme that considers these odd-dimension kernel sizes. 

    We would like to process at least some kernels in parallel to leverage the parallel processing advantages inherent in hardware-based designs. A bunch of kernels will be transferred to the processing block to be processed in parallel. In our paper, we shall denote this group of kernels as a “chunk”. The block that processes these chunks is shown in Figure 1 as the Chunk Processor. In selecting a suitable size of a chunk, we considered the various kernel sizes from 3×3, 5×5, up to 15×15. We settled on a chunk size of 15×3 pixels due to practical reasons.

Figure 1 Video processing system block diagram

Figure 2 Examples of odd-sized kernel structures

    Figure 2 shows several examples of odd-sized image kernels. The dark pixels denote the centre pixels. Because we are not able to support each and every possible kernel dimension, we wanted to make sure at least several kernels could be processed simultaneously using common chunk. By selecting a 15×3 chunk size, 3×3-sized kernels can be supported. Although other kernel dimensions such as 5×5 and 7×7 are gaining in popularity, kernels having dimensions of 3×3 pixels are still very much in use, therefore will be the focus of this paper.

        To perform video processing continuously, one would shift the center pixel by one pixel, (either to the right, or to the bottom) by assuming that processing is done from the left to right and top to bottom. For example, by shifting the center pixel to the right, this would result in having 13 full kernels in a 15×3 chunk, and 2 partial kernels. The partial kernels are what we denote as the edge kernels or boundary kernels, as shown in Figure 3.

Figure 3 Boundary kernel structures

    We will show that, from the viewpoint of the Kernel Processors, the pixel shifts are occurring in parallel, which means that we are processing all 15 kernels within a chunk simultaneously. This means that within a single chunk_valid cycle, all 15 kernels within a chunk are being processed in parallel, while the next chunk of video data is being buffered by the row buffers and transformed into chunks by the chunk processor.

3.2Kernel processing scheme

    The input video source is read in pixel-by-pixel by a pre-processing block, which is comprised of row buffers that stores several rows of pixels of a frame. These Row Buffers, as shown in Figure 1, receive the incoming pixels serially and buffer them in internal RAM memory. This block then transforms the buffered pixels into our chunk format, which is subsequently transferred to the chunk processor block. The row buffers will also generate a chunk valid pulse every time a 15×3-pixel chunk has been buffered and the data is available for the chunk processor. The chunk processor then transforms chunks into individual kernels which are fed into the parallel algorithmic kernel processors. This will process all the 15 kernels within the chunk in parallel.

    In our research, we wanted to perform Sobel edge detection with a 3×3 kernel. The pixel arrangement of such a kernel is shown in Figure 2, where the highlighted pixel in black is the center pixel. For a typical software-based solution, the algorithm would sweep across the kernels from the left-most of the frame to the right, each time reading a 3×3-pixel kernel across 3 rows of the image. For our proposed hardware-based solution, we would read in 3 rows of 15 pixels directly (a chunk) and begin to process all the 15 kernels simultaneously. The proposed chunk structure is shown in Figure 5.

       In the case of our Sobel gradient computations, it takes 3 clock cycles to complete the processing of a kernel. Because all the kernels within a chunk commences processing in parallel in which the same algorithm is used to process each kernel, all the kernels will also complete the processing at the same time. We shall denote this duration as the chunk processing time ?Tc. The chunk processing time varies depending on which algorithm is being used to process the kernels, and the actual hardware implementation involved including the number of pipeline stages used.

       After a chunk has been processed, the hardware moves on to the next chunk and repeats the process until the end of the frame is reached. In the context of a Full-HD (1920x1080p) frame, our proposed chunk structure would be as shown in Figure 6.

         Figure 7 shows a block diagram of our proposed chunk processing scheme. The Chunk Processor in Figure 7 corresponds to the same Chunk Processor module as was shown in Figure 1. Likewise, the Kernel Processors in Figure 7 corresponds to the same module as depicted in Figure 1. Here, we show the parallel processing architecture of the 15 Kernel Processors.
      The kernel outputs from the Chunk Processor are fed to the algorithmic Kernel Processors for parallel computation. The kernel processor outputs q0 to q14 correspond to the processed pixels after every kernel has been processed. Each kernel processor works on the center pixel of a 3×3-pixel kernel and computes the resulting output pixel qN based on the other 8 pixels adjacent to the center pixel.

Figure 4 Chunk valid timing

Figure 5 Chunk structure

Figure 6 Chunk structure for a Full-HD Frame

Figure 7 Parallel kernel processing scheme

At every chunk_valid pulse, data from 15 kernels is available. The 15 Kernel Processors will commence the parallel processing of all the kernels coming from the Chunk Processor, producing the pixel field q consisting of 15 pixels per chunk_valid cycle.

Experimental Methods

Figure 8 shows the experimental setup to characterize our video buffering scheme and processing algorithms. Our hardware implementation currently supports Full-HD (1080p) resolution video running at 60 frames-per-second (fps).

In our laboratory environment, we used a 100-MHz oscillator input from our board as our system clock. A Xilinx Mixed-mode Clock Manager (MMCM) module is used to synthesize the output frequencies required for 1080p video. From the 100-MHz input clock, a pixel clock running at 148.5 MHz is generated by the MMCM. Also, a high-speed clock running at 750 MHz, required for transmitting and receiving serialized video data at the HDMI connectors, is also generated by the MMCM.

Figure 8 Experimental laboratory setup

To characterize the performance of our chunk processor in terms of signal latencies, we shall first denote the signals of interest. The reset signal resets the system to default values. During this time, all counters and registers are zeroed out. The Vsync signal is the vertical synchronization signal of the incoming video input. Together with the vertical blanking (Vblank) signal, the Vsync signal is used to flag that a new video frame has arrived. Both the reset and Vsync signals are inputs to our chunk processing module.

The chunk_valid signal is an output of our chunk processor block, indicating that a chunk of video data (15×3 pixels) has been processed and it is available for use by the downstream algorithmic processing blocks.

As part of our characterization work, all these signals are captured using Xilinx ChipScope, and they produce similar results with our logic simulations. The Results section discusses these in terms of signal latencies and performance.

Results and Discussion

Our simulation testbench closely emulates our hardware configuration. All hardware blocks have been written in the VHDL (Ashenden, 2010) hardware language. The design was functionally simulated in Mentor Graphics ModelSim. A sample screen capture of the final simulation results are shown in Figure 9.

Figure 9 Partial view of output data from chunk processor

After gaining enough confidence from our simulations, we synthesised our design to FPGA hardware using Xilinx Vivado. The final implementation was downloaded into the Xilinx Artix XC7A200T device and the hardware results are acquired in real-time using Xilinx ChipScope. The realtime hardware results are compared and verified against the ModelSim simulation results.

5.1. Functional Simulations

   We have implemented this design on a Xilinx Artix 7 (XC7A200T) FPGA device and used Xilinx’s ChipScope integrated logic analyzer to acquire real-time waveforms from our development hardware. Figure 9 shows a partial view of the functional simulation results in ModelSim.

   From Figure 10, we note that the latency from the deassertion of the reset signal until the first assertion of the chunk_valid signal is 59.3535 microseconds (?s). A similar amount of time will be taken between the assertion of the Vsync video synchronization signal until the assertion of the chunk_valid signal. This means that the initial latency after a reset or power-up -up event, until data becomes available for processing by the algorithmic blocks, is 59.3535 ?s.

     Figure 11 shows that the processing latency between one chunk_valid pulse and the next is 101.01 nanoseconds (ns).

Figure 10 Start-up latency between reset and first chunk_valid

Figure 11 Latency between two consecutive chunk_valid pulses

5.2. Hardware Synthesis, Place & Route, and Design Assembly

   The Xilinx post-synthesis report shows that our row buffers and chunk processing scheme utilize only 397 look-up tables (LUTs), which on the Artix XC7A200T device, translates to just 0.29% of all the LUT resources in the chip. The post-synthesis results are shown in Figure 12.