Abstract
In the era of digital media, the rapidly increasing volume and complexity of multimedia data cause many problems in storing, processing, and querying information in a reasonable time. Feature extraction and processing time play an extremely important role in large-scale video retrieval systems and currently receive much attention from researchers. We, therefore, propose an efficient approach to feature extraction on big video datasets using deep learning techniques. It focuses on the main features, including subtitles, speeches, and objects in video frames, by using a combination of three techniques: optical character recognition (OCR), automatic speech recognition (ASR), and object identification with deep learning techniques. We provide three network models developed from networks of Faster R-CNN ResNet, Faster R-CNN Inception ResNet V2, and Single Shot Detector MobileNet V2. The approach is implemented in Spark, the next-generation parallel and distributed computing environment, which reduces the time and space costs of the feature extraction process. Experimental results show that our proposal achieves an accuracy of 96% and a processing time reduction of 50%. This demonstrates the feasibility of the approach for content-based video retrieval systems in a big data context.
Introduction
The demand for multimedia data and information systems is growing rapidly due to the development of the internet, big data, and broadband networks. However, multimedia data requires significant storage and processing, posing a challenge for efficiently extracting, indexing, storing, and retrieving video information from large databases. This paper proposes a feature extraction method using distributed deep learning in Spark to index and retrieve content-based videos based on subtitles, speech, and objects in the video frames, as an alternative to traditional keyword-based video search.
In this section, we summarize the techniques used in the proposed method, including techniques for the video content extraction such as optical characters, speech, and image objects to query video
Proposed Method
In this section, we present our proposed approach for extensive feature extraction on big video datasets using deep learning techniques. It includes the distributed and parallel processing model for better processing time, the techniques for content features extraction (speeches, subtitles, and objects), and content indexing for video retrieval.
- Feature Extraction: The proposed method uses techniques of Optical Character Recognition (OCR), Automatic Speech Recognition (ASR), and deep neural networks to extract features suitable for video retrieval in a big data context. The feature extraction process includes the following steps:
- Pre-Processing:The first step is to extract the images and audio clips from the input videos. Standard video files typically have 25-30 frames per second (fps), and these extracted images and audio clips will be the input data for the feature extraction.
- Content Extraction:After the pre-processing step, the method performs feature extraction from the video content, including:
- Extracting speeches from the audio clips using ASR
- Extracting subtitles from the video frames using OCR
- Detecting and extracting objects from the video frames using deep neural networks
Results
The proposed method for content-based video retrieval achieves high accuracy from 85% to 96% across Scenarios 1 to 9. Scenario 8 obtained the highest accuracy of 96% using the proposed distributed deep learning model on Spark. The experimental results show that the processing time in Spark is shortened by 50% without reducing accuracy when the dataset is increased by six times, compared to a normal computing environment (Scenario 3 versus Scenario 1). Scenario 8 achieves the highest accuracy compared to the remaining scenarios. The average execution time for Scenario 3 is the lowest, as it only extracts speech and subtitle features.
When pressing the on/off button, the text will change according to the current state of the relay, and the delay when toggling is very low.
Some illustrative results of the object recognition for Scenarios 4–9 are presented in Figure 4. We can see that Scenario 8 detects the bee object with the highest accuracy of 94%. Meanwhile, the bee object detection in Scenarios 6 and 9 has the lowest accuracy of 83% and 86%, respectively.