Scientific research project

Abstract

In the era of digital media, the rapidly increasing volume and complexity of multimedia data cause many problems in storing, processing, and querying information in a reasonable time. Feature extraction and processing time play an extremely important role in large-scale video retrieval systems and currently receive much attention from researchers. We, therefore, propose an efficient approach to feature extraction on big video datasets using deep learning techniques. It focuses on the main features, including subtitles, speeches, and objects in video frames, by using a combination of three techniques: optical character recognition (OCR), automatic speech recognition (ASR), and object identification with deep learning techniques. We provide three network models developed from networks of Faster R-CNN ResNet, Faster R-CNN Inception ResNet V2, and Single Shot Detector MobileNet V2. The approach is implemented in Spark, the next-generation parallel and distributed computing environment, which reduces the time and space costs of the feature extraction process. Experimental results show that our proposal achieves an accuracy of 96% and a processing time reduction of 50%. This demonstrates the feasibility of the approach for content-based video retrieval systems in a big data context.

Introduction

The demand for multimedia data and information systems is growing rapidly due to the development of the internet, big data, and broadband networks. However, multimedia data requires significant storage and processing, posing a challenge for efficiently extracting, indexing, storing, and retrieving video information from large databases. This paper proposes a feature extraction method using distributed deep learning in Spark to index and retrieve content-based videos based on subtitles, speech, and objects in the video frames, as an alternative to traditional keyword-based video search.

In this section, we summarize the techniques used in the proposed method, including techniques for the video content extraction such as optical characters, speech, and image objects to query video

Figure 1: Tesseract recognition algorithm—OCR.

Table 1. Experimental datasets.

Proposed Method

In this section, we present our proposed approach for extensive feature extraction on big video datasets using deep learning techniques. It includes the distributed and parallel processing model for better processing time, the techniques for content features extraction (speeches, subtitles, and objects), and content indexing for video retrieval.

Figure 2: The proposed model of the distributed and parallel computing for feature extraction.

- Feature Extraction: The proposed method uses techniques of Optical Character Recognition (OCR), Automatic Speech Recognition (ASR), and deep neural networks to extract features suitable for video retrieval in a big data context. The feature extraction process includes the following steps:

- Pre-Processing:The first step is to extract the images and audio clips from the input videos. Standard video files typically have 25-30 frames per second (fps), and these extracted images and audio clips will be the input data for the feature extraction.

- Content Extraction:After the pre-processing step, the method performs feature extraction from the video content, including:

Extracting speeches from the audio clips using ASR
Extracting subtitles from the video frames using OCR
Detecting and extracting objects from the video frames using deep neural networks

Table 2. Experimental scenarios.

Results

The proposed method for content-based video retrieval achieves high accuracy from 85% to 96% across Scenarios 1 to 9. Scenario 8 obtained the highest accuracy of 96% using the proposed distributed deep learning model on Spark. The experimental results show that the processing time in Spark is shortened by 50% without reducing accuracy when the dataset is increased by six times, compared to a normal computing environment (Scenario 3 versus Scenario 1). Scenario 8 achieves the highest accuracy compared to the remaining scenarios. The average execution time for Scenario 3 is the lowest, as it only extracts speech and subtitle features.

When pressing the on/off button, the text will change according to the current state of the relay, and the delay when toggling is very low.

Figure 3: Average accuracy for Scenarios 1 to 9.

Some illustrative results of the object recognition for Scenarios 4–9 are presented in Figure 4. We can see that Scenario 8 detects the bee object with the highest accuracy of 94%. Meanwhile, the bee object detection in Scenarios 6 and 9 has the lowest accuracy of 83% and 86%, respectively.

Figure 4: Illustration of object recognition for Scenarios 4–9.

Conclusions

This study proposes an efficient method for content-based video retrieval in a big data context. The method extracts features from video content including subtitles, speech, and objects using a combination of techniques like automatic speech recognition, subtitle recognition, and deep learning-based object identification.

The proposed method leverages distributed computing on Spark to enable efficient processing of large video datasets. It uses pre-trained deep neural network models like Faster R-CNN and Single Shot Detector, fine-tuned on ImageNet and COCO datasets, to identify objects in the videos.

Experiments show the proposed method achieves 96% accuracy and 50% reduction in processing time compared to other approaches without distributed computing. The method can extract robust video features suitable for big data-driven video retrieval systems. Future work will explore extracting associative features related to video content and testing the approach across multiple compute clusters.

References