Video object recognition based on deep learning

Stojanoski, Stefan

doi:10.34726/hss.2019.55534

Record link:

https://doi.org/10.34726/hss.2019.55534
http://hdl.handle.net/20.500.12708/8488

Title:

Video object recognition based on deep learning

Citation:

Stojanoski, S. (2019). Video object recognition based on deep learning [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2019.55534

reposiTUm DOI:

10.34726/hss.2019.55534

CatalogPlus:

AC15355653

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Stojanoski, Stefan

Advisor:

Eidenberger, Horst

Organisational Unit:

E193 - Institut für Visual Computing and Human-Centered Technology

Date (published):

2019

Number of Pages:

102

Keywords:

Deep Learning; Convolutional Neural Networks; Video Object Recognition

Abstract:

In dieser Masterarbeit haben wir eine Client-Server-Architektur zur automatischen Erkennung von Reklametafeln in Videostreams entwickelt. Die Client-Seite wurde mit einer Android-Anwendung umgesetzt, die dem Zweck dient Videodaten für die Server-Seite zu sammeln. Für die Server-Seite hingegen wurde StefanNet entwickelt, wobei es sich um ein Deep Neural Network handelt. StefanNet kann Reklametafeln in einem Video Frame erkennen und klassifizieren. StefanNet hat einen Feature Extractor mit 23 konvolutionären Ebenen und benutzt einen Single Shot Detector (SSD) zur Objekterkennung. Das Netz wurde mit dem selbst erstellten BillboardDataset trainiert, welches 4042 Beispielbilder beinhaltet, die von Reklametafeln in den U-Bahn-Stationen Wiens gemacht wurden. Zusätzlich wurden Data-Augmentation-Techniken angewendet um den Datensatz künstlich um 25% zu vergrößern. Außerdem wurden Quantisierungstechniken auf StefanNet angewendet um die Bittiefe, die notwendig ist, um die Gewichte des Netzwerks zu speichern, von float32 auf float16 zu verringern. Wir haben die Performance von StefanNet evaluiert, indem wir es mit den state of the art Netzwerken ResNet, MobileNet, Inception und VGG16 verglichen haben. Der Validierungsdatensatz setzt sich zusammen aus Ansichten der Reklametafeln von vorne und von der Seite. StefanNet erreichte 91% mean average precision (mAp) auf dem Testdatensatz, 98% mAp für Ansichten von vorne und 82% mAp für Ansichten von der Seite. Die Inferenzgeschwindigkeit war 40 Bilder pro Sekunde (FPS) auf einer Nvidia 1080 Grafikkarte. Die quantisierte Version von StefanNet erreichte 91% mAp auf dem Testdatensatz, 96% mAp für Ansichten von vorne und 85% mAp für Ansichten von der Seite. Die Inferenzgeschwindigkeit für die quantisierte Version war 45 FPS. Sowohl StefanNet als auch dessen quantisierte Version hat eine höhere mAp als die anderen evaluierten Netzwerke erreicht. Das bestätigt, dass die Architektur von StefanNet die derzeit am Besten passende Architektur für das Problem der automatischen Reklametafel-Erkennung in einem Video Stream ist.

In this master thesis we designed a client server system for automatic billboard recognition in video streams. The client side is represented by an Android application which serves the purpose of collecting various video data streams for the server side. For the server side a deep neural network, called StefanNet, was designed. StefanNet is a fully convolutional neural network which is able to properly classify and localize billboard objects within a video frame. StefanNet has a feature extractor which contains 23 convolutional layers and uses a single shot detector (SSD) as an object detector. StefanNet has been trained on the self-designed BillboardDataset which contains 4042 image samples taken from the billboards located throughout the metro stations in Vienna. Additionally, data augmentation techniques have been implemented to artificially augment the dataset with a 25% increase rate. Furthermore, the compression-based quantization technique has been applied to the StefanNet model to reduce the bit-width necessary for storing the weights of the network from float32 to float16. We evaluated the performance of StefanNet by comparing against the state-of-the-art networks ResNet, MobileNet, Inception and VGG16. The validation dataset contains both side and frontal views of the billboards. StefanNet achieved 91% mean average precision (mAp) on the test dataset, 98% mAp on the frontal view validation dataset and 82% mAp on the side view validation dataset. The inference rate was 40 FPS on a Nvidia 1080 graphics card. The quantized version of the StefanNet model achieved 91% mAp on the test dataset, 96% mAp on the frontal view validation dataset and 85% mAp on the side view validation at an inference rate of 45 FPS. In comparison to the other evaluated networks both the StefanNet model and the quantized version of the model produce superior results and outperform the benchmark network models on all datasets. This confirms that the architecture of StefanNet is currently the most suitable for the specific problem of automatic billboard detection in video streams.

License:

In Copyright

Appears in Collections:

Thesis