Plant Detection and State Classification with Machine Learning DIPLOMARBEIT zur Erlangung des akademischen Grades Diplom-Ingenieur im Rahmen des Studiums Software Engineering & Internet Computing eingereicht von Tobias Eidelpes, BSc Matrikelnummer 01527193 an der Fakultät für Informatik der Technischen Universität Wien Betreuung: Ao.Univ.-Prof. Dr. Horst Eidenberger Wien, 30. Dezember 2023 Tobias Eidelpes Horst Eidenberger Technische Universität Wien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at Plant Detection and State Classification with Machine Learning DIPLOMA THESIS submitted in partial fulfillment of the requirements for the degree of Diplom-Ingenieur in Software Engineering & Internet Computing by Tobias Eidelpes, BSc Registration Number 01527193 to the Faculty of Informatics at the TU Wien Advisor: Ao.Univ.-Prof. Dr. Horst Eidenberger Vienna, 30th December, 2023 Tobias Eidelpes Horst Eidenberger Technische Universität Wien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at Erklärung zur Verfassung der Arbeit Tobias Eidelpes, BSc Hiermit erkläre ich, dass ich diese Arbeit selbständig verfasst habe, dass ich die verwen- deten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der Arbeit – einschließlich Tabellen, Karten und Abbildungen –, die anderen Werken oder dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter Angabe der Quelle als Entlehnung kenntlich gemacht habe. Wien, 30. Dezember 2023 Tobias Eidelpes v Danksagung Ich danke vor allem dem Betreuer dieser Arbeit—Horst Eidenberger—für die rasche und zeitnahe Bearbeitung meiner Anliegen sowie das aussagekräftige Feedback, das ich laufend erhalten habe. Die von ihm vorgegebene Roadmap erleichtert das Arbeiten nicht nur in der Anfangsphase, sondern auch später ungemein. Meine Eltern sind diejenigen, die mir das Studium überhaupt erst ermöglicht haben. Die laufende Unterstützung auch in den späteren Jahren—primär in Form von gemeinsamen Mittagessen—ließ mich auf das konzentrieren, was wichtig ist. Zu guter Letzt geht noch ein Dank an meine Studienkollegen, die bei Fragen, zu Diskus- sionen und zum Entspannen immer erreichbar waren. Der gelegentliche Austausch macht mir immer Freude und ich hoffe, dass wir dies auch in Zukunft aufrechterhalten können. vii Acknowledgements I would especially like to thank the supervisor of this thesis—Horst Eidenberger—for the quick and prompt processing of my requests as well as the meaningful feedback I received on an ongoing basis. The roadmap he provided not only made this thesis easier in the initial phase but also later on. My parents are the ones who made my studies possible in the first place. The ongoing support, even in the later years—primarily in the form of shared lunches—allowed me to concentrate on what is important. Last but not least, I would like to thank my fellow colleagues, who have always been available for questions, discussions and relaxation. I always enjoy the occasional exchange and I hope that we can continue to do those in the future. ix Kurzfassung Wassermangel in Zimmerpflanzen kann ihr Wachstum negativ beeinflussen. Bestehende Lösungen zur Überwachung von Wasserstress sind in erster Linie für landwirtschaftliche Kontexte gedacht, bei denen nur eine kleine Auswahl an Pflanzen von Interesse ist. Bislang gab es keine Forschung im Haushaltskontext, wo die Vielfalt der Pflanzen wesentlich größer ist und es daher schwieriger ist, Wasserstress zu erfassen. Außerdem beinhalten derzeitige Ansätze entweder keinen eigenen Pflanzenerkennungsschritt oder es kommt traditionelle Feature Extraction zur Anwendung. Wir entwickeln einen Prototyp zur Erkennung und nachfolgender Klassifizierung des Wasserstresses von Pflanzen, der ausschließlich auf Deep Learning basiert. Unser zweistufiger Ansatz besteht aus einem Erkennungs- und einem Klassifizierungs- schritt. In der Erkennungsphase werden die Pflanzen identifiziert und aus dem Originalbild ausgeschnitten. Die Ausschnitte werden an das Klassifizierungsmodell weitergeleitet, das die Wahrscheinlichkeit für Wasserstress ausgibt. Wir verwenden Transfer Learning und führen die Feinabstimmung der beiden Modelle anhand zweier Datensätze durch. Jedes Modell wird mithilfe einer Hyperparameter-Suche optimiert und zunächst einzeln und dann im Aggregat auf einem eigens erstellten Datensatz evaluiert. Wir stellen beide Modelle auf einem Nvidia Jetson Nano bereit, der in der Lage ist, Pflanzen autonom über eine angeschlossene Kamera zu klassifizieren. Die Ergebnisse der Pipeline werden kontinuierlich über eine API veröffentlicht. Nachgeschaltete Bewässerungssysteme können die Vorhersagen zum Wasserstress nutzen, um die Hauspflanzen ohne menschliches Zutun zu bewässern. Die beiden Modelle zusammengenommen erreichen einen mAP-Wert von 0.3581 in der nicht optimierten Version. Beide Modelle sind in der Lage, mit verschiedenen Licht- verhältnissen, verschiedenen Blickwinkeln und einer Vielfalt an Pflanzen umzugehen. Die optimierte Pipeline erreicht einen mAP-Wert von 0.3838. Im Vergleich zur nicht optimierten Version ist die Genauigkeit für nicht gestresste Pflanzen höher, aber geringer für die gestresste Klasse. Die Spezifität für die nicht gestresste Klasse bleibt im Vergleich zur nicht optimierten Basislinie auf demselben Niveau, ist aber um 12.1 Prozentpunkte höher für die gestresste Klasse. Das gewichtete harmonische Mittel (F1-score) für bei- de Klassen konnte um 2.4 Prozentpunkte verbessert werden. Diese Ergebnisse zeigen, dass unser zweistufiger Ansatz funktioniert und ein vielversprechender erster Schritt zur Klassifizierung des Zustands von Zimmerpflanzen ist. xi Abstract Water deficiency in household plants can adversely affect growth. Existing solutions to monitor water stress are primarily intended for agricultural contexts where only a small selection of plants are of interest. To date, there has been no research in household settings where the variety of plants is considerably higher and it is thus more difficult to obtain accurate measures of water stress. Furthermore, current approaches either do not detect plants in images first or use traditional feature extraction for plant detection. We develop a prototype to detect plants and classify them into water-stressed or not using deep learning based methods exclusively. Our two-stage approach consists of a detection and a classification step. In the detection step, plants are identified and cut out from the original image. The cutouts are passed to the classifier which outputs a probability for water stress. We use transfer learning to start from a robust base and fine-tune both models for their respective tasks. Each model is optimized using hyperparameter optimization and first evaluated individually and then in aggregate on a custom dataset. We deploy both models to an Nvidia Jetson Nano which is able to survey plants autonomously via an attached camera. The results of the pipeline are published continuously via an API. Downstream watering systems can use the water stress predictions to water the plants without human intervention. The two models in aggregate achieve a mAP of 0.3581 for the non-optimized version. Both constituent models have robust feature extraction capabilities and are able to cope with various lighting conditions, different angles and a wide variety of household plants. The optimized pipeline achieves a mAP of 0.3838 on unseen images with higher precision for the non-stressed but lower precision for the stressed class. Recall for the non-stressed class remains at the same level compared to the non-optimized baseline but is 12.1 percentage points higher for the stressed class. The weighted F1-score across both classes was improved by 2.4 percentage points. These results show that our two-stage approach is viable and a promising first step for plant state classification for household plants. xiii Contents Kurzfassung xi Abstract xiii Contents xv 1 Introduction 1 1.1 Motivation and Problem Statement . . . . . . . . . . . . . . . . . . . . 1 1.2 Methodological Approach . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Theoretical Background 7 2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . 9 2.1.3 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.5 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Traditional Methods . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 Deep Learning Based Methods . . . . . . . . . . . . . . . . . . 16 2.2.3 Two-Stage Detectors . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.4 One-Stage Detectors . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.1 Traditional Methods . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.2 Deep Learning Based Methods . . . . . . . . . . . . . . . . . . 22 2.4 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 Hyperparameter Optimization . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.1 Grid Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5.2 Random Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5.3 Evolution Strategies . . . . . . . . . . . . . . . . . . . . . . . . 30 2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3 Prototype Design 35 xv 3.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3 Selected Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.1 You Only Look Once . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.2 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.3 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 44 4 Prototype Implementation 45 4.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.2 Training Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.1.3 Hyperparameter Optimization . . . . . . . . . . . . . . . . . . 48 4.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.2 Hyperparameter Optimization . . . . . . . . . . . . . . . . . . 51 4.3 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5 Evaluation 55 5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2.3 Aggregate Model . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.4 Non-optimized Model . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.5 Optimized Model . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6 Conclusion 67 6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 List of Figures 71 List of Tables 73 Acronyms 75 Bibliography 79 CHAPTER 1 Introduction Machine learning has seen an unprecedented rise in various research fields during the last few years. Large-scale distributed computing and advances in hardware manufactur- ing have allowed machine learning models to become more sophisticated and complex. Multi-billion parameter deep learning models show best-in-class performance in Natural Language Processing (NLP) [BMR+20], fast object detection [BWL20] and various classi- fication tasks [ZHT22; AH22]. Agriculture is one of the areas which profits substantially from the automation possible with machine learning. Large-scale as well as small local farmers are able to survey their fields and gardens with drones or stationary cameras to determine soil and plant condition as well as when to water or fertilize [RRL+20]. Machine learning models play an important role in that process because they allow automated decision-making in real time. While machine learning has been used in large-scale agriculture, it is also a valuable tool for household plants and gardens. By using machine learning to monitor and analyze plant conditions, homeowners can optimize their plant care and ensure their plants are healthy and thriving. 1.1 Motivation and Problem Statement The challenges to implement an automated system for plant surveying are numerous. First, gathering data in the field requires a network of sensors which are linked to a central server for processing. Since communication between sensors is difficult without proper infrastructure, there is a high demand for processing the data on the sensor itself [MWL22]. Second, differences in local soil, plant and weather conditions require models to be optimized for these diverse inputs. Centrally trained models often lose the nuances present in the data because they have to provide actionable information for a larger area [Awa19]. Third, specialized methods such as hyper- or multispectral imaging in the field provide fine-grained information about the object of interest but come with substantial upfront costs and are of limited interest for gardeners. 1 1. Introduction To address all of the aforementioned problems, there is a need for an installation which is deployable by homeowners, gathers data using readily available hardware and performs computation on the device without a connection to a central server. The device should be able to visually determine whether the plants in its field of view need water or not and output its recommendation. The recommendation should then be used as a data point off of which homeowners can automatically water their plants with an automated watering system. The aim of this work is to develop a prototype which can be deployed by gardeners to survey plants and recommend watering or not. To this end, a machine learning model will be trained to first identify the plants in the field of view and then to determine if the plants need water or not. The model should be suitable for edge devices equipped with a Tensor Processing Unit (TPU) or Graphics Processing Unit (GPU) but with otherwise limited processing capabilities. Examples of such systems include Google’s Coral development board and the Nvidia Jetson series of single-board computers (SBCs). The model should make use of state-of-the-art algorithms from either classical machine learning or deep learning. The literature review will yield an appropriate machine learning method. Furthermore, the adaption of existing models (transfer learning) for object detection to the domain of plant recognition may provide higher performance than would otherwise be achievable within the time constraints. The model will be deployed to the SBC and evaluated using established and well-known metrics from the field of machine learning. The evaluation will seek to answer the following questions: 1. How well does the model work in theory and how well in practice? We will measure the performance of our model with common metrics such as accuracy, F-score, Receiver Operating Characteristic (ROC) curve, Area Under the Curve (AUC), Intersection over Union (IOU) and various mean Average Precision (mAP) measures. These measurements will allow comparisons between our model and existing models. We expect the plant detection part of the model to achieve high scores on the test dataset. However, the classification of plants into stressed and non-stressed will likely prove to be more difficult. The model is limited to physiological markers of water stress and thus will have difficulties with plants which do not overtly display such features. Even though models may work well in theory, some do not easily transfer to practical applications. It is, therefore, important to examine if the model is suited for productive use in the field. The evaluation will contain a discussion about the model’s transferability because theoretical performance does not automatically guarantee real-world performance due to different environmental conditions. 2. What are possible reasons for it to work/not work? Even if a model scores high on performance metrics, there might be a mismatch between how researchers think it achieves its goal and how it actually achieves its 2 1.2. Methodological Approach goal. The results have to be plausible and explainable with its inputs. Otherwise, there can be no confidence in the model’s outputs. Conversely, if the model does not work, there must be a reason. We estimate that the curation of the dataset for the training and test phases will play a significant role. Explanations for model out- or underperformance are likely to be found in the structure and composition of the model’s inputs. 3. What are possible improvements to the system in the future? The previous two questions will yield the data for possible improvements to the model and/or our approach. With the decision to include a plant detection step at the start, we hope to create consistent conditions for the stress classification. A downside to this approach is that errors during detection can be propagated through the system and result in adverse effects to overall performance. Although we estimate this problem to be negligible, additional feedback regarding our approach in this way might offer insight into potential improvements. If the model does not work as well as expected, which changes to the approach will yield a better result? Similarly to the previous question, the answer will likely lie in the dataset. A heavy focus on dataset construction and curation will ensure satisfactory model performance. 1.2 Methodological Approach The methodological approach consists of the following steps: 1. Literature Review: The literature review informs the type of machine learning methods which are later applied during the implementation of the prototype. 2. Dataset Curation: After selecting the methods to use for the implementation, we have to create our own dataset or use existing ones, depending on availability. 3. Model Training: The selected models will be trained with the datasets curated in the previous step. 4. Optimization: The selected models will be optimized with respect to their parameters. 5. Deployment to SBC: The software prototype will be deployed to the SBC. 6. Evaluation: The models will be evaluated extensively and compared to other state-of-the-art systems. During evaluation, the author seeks to provide a basis for answering the research questions. During the literature review, the search is centered around the terms plant classification, plant state classification, plant detection, water stress detection, machine learning agri- culture, crop machine learning and remote sensing. These terms provide a solid basis for 3 1. Introduction understanding the state of the art in plant detection and stress classification. We will use multiple search engines such as Google Scholar, Semantic Scholar, the ACM Digital Library, and IEEE Xplore. It is common to only publish research papers in preprint form in the data science and machine learning fields. For this reason, we will also reference arXiv.org for these papers. The work discovered in this way will also lead to further insights about the type of models which are commonly used. In order to find and select appropriate datasets to train the models on, we will survey the existing big datasets for classes we can use. Datasets such as the Common Objects in Context (COCO) [LMB+15] and PASCAL Visual Object Classes (VOC) [EVW+10] contain the highly relevant class Potted Plant. By extracting only these classes from multiple datasets and concatenating them together, it is possible to create one unified dataset which only contains the classes necessary for training the model. The training of the models will happen in an environment where more computational resources are available than what the SBC offers. We will deploy the final model with the Application Programming Interface (API) to the SBC after training and optimization. Furthermore, training will happen in tandem with a continuous evaluation process. After every iteration of the model, an evaluation run against the test set determines if there has been an improvement in performance. The results of the evaluation feed back into the parameter selection at the beginning of each training phase. Small changes to the training parameters, augmentations or structure of the model are followed by another test phase. The iterative nature of the development of the prototype increases the likelihood that the model’s performance is not only locally maximal but also as close as possible to the global maximum. In the final evaluation phase, we will measure the resulting model against the test set and evaluate its performance with common metrics. The aim is to first provide a solid basis of facts regarding the model(s). Second, the results will be discussed in detail. Third, we will cross-check the results with the hypotheses from section 1.1 and determine whether the aim of the work has been met, and—if not—give reasons for the rejection of all or part of the hypotheses. Overall, the development of our application follows an evolutionary prototyping process [Dav92; SJJ07]. Instead of producing a full-fledged product from the start, development happens iteratively in phases. The main phases and their order for the prototype at hand are: model selection, implementation, and evaluation. The results of each phase—for example, which model has been selected—inform the decisions which have to be made in the next phase (implementation). In other words, every subsequent phase is dependent on the results of the previous phase. All three phases, in turn, constitute one iteration within the prototyping process. At the start of the next prototype, the results of the previous iteration determine the path forward. The decision to use an evolutionary prototyping process follows in large part from the problem to be solved (as specified in section 1.1). Since the critical requirements have been established from the start, it is possible to build a solid prototype from the beginning 4 1.3. Thesis Structure by implementing only those features which are well-understood. The aim is to allow the developer to explore the problem further so that additional requirements which arise during development can be incorporated properly. The prototyping process is embedded within the concepts of the Scientific Method. This thesis not only produces a prototype but also explores the problem of plant detection and classification scientifically. Exploration of the problem requires making falsifiable hypotheses (see section 1.1), gathering empirical evidence (see section 5.2), and accepting or rejecting the initial hypotheses (see section 5.3). Empirical evidence is provided by measuring the model(s) against out-of-sample test sets. This provides the necessary foundation for acceptance or rejection of the hypotheses. 1.3 Thesis Structure The first part of the thesis (chapter 2) contains the theoretical basis of the models which we use for the prototype. Chapter 3 goes into detail about the requirements for the prototype, the overall design and architecture of the recognition and classification pipeline, and the structure and unique properties of the selected models. Chapter 4 expands on how the datasets are used during training as well as how the prototype publishes its classification results. Chapter 5 shows the results of the testing phases as well as the performance of the aggregate model. Furthermore, the results are compared with the expectations and it is discussed whether they are explainable in the context of the task at hand as well as benchmark results from other datasets (COCO [LMB+15]). Chapter 6 concludes the thesis with a summary and an outlook on possible improvements and further research questions. 5 CHAPTER 2 Theoretical Background This chapter is split into five parts. First, we introduce general machine learning concepts (section 2.1). Second, we provide a survey of object detection methods from early traditional methods to one-stage and two-stage deep learning based methods (section 2.2). Third, we go into detail about image classification in general and which approaches have been published in the literature (section 2.3). Fourth, we give a short explanation of transfer learning and its advantages and disadvantages (section 2.4). The chapter concludes with a section on hyperparameter optimization (section 2.5). 2.1 Machine Learning The term machine learning was first used by Samuel [Sam59] in 1959 in the context of teaching a machine how to play the game Checkers. Mitchell [Mit97] defines learning in the context of programs as: [Mit97, p.2] A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P , if its performance at tasks in T , as measured by P , improves with experience E. In other words, if the aim is to learn to win at a game, the performance measure P is defined as the ability to win at that game. The tasks in T are playing the game multiple times, and the experience E is gained by letting the program play the game against itself. Machine learning is thought to be a sub-field of Artificial Intelligence (AI). AI is a more general term for the scientific endeavor of creating things which possess the kind of intelligence we humans have. Since those things will not have been created naturally, their intelligence is termed artificial. Within the field of AI there have been other approaches than what is commonly referred to as machine learning today. 7 2. Theoretical Background A major area of interest in the 1980s was the development of expert systems. These systems try to approach problem solving as a rational decision-making process. Starting from a knowledge base, which contains facts and rules about the world and the problem to be solved, the expert system applies an inference engine to arrive at a conclusion. An advantage of these systems is that they can often explain how they came to a particular conclusion, allowing humans to verify and judge the inference process. This kind of explainability is missing in the neural network based approaches of today. However, an expert system needs a significant base of facts and rules to be able to do any meaningful inference. Outside of specialized domains such as medical diagnosis, expert systems have always failed at commonsense reasoning. Machine learning can be broadly divided into two distinct approaches: supervised and unsupervised. Supervised learning describes a process where the algorithm receives input values as well as their corresponding output values and tries to learn the function which maps inputs to outputs. This is called supervised learning because the model knows a target to map to. In unsupervised learning, in contrast, algorithms do not have access to labeled data or output values and therefore have to find patterns in the underlying inputs. There can be mixed approaches as in semi-supervised learning where a model receives a small amount of labeled data as an aid to better extract the patterns in the unlabeled data. Which type of learning to apply depends heavily on the problem at hand. Tasks such as image classification and speech recognition are good candidates for supervised learning. If a model is required to generate speech, text or images, an unsupervised approach makes more sense. We will go into detail about the general approach in supervised learning because it is used throughout this thesis when training the models. 2.1.1 Supervised Learning The overall steps when training a model with labeled data are as follows: 1. Determine which type of problem is to be solved and select adequate training samples. 2. Gather enough training samples and obtain their corresponding targets (labels). This stage usually requires humans to create a body of ground truth with which the model can compare itself. 3. Select the type of representation of the inputs which is fed to the model. The representation heavily relies on the amount of data which the model can process in a reasonable amount of time. For speech recognition, for example, raw waveforms are rarely fed to any classifier. Instead, humans have to select a less granular and more meaningful representation of the waveforms such as Mel-frequency Cepstral Coefficients (MFCCs). Selecting the representation to feed to the model is also referred to as feature selection or feature engineering. 8 2.1. Machine Learning 4. Select the structure of the model or algorithm and the learning function. De- pending on the problem, possible choices are Support Vector Machines (SVMs), Convolutional Neural Networks (CNNs) and many more. 5. Train the model on the training set. 6. Validate the results on out-of-sample data by computing common metrics and comparing the results to other approaches. 7. Optionally go back to 4. to select different algorithms or to train the model with different parameters or adjusted training sets. Depending on the results, one can also employ computational methods such as hyperparameter optimization to find a better combination of model parameters. These steps are generally the same for every type of supervised or semi-supervised machine learning approach. The implementation for solving a particular problem differs depending on the type of problem, how much data is available, how much can reasonably be labeled and any other special requirements such as favoring speed over accuracy. 2.1.2 Artificial Neural Networks Artificial neural networks are the building blocks of most state-of-the-art models in use today. The computer sciences have adopted the term from biology where it defines the complex structure in the human brain which allows us to experience and interact with the world around us. A neural network is necessarily composed of neurons which act as gatekeepers for the signals they receive. Depending on the inputs—electrochemical impulses, numbers, or other—the neuron excites and produces an output value if the right conditions are met. This output value travels via connections to other neurons and acts as an input on their side. Each neuron and connection between the neurons has an associated weight which changes when the network learns. The weights increase or decrease the signal from the neuron. The neuron itself only passes a signal on to its output connections if the conditions of its activation function have been met. This is typically a non-linear function. Multiple neurons are usually grouped together to form a layer within the network. Multiple layers are stacked one after the other with connections in-between to form a neural network. Layers between the input and output layers are commonly referred to as hidden layers. Figure 2.1 shows the structure of a three-layer fully-connected artificial neural network. The earliest attempts at describing learning machines were by McCulloch and Pitts [MP43] with the idea of the perceptron. This idea was implemented in a more general sense by Rosenblatt [Ros57; Ros62] as a physical machine. At its core, the perceptron is the simplest artificial neural network with only one neuron in the center. The neuron takes all its inputs, aggregates them with a weighted sum and outputs 1 if the result is above some threshold θ and 0 if it is not (see equation 2.1). This function is called the 9 2. Theoretical Background Input Hidden Output Figure 2.1: Structure of an artificial neural network. Information travels from left to right through the network using neurons and the connections between them. activation function of a perceptron. A perceptron is a type of binary classifier which can only classify linearly separable variables. y = 1 if n i=1 wi · xi ≥ θ 0 if n i=1 wi · xi < θ (2.1) Due to the inherent limitations of perceptrons to only be able to classify linearly separable data, Multilayer Perceptrons (MLPs) are the bedrock of modern artificial neural networks. By adding an input layer, a hidden layer, and an output layer as well as requiring the activation function of each neuron to be non-linear, a MLP can classify also non-linear data. Every neuron in each layer is fully connected to all of the neurons in the next layer and it is the most straightforward case of a feedforward network. Figure 2.1 shows the skeleton of a MLP. There are two types of artificial neural networks: feedforward and recurrent networks. Their names refer to the way information flows through the network. In a feedforward network, the information enters the network and flows only uni-directionally to the output nodes. In a recurrent network, information can also feed back into previous nodes. Which network is best used depends on the task at hand. Recurrent networks are usually necessary when context is needed. For example, if the underlying data to classify is a time series, individual data points have some relation to the previous and next points in the series. Maintaining a bit of state is beneficial because networks should be able to capture these dependencies. However, having additional functionality for feeding information back into previous neurons and layers comes with increased complexity. A feedforward network, as depicted in Figure 2.1, represents a simpler structure. 10 2.1. Machine Learning 2.1.3 Activation Functions Activation functions are the functions inside each neuron which receive inputs and produce an output value. The nature of these functions is that they need a certain amount of excitation from the inputs before they produce an output, hence the name activation function. Activation functions are either linear or non-linear. Linear functions are limited in their capabilities because they cannot approximate certain functions. For example, a perceptron, which uses a linear activation function, cannot approximate the XOR function [MP17]. Non-linear functions, however, are a requirement for neural networks to become universal approximators [HSW89]. We will introduce several activation functions which are used in the field of machine learning in the following sections. There exist many more than can be discussed within the scope of this thesis. However, the selection should give an overview of the most used and influential ones in the author’s opinion. Identity The simplest activation function is the identity function. It is defined as g(x) = x. (2.2) If all layers in an artificial neural network use the identity activation function, the network is equivalent to a single-layer structure. The identity function is often used for layers which do not need an activation function per se but require one to uphold consistency with the rest of the network structure. Heaviside Step The Heaviside step function, also known as the unit step function, is a mathematical function that is commonly used in control theory and signal processing to represent a signal that switches on at a specified time and stays on. The function is named after Oliver Heaviside, who introduced it in the late 19th century. It is defined as H(x) = 1, x ≥ 0 0, x < 0 . (2.3) In engineering applications, the Heaviside step function is used to describe functions whose values change abruptly at specified values of time t. We have already mentioned the Heaviside step function in section 2.1.2 when introducing the perceptron. It can only classify linearly separable variables when used in a neural network and is, therefore, not suitable for complex intra-data relationships. A major downside to using the Heaviside step function is that it is not differentiable at x = 0 and has a 0 derivative elsewhere. These properties make it unsuitable for use with gradient descent during backpropagation (section 2.1.5). 11 2. Theoretical Background Sigmoid The sigmoid activation function is one of the most important functions to introduce non-linearity into the outputs of a neuron. It is a special case of a logistic function and used synonymously with logistic function in machine learning. It is defined as σ(x) = 1 1 + e−x . (2.4) It has a characteristic S-shaped curve, mapping each input value to a number between 0 and 1, regardless of input size. This squashing property is particularly desirable for binary classification problems because the outputs can be interpreted as probabilities. Additionally to the squashing property, it is also a saturating function because large values map to 1 and very small values to 0. If a learning algorithm has to update the weights in the network, saturated neurons are very inefficient and difficult to process because the outputs do not provide valuable information. In contrast to the Heaviside step function (section 2.1.3), it is differentiable which allows it to be used with gradient descent optimization algorithms. Unfortunately, the sigmoid function exacerbates the vanishing gradient problem, which makes it unsuitable for training deep neural networks. Rectified Linear Unit The Rectified Linear Unit (ReLU) function is defined as f(x) = max(0, x) = x, x > 0 0, x ≤ 0 (2.5) which means that it returns the input value if it is positive, and returns zero if it is negative. It was first introduced by Fukushima [Fuk69] in a modified form to construct a visual feature extractor. The ReLU function is nearly linear, and it thus preserves many of the properties that make linear models easy to optimize with gradient-based methods [GBC16]. In contrast to the sigmoid activation function, the ReLU function partially mitigates the vanishing gradient problem and is therefore suitable for training deep neural networks. Furthermore, the ReLU function is easier to calculate than sigmoid functions which allows networks to be trained more quickly. Even though it is not differentiable at 0, it is differentiable everywhere else and often used with gradient descent during optimization. The ReLU function suffers from the dying ReLU problem, which can cause some neurons to become inactive. Large gradients, which are passed back through the network to update the weights, are typically the source of this. If many neurons are pushed into this state, the model’s capability of learning new patterns is diminished. To address this problem, there are two possibilities. One solution is to make sure that the learning rate is not set too high, which reduces the problem but does not fully remove it. Another 12 2.1. Machine Learning solution is to use one of the several variants of the ReLU function such as leaky ReLU, Exponential Linear Unit (ELU), and Sigmoid Linear Unit (SiLU). In recent years, the ReLU function has become the most popular activation function for deep neural networks and is recommended as the default activation function in modern neural networks [GBC16]. Despite its limitations, the ReLU function has become an essential tool for deep learning practitioners and has contributed to the success of many state-of-the-art models in computer vision, natural language processing, and other domains. Softmax The softmax activation function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes. It takes a vector of numbers, known as logits, and scales them into probabilities. The output of the softmax function is a vector with probabilities of each possible outcome, and the probabilities in the vector sum to one for all possible outcomes or classes. In mathematical terms, the function is defined as σ(z⃗)i = ezi K j=1 ezj i for i = 1, . . . , K and z⃗ = (z1, . . . , zK) ∈ RK (2.6) where the standard exponential function is applied to each value in the vector z⃗ and the result is normalized with the sum of the exponentials. 2.1.4 Loss Function Loss functions play a fundamental role in machine learning, as they are used to evaluate the performance of a model and guide its training. The choice of loss function can significantly impact the accuracy and generalization of the model. There are various types of loss functions, each with its strengths and weaknesses, and the appropriate choice depends on the specific problem being addressed. From the definition of a learning program from section 2.1, loss functions constitute the performance measure P against which the results of the learning program are measured. Only by minimizing the error obtained from the loss function and updating the weights within the network is it possible to gain experience E at carrying out a task T . How the weights are updated depends on the algorithm which is used during the backward pass to minimize the error. This type of procedure is referred to as backpropagation (see section 2.1.5). One common type of loss function is the mean squared error (MSE) which is widely used in regression problems. The MSE is a popular choice because it is easy to compute and has a closed-form solution, making it efficient to optimize. It does have some limitations, however. For instance, it is sensitive to outliers, and it may not be appropriate for 13 2. Theoretical Background problems with non-normal distributions. MSE measures the average squared difference between predicted and actual values. It is calculated with MSEtest = 1 m i (ŷ(test) − y(test))2 i (2.7) where ŷ(test) contains the predictions of the model on the test set and y(test) refers to the target labels [GBC16]. It follows that, if ŷ(test) = y(test), the error is 0 and the model has produced a perfect prediction. We cannot, however, take the results of the error on the test set to update the weights during training because the test set must always contain only samples which the model has not seen before. If the model is trained to minimize the MSE on the test set and then evaluated against the same set, the results will be how well the model fits to the test set and not how well it generalizes. The goal, therefore, is to minimize the error on the training set and to compare the results against an evaluation on the test set. If the model achieves very low error rates on the training set but not on the test set, it is likely that the model is suffering from overfitting. Conversely, if the model does not achieve low error rates on the training set, it is likely that the model is suffering from underfitting. Goodfellow, Bengio, and Courville [GBC16] writes on MSE: “MSE was popular in the 1980s and 1990s but was gradually replaced by cross-entropy losses and the principle of maximum likelihood as ideas spread between the statistics community and the machine learning community” [GBC16, p.222]. Cross-entropy measures the difference in infor- mation between two distinct probability distributions. Specifically, it gives a number on the average total amount of bits needed to represent a message or event from the first probability distribution in the second probability distribution. If there is the case of binary random variables, i.e. only two classes to classify exist, the measure is called binary cross-entropy. Cross-entropy loss is known to outperform MSE for classification tasks and allows the model to be trained faster [SSP03]. 2.1.5 Backpropagation So far, information only flows forward through the network whenever a prediction for a particular input should be made. In order for a neural network to learn, information about the computed loss has to flow backward through the network. Only then can the weights at the individual neurons be updated. This type of information flow is termed backpropagation [RHW86]. Backpropagation computes the gradient of a loss function with respect to the weights of a network for an input-output pair. The algorithm computes the gradient iteratively starting from the last layer and works its way backward through the network until it reaches the first layer. Strictly speaking, backpropagation only computes the gradient but does not determine how the gradient is used to learn the new weights. Once the backpropagation algorithm has computed the gradient, that gradient is passed to an algorithm which finds a local 14 2.2. Object Detection minimum of it. This step is usually performed by some variant of gradient descent [Cau47]. 2.2 Object Detection From facial detection to fully automated driving—object detection provides the basis for a wide variety of tasks within the computer vision world. While most implementations in the 1990s and early 2000s relied on cumbersome manual feature extraction, current methods almost exclusively leverage a deep learning based approach. This chapter gives an introduction to object detection, explains common problems researchers have faced and how they have been solved, and discusses the two main approaches to object detection via deep learning. 2.2.1 Traditional Methods Before the advent of powerful GPUs, object detection was commonly done by manually extracting features from images and passing these features on to a classical machine learning algorithm. Early methods were generally far from being able to detect objects in real time. Viola-Jones Detector The first milestone was the face detector by Viola and Jones [VJ01; VJ01] which is able to perform face recognition on 384 by 288 pixel (grayscale) images with 15 fps on a 700 MHz Intel Pentium III processor. The authors use an integral image representation where every pixel is the summation of the pixels above and to the left of it. This representation allows them to quickly and efficiently calculate Haar-like features. The Haar-like features are passed to a modified AdaBoost algorithm [FS95] which only selects the (presumably) most important features. At the end there is a cascading stage of classifiers where regions are only considered further if they are promising. Every additional classifier adds complexity, but once a classifier rejects a sub-window, the processing stops and the algorithm moves on to the next window. Despite their final structure containing 32 classifiers, the sliding-window approach is fast and achieves comparable results to the state of the art in 2001. HOG Detector The Histogram of Oriented Gradients (HOG) [DT05] is a feature descriptor used in computer vision and image processing to detect objects in images. It is a detector which detects shape like other methods such as Scale-Invariant Feature Transform (SIFT) [Low99]. The idea is to use the distribution of local intensity gradients or edge directions to describe an object. To this end, the authors divide the image into a grid of cells and calculate a histogram of edge orientations within each cell. Additionally, each histogram is normalized by taking a larger region and adjusting the local histograms based on the 15 2. Theoretical Background larger region’s intensity levels. The resulting blocks of normalized gradients are evenly spaced out across the image with some overlap. These patches are then passed as a feature vector to a classifier. Dalal and Triggs [DT05] successfully use the HOG with a linear SVM for classification to detect humans in images. They work with images of 64 by 128 pixels and make sure that the image contains a margin of 16 pixels around the person. Decreasing the border by either enlarging the person or reducing the overall image size results in worse performance. Unfortunately, their method is far from being able to process images in real time—a 320 by 240 image takes roughly a second to process. Deformable Part-Based Model Deformable Part-Based Models (DPMs) [FMR08] were the winners of the VOC challenge in the years 2007, 2008, and 2009. The method is heavily based on the previously discussed HOG since it also uses HOG descriptors internally. The authors addition is the idea of learning how to decompose objects during training and classifying/detecting the decomposed parts during inference. The HOG descriptors are computed on different scales to form a HOG feature pyramid. Coarse features are more easily identified at the top of the pyramid while details are present at the lower end of the pyramid. The coarse features are obtained by calculating the histograms over fairly large areas, whereas smaller image patches are used for the detailed levels. A root filter works on the coarse levels by detecting general features of the object of interest. If the goal is to detect a face, for example, the root filter detects the contours of the face. Smaller part filters provide additional information about the individual parts of the object. For the face example, these filters capture information about the eyes, mouth and nose. The idea of detecting detail at different scales is not unlike what happens with the later CNNs. The individual layers of a CNN often describe higher level features in the earlier layers and provide additional lower level information as the network increases in depth. Girshick et al. [GID+15] argue that DPMs are in fact CNNs because they can be formulated as CNNs by unrolling each step of the algorithm into a corresponding CNN layer. 2.2.2 Deep Learning Based Methods After the publication of the DPM, the field of object detection did not make significant advances regarding speed or accuracy until 2012. Only the (re-)introduction of CNNs by Krizhevsky, Sutskever, and Hinton [KSH12] with their AlexNet architecture and their subsequent win of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012 gave the field a new influx of ideas. The availability of the 12 × 106 labeled images in the ImageNet dataset [DDS+09] allowed a shift from focusing on better methods to being able to use more data to train models. Earlier models had difficulties with making use of the large dataset since training was unfeasible. AlexNet, however, provided an architecture which was able to be trained on two GPUs within six days. For an in depth 16 2.2. Object Detection overview of AlexNet see section 2.3.2. Object detection networks from 2014 onward either follow a one-stage or two-stage detection approach. The following sections go into detail about each model category. 2.2.3 Two-Stage Detectors As their name implies, two-stage detectors consist of two stages which together form a complete object detection pipeline. Commonly, the first stage extracts Regions of Interest (ROIs) which might contain relevant objects to detect. The second stage operates on the extracted ROIs and returns a vector of class probabilities. Since the computation in the second stage is performed for every ROI, two-stage detectors are often not as efficient as one-stage detectors. R-CNN Girshick et al. [GDD+14] were the first to propose using feature representations of CNNs for object detection. Their approach consists of generating around 2000 region proposals and passing these on to a CNN for feature extraction. The fixed-length feature vector is used as input for a linear SVM which classifies the region. They name their method R-CNN, where the R stands for region. R-CNN uses selective search to generate region proposals [UvdSG+13].The authors use selective search’s fast mode to generate the 2000 proposals and warp (i.e. aspect ratios are not retained) each proposal into the image dimensions required by the CNN. The CNN, which matches the architecture of AlexNet [KSH12], generates a 4096-dimensional feature vector and each feature vector is scored by a linear SVM for each class. Scored regions are selected/discarded by comparing each region to other regions within the same class and rejecting them if there exists another region with a higher score and greater IOU than a threshold. The linear SVM classifiers are trained to only label a region as positive if the overlap, as measured by IOU, is above 0.3. While the approach of generating region proposals is not new, using a CNN purely for feature extraction is. Unfortunately, R-CNN is far from being able to operate in real time. The authors report that it takes 13 s/image on a GPU and 53 s/image on a Central Processing Unit (CPU) to generate the region proposals and feature vector. In some sense, these processing times are a step backward from the DPMs introduced in section 2.2.1. However, the authors showed that CNNs can function perfectly well as feature extractors, even if their processing performance is not yet up to par with traditional methods. Furthermore, R-CNN crushes DPMs on the VOC 2007 challenge with a mAP of 58.5% [GDD+14] versus 33.7% (DPM-v5 [GFM; FGM+10]). This was enough to spark renewed interest in CNNs and—with better availability of large datasets and GPU processing capabilities—opened the way for further research in that direction. 17 2. Theoretical Background SPP-net A year after the publication of R-CNN, He et al. [HZR+15] introduce the concept of Spatial Pyramid Pooling (SPP) to allow CNNs to accept arbitrarily sized instead of fixed-size input images. They name their method SPP-net and it outputs a fixed-length feature vector of the input image. SPP layers operate in-between the convolutional and fully-connected layers of a CNN. Since the fully-connected layers require fixed-size inputs but the convolutional layers do not, SPP layers aggregate the information from convolutional layers and pass the resulting fixed-size outputs to the fully-connected layers. This approach allows only passing the full image through the convolutional layers once and calculating features with the SPP layer from these results. This avoids the redundant computations for each ROI present in R-CNN and provides a speedup of 24-102 times while achieving even better metrics on the VOC 2007 data set at a mAP of 59.2%. Fast R-CNN Fast R-CNN was proposed by Girshick [Gir15] to fix the three main problems R-CNN and SPP-net have. The first problem is that the training for both models is multi-stage. R-CNN fine-tunes the convolutional network which is responsible for feature extraction and then trains SVMs to classify the feature vectors. The third stage consists of training the bounding box regressors. The second problem is the training time which is on the order of multiple days for deep convolutional networks. The third problem is the processing time per image which is (depending on the convolutional network) upwards of 13 s/image. Fast R-CNN deals with these problems by having an architecture which allows it to take in images and object proposals at once and process them simultaneously to arrive at the results. The outputs of the network are the class an object proposal belongs to and four scalar values representing the bounding box of the object. Unfortunately, this approach still requires a separate object proposal generator such as selective search [UvdSG+13]. Faster R-CNN Faster R-CNN [RHG+15; RHG+17]—as the name implies—is yet another improvement building on R-CNN, SPP-net and Fast R-CNN. Since the bottleneck in performance with previous approaches has been the object proposal generator, the authors of Faster R-CNN introduce a Region Proposal Network (RPN) to predict bounding boxes and objectness in one step. As with previous networks, the proposals are then passed to the detection network. RPNs work by using the already present convolutional features in Fast R-CNN and adding additional layers on top to also regress bounding boxes and objectness scores per location. Instead of relying on a pyramid structure such as with SPP-net (see section 2.2.3), RPNs use anchor boxes as a basis for the bounding box regressor. These 18 2.2. Object Detection anchor boxes are predefined for various scales and aspect ratios and serve as starting points for the regressor to properly fit a bounding box around an object. The RPN makes object proposal generation inexpensive and possible on GPUs. The whole network operates on an almost real time scale by being able to process 5 images/s and maintaining high state-of-the-art mAP values of 73.2% (VOC 2007). If the detection network is switched from VGGNet [LD15] to ZF-Net [ZF14], Faster R-CNN is able to achieve 17 images/s, albeit at a lower mAP of 59.9%. Feature Pyramid Network Feature Pyramid Networks (FPNs) were first introduced by Lin et al. [LDG+17] to use the hierarchical pyramid structure inherent in CNNs to compute feature maps on different scales. Previously, detectors were only using the features of the top most (coarse) layers because it was computationally too expensive to use lower (fine-grained) layers. By leveraging feature maps on different scales, FPNs are able to better detect small objects because predictions are made independently on all levels. FPNs are an important building block of many state-of-the-art object detectors. A FPN first computes the feature pyramid bottom-up with a scaling step of two. The lower levels capture less semantic information than the higher levels but include more spatial information due to the higher granularity. In a second step, the FPN upsamples the higher levels such that the dimensions of two consecutive layers are the same. The upsampled top layer is merged with the layer beneath it via element-wise addition and convolved with a one by one convolutional layer to reduce channel dimensions and to smooth out potential artifacts introduced during the upsampling step. The results of that operation constitute the new top layer and the process continues with the layer below it until the finest resolution feature map is generated. In this way, the features of the different layers at different scales are fused to obtain a feature map with high semantic information but also high spatial information. Lin et al. [LDG+17] report results on COCO with a mAP@0.5 of 59.1% with a Faster R- CNN structure and a Residual Neural Network (ResNet)-101 backbone. Their submission does not include any specific improvements such as hard negative mining [SGG16] or data augmentation. 2.2.4 One-Stage Detectors One-stage detectors, in contrast to two-stage detectors, combine the proposal generation and detection tasks into one neural network such that all objects can be retrieved in a single step. Since the proposal generation in two-stage detectors is a costly operation and usually the bottleneck, one-stage detectors are significantly faster overall. Their speeds allow them to be deployed to low-resource devices such as mobile phones while still providing real time object detection. Unfortunately, their detection accuracy trailed the two-stage approaches for years, especially for small and/or dense objects. 19 2. Theoretical Background You Only Look Once You Only Look Once (YOLO) was the first one-stage detector introduced by Redmon et al. [RDG+16]. It divides each image into regions and predicts bounding boxes and classes of objects simultaneously. This allows it to be extremely fast at up to 155 fps with a mAP of 52.7% on VOC 2007. The accuracy results were not state of the art at the time because the architecture trades localization accuracy for speed, especially for small objects. These issues have been gradually dealt with in later versions of YOLO as well as in other one-stage detectors such as Single Shot MultiBox Detector (SSD). Since a later version of YOLO is used in this work, we refer to section 3.3.1 for a thorough account of its architecture. Single Shot MultiBox Detector SSD was proposed by Liu et al. [LAE+16] and functions similarly to YOLO in that it does not need an extra proposal generation step but instead detects and classifies objects in one go. The aim of one-stage detectors is to be considerably faster and at least as accurate as two-stage detectors. While YOLO paved the way for one-stage detectors, the detection accuracy is significantly lower than state-of-the-art two-stage detection approaches such as Faster RCNN. SSD combines generating detections on multiple scales and an end-to-end architecture to achieve high accuracy as well as high speed. SSD is based on a standard CNN such as VGG16 [LD15] and adds additional feature layers to the network. The CNN, which the detector is using to extract features, has its last fully-connected layer removed such that the output of the CNN is a scaled down representation of the input image. The extra layers are intended to capture features at different scales and compare them during training to a range of default anchor boxes. This idea comes from MultiBox [EST+14] but is implemented in SSD with a slight twist: during matching of default boxes to the ground truth, boxes with a Jaccard overlap (IOU) of less than 0.5 are discarded. In one-stage detector terms, the feature extractor is the backbone whereas the extra layers constitute the head of the network. The outputs of the extra layers contain features for smaller regions with higher spatial information. Making use of these additional feature maps is what sets SSD apart from YOLO and results in SSD being able to detect smaller and denser objects as well. The authors report results on VOC 2007 for their SSD300 and SSD512 model varieties. The number refers to the size of the input images. SSD300 outperforms Fast R-CNN by 1.1 percentage points (mAP 66.9% vs 68%). SSD512 outperforms Faster R-CNN by 1.7% mAP. If trained on the VOC 2007, 2012 and COCO train sets, SSD512 achieves a mAP of 81.5% on the VOC 2007 test set. SSD’s speed is at 46 fps which, although lower than Fast YOLO’s 155 fps, is still in real time. Furthermore, SSD has a mAP which is almost 22% higher than Fast YOLO. 20 2.3. Image Classification RetinaNet One-stage detectors before 2017 always trailed the accuracy of top two-stage detectors on common and difficult benchmark datasets such as COCO. Lin et al. [LGG+17] investigated what the culprit for the lower accuracy scores could be and found that the severe class imbalance between foreground and background instances is the problem. They introduce a novel loss function called Focal Loss which replaces the standard cross-entropy loss. Focal loss down-weights the importance of easy negative examples during training and instead focuses on instances which are harder but provide more information. Focal loss is based on cross-entropy loss but includes a scaling factor which decreases while the classification confidence increases. In other words, if the confidence that an object belongs to a particular class is already high, focal loss outputs a small value such that the weight updates during backpropagation are only marginally affected by the current example. The model can thus focus on examples which are harder to achieve a good confidence score on. Lin et al. [LGG+17] implement their focal loss with a simple one-stage detector called RetinaNet. It makes use of previous advances in object detection and classification by including a FPN on top of a ResNet [HZR+16] as the backbone and using anchors for the different levels in the feature pyramid. Attached to the backbone are two subnetworks which classify anchor boxes and regress them to the ground truth boxes. The results are that the RetinaNet-101-500 version (with an input size of 500 px) achieves a mAP of 34.4% at a speed of around 11 fps on the COCO dataset. 2.3 Image Classification Image classification, in contrast to object detection, is a slightly easier task because there is no requirement to localize objects in the image. Instead, image classification operates always on the image as a whole rather than individual parts of it. As has been demonstrated in the last chapter, object detection methods often rely on advances in image classification to accurately detect objects. After objects have been localized, we humans want to know what kind of object it is and that is where image classification methods become useful. This section goes into detail about various image classification methods. We first give a short summary on how image classification was commonly done before CNNs became the de facto standard. Afterwards, we will introduce common and influential approaches leveraging CNNs and discuss problems and solutions for training large networks. 2.3.1 Traditional Methods Similarly to early object detection algorithms, traditional methods rely on manual feature extraction and subsequent classification with classical algorithms. Passing raw images to the algorithms is often not feasible due to the immense information contained in just one 21 2. Theoretical Background image. Furthermore, a raw image contains a signal to noise ratio which is too low for a computer to successfully learn properties about the image. Instead, humans—with the aid of image processing methods—have to select a lower-dimensional representation of the input image and then pass this representation to a classifier. This process of manually reducing the dimensions and complexity of an image to the part which is relevant is termed feature engineering. Manual feature engineering requires selecting an appropriate representation for the task at hand. For example, if the task is to classify images which show an object with a special texture, a feature engineer will likely select an image representation which clearly pulls the texture into the foreground. In other words, engineers help the classifier by preprocessing the image such that the most discriminative features are easily visible. The methods with which an image representation is created is called feature descriptor. In line with the different ways objects can present themselves on images, there have been many feature descriptors proposed. Most of the feature descriptors used in object detection are also used in image classification (see HOG and SIFT from section 2.2.1) because their representational power is useful in both domains. 2.3.2 Deep Learning Based Methods Manual feature engineering is a double-edged sword. Although it allows to have a high amount of control, it also necessitates the engineer to select a meaningful representation for training the downstream classifier. Often, humans make unconscious assumptions about the problem to be solved as well as the available data and how best to extract features. These assumptions can have a detrimental effect on classification accuracy later on because the best-performing feature descriptor lies outside of the engineer’s purview. Therefore, instead of manually preparing feature vectors for the classifier, researchers turned to allowing an Artificial Neural Network (ANN) to recognize and extract the most relevant aspects of an image on its own, without human intervention. Attention is thus mostly given to the structure of the ANN and less to the preparation of inputs. The idea of automatic generation of feature maps via ANNs gave rise to CNNs. Early CNNs [LBD+89] were mostly discarded for practical applications because they require much more data during training than traditional methods and also more processing power during inference. Passing 224 by 224 pixel images to a CNN, as is common today, was simply not feasible if one wanted a reasonable inference time. With the development of GPUs and supporting software such as the Compute Unified Device Architecture (CUDA) toolkit, it was possible to perform many computations in parallel. The architecture of CNNs lends itself well to parallel processing and thus CNNs slowly but surely overtook other image classification methods. LeNet-5 LeNet-5, developed and described by Lecun et al. [LBB+98], laid the foundation of CNNs as we still use them today. The basic structure of convolutional layers with pooling 22 2.3. Image Classification layers in-between and one or more fully-connected layers at the end has been iterated on many times since then. LeCun et al. [LBD+89] introduced the first version of LeNet when describing their system for automatic handwritten zip code recognition. They applied backpropagation with Stochastic Gradient Descent (SGD) and used the scaled hyperbolic tangent as the activation function. The error function with which the weights are updated is MSE. The architecture of LeNet-5 is composed of two convolutional layers, two pooling layers and a dense block of three fully-connected layers. The input image is a grayscale image of 32 by 32 pixels. The first convolutional layer generates six feature maps, each with a scale of 28 by 28 pixels. Each feature map is fed to a pooling layer which effectively downsamples the image by a factor of two. By aggregating each two by two area in the feature map via averaging, the authors are more likely to obtain relative (to each other) instead of absolute positions of the features. To make up for the loss in spatial resolution, the following convolutional layer increases the amount of feature maps to 16 which aims to increase the richness of the learned representations. Another pooling layer follows which reduces the size of each of the 16 feature maps to five by five pixels. A dense block of three fully-connected layers of 120, 84 and 10 neurons serves as the actual classifier in the network. The last layer uses the euclidean Radial Basis Function (RBF) to compute the class an image belongs to (0-9 digits). The performance of LeNet-5 was measured on the Modified National Institute of Standards and Technology (MNIST) database which consists of 70 000 labeled images of handwritten digits. The MSE on the test set is 0.95%. This result is impressive considering that character recognition with a CNN had not been done before. However, standard machine learning methods of the time, such as manual feature engineering and SVMs, achieved a similar error rate, even though they are much more memory-intensive. LeNet-5 was conceived to take advantage of the (then) large MNIST database. Since there were not many datasets available at the time, especially with more samples than in the MNIST database, CNNs were not widely used even after their viability had been demonstrated by Lecun et al. [LBB+98]. Only in 2012 Krizhevsky, Sutskever, and Hinton [KSH12] reintroduced CNNs (see section 2.2.2) and since then most state-of-the-art image classification methods have used them. AlexNet AlexNet’s main contributions are the use of ReLUs, training on multiple GPUs, Local Response Normalization (LRN) and overlapping pooling [KSH12]. As mentioned in section 2.1.3, ReLUs introduce non-linearity into the network. Instead of using the traditional non-linear activation function tanh, where the output is bounded between −1 and 1, ReLUs allow the output layers to grow as high as training requires it. Normalization before an activation function is usually used to prevent the neuron from saturating, as would be the case with tanh. Even though ReLUs do not suffer from saturation, the authors found that LRN reduces the top-1 error rate by 1.4% [KSH12]. Overlapping pooling, in contrast to regular pooling, does not easily accept the dominant pixel values 23 2. Theoretical Background per window. By smoothing out the pooled information, bias is reduced and networks are slightly more resilient to overfitting. Overlapping pooling reduces the top-1 error rate by 0.4% [KSH12]. In aggregate, these improvements result in a top-5 error rate of below 25% at 16.4%. These results demonstrated that CNNs can extract highly relevant feature representations from images. While AlexNet was only concerned with the classification of images, it did not take long for researchers to apply CNNs to the problem of object detection. ZFNet ZFNet’s [ZF14] contributions to the image classification field are twofold. First, the authors develop a way to visualize the internals of a CNN with the use of deconvolution techniques. Second, with the added knowledge gained from looking inside a CNN, they improve AlexNet’s structure. The deconvolution technique is essentially the reverse operation of a CNN layer. Instead of pooling (downsampling) the results of the layer, Zeiler and Fergus [ZF14] unpool the max-pooled values by recording the maximum positions of the maximum value per kernel. The maximum values are then put back into each two by two area (depending on the kernel size). This process loses information because a max-pooling layer is not invertible. The subsequent ReLU function can be easily inverted because negative values are squashed to zero and positive values are retained. The final deconvolution operation concerns the convolutional layer itself. In order to reconstruct the original spatial dimensions (before convolution), a transposed convolution is performed. This process reverses the downsampling which happens during convolution. With these techniques in place, the authors visualize the first and second layers of the feature maps present in AlexNet. They identify multiple problems with their structure such as aliasing artifacts and a mix of low and high frequency information without any mid frequencies. These results indicate that the filter size in AlexNet is too large at 11 by 11 and the authors reduce it to seven by seven. Additionally, they modify the original stride of four to two. These two changes result in an improvement in the top-5 error rate of 1.6% over their own replicated AlexNet result of 18.1%. GoogLeNet GoogLeNet, also known as Inception v1, was proposed by Szegedy et al. [SLJ+15] to increase the depth of the network without introducing too much additional complexity. Since the relevant parts of an image can often be of different sizes but kernels within convolutional layers are fixed, there is a mismatch between what can realistically be detected by the layers and what is present in the dataset. Therefore, the authors propose to perform multiple convolutions with different kernel sizes and concatenating them together before sending the result to the next layer. Unfortunately, three by three and five by five kernel sizes within a convolutional layer can make the network too expensive to train. The authors add one by one convolutions to the outputs of the previous layer 24 2.3. Image Classification before passing the result to the three by three and five by five convolutions. The one by one convolutions have the effect that the channels of the inputs (feature maps) are reduced and are thus easier to process by the subsequent larger filters. GoogLeNet consists of nine Inception modules stacked one after the other and a stem with convolutions at the beginning as well as two auxiliary classifiers which help retain the gradient during backpropagation. The auxiliary classifiers are only used during training. The authors submitted multiple model versions to the 2004 ILSVRC and their ensemble prediction model consisting of seven GoogleNets achieved a top-5 error rate of 6.67%, which resulted in first place. VGGNet In the quest for ever-more layers and deeper networks, Simonyan and Zisserman [SZ15] propose an architecture which is based on small-resolution kernels (receptive fields) for each convolutional layer. They make extensive use of stacked three by three kernels and one by one convolutions with ReLUs in-between to decrease the number of parameters. Their choice relies on the fact that two three by three convolutional layers have an effective receptive field of one five by five layer. The advantage is that they introduce additional non-linearities by having two ReLUs instead of only one. The authors provide five different networks with increasing number of parameters based on these principles. The smallest network has a depth of eight convolutional layers and three fully-connected layers for the head (11 in total). The largest network has 16 convolutional and three fully-connected layers (19 in total). The fully-connected layers are the same for each architecture, only the layout of the convolutional layers varies. The deepest network with 19 layers achieves a top-5 error rate on ILSVRC 2014 of 9%. If trained with different image scales in the range of S ∈ [256, 512], the same network achieves a top-5 error rate of 8% (test set at scale 256). By combining their two largest architectures and multi-crop as well as dense evaluation, they achieve an ensemble top-5 error rate of 6.8%, while their best single network with multi-crop and dense evaluation results in 7%, thus beating the single-net submission of GoogLeNet (see section 2.3.2) by 0.9%. ResNet The 22-layer structure of GoogLeNet [SLJ+15] and the 19-layer structure of VGGNet [SZ15] showed that going deeper is beneficial for achieving better classification performance. However, the authors of VGGNet already note that stacking even more layers does not lead to better performance because the model is saturated. He et al. [HZR+16] provide a solution to the vanishing gradient as well as the degradation problem by introducing skip connections to the network. They call their resulting network architecture ResNet and since it is used in this work, we will give a more detailed account of its structure in section 3.3.2. 25 2. Theoretical Background DenseNet The authors of DenseNet [HLV+17] go one step further than ResNets by connecting every convolutional layer to every other layer in the chain. Previously, each layer was connected in sequence with the one before and the one after it. Residual connections establish a link between the previous layer and the next one but still do not always propagate enough information forward. These shortcut connections from earlier layers to later layers are thus only taking place in an episodic way for short sections in the chain. DenseNets are structured in a way such that every layer receives the feature map of every previous layer as input. In ResNets, information from previous layers is added on to the next layer via element-wise addition. DenseNets concatenate the features of the previous layers. The number of feature maps per layer has to be kept low so that the subsequent layers can still process their inputs. Otherwise, the last layer in each dense block would receive too many channels which increases computational complexity. The authors construct their network from multiple dense blocks which are connected via a batch normalization layer, a one by one convolutional layer and a two by two pooling layer to reduce the spatial resolution for the next dense block. Each dense block consists of a Batch Normalization (BN) layer, a ReLU layer and a three by three convolutional layer. In order to keep the number of feature maps low, the authors introduce a growth rate k as a hyperparameter. The growth rate can be as low as k = 4 and still allow the network to learn highly relevant representations. In their experiments, the authors evaluate different combinations of dense blocks and growth rates against ImageNet. Their DenseNet-161 (k = 48) achieves a top-5 error rate with single-crop of 6.15% and with multi-crop 5.3%. Their DenseNet-BC variant requires only one third of the amount of parameters of a ResNet-101 network to achieve the same test error on the CIFAR-10 dataset. MobileNet v3 MobileNet v3 by Howard et al. [HSC+19] is the third iteration of the original MobileNet architecture [HZC+17]. MobileNets use depthwise separable convolution instead of regular convolution. In the latter, the kernel in each convolutional layer is applied to all channels of the input simultaneously. Depthwise convolution applies the kernel to each channel separately instead and the output is then convolved in a second layer with a one by one kernel over all channels. The second step is also called a pointwise convolution because it squeezes the number of channels per one by one input field into n output channels. The effect of using depthwise separable convolutions is that the amount of computation needed is severely reduced compared to standard convolutions. A standard convolutional layer with a kernel size of DK × DK , an output feature map size of DF × DF , M input channels and N output channels has a computational cost of DK · DK · M · N · DF · DF . (2.8) 26 2.4. Transfer Learning A depthwise separable convolution, however, has a computational cost of DK · DK · M · DF · DF + M · N · DF · DF . (2.9) The first summand refers to the cost of the depthwise convolution and added to it is the cost for the pointwise convolution. The authors demonstrate that the reduction in computational cost is 1 N + 1 D2 K (2.10) which—at a kernel size of three by three—results in a smaller computational cost of between eight to nine times. MobileNet v2 [SHZ+18] introduced inverted residuals and linear bottlenecks and MobileNet v3 [HSC+19] brought squeeze and excitation layers among other improvements. These concepts led to better classification accuracy at the same or smaller model size. The authors evaluate a large and a small variant of MobileNet v3 on ImageNet on single-core phone processors and achieve a top-1 accuracy of 75.2% and 67.4% respectively. 2.4 Transfer Learning Transfer learning refers to the application of a learning algorithm to a target domain by utilizing knowledge already learned from a different source domain [ZQD+21]. The learned representations from the source domain are thus transferred to solve a related problem in another domain. Transfer learning works because semantically meaningful information an algorithm has learned from a (large) dataset is often meaningful in other contexts as well, even though the new problem is not exactly the same problem for which the original model had been trained for. An analogy to day-to-day life as humans can be drawn with sports. Intuitively, skills learned during soccer such as ball control, improved endurance and strategic thinking are often also useful in other ball sports. Someone who is adept at certain kinds of sports will likely be able to pick up similar types much faster. In mathematical terms, Pan and Yang [PY10] define transfer learning as: [PY10, p.1347] Given a source domain DS and learning task TS , a target domain DT and learning task TT , transfer learning aims to help improve the learning of the target predictive function fT (·) in DT using the knowledge in DS and TS , where DS ̸= DT , or TS ̸= TT . In the machine learning world, collecting and labeling data for training a model is often time consuming, expensive and sometimes not possible. Deep learning based models especially require substantial amounts of data to be able to robustly classify images or solve other tasks. Semi-supervised or unsupervised (see section 2.1) learning approaches 27 2. Theoretical Background can partially mitigate this problem, but having accurate ground truth data is usually a requirement nonetheless. Through the publication of large labeled datasets such as via the ILSVRCs, a basis for (pre-)training exists from which the model can be optimized for downstream tasks. Transfer learning is not a panacea, however. Care has to be taken to only use models which have been pretrained in a source domain which is similar to the target domain in terms of feature space. While this may seem to be an easy task, it is often not known in advance if transfer learning is the correct approach. Furthermore, choosing whether to only remove the fully-connected layers at the end of a pretrained model or to fine-tune all parameters introduces at least one additional hyperparameter. These decisions have to be made by comparing the source domain with the target domain, how much data in the target domain is available, how much computational resources are available and observing which layers are responsible for which features. Since earlier layers usually contain low-level and later layers high-level information, resetting the weights of the last few layers or replacing them with different ones entirely is also an option. To summarize, while transfer learning is an effective tool and is likely a major factor in the proliferation of deep learning based models, not all domains are suited for it. The additional decisions which have to be made as a result of using transfer learning can introduce more complexity than would otherwise be necessary for a particular problem. It does, however, allow researchers to get started quickly and to iterate faster because popular network architectures pretrained on ImageNet are integrated into the major machine learning frameworks. Transfer learning is used extensively in this work to train a classifier as well as an object detection model. 2.5 Hyperparameter Optimization While a network is learning, the parameters of its layers are updated. These parameters are learnable in the sense that changing them should bring the model closer to solving a problem. Updating these parameters happens during the learning/training phase. Hyperparameters, on the other hand, are not included in the learning process because they are fixed before the model starts to train. They are fixed because hyperparameters concern the structure, architecture and learning parameters of the model and without having those in place, a model cannot start training. Model designers have to carefully define values for a wide range of hyperparameters. Which hyperparameters have to be set is determined by the type of model which is being used. A SVM, for example, has a penalty parameter C which indicates to the network how lenient it should be when misclassifying training examples. The type of kernel to use is also a hyperparameter for any SVM and can only be answered by looking at the distribution of the underlying data. In neural networks the range of hyperparameters is even greater because every part of the network architecture such as how many layers to stack, which layers to stack, which kernel sizes to use in each CNN layer and which activation function(s) to use in-between the layers is a parameter which can be altered. 28 2.5. Hyperparameter Optimization Finding the best combination of some or all of the available hyperparameters is called hyperparameter tuning. Hyperparameter tuning can be and is often done manually by researchers where they select values which have been known to work well. This approach—while it works to some extent—is not optimal because adhering to best practice precludes parameter configurations which would be closer to optimality for a given data set. Furthermore, manual tuning requires a deep understanding of the model itself and how each parameter influences it. Biases present in a researcher’s understanding are detrimental to finding optimal hyperparameters and the amount of possible combinations can quickly get intractable. Instead, automated methods to search the hyperparameter space offer an unbiased and more efficient approach to hyperparameter tuning. This type of algorithmic search is called hyperparameter optimization. 2.5.1 Grid Search There are multiple possible strategies to opt for when optimizing hyperparameters. The straightforward approach is to do grid search. In grid search, all hyperparameters are discretized and all possible combinations mapped to a search space. The search space is then sampled for configurations at evenly spaced points and the resulting vectors of hyperparameter values are evaluated. For example, if a model has seven hyperparameters and three of those can take on a continuous value, these three variables have to be discretized. In practical terms this means that the model engineer chooses suitable discrete values for said hyperparameters. Once all hyperparameters are discrete, all possible combinations of the hyperparameters are evaluated. If each of the seven hyperparameters has three discrete values, the number of possible combinations is 3 · 3 · 3 · 3 · 3 · 3 · 3 = 37 = 2187. (2.11) For this example, evaluating 2187 possible combinations can already be intractable depending on the time required for each run. Further, grid search requires that the resolution of the grid is determined beforehand. If the points on the grid (combinations) are spaced too far apart, the chance of finding a global optimum is lower than if the grid is dense. However, a dense grid results in a higher number of possible combinations and thus more time is required for an exhaustive search. Additionally, grid search suffers from the curse of dimensionality because the number of evaluations scales exponentially with the number of hyperparameters. 2.5.2 Random Search Random search [PDD+09] is an alternative to grid search which often provides configura- tions which are similar or better in the same amount of time than ones obtained with grid search [BB12]. Random search performs especially well in high-dimensional environments because the hyperparameter response surface is often of low effective dimensionality 29 2. Theoretical Background [BB12]. That is, a low number of hyperparameters disproportionately affects the perfor- mance of the resulting model and the rest has a negligible effect. We use random search in this work to improve the hyperparameters of our classification model. 2.5.3 Evolution Strategies Evolution strategies follow a population-based model where the search strategy starts from initial random configurations and evolves the hyperparameters through mutation and crossover. Mutation randomly changes the value of a hyperparameter and crossover creates a new configuration by mixing the values of two configurations. Hyperparameter optimization with evolutionary strategies roughly goes through the following stages [BBL+23]. 1. Set the hyperparameters to random initial values and create a starting population of configurations. 2. Evaluate each configuration. 3. Rank all configurations according to a fitness function. 4. The best-performing configurations are selected as parents. 5. Child configurations are created from the parent configurations by mutation and crossover. 6. Evaluate the child configurations. 7. Go to step three and repeat the process until a termination condition is reached. This strategy is more efficient than grid search or random search but requires a substantial amount of iterations for good solutions and can thus be too expensive for hyperparameter optimization [BBL+23]. We use an evolution strategy based on a genetic algorithm in this work to optimize the hyperparameters of our object detection model. 2.6 Related Work The literature on machine learning in agriculture is broadly divided into four main areas: livestock management, soil management, water management, and crop management [BTD+21]. Of those four, water management only makes up about 10% of all surveyed papers during the years 2018–2020. This highlights the potential for research in this area to have a high real-world impact. Besides agriculture, algorithmic approaches to watering house plants have not been studied at all to the best of our knowledge. Related work thus mostly focuses on a small selection of plants which are used for agricultural purposes. Nevertheless, the methods presented in those works are of interest for our own work. 30 2.6. Related Work Su et al. [SCL+20] used traditional feature extraction and preprocessing techniques to train various machine learning models for classifying water stress for a wheat field. They took top-down images of the field using an Unmanned Aerial Vehicle (UAV), segmented wheat pixels from background pixels and constructed features based on spectral intensities and color indices. The features are fed into a SVM with a Gaussian kernel and optimized using Bayesian optimization. Their results of 92.8% accuracy show that classical machine learning approaches can offer high classification scores if meaningful features are chosen. One disadvantage is that feature extraction is often a tedious task involving trial and error (see section 2.3.1). Advantages are the small data set and the short training time (3 s) required to obtain a good result. Similarly, López-García et al. [LIM+22] investigated the potential for UAVs to determine water stress for vineyards using RGB and multispectral imaging. The measurements of the UAV were taken at 80 m with a common off-the-shelf Advanced Photo System type-C (APS-C) sensor. At the same time, stem water measurements were taken with a pressure chamber to be able to evaluate the performance of an ANN against the ground truth. The RGB images were used to calculate the Green Canopy Cover (GCC) which was also fed to the model as input. The model achieves a high determination coefficient R2 of 0.98 for the 2018 season on RGB data with a relative error of RE = 10.84 %. However, their results do not transfer well to the other seasons under survey (2019 and 2020). Zhuang et al. [ZWJ+17] showed that water stress in maize can be detected early on and, therefore, still provide actionable information before the plants succumb to drought. They installed a camera which took 640 by 480 pixel RGB images every two hours. A simple linear classifier (SVM) segmented the image into foreground and background using the green color channel. The authors constructed a 14-dimensional feature space consisting of color and texture features. A Gradient Boosted Decision Tree (GBDT) model classified the images into water stressed and non-stressed and achieved an accuracy of 90.39 %. Remarkably, the classification was not significantly impacted by illumination changes throughout the day. An et al. [ALL+19] used the ResNet50 model (see section 2.3.2) as a basis for transfer learning and achieved high classification scores (ca. 95%) on maize. Their model was fed with 640 by 480 pixel images of maize from three different viewpoints and across three different growth phases. The images were converted to grayscale which turned out to slightly lower classification accuracy. Their results also highlight the superiority of Deep Convolutional Neural Networkss (DCNNs) compared to manual feature extraction and GBDTs. Chandel et al. [CCR+21] investigated deep learning models in depth by comparing three well-known CNNs. The models under scrutiny were AlexNet (see section 2.3.2), GoogLeNet (see section 2.3.2), and Inception v3. Each model was trained with a dataset containing images of maize, okra, and soybean at different stages of growth and under stress and no stress. The researchers did not include an object detection step before image classification and compiled a fairly small dataset of 1200 images. Of the three models, GoogLeNet beat the other two with a sizable lead at a classification accuracy of >94% for 31 2. Theoretical Background all three types of crop. The authors attribute its success to its inherently deeper structure and application of multiple convolutional layers at different stages. Unfortunately, all of the images were taken at the same 45◦ ± 5◦ angle and it stands to reason that the models would perform significantly worse on images taken under different conditions. Ramos-Giraldo et al. [RRL+20] detected water stress in soybean and corn crops with a pretrained model based on DenseNet-121 (see section 2.3.2). Low-cost cameras deployed in the field provided the training data over a 70-day period. They achieved a classification accuracy for the degree of wilting of 88%. In a later study, the same authors [RRM+20] deployed their machine learning model in the field to test it for production use. They installed multiple Raspberry Pis with attached Raspberry Pi Cameras which took images in 30 min intervals. The authors had difficulties with cameras not working and power supply issues. Furthermore, running the model on the resource-constrained RPis proved difficult and they had to port their TensorFlow model to a TensorFlow Lite model. This conversion lowered their classification scores slightly since it was sometimes off by one water stress level. Nevertheless, their architecture allowed for reasonably high classification scores on corn and soybean with a low-cost setup. Azimi, Kaur, and Gandhi [AKG20] demonstrate the efficacy of deep learning models versus classical machine learning models on chickpea plants. The authors created their own dataset in a laboratory setting for stressed and non-stressed plants. They acquired 8000 images at eight different angles in total. For the classical machine learning models, they extracted feature vectors using SIFT and HOG. The features are fed into three classical machine learning models: SVM, k-Nearest Neighbors (k-NN), and a Decision Tree (DT) using the Classification and Regression Tree (CART) algorithm. On the deep learning side, they used their own CNN architecture and the pretrained ResNet-18 (see section 2.3.2) model. The accuracy scores for the classical models was in the range of 60% to 73% with the SVM outperforming the two others. The CNN achieved higher scores at 72% to 78% and ResNet-18 achieved the highest scores at 82% to 86%. The results clearly show the superiority of deep learning over classical machine learning. A downside of their approach lies in the collection of the images. The background in all images was uniformly white and the plants were prominently placed in the center. It should, therefore, not be assumed that the same classification scores can be achieved on plants in the field with messy and noisy backgrounds as well as illumination changes and so forth. Venal, Fajardo, and Hernandez [VFH19] combine a standard CNN architecture with a SVM for classification. The CNN acts as a feature extractor and instead of using the last fully-connected layers of an off-the-shelf CNN, they replace them with a SVM. They use this classifier to determine which biotic or abiotic stresses soybeans suffer from. Their dataset consists of 65 184 64 by 64 RGB images of which around 40 000 were used for training and 6000 for testing. All images show a close-up of a soybean leaf. Their CNN architecture makes use of three Inception modules (see section 2.3.2) with Squeeze-Excitation (SE) blocks and BN layers in-between. Their model achieves an 32 2.6. Related Work average F1-score of 97% and an average accuracy of 97.11% on the test set. Overall, the hybrid structure of their model is promising, but it is not clear why only using the CNN as a feature extractor provides better results than using it also for classification. Aversano, Bernardi, and Cimitile [ABC22] perform water stress classification on images of tomato crops obtained with a UAV. Their dataset consists of 6600 thermal and 6600 optimal images which have been segmented using spectral clustering. They use two VGG-19 networks (see section 2.3.2) which extract features from the thermal (network one) and optical (network two) images. Both feature extractors are merged together via a fully-connected and softmax layer to predict one of three classes: water excess, well- watered and water deficit. The authors select three hyperparameters (image resolution, optimization algorithm and batch size) and optimize them for accuracy. The best classifier works with a resolution of 512 px, SGD and a batch size of 32. This configuration achieves an accuracy of 80.5% and an F1-score of 79.4% on the validation set. To test whether the optical or thermal images are more relevant for classification, the authors conduct an ablation study. The results show that the network with the optical images alone achieves an F1-score of 74% while only using the thermal images gives an F1-score of 62%. A significant problem in the detection of water stress is posed by the evolution of indicators across time. Since physiological features such as leaf wilting progress as time passes, the additional time domain has to be taken into account. To make use of these spatiotemporal patterns, Azimi, Wadhawan, and Gandhi [AWG21] propose the application of a CNN Long Short-Term Memory Network (CNN-LSTM) architecture. The model was trained on chickpea plants and achieves a robust classification accuracy of >97%. All of the previously mentioned studies solely focus on either one specific type of plant or on a small number of them. Furthermore, the researchers construct their datasets in homogeneous environments which often do not mimic real-world conditions. Finally, there exist no studies on common household or garden plants. This fact may be attributed to the propensity for funding to come from the agricultural sector. It is thus desirable to explore how plants other than crops show water stress and if there is additional information to be gained from them. 33 CHAPTER 3 Prototype Design The following sections establish the requirements as well as the general design philosophy of the prototype. We will then go into detail about the selected model architectures and data augmentations which are applied during training. 3.1 Requirements The basic requirements for the prototype have been introduced in section 1.1 and stem from the research questions defined in the same section. The aim of this work is to detect household plants, classify them into water-stressed or healthy, and to continuously publish the results via a Representational State Transfer (REST) API. To this end, a portable SBC such as the Nvidia Jetson Nano stores the trained models locally and uses them for inference on images which are periodically taken with an attached camera. The prototype is thus required to be running the models on its own without help from a central server or other computational resource. However, because the results are published via a REST service, internet access is necessary to be able to retrieve the predictions. Other technical requirements are that the inference on the device for both models does not take too long (i.e. not longer than a few seconds per image). Even though plants are not known to grow extremely rapidly from one minute to the next, keeping the inference time low results in a more resource efficient prototype. As such, it is possible to run the device off of a battery which completes the self-contained nature of the prototype. From an evaluation perspective, the models should have high specificity and sensitivity. In order to be useful for plant water-stress detection, it is necessary to identify as many water-stressed plants as possible while keeping the number of false positives as low as possible (specificity). If the number of water-stressed plants is severely overestimated, downstream watering systems could damage the plants by overwatering. Conversely, if the number of water-stressed plants is underestimated, some plants are likely to die 35 3. Prototype Design because no water-stress is detected (sensitivity). Furthermore, the models are required to attain a reasonable level of precision as well as good localization of plants. It is difficult to determine said levels beforehand, but considering the task at hand as well as general object detection and classification benchmarks such as COCO [LMB+15], we expect a mAP of around 40% and precision and recall values of 70%. Other basic model requirements are robust object detection and classification as well as good generalizability. The prototype should be able to function in different environments where different lighting conditions, different backgrounds, and different angles do not have an impact on model performance. Where feasible, models should be evaluated with cross validation to ensure that the performance of the model on the test set is a good indicator of its generalizability. In the same vein, models should not overfit or underfit the training data which also results in bad generalizability. During the iterative process of training the models as well as for evaluation purposes, the models should be interpretable. Especially when there is comparatively little training data available, verifying if the model is focusing on the right parts of an image gives insight into its robustness and generalizability which can increase trust. Furthermore, if a model is clearly not focusing on the right parts of an image, interpretability can help debug where the problem lies. Interpretability is thus an important property of any model so that the model engineer is able to steer the training and inference process in the right direction. 3.2 Design Figure 3.1 shows the overall processing loop which happens on the device. The camera is directly attached to the Nvidia Jetson Nano via a Camera Serial Interface (CSI) cable. Since the cable is quite rigid, the camera must be mounted on a small stand such as a tripod. Images coming in from the camera are then passed to the object detection model running on the Nvidia Jetson Nano. The model detects all plants in the image and returns the coordinates of a bounding box per plant. These coordinates are used to cut out each plant from the original image. The cutout is then passed to the second model running on the Nvidia Jetson Nano which determines if the plant is water-stressed or not. The percentage values of the prediction are mapped to a scale between one and ten, where ten indicates that the plant is in a very dire state. This number is available via a REST endpoint with additional information such as current time as well as how long it has been since the state has been better than three. The endpoint publishes this information for every plant which has been detected. The water stress prediction itself consists of two stages. First, plants are detected and, second, each individual plant is classified. This two-stage approach lends itself well to a two-stage model structure. Since the first stage is an object detection task, we employ an object detection model and pass the individual plant images to a second model—the classifier. 36 3.2. Design Figure 3.1: Methodological approach for the prototype. The prototype will run in a loop which starts at the top left corner. First, the camera attached to the prototype takes images of plants. These images are passed to the models running on the prototype. The first model generates bounding boxes for all detected plants. The bounding boxes are used to cut out the individual plants and pass them to the state classifier in sequence. The classifier outputs a probability score indicating the amount of stress the plant is experiencing. After a set amount of time, the camera takes a picture again and the process continues indefinitely. While most object detection models could be trained to determine the difference between water-stressed and healthy, the reason for this two-stage design lies in the availability of data. To our knowledge, there are no sufficiently large enough datasets available which contain labeling information for water-stressed and healthy. Instead, most datasets only classify common objects such as plane, person, car, bicycle, and so forth (e.g. COCO [LMB+15]). However, the classes plant and houseplant are present in most datasets and provide the basis for our object detection model. The size of these datasets allows us to train the object detection model with a large number of samples which would have been unfeasible to label on our own. The classifier is then trained with a smaller data set which only comprises individual plants and their associated classification (stressed or healthy). Both datasets (object detection and classification) only allow us to train and validate each model separately. A third dataset is needed to evaluate the detection/classification pipeline as a whole. To this end, we construct our own dataset where all plants per image are labeled with bounding boxes as well as the classes stressed or healthy. This dataset is small in comparison to the one with which the object detection model is trained but 37 3. Prototype Design suffices because it is only used for evaluation. Labeling each sample in the evaluation dataset manually is still a laborious task which is why each image is preannotated by the already existing object detection and classification model. The task of labeling thus becomes a task of manually correcting the annotations which have been generated by the models. 3.3 Selected Methods In the following sections we will go into detail about the two selected architectures for our prototype. The object detector we chose—YOLOv7—is part of a larger family of models which all function similarly but have undergone substantial changes from version to version. In order to understand the used model, we trace the improvements to the YOLO family from version one to version seven. For the classification stage, we have opted for a ResNet architecture which is also described in detail. 3.3.1 You Only Look Once The YOLO family of object detection models started in 2015 when [RDG+16] published the first version. Since then there have been up to 16 updated versions depending on how one counts. The original YOLO model marked a shift from two-stage detectors to one-stage detectors as is evident in its name. Two-stage detectors (see section 2.2.3) rely on a proposal generation step and then subsequent rejection or approval of each proposal to detect objects. Generating proposals, however, is an expensive procedure which limits the amount of object detections per second. YOLO dispenses with the extra proposal generation step and instead provides a unified one-stage detection approach. The first version of YOLO [RDG+16] framed object detection as a single regression problem which allows the model to directly infer bounding boxes with class probabilities from image pixels. This approach has the added benefit that YOLO sees an entire image at once, allowing it to capture more contextual information than with sliding window or region proposal methods. However, YOLO still divides an image into regions which are called grid cells, but this is just a simple operation and does not rely on external algorithms such as selective search [UvdSG+13]. The number of bounding box proposals within YOLO is much lower than with selective search as well (98 versus 2000 per image). The architecture of YOLO is similar to GoogleNet (see section 2.3.2), but the authors do not use inception modules directly. The network contains 24 convolutional layers in total where most three by three layers are fed a reduced output from a one by one layer. This approach reduces complexity substantially—as has been demonstrated with GoogleNet. Every block of convolutional layers is followed by a two by two maxpool layer for downsampling. The model expects an input image of size 448 by 448 pixels but has been pretrained on ImageNet with half that resolution (i.e. 224 by 224 pixels). After the convolutional layers, the authors add two fully-connected layers to produce an output of size 7 × 7 × 30. This output tensor is chosen because the VOC data set has 20 38 3.3. Selected Methods classes C and each grid cell produces two bounding boxes B where each bounding box is described by x, y, w, h and the confidence. With a grid size of S = 7, the output is thus S × S × (B · 5 + C) = 7 × 7 × 30. Each grid cell is responsible for a detected object if the object’s center coordinates (x, y) fall within the bounds of the cell. Furthermore, every cell can only predict one object which leads to problems with images of dense objects. In that case, a finer grid size is needed. The w and h of a bounding box is relative to the image as a whole which allows the bounding box to span more than one grid cell. Since the authors frame object detection as a regression problem of bounding box coordinates (center point (x, y), width w, and height h), object probabilities per box, and class probabilities, they develop a loss function which is a sum of five parts. The first part describes the regression for the bounding box center coordinates (sum of squared differences), the second part the width and height of the box, the third part the confidence of there being an object in a box, the fourth part the confidence if there is no actual object in the box, and the fifth part the individual class probabilities (see equation 3.1). The two constants λcoord and λnoobj are weighting factors which increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes without objects. These are set to λcoord = 5 and λnoobj = 0.5. λcoord S2 i=0 B j=0 1 obj ij  (xi − x̂i)2 + (yi − ŷi)2  + λcoord S2 i=0 B j=0 1 obj ij √ wi − ŵi 2 + S2 i=0 B j=0 1 obj ij  Ci − Ĉi 2 + λnoobj S2 i=0 B j=0 1 noobj ij  Ci − Ĉi 2 + S2 i=0 1 obj i c∈classes  pi(c) − p̂i(c) 2 (3.1) The original YOLO model has a few limitations. It only predicts one class per bounding box and can only accommodate two bounding boxes per grid cell. YOLO thus has problems detecting small and dense objects. The most severe problem, however, is the localization accuracy. The loss function treats errors in small bounding boxes similarly to errors in big bounding boxes even though small errors have a higher impact on small bounding boxes than big ones. This results in a more lenient loss function for IOUs of small bounding boxes and, therefore, worse localization. 39 3. Prototype Design YOLOv2 YOLOv2 [RF17] incorporates multiple improvements such as BN layers, higher resolution inputs, a fully-convolutional architecture, anchor boxes, dimension priors, and multi-scale training. Of particular interest is the use of anchor boxes to localize bounding boxes. Instead of regressing arbitrary bounding box sizes, YOLOv2 predicts the bounding box offsets from a set of predefined boxes which are called anchor boxes. The authors note that finding a good set of prior anchor boxes by hand is error-prone and suggest finding them via k-means clustering (dimension priors). They select five anchor boxes per grid cell which still results in high recall but does not introduce too much complexity. These additional details result in an improved mAP of 78.6% on the VOC 2007 dataset compared to 63.4% of the previous YOLO version. YOLOv2 still maintains a fast detection rate at 40 fps (mAP 78.6%) and up to 91 fps (mAP 69%). YOLOv3 YOLOv3 [RF18] provided additional updates to the YOLOv2 model. To be competitive with the deeper network structures of state-of-the-art models at the time, the authors introduce a deeper feature extractor called Darknet-53. It makes use of the residual connections popularized by ResNet [HZR+16] (see section 2.3.2). Darknet-53 is more accurate than Darknet-19 and compares to ResNet-101 but can process more images per second (78 fps versus 53 fps). The activation function throughout the network is still leaky ReLU, as in earlier versions. YOLOv3 uses multi-scale predictions to achieve better detection ratios across object sizes. Inspired by FPNs (see section 2.2.3), YOLOv3 uses predictions at different scales from the feature extractor and combines them to form a final prediction. Combining the features from multiple scales is often done in the neck of the object detection architecture. Around the time of the publication of YOLOv3, researchers started to use the terminology backbone, neck and head to describe the architecture of object detection models. The feature extractor (Darknet-53 in this case) is the backbone and provides the feature maps which are aggregated in the neck and passed to the head which outputs the final predictions. In some cases there are additional postprocessing steps in the head such as Non Maximum Suppression (NMS) to eliminate duplicate or suboptimal detections. While YOLOv2 had problems detecting small objects, YOLOv3 performs much better on them (Average Precision (AP) of 18.3% versus 5% on COCO). The authors note, however, that the new model sometimes has comparatively worse results with larger objects. The reasons for this behavior are unknown. Additionally, YOLOv3 is still lagging behind other detectors when it comes to accurately localizing objects. The COCO evaluation metric was changed from the previous AP0.5 to the mAP between 0.5 to 0.95 which penalizes detectors which do not achieve close to perfect IOU scores. This change highlights YOLOv3’s weakness in that area. 40 3.3. Selected Methods YOLOv4 Keeping in line with the aim of carefully balancing accuracy and speed of detection, Bochkovskiy, Wang, and Liao [BWL20] publish the fourth version of YOLO. The authors investigate the use of what they term bag of freebies—methods which increase training time while increasing inference accuracy without sacrificing inference speed. A prominent example of such methods is data augmentation (see section 3.3.3). Specifically, the authors propose to use mosaic augmentation which lowers the need for large mini-batch sizes. They also use new features such as weighted residual connections [SGZ16], a modified Spatial Attention Module (SAM) [WPL+18], a modified Path Aggregation Network (PANet) [LQQ+18] for the neck, Complete Intersection over Union (CIoU) loss [ZWL+20] for the detector and the Mish activation function [Mis20]. Taken together, these additional improvements yield a mAP of 43.5% on the COCO test set while maintaining a speed of above 30 fps on modern GPUs. YOLOv4 was the first version which provided results on all scales (S, M, L) that were better than almost all other detectors at the time without sacrificing speed. YOLOv5 The author of YOLOv5 [Joc20] ported the code from YOLOv4 from the Darknet framework to PyTorch which facilitated better interoperability with other Python utilities. New in this version is the pretraining algorithm called AutoAnchor which adjusts the anchor boxes based on the dataset at hand. This version also implements a genetic algorithm for hyperparameter optimization (see section 2.5.3) which is used in our work as well. Version 5 comes in multiple architectures of various complexity. The smallest—and there- fore fastest—version is called YOLOv5n where the n stands for nano. Additional versions with increasing parameters are YOLOv5s (small), YOLOv5m (medium), YOLOv5l (large), and YOLOv5x (extra large). The smaller models are intended to be used in resource constrained environments such as edge devices but come with a cost in accuracy. Conversely, the larger models are for tasks where high accuracy is paramount and enough computational resources are available. The YOLOv5x model achieves a mAP of 50.7% on the COCO test dataset. YOLOv6 The authors of YOLOv6 [LLJ+22] use a new backbone based on RepVGG [DZM+21] which they call EfficientRep. They also use different losses for classification (varifocal loss [ZWD+21]) and bounding box regression (Scylla Intersection over Union (SIoU) [Gev22]/Generalized Intersection over Union (GIoU) [RTG+19]). YOLOv6 is made available in eight scaled version of which the largest achieves a mAP of 57.2% on the COCO test set. 41 3. Prototype Design YOLOv7 At the time of implementation of our own plant detector, YOLOv7 [WBL22] was the newest version within the YOLO family. Similarly to YOLOv4, it introduces more trainable bag of freebies which do not impact inference time. The improvements include the use of Extended Efficient Layer Aggregation Networks (E-ELANs) (based on Efficient Layer Aggregation Networks (ELANs) [WLY22]), joint depth and width model scaling techniques, reparameterization on module level, and an auxiliary head—similarly to GoogleNet (see section 2.3.2)—which assists during training. The model does not use a pretrained backbone, it is instead trained from scratch on the COCO dataset. These changes result in much smaller model sizes compared to YOLOv4 and a mAP of 56.8% with a detection speed of over 30 fps. We use YOLOv7 in our own work during the plant detection stage because it was the fastest and most accurate object detector at the time of implementation. 3.3.2 ResNet Early research [BSF94; GB10] already demonstrated that the vanishing/exploding gradi- ent problem with standard gradient descent and random initialization adversely affects convergence during training and results in worse performance than would be otherwise achievable with the same architecture. If a neural network is trained with gradient descent by the application of the chain rule (backpropagation), weight updates are passed from the later layers back through the network to the early layers. Unfortunately, with some activation functions (notably tanh), the gradient can be very small and decreases exponentially the further it passes through the network. The effect being that the early layers do not receive any weight updates which can stop the learning process entirely. There are multiple potential solutions to the vanishing gradient problem. Different weight initialization schemes [GB10; SA15] as well as BN layers [IS15] can help mitigate the problem. The most effective solution yet, however, was proposed as residual connections by He et al. [HZR+16]. Instead of connecting each layer only to the previous and next layer in a sequential way, the authors add the input of the previous layer to the output of the next layer. This is achieved through the aforementioned residual or skip connections (see figure 3.3). He et al. [HZR+16] develop a new architecture called ResNet based on VGGNet (see section 2.3.2) which includes residual connections after every second convolutional layer. The filter sizes in their approach are smaller than in VGGNet which results in much fewer trainable parameters overall. Since residual connections do not add additional parameters and are relatively easy to add to existing network structures, the authors compare four versions of their architecture: one with 18 and the other with 34 layers, each with (ResNet) and without (plain ResNet) residual connections. Curiously, the 34-layer plain network performs worse on ImageNet classification than the 18-layer plain network. Once residual connections are used, however, the 34-layer network outperforms the 18-layer version by 2.85 percentage points on the top-1 error metric of ImageNet. 42 3.3. Selected Methods weight layer weight layer F(x) F(x) + x x relu relu x identity Figure 3.2: Residual connections: information from previous layers flows into subsequent layers before the activation function is applied. The shortcut connection provides a path for information to skip multiple layers. These connections are parameter-free because of the identity mapping. The symbol represents simple element-wise addition. Figure redrawn from He et al. [HZR+16]. 1× 1, 64 3× 3, 64 1× 1, 256 256-d relu relu relu identity Figure 3.3: A bottleneck building block used in the ResNet-50, ResNet-101 and ResNet- 152 architectures. The one by one convolutions serve as a reduction and then inflation of dimensions. The dimension reduction results in lower input and output dimensions for the three by three layer and thus improves training time. Figure redrawn from He et al. [HZR+16] with our own small changes. 43 3. Prototype Design We use the ResNet-50 model developed by He et al. [HZR+16] pretrained on ImageNet in our own work. The 50-layer model uses bottleneck building blocks instead of the two three by three convolutional layers which lie in-between the residual connections of the smaller ResNet-18 and ResNet-34 models. We chose this model because it provides a suitable trade off between model complexity and inference time. 3.3.3 Data Augmentation Data augmentation is an essential part of every training process throughout machine learning. By perturbing already existing data with transformations, model engineers achieve an artificial enlargement of the dataset which allows the machine learning model to learn more robust features. It can also reduce overfitting for smaller datasets. In the object detection world, special augmentations such as mosaic help with edge cases which might crop up during inference. For example, by combining four or more images of the training set into one the model better learns to draw bounding boxes around objects which are cut off and at the edges of the individual images. Since we use data augmentation extensively during the training phases, we will list a small selection of them. HSV-hue Randomly change the hue of the color channels. HSV-saturation Randomly change the saturation of the color channels. HSV-value Randomly change the value of the color channels. Translation Randomly translate, i.e., move the image by a specified amount of pixels. Scaling Randomly scale the image up and down by a factor. Rotation Randomly rotate the image. Inversion Randomly flip the image along the x or the y-axis. Mosaic Combine multiple images into one in a mosaic arrangement. Mixup Create a linear combination of multiple images. These augmentations can either be defined to happen with a fixed value and a specified probability or they can be applied to all images, but the value is not fixed. For example, one can specify a range for the degree of rotation and every image is rotated by a random value within that range. Or these two options are combined to rotate an image by a random value within a range with a specified probability. 44 CHAPTER 4 Prototype Implementation In this chapter we describe the implementation of the prototype. Part of the implemen- tation is how the two models were trained and with which datasets, how the models are deployed to the SBC, and how they were optimized. 4.1 Object Detection As mentioned before, our approach is split into a detection and a classification stage. The object detector detects all plants in an image during the first stage and passes the cutouts on to the classifier. In this section, we describe what the dataset the object detector was trained with looks like, what the results of the training phase are and how the model was optimized with respect to its hyperparameters. 4.1.1 Dataset The object detection model has to correctly detect plants in various locations, different lighting conditions, and in partially occluded settings. Fortunately, there are many datasets available which contain a large amount of classes and samples of common everyday objects. Most of these datasets contain at least one class about plants and multiple related classes such as houseplant and potted plant can be merged together to form a single plant class which exhibits a great variety of samples. One such dataset which includes the aforementioned classes is the Open Images Dataset (OID) [KRA+20; KDA+17]. The OID has been published in multiple versions starting in 2016 with version one. The most recent iteration is version seven which has been released in October 2022. We use version six of the dataset in our own work which contains 9 011 219 training, 41 620 validation, and 125 436 testing images. The dataset provides image-level labels, bounding boxes, object segmentations, visual relationships, and localized narratives on 45 4. Prototype Implementation those images. For our own work, we are only interested in the labeled bounding boxes of all images which belong to the classes Houseplant and Plant with their respective class identifiers /m/03fp41 and /m/05s2s. These images have been extracted from the dataset and arranged in the directory structure which YOLOv7 requires. The bounding boxes themselves are collapsed into one single label Plant and converted to the YOLOv7 label format. In total, there are 79 204 images with 284 130 bounding boxes in the training set. YOLOv7 continuously validates the training progress after every epoch on a validation set of 3091 images with 4092 bounding boxes. 4.1.2 Training Phase We use the smallest YOLOv7 model which has 36.9 × 106 parameters [WBL22] and has been pretrained on the COCO dataset [LMB+15] with an input size of 640 by 640 pixels. The object detection model was then fine-tuned for 300 epochs on the training set. The weights from the best-performing epoch were saved. The model’s fitness for each epoch is calculated as the weighted average of mAP@0.5 and mAP@0.5:0.95: fepoch = 0.1 · mAP@0.5 + 0.9 · mAP@0.5:0.95 (4.1) Figure 4.1 shows the model’s fitness over the training period of 300 epochs. The gray vertical line indicates the maximum fitness of 0.61 at epoch 133. The weights of that epoch were frozen to be the final model parameters. Since the fitness metric assigns the mAP at the higher range the overwhelming weight, the mAP@0.5 starts to decrease after epoch 30, but the mAP@0.5:0.95 picks up the slack until the maximum fitness at epoch 133. This is an indication that the model achieves good performance early on and continues to gain higher confidence values until performance deteriorates due to overfitting. Overall precision and recall per epoch are shown in figure 4.2. The values indicate that neither precision nor recall change materially during training. In fact, precision starts to decrease from the beginning, while recall experiences a barely noticeable increase. Taken together with the box and object loss from figure 4.3, we speculate that the pretrained model already generalizes well to plant detection because one of the categories in the COCO [LMB+15] dataset is potted plant. Any further training solely impacts the confidence of detection but does not lead to higher detection rates. This conclusion is supported by the increasing mAP@0.5:0.95 until epoch 133. Further culprits for the flat precision and recall values may be found in bad ground truth data. The labels from the OID are sometimes not fine-grained enough. Images which contain multiple individual—often overlapping—plants are labeled with one large bounding box instead of multiple smaller ones. The model recognizes the individual plants and returns tighter bounding boxes even if that is not what is specified in the ground truth. Therefore, it is prudent to limit the training phase to relatively few epochs in order to not penalize the more accurate detections of the model. The smaller bounding 46 4.1. Object Detection 0 50 100 150 200 250 300 epoch 0.55 0.56 0.57 0.58 0.59 0.60 0.61 fi tn es s Figure 4.1: Object detection model fitness for each epoch calculated as in equation 4.1. The vertical gray line at 133 marks the epoch with the highest fitness. 0 50 100 150 200 250 300 epoch 0.0 0.2 0.4 0.6 0.8 1.0 metric precision recall Figure 4.2: Overall precision and recall during training for each epoch. The vertical gray line at 133 marks the epoch with the highest fitness. 47 4. Prototype Implementation 0 50 100 150 200 250 300 epoch 0.02 0.04 0.06 b ox lo ss 0 50 100 150 200 250 300 epoch 0.007 0.008 0.009 0.010 ob je ct lo ss Figure 4.3: Box and object loss measured against the validation set of 3091 images and 4092 ground truth labels. The class loss is omitted because there is only one class in the dataset and the loss is therefore always zero. boxes make more sense considering the fact that the cutout is passed to the classifier in a later stage. Smaller bounding boxes help the classifier to only focus on one plant at a time and to not get distracted by multiple plants in potentially different stages of wilting. The box loss decreases slightly during training which indicates that the bounding boxes become tighter around objects of interest. With increasing training time, however, the object loss increases, indicating that less and less plants are present in the predicted bounding boxes. It is likely that overfitting is a cause for the increasing object loss from epoch 40 onward. Since the best weights as measured by fitness are found at epoch 133 and the object loss accelerates from that point, epoch 133 is arguably the correct cutoff before overfitting occurs. 4.1.3 Hyperparameter Optimization To further improve the object detection performance, we perform hyperparameter opti- mization using a genetic algorithm. Evolution of the hyperparameters starts from the initial 30 default values provided by the authors of YOLO. Of those 30 values, 26 are allowed to mutate. During each generation, there is an 80% chance that a mutation occurs with a variance of 0.04. To determine which generation should be the parent of the new mutation, all previous generations are ordered by fitness in decreasing order. At most five top generations are selected and one of them is chosen at random. Better generations have a higher chance of being selected as the selection is weighted by fitness. The parameters of that chosen generation are then mutated with the aforementioned probability and variance. Each generation is trained for three epochs and the fitness of the best epoch is recorded. In total, we ran 87 iterations of which the 34th generation provides the best fitness of 0.6076. Due to time constraints, it was not possible to train each generation for more epochs or to run more iterations in total. We assume that the performance of the first few epochs is a reasonable proxy for model performance overall. The optimized version 48 4.1. Object Detection 0 10 20 30 40 50 60 70 epoch 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.61 0.62 fi tn es s Figure 4.4: Object detection model fitness for each epoch calculated as in equation 4.1. The vertical gray line at 27 marks the epoch with the highest fitness of 0.6172. of the object detection model is then trained for 70 epochs using the parameters of the 34th generation. Figure 4.4 shows the model’s fitness during training for each epoch. After the highest fitness of 0.6172 at epoch 27, the performance quickly declines and shows that further training would likely not yield improved results. The model converges to its highest fitness much earlier than the non-optimized version, which indicates that the adjusted parameters provide a better starting point in general. Furthermore, the maximum fitness is 0.74 percentage points higher than in the non-optimized version. Figure 4.5 shows precision and recall for the optimized model during training. Similarly to the non-optimized model from figure 4.2, both metrics do not change materially during training. Precision is slightly higher than in the non-optimized version and recall hovers at the same levels. The box and object loss during training is pictured in figure 4.6. Both losses start from a lower level which suggests that the initial optimized parameters allow the model to converge quicker. The object loss exhibits a similar slope to the non-optimized model in figure 4.3. The vertical gray line again marks epoch 27 with the highest fitness. The box loss reaches its lower limit at that point and the object loss starts to increase again after epoch 27. 49 4. Prototype Implementation 0 10 20 30 40 50 60 70 epoch 0.0 0.2 0.4 0.6 0.8 1.0 metric precision recall Figure 4.5: Overall precision and recall during training for each epoch of the optimized model. The vertical gray line at 27 marks the epoch with the highest fitness. 0 20 40 60 epoch 0.0200 0.0225 0.0250 0.0275 0.0300 b ox lo ss 0 20 40 60 epoch 0.0050 0.0055 0.0060 0.0065 0.0070 ob je ct lo ss Figure 4.6: Box and object loss measured against the validation set of 3091 images and 4092 ground truth labels. The class loss is omitted because there is only one class in the dataset and the loss is therefore always zero. 50 4.2. Classification 4.2 Classification The second stage of our approach consists of the classification model which determines whether the plant in question is water-stressed or not. The classifier receives the cutouts for each plant from stage one (object detection). We chose a ResNet-50 model (see section 3.3.2) which has been pretrained on ImageNet. We chose the ResNet architecture due to its popularity and ease of implementation as well as its consistently high performance on various classification tasks. While its classification speed in comparison with networks optimized for mobile and edge devices (e.g. MobileNet) is significantly lower, the deeper structure and the additional parameters are necessary for the fairly complex task at hand. Furthermore, the generous time budget for object detection and classification allows for more accurate results at the expense of speed. The 50 layer architecture (ResNet-50) is adequate for our use case. In the following sections we describe the dataset the classifier was trained on, the metrics of the training phase and how the performance of the model was further improved with hyperparameter optimization. 4.2.1 Dataset The dataset we used for training the classifier consists of 452 images of healthy and 452 stressed plants. It has been made public on Kaggle Datasets1 under the name Healthy and Wilted Houseplant Images [Cha20]. The images in the dataset were collected from Google Images and labeled accordingly. The dataset was split 85/15 into training and validation sets. The images in the training set were augmented with a random crop to arrive at the expected image dimensions of 224 pixels. Additionally, the training images were modified with a random horizontal flip to increase the variation in the set and to train a rotation invariant classifier. All images, regardless of their membership in the training or validation set, were normalized with the mean and standard deviation of the ImageNet [DDS+09] dataset, which the original ResNet-50 model was pretrained with. Training was done for 50 epochs and the best-performing model as measured by validation accuracy was selected as the final version. Figure 4.7 shows accuracy and loss on the training and validation sets. There is a clear upwards trend until epoch 20 when validation accuracy and loss stabilize at around 0.84 and 0.3, respectively. The quick convergence and resistance to overfitting can be attributed to the model already having robust feature extraction capabilities. 4.2.2 Hyperparameter Optimization In order to improve the aforementioned accuracy values, we perform hyperparameter optimization across a wide range of parameters. Table 4.1 lists the hyperparameters and their possible values. Since the number of all combinations of values is 11 520 and each combination is trained for ten epochs with a training time of approximately six 1https://www.kaggle.com/datasets 51 4. Prototype Implementation 0 10 20 30 40 50 epoch 0.7 0.8 0.9 ac cu ra cy metric train val 0 10 20 30 40 50 epoch 0.4 0.6 0.8 lo ss metric train val Figure 4.7: Accuracy and loss during training of the classifier. The model converges quickly, but additional epochs do not cause validation loss to increase, which would indicate overfitting. The maximum validation accuracy of 0.9118 is achieved at epoch 27. minutes per combination, exhausting the search space would take 48 days. Due to time limitations, we have chosen to not search exhaustively but to pick random combinations instead. Random search works surprisingly well—especially compared to grid search—in a number of domains, one of which is hyperparameter optimization [BB12]. Parameter Values Optimizer Adam, SGD Batch Size 4, 8, 16, 32, 64 Learning Rate 0.0001, 0.0003, 0.001, 0.003, 0.01, 0.1 Step Size 2, 3, 5, 7 Gamma 0.1, 0.5 Beta One 0.9, 0.99 Beta Two 0.5, 0.9, 0.99, 0.999 Eps 0.00000001, 0.1, 1 Table 4.1: Hyperparameters and their possible values during optimization. The random search was run for 138 iterations which equates to a 75% probability that the best solution lies within 1% of the theoretical maximum (4.2). Figure 4.8 shows three of the eight parameters and their impact on a high F1-score. SGD has less variation in its results than Adam [KB17] and manages to provide eight out of the ten best results. The number of epochs to train for was chosen based on the observation that almost all configurations converge well before reaching the tenth epoch. The assumption that a training run with ten epochs provides a good proxy for final performance is supported by the quick convergence of validation accuracy and loss in figure 4.7. Table 4.2 lists the final hyperparameters which were chosen to train the improved model. 1 − (1 − 0.01)138 ≈ 0.75 (4.2) 52 4.2. Classification 0.0001 0.0003 0.001 0.003 0.01 0.1 learning rate 0.2 0.4 0.6 0.8 1.0 te st /f 1- sc or e batch size 4 8 16 32 64 optimizer adam sgd Figure 4.8: This figure shows three of the eight hyperparameters and their performance measured by the F1-score during 138 trials. Differently colored markers show the batch size with darker colors representing a larger batch size. The type of marker (circle or cross) shows which optimizer was used. The x-axis shows the learning rate on a logarithmic scale. In general, a learning rate between 0.003 and 0.01 results in more robust and better F1-scores. Larger batch sizes more often lead to better performance as well. As for the type of optimizer, SGD produced the best iteration with an F1-score of 0.9783. Adam tends to require more customization of its parameters than SGD to achieve good results. Optimizer Batch Size Learning Rate Step Size SGD 64 0.01 5 Table 4.2: Chosen hyperparameters for the final, improved model. The difference to the parameters listed in Table 4.1 comes as a result of choosing SGD over Adam. The missing four parameters are only required for Adam and not SGD. 53 4. Prototype Implementation 4.3 Deployment After training of the two models (object detector and classifier), we export them to the Open Neural Network Exchange (ONNX)2 format and move the model files to the Nvidia Jetson Nano. On the device, a Flask application (server) provides a REST endpoint from which the results of the most recent prediction can be queried. The server periodically performs the following steps: 1. Call a binary which takes an image and writes it to a file. 2. Take the image and detect all plants as well as their status using the two models. 3. Draw the returned bounding boxes onto the original image. 4. Number each detection from left to right. 5. Coerce the prediction for each bounding box into a tuple ⟨I, S, T, ∆T ⟩. 6. Store the image with the bounding boxes and an array of all tuples (predictions) in a dictionary. 7. Wait two minutes. 8. Go to step one. The binary uses the accelerated GStreamer implementation by Nvidia to take an image. The tuple ⟨I, S, T, ∆T ⟩ consists of the following items: I is the number of the bounding box in the image, S the current state from one to ten, T the timestamp of the prediction, and ∆T the time since the state S last fell under three. The server performs these tasks asynchronously in the background and is always ready to respond to requests with the most recent prediction. This chapter detailed the training and deployment of the two models used for the plant water-stress detection system—the object detector and the classifier. Furthermore, we have specified the API which publishes the results continuously. We will now turn towards the evaluation of the two separate models as well as the aggregate model. 2https://github.com/onnx 54 CHAPTER 5 Evaluation The following sections contain a detailed evaluation of the model in various scenarios. First, we describe the test datasets as well as the metrics used for assessing model performance. Second, we present the results of the evaluation and analyze the behavior of the classifier with Gradient-weighted Class Activation Mapping (Grad-CAM). Finally, we discuss the results and identify the limitations of our approach. 5.1 Methodology In order to evaluate the object detection model and the classification model, we analyze their predictions on test datasets. For the object detection model, the test dataset is a 10% split of the original dataset which we describe in section 4.1.1. The classifier is evaluated with a 10-fold cross validation from the original dataset (see section 4.2.1). After the evaluation of both models individually, we evaluate the model in aggregate on a new dataset. This is necessary because the prototype uses the two models as if they were one. The aggregate performance is ultimately the most important measure to decide if the prototype is able to meet the requirements. The test set for the aggregate model contains 640 images which were obtained from a google search using the terms thirsty plant, wilted plant and stressed plant. Images which clearly show one or multiple plants with some amount of visible stress were added to the dataset. Care was taken to include plants with various degrees of stress and in various locations and lighting conditions. The search not only provided images of stressed plants but also of healthy plants. The dataset is biased towards potted plants which are commonly put on display in western households. Furthermore, many plants, such as succulents, are sought after for home environments because of their ease of maintenance. Due to their inclusion in the dataset and how they exhibit water stress, the test set contains a wide variety of scenarios. 55 5. Evaluation After collecting the images, the aggregate model was run on them to obtain initial bounding boxes and classifications for ground truth labeling. Letting the model do the work beforehand and then correcting the labels allowed to include more images in the test set because they could be labeled more easily. Additionally, going over the detections and classifications provided a comprehensive view on how the models work and what their weaknesses and strengths are. After the labels have been corrected, the ground truth of the test set contains 766 bounding boxes of healthy plants and 494 of stressed plants. 5.2 Results This section presents the results of the evaluation of the constituent models as well as the aggregate model. First, we evaluate the object detection model before and after hyperparameter optimization. Second, we evaluate the performance of the classifier after hyperparameter optimization and present the results of Grad-CAM. Finally, we evaluate the aggregate model before and after hyperparameter optimization. 5.2.1 Object Detection Of the 91 479 images around 10% were used for the test phase. These images contain a total of 12 238 ground truth labels. Table 5.1 shows precision, recall and the harmonic mean of both (F1-score). The results indicate that the model errs on the side of sensitivity because recall is higher than precision. Although some detections are not labeled as plants in the dataset, if there is a labeled plant in the ground truth data, the chance is high that it will be detected. This behavior is in line with how the model’s detections are handled in practice. The detections are drawn on the original image and the user is able to check the bounding boxes visually. If there are wrong detections, the user can ignore them and focus on the relevant ones instead. A higher recall will thus serve the user’s needs better than a high precision. Precision Recall F1-score Support Plant 0.547 571 0.737 866 0.628 633 12 238.0 Table 5.1: Precision, recall and F1-score for the object detection model. Figure 5.1 shows the AP for the IOU thresholds of 0.5 and 0.95. Predicted bounding boxes with an IOU of less than 0.5 are not taken into account for the precision and recall values of table 5.1. The lower the detection threshold, the more plants are detected. Conversely, a higher detection threshold leaves potential plants undetected. The precision-recall curves confirm this behavior because the area under the curve for the threshold of 0.5 is higher than for the threshold of 0.95 (0.66 versus 0.41). These values are combined in COCO’s [LMB+15] main evaluation metric which is the AP averaged across the IOU thresholds from 0.5 to 0.95 in 0.05 steps. This value is then averaged across all classes 56 5.2. Results 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.5 1.0 P re ci si on AP = 0.66 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.5 1.0 P re ci si on AP = 0.41 Figure 5.1: Precision-recall curves for IOU thresholds of 0.5 and 0.95. The AP of a specific threshold is defined as the area under the precision-recall curve of that threshold. The mAP across IOU thresholds from 0.5 to 0.95 in 0.05 steps mAP@0.5:0.95 is 0.5727. and called mAP. The object detection model achieves a state-of-the-art mAP of 0.5727 for the Plant class. Hyperparameter Optimization Turning to the evaluation of the optimized model on the test dataset, table 5.2 shows precision, recall and the F1-score for the optimized model. Comparing these metrics with the non-optimized version from table 5.1, precision is significantly higher by more than 8.5 percentage points. Recall, however, is 3.5 percentage points lower. The F1-score is higher by more than 3.7 percentage points which indicates that the optimized model is better overall despite the lower recall. We argue that the lower recall value is a suitable trade off for the substantially higher precision considering that the non-optimized model’s precision is quite low at 0.55. Precision Recall F1-score Support Plant 0.633 358 0.702 811 0.666 279 12 238.0 Table 5.2: Precision, recall and F1-score for the optimized object detection model. The precision-recall curves in figure 5.2 for the optimized model show that the model draws looser bounding boxes than the optimized model. The AP for both IOU thresholds of 0.5 and 0.95 is lower indicating worse performance. It is likely that more iterations during evolution would help increase the AP values as well. Even though the precision and recall values from table 5.2 are better, the mAP@0.5:0.95 is lower by 1.8 percentage points. 5.2.2 Classification In order to confirm that the optimized classification model does not suffer from overfitting or is a product of chance due to a coincidentally advantageous train/test split, we perform 57 5. Evaluation 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.5 1.0 P re ci si on AP = 0.64 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.5 1.0 P re ci si on AP = 0.40 Figure 5.2: Precision-recall curves for IOU thresholds of 0.5 and 0.95. The AP of a specific threshold is defined as the area under the precision-recall curve of that threshold. The mAP across IOU thresholds from 0.5 to 0.95 in 0.05 steps mAP@0.5:0.95 is 0.5546. stratified 10-fold cross validation on the dataset. Each fold contains 90% training and 10% test data and was trained for 25 epochs. Figure 5.3 shows the performance of the epoch with the highest F1-score of each fold as measured against the test split. The mean ROC curve provides a robust metric for a classifier’s performance because it averages out the variability of the evaluation. Each fold manages to achieve at least an AUC of 0.94, while the best fold reaches 0.99. The mean ROC has an AUC of 0.96 with a standard deviation of 0.02. These results indicate that the model is accurately predicting the correct class and is robust against variations in the training set. The classifier shows good performance so far, but care has to be taken to not overfit the model to the training set. Comparing the F1-score during training with the F1-score during testing gives insight into when the model tries to increase its performance during training at the expense of generalizability. Figure 5.4 shows the F1-scores of each epoch and fold. The classifier converges quickly to 1 for the training set at which point it experiences a slight drop in generalizability. Training the model for at most five epochs is sufficient because there are generally no improvements afterwards. The best-performing epoch for each fold is between the second and fourth epoch which is just before the model achieves an F1-score of 1 on the training set. Class Activation Maps Neural networks are notorious for their black-box behavior, where it is possible to observe the inputs and the corresponding outputs, but the stage in-between stays hidden from view. Models are continuously developed and deployed to aid in human decision-making and sometimes supplant it. It is, therefore, crucial to obtain some amount of interpretability of what the model does inside to be able to explain why a decision was made in a certain way. The research field of Explainable Artificial Intelligence (XAI) gained significance during the last few years because of the development of new methods to peek inside these black boxes. One such method, Class Activation Mapping (CAM) [ZKL+15], is a popular tool to 58 5.2. Results 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 T ru e P os it iv e R at e Mean ROC curve with variability (Positive label ‘wilted’) Fold 0 (AUC = 0.94) Fold 1 (AUC = 0.96) Fold 2 (AUC = 0.96) Fold 3 (AUC = 0.99) Fold 4 (AUC = 0.98) Fold 5 (AUC = 0.95) Fold 6 (AUC = 0.98) Fold 7 (AUC = 0.97) Fold 8 (AUC = 0.97) Fold 9 (AUC = 0.98) chance level (AUC = 0.5) Mean ROC (AUC = 0.96 ± 0.02 ± 1 std. dev. Figure 5.3: This plot shows the ROC curve for the epoch with the highest F1-score of each fold as well as the AUC. To get a less variable performance metric of the classifier, the mean ROC curve is shown as a thick line and the variability is shown in gray. The overall mean AUC is 0.96 with a standard deviation of 0.02. The best-performing fold reaches an AUC of 0.99 and the worst an AUC of 0.94. The black dashed line indicates the performance of a classifier which picks classes at random (AUC = 0.5). The shapes of the ROC curves show that the classifier performs well and is robust against variations in the training set. 59 5. Evaluation 0.70 0.75 0.80 0.85 0.90 0.95 1.00 tr ai n /f 1- sc or e fold 0 1 2 3 4 5 6 7 8 9 0 5 10 15 20 25 epoch 0.65 0.70 0.75 0.80 0.85 0.90 te st /f 1- sc or e fold 0 1 2 3 4 5 6 7 8 9 Figure 5.4: These plots show the F1-score during training as well as testing for each of the folds. The classifier converges to 1 by the third epoch during the training phase, which might indicate overfitting. However, the performance during testing increases until epoch three in most cases and then stabilizes at approximately 2-3 percentage points lower than the best epoch. We believe that the third, or in some cases fourth, epoch is detrimental to performance and results in overfitting, because the model achieves an F1-score of 1 for the training set, but that gain does not transfer to the test set. Early stopping during training alleviates this problem. 60 5.2. Results Precision Recall F1-score Support Healthy 0.665 0.554 0.604 766 Stressed 0.639 0.502 0.562 494 Micro Avg 0.655 0.533 0.588 1260 Macro Avg 0.652 0.528 0.583 1260 Weighted Avg 0.655 0.533 0.588 1260 Table 5.3: Precision, recall and F1-score for the aggregate model. produce visual explanations for decisions made by CNNs. Convolutional layers essentially function as object detectors as long as no fully-connected layers perform the classification. This ability to localize regions of interest, which play a significant role in the type of class the model predicts, can be retained until the last layer and used to generate activation maps for the predictions. A more recent approach to generating a CAM via gradients is proposed by Selvaraju et al. [SCD+20]. Their Grad-CAM approach works by computing the gradient of the feature maps of the last convolutional layer with respect to the specified class. The last layer is chosen because the authors find that “[. . . ] Grad-CAM maps become progressively worse as we move to earlier convolutional layers as they have smaller receptive fields and only focus on less semantic local features.” [SCD+20, p.5] Turning to our classifier, figure 5.5 shows the CAMs for healthy and stressed. While the regions of interest for the healthy class lie on the healthy plant, the stressed plant is barely considered and mostly rendered as background information (blue). Conversely, when asked to explain the inputs to the stressed classification, the regions of interest predominantly stay on the thirsty as opposed to the healthy plant. In fact, the large hanging leaves play a significant role in determining the class the image belongs to. This is an additional data point confirming that the model focuses on the semantically meaningful parts of the image during classification. 5.2.3 Aggregate Model In this section we turn to the evaluation of the aggregate model. We have confirmed the performance of the constituent models: the object detection and the classification model. It remains to evaluate the complete pipeline from gathering detections of potential plants in an image and forwarding them to the classifier to obtaining the results as either healthy or stressed with their associated confidence scores. 5.2.4 Non-optimized Model Table 5.3 shows precision, recall and the F1-score for both classes Healthy and Stressed. Precision is higher than recall for both classes and the F1-score is at 0.59. Unfortunately, 61 5. Evaluation Figure 5.5: The top left image shows the original image of the same plant in a stressed (left) and healthy (right) state. In the top right image, the CAM for the class healthy is laid over the original image. The classifier draws its conclusion mainly from the healthy plant, which is indicated by the red hot spots around the tips of the plant. The bottom right image shows the CAM for the stressed class. The classifier focuses on the hanging leaves of the thirsty plant. The image was classified as stressed with a confidence of 70%. these values do not take the accuracy of bounding boxes into account and thus have only limited expressive power. Figure 5.6 shows the precision and recall curves for both classes at different IOU thresholds. The left plot shows the AP for each class at the threshold of 0.5 and the right one at 0.95. The mAP is 0.3581 and calculated across all classes as the median of the IOU thresholds from 0.5 to 0.95 in 0.05 steps. The cliffs at around 0.6 (left) and 0.3 (right) happen at a detection threshold of 0.5. The classifier’s last layer is a softmax layer which necessarily transforms the input into a probability of showing either a healthy or stressed plant. If the probability of an image showing a healthy plant is below 0.5, it is no longer classified as healthy but as stressed. The threshold for discriminating the two classes lies at the 0.5 value and is therefore the cutoff for either class. 5.2.5 Optimized Model So far the metrics shown in table 5.3 are obtained with the non-optimized versions of both the object detection and classification model. Hyperparameter optimization of the classifier led to significant model improvements, while the object detector has improved 62 5.2. Results 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.5 1.0 P re ci si on AP = 0.45, class = Healthy AP = 0.42, class = Stressed 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.5 1.0 P re ci si on AP = 0.18, class = Healthy AP = 0.16, class = Stressed Figure 5.6: Precision-recall curves for IOU thresholds of 0.5 and 0.95. The AP of a specific threshold is defined as the area under the precision-recall curve of that threshold. The mAP across IOU thresholds from 0.5 to 0.95 in 0.05 steps mAP@0.5:0.95 is 0.3581. Precision Recall F1-score Support Healthy 0.711 0.555 0.623 766 Stressed 0.570 0.623 0.596 494 Micro Avg 0.644 0.582 0.611 1260 Macro Avg 0.641 0.589 0.609 1260 Weighted Avg 0.656 0.582 0.612 1260 Table 5.4: Precision, recall and F1-score for the optimized aggregate model. precision but lower recall and slightly lower mAP values. To evaluate the final aggregate model which consists of the individual optimized models, we run the same test described in section 5.2.3. Table 5.4 shows precision, recall and F1-score for the optimized model on the same test dataset of 640 images. All of the metrics are better for the optimized model. In particular, precision for the healthy class could be improved significantly while recall remains at the same level. This results in a better F1-score for the healthy class. Precision for the stressed class is lower with the optimized model, but recall is significantly higher (0.502 vs. 0.623). The higher recall results in a three percentage point gain for the F1-score in the stressed class. Overall, precision is the same, but recall has improved significantly, which also results in a noticeable improvement for the average F1-score across both classes. Figure 5.7 confirms the performance increase of the optimized model established in table 5.4. The mAP@0.5 is higher for both classes, indicating that the model better detects plants in general. The mAP@0.95 is slightly lower for the healthy class, which means that the confidence for the healthy class is slightly lower compared to the non- optimized model. The result is that more plants are correctly detected and classified overall, but the confidence scores tend to be lower with the optimized model. The mAP@0.5:0.95 could be improved by about 0.025. 63 5. Evaluation 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.5 1.0 P re ci si on AP = 0.48, class = Stressed AP = 0.46, class = Healthy 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.5 1.0 P re ci si on AP = 0.16, class = Healthy AP = 0.16, class = Stressed Figure 5.7: Precision-recall curves for IOU thresholds of 0.5 and 0.95. The AP of a specific threshold is defined as the area under the precision-recall curve of that threshold. The mAP across IOU thresholds from 0.5 to 0.95 in 0.05 steps mAP@0.5:0.95 is 0.3838. 5.3 Discussion Overall, the performance of the individual models is state of the art when compared with object detection benchmarks such as the COCO dataset. The mAP of 0.5727 for the object detection model is in line with most other object detectors. Even though the results are reasonably good, we argue that they could be better for the purposes of plant detection in the context of this work. The OID was labeled by humans and thus exhibits characteristics which are not optimal for our purposes. The class plant does not seem to have been defined rigorously. Large patches of grass, for example, are labeled with large bounding boxes. Trees are sometimes labeled, but only if their size suggests that they could be bushes or similar types of plant. Large corn fields are also labeled as plants but again with one large bounding box. If multiple plants are densely packed, the annotators often label them as belonging to one plant and thus one bounding box. Sometimes the effort has been made to delineate plants accurately and sometimes not which results in inconsistent bounding boxes. These inconsistencies and peculiarities as well as the always present error rate introduced by humans complicate the training process of our object detection model. During a random sampling of labels and predictions of the object detection model on the validation set, it became clear that the model tries to always correctly label each individual plant when it is faced with an image of closely packed plants. For images where one bounding box encapsulates all of the plants, the IOU of the model’s predictions is too far off from the ground truth which lowers the mAP accordingly. Since arguably all datasets will have some inconsistencies and errors in their ground truth, model engineers can only hope that the sheer amount of data available evens out these problems. In our case, the 79 204 training images with 284 130 bounding boxes might be enough to provide the model with a smooth distribution from which to learn from, but unless every single label is analyzed and systematically categorized this remains speculation. The hyperparameter optimization of the object detector raises further questions. The mAP of the optimized model is 1.8 percentage points lower than the non-optimized 64 5.3. Discussion version. Even though precision and recall of the model improved, the bounding boxes are worse. We argue that the hyperparameter optimization has to be run for more than 87 iterations to provide better results. Searching for the optimal hyperparameters with genetic methods usually requires many more iterations than that because it takes a significant amount of time to evolve the parameters away from the starting conditions. However, as mentioned before, our time constraints only allowed optimization to run for 87 iterations. Furthermore, we only train each iteration for three epochs and assume that those already provide a good measure of the model’s performance. It can be seen in figure 4.4 that the fitness during the first few epochs exhibits some amount of variation before it stabilizes. In fact, the fitness of the non-optimized object detector (figure 4.1) only achieves a stable value at epoch 50. An optimized model is often able to converge faster which is supported by figure 4.4, but even in that case it takes more than ten epochs to stabilize the training process. We argue that three epochs are likely not enough to support the hyperparameter optimization process. Unfortunately, if the number of epochs per iteration is increased by one, the complete number of epochs over all iterations increases by the total number of iterations. Every additional epoch thus contributes to a significantly longer optimization time. For our purposes, 87 iterations and three epochs per iteration are close to the limit. Further iterations or epochs were not feasible within our time budget. The optimized classifier shows a strong performance in the 10-fold cross validation where it achieves a mean AUC of 0.96. The standard deviation of the AUC across all folds is small enough at 0.02 to indicate that the model generalizes well to unseen data. We are confident in these results provided that the ground truth was labeled correctly. The CAM (figure 5.5) constitute another data point in support of this conclusion. Despite these points, the results come with a caveat. The ground truth was not created by an expert in botany or related sciences and thus could contain a significant amount of errors. Even though we manually verified most of the labels in the dataset and agree with the labels, we are also not expert labelers. The aggregate model achieves a mAP of 0.3581 before and 0.3838 after optimization. If we look at the common benchmarks (COCO) again where the state of the art achieves mAP values of between 0.5 and 0.58, we are confident that our results are reasonably good. Comparing the mAP values directly is not a clear indicator of how good the model is or should be because it is an apples to oranges comparison due to the different test datasets. Nevertheless, the task of detecting objects and classifying them is similar across both datasets and the comparison thus provides a rough guideline for the performance of our prototype. We argue that the task of classifying the plants into healthy and stressed on top of detecting plants is a more difficult task than just object detection. Additionally to having to discriminate between different common objects, our model also has to discriminate between plant states which requires further knowledge. The lower mAP values are thus attributable to the more difficult task posed by our research questions. We do not know the reason for the better performance of the optimized versus the 65 5. Evaluation non-optimized aggregate model. Evidently, the optimized version should be better, but considering that the optimized object detector performs worse in terms of mAP, we would expect to see this reflected in the aggregate model as well. It is possible that the optimized classifier balances out the worse object detector and even provides better results beyond that. Another possibility is that the better performance is in large part due to the increased precision and recall of the optimized object detector. In fact, these two possibilities taken together might explain the optimized model results. Nevertheless, we caution against putting too much weight on the 2.5 percentage point mAP increase because both models have been optimized separately instead of in aggregate. By optimizing the models separately to increase the accuracy on a new dataset instead of optimizing them in aggregate, we do not take the dependence between the two models into account. As an example, it could be the case that new, better configurations of both models are worse in aggregate than some other option would be. Even though both models are locally better (w.r.t. their separate tasks), they are worse globally when taken together to solve both tasks in series. A better approach to optimization would be to either combine both models into one and only optimize once or to introduce a different metric against which both models are optimized. Apart from these concerns, both models on their own as well as in aggregate are a promising first step into plant state classification. The results demonstrate that solving the task is feasible and that good results can be obtained with off-the-shelf object detectors and classifiers. As a consequence, the baseline set forth in this work is a starting point for further research in this direction. 66 CHAPTER 6 Conclusion In this thesis, we have developed a prototype system for plant detection and classification using a machine learning model deployed on an edge device. The model consists of a two-stage approach wherein the first stage detects plants and the second stage classifies them. This approach has been chosen because of the limited availability of data to train one model end-to-end and comes with downsides such as an increased error rate and additional training, optimization, and evaluation complexity. Despite these downsides, the prototype performs well in the homeowner context where the variety of plants is limited. This conclusion is supported by the metrics discussed in chapter 5. The optimization of the model has been shown to require a substantial amount of computational resources and proved to be difficult to get right. The object detection model in particular needs many iterations during the hyperparameter search to converge to a global optimum. We attribute these difficulties to the model complexity of the YOLO series and the numerous hyperparameters which are available. The classifier, however, is comparatively simpler from an architectural standpoint and lends itself more easily to optimization. Revisiting the research questions posed in section 1.1, we can now assess the extent to which our findings have addressed them. 1. How well does the model work in theory and how well in practice? The optimized model achieves a mAP of 0.3838 which suggests that the prototype works well on unseen data. The plant detection is robust, particularly for household plants and the classifier shows strong performance for a wide array of common plants. Contrary to our expectations, the stress classification is not more difficult than the detection step. In fact, the problems we encountered during the optimization of the detection model are likely to stem from the increased complexity of the detection 67 6. Conclusion versus the classification task. The various different ways in which plants show water stress does not seem to be a limiting factor for stress classification. 2. What are possible reasons for it to work/not work? We have demonstrated possible reasons for why either the constituent models or the aggregate model underperform. In general, we conclude that the prototype does work and can be used within the context established in chapter 1. Our expectation that dataset curation will play a major role in successfully implementing the prototype turned out to be true. For example, some of the problems with the plant detection model can be attributed to the inconsistent labeling information present in the OID. Care had to be taken during the creation of the dataset the aggregate model was evaluated on to not introduce a bias which favors the predictions. 3. What are possible improvements to the system in the future? Specific improvements to the prototype include curating bigger datasets to train on, running the hyperparameter optimization for more iterations and more epochs per iteration, and including experts such as botanists to establish higher confidence in the ground truth. Unfortunately, the first two suggestions result in a significantly higher computational cost during training, optimization, and evaluation. This observation applies to most machine learning models and there is always a trade-off between model performance and training/optimization time. 6.1 Future Work An interesting further research direction for plant detection and classification is exploring the viability of single-stage approaches. Even though our two-stage approach leads to acceptable results, we believe that incorporating the classification step into the plant detection step would likely yield better results. A unified single-stage approach does not fully deal with the problem of propagated errors but should substantially reduce it. An advantage of this approach is that the resulting model could be optimized more easily because the loss function is dependent on object detection as well as classification. A disadvantage, however,—and this is the reason why we have not adopted such an approach—is that a unified model also needs large datasets it can be trained on. Additional datasets to train a plant detection and classification model on are needed. While we were able to find separate datasets to train the individual models on, it also meant that we were not able to implement the aforementioned single-stage approach. If there is enough interest in this research problem, it should be possible to create large datasets which encode expert knowledge. Since there is such a variety of plants and how they express nutrient deficiencies or illnesses, only experts are able to create correct ground truth labels for all or most of them. In the limited context of this thesis, we were able to label common household plants with additional information from the Internet. As soon as more exotic plants are added to the datasets, layman knowledge reaches its 68 6.1. Future Work limits. Having more and better ground truth labels should result in better detection and classification performance as well as a more robust evaluation. Future research could add additional information to the datasets such that models are able to work with more data. For example, including images of plants in the infrared spectrum would provide a visual measure of evapotranspiration. This additional perspective might allow the model to better discriminate between stressed and non-stressed plants. Other valuable perspectives could be provided by sensor data which track soil moisture, humidity and radiant flux. Although this has been done in single-plant agricultural settings, it has not been tried in a multi-plant household context. 69 List of Figures 2.1 Structure of an artificial neural network . . . . . . . . . . . . . . . . . . . 10 3.1 Methodological approach for the prototype. . . . . . . . . . . . . . . . . . 37 3.2 Residual connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3 Bottleneck building block . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1 Object detection fitness per epoch. . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Object detection precision and recall during training. . . . . . . . . . . . . 47 4.3 Object detection box and object loss. . . . . . . . . . . . . . . . . . . . . . 48 4.4 Optimized object detection fitness per epoch. . . . . . . . . . . . . . . . . 49 4.5 Hyperparameter optimized object detection precision and recall during training 50 4.6 Hyperparameter optimized object detection box and object loss . . . . . . 50 4.7 Classifier accuracy and loss during training. . . . . . . . . . . . . . . . . . 52 4.8 Classifier hyperparameter optimization results. . . . . . . . . . . . . . . . 53 5.1 Object detection AP@0.5 and AP@0.95. . . . . . . . . . . . . . . . . . . . 57 5.2 Hyperparameter optimized object detection AP@0.5 and AP@0.95 . . . . 58 5.3 Mean ROC and variability of hyperparameter-optimized model. . . . . . . 59 5.4 F1-score of stratified 10-fold cross validation. . . . . . . . . . . . . . . . . 60 5.5 Classifier CAMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.6 Aggregate model AP@0.5 and AP@0.95. . . . . . . . . . . . . . . . . . . . 63 5.7 Optimized aggregate model AP@0.5 and AP@0.95. . . . . . . . . . . . . . 64 71 List of Tables 4.1 Hyperparameters and their possible values during optimization. . . . . . . 52 4.2 Hyperparameters for the optimized classifier. . . . . . . . . . . . . . . . . 53 5.1 Precision, recall and F1-score for the object detection model. . . . . . . . 56 5.2 Precision, recall and F1-score for the optimized object detection model. . 57 5.3 Precision, recall and F1-score for the aggregate model. . . . . . . . . . . . 61 5.4 Precision, recall and F1-score for the optimized aggregate model. . . . . . 63 73 Acronyms AI Artificial Intelligence. 7 ANN Artificial Neural Network. 22, 31 AP Average Precision. 40, 56–58, 62–64 API Application Programming Interface. 4, 35, 54 APS-C Advanced Photo System type-C. 31 AUC Area Under the Curve. 2, 58, 59, 65 BN Batch Normalization. 26, 32, 40, 42 CAM Class Activation Mapping. 58, 61, 62, 65, 71 CART Classification and Regression Tree. 32 CIoU Complete Intersection over Union. 41 CNN Convolutional Neural Network. 9, 16–24, 28, 31–33, 61 CNN-LSTM CNN Long Short-Term Memory Network. 33 COCO Common Objects in Context. 4, 5, 19–21, 36, 37, 40–42, 46, 64, 65 CPU Central Processing Unit. 17 CSI Camera Serial Interface. 36 CUDA Compute Unified Device Architecture. 22 DCNN Deep Convolutional Neural Networks. 31 DPM Deformable Part-Based Model. 16, 17 DT Decision Tree. 32 75 E-ELAN Extended Efficient Layer Aggregation Network. 42 ELAN Efficient Layer Aggregation Network. 42 ELU Exponential Linear Unit. 13 FPN Feature Pyramid Network. 19, 21, 40 GBDT Gradient Boosted Decision Tree. 31 GCC Green Canopy Cover. 31 GIoU Generalized Intersection over Union. 41 GPU Graphics Processing Unit. 2, 15–17, 19, 22, 23, 41 Grad-CAM Gradient-weighted Class Activation Mapping. 55, 56, 61 HOG Histogram of Oriented Gradients. 15, 16, 22, 32 ILSVRC ImageNet Large Scale Visual Recognition Challenge. 16, 25, 28 IOU Intersection over Union. 2, 17, 20, 39, 40, 56–58, 62–64 k-NN k-Nearest Neighbors. 32 LRN Local Response Normalization. 23 mAP mean Average Precision. 2, 17–21, 36, 40–42, 46, 57, 58, 62–67 MFCC Mel-frequency Cepstral Coefficient. 8 MLP Multilayer Perceptron. 10 MNIST Modified National Institute of Standards and Technology. 23 MSE mean squared error. 13, 14, 23 NLP Natural Language Processing. 1 NMS Non Maximum Suppression. 40 OID Open Images Dataset. 45, 46, 64, 68 ONNX Open Neural Network Exchange. 54 PANet Path Aggregation Network. 41 76 RBF Radial Basis Function. 23 ReLU Rectified Linear Unit. 12, 13, 23–26, 40 ResNet Residual Neural Network. 51 REST Representational State Transfer. 35, 36, 54 ROC Receiver Operating Characteristic. 2, 58, 59, 71 ROI Region of Interest. 17, 18 RPN Region Proposal Network. 18, 19 SAM Spatial Attention Module. 41 SBC single-board computer. 2–4, 35, 45 SE Squeeze-Excitation. 32 SGD Stochastic Gradient Descent. 23, 33, 52, 53 SIFT Scale-Invariant Feature Transform. 15, 22, 32 SiLU Sigmoid Linear Unit. 13 SIoU Scylla Intersection over Union. 41 SPP Spatial Pyramid Pooling. 18 SSD Single Shot MultiBox Detector. 20 SVM Support Vector Machine. 9, 16–18, 23, 28, 31, 32 TPU Tensor Processing Unit. 2 UAV Unmanned Aerial Vehicle. 31, 33 VOC PASCAL Visual Object Classes. 4, 16–20, 38, 40 XAI Explainable Artificial Intelligence. 58 YOLO You Only Look Once. 20, 38–42, 46, 48, 67 77 Bibliography [ABC22] Lerina Aversano, Mario Luca Bernardi, and Marta Cimitile. Water stress classification using Convolutional Deep Neural Networks. JUCS - Journal of Universal Computer Science, 28(3):311–328, 3, March 28, 2022. doi: 10.3897/jucs.80733. [AH22] Omar El Ariss and Kaoning Hu. ResNet-based Parkinson’s Disease Clas- sification. IEEE Transactions on Artificial Intelligence:1–11, 2022. doi: 10.1109/TAI.2022.3193651. [AKG20] Shiva Azimi, Taranjit Kaur, and Tapan K Gandhi. Water Stress Identifica- tion in Chickpea Plant Shoot Images Using Deep Learning. In 2020 IEEE 17th India Council International Conference (INDICON). 2020 IEEE 17th India Council International Conference (INDICON), pages 1–7, December 2020. doi: 10.1109/INDICON49873.2020.9342388. [ALL+19] Jiangyong An, Wanyi Li, Maosong Li, Sanrong Cui, and Huanran Yue. Identification and Classification of Maize Drought Stress Using Deep Convolutional Neural Network. Symmetry, 11(2):256, February 2019. doi: 10.3390/sym11020256. [Awa19] Mohamad M. Awad. Toward Precision in Crop Yield Estimation Using Remote Sensing and Optimization Techniques. Agriculture, 9(3):54, March 2019. doi: 10.3390/agriculture9030054. [AWG21] Shiva Azimi, Rohan Wadhawan, and Tapan K. Gandhi. Intelligent Monitor- ing of Stress Induced by Water Deficiency in Plants Using Deep Learning. IEEE Transactions on Instrumentation and Measurement, 70:1–13, 2021. doi: 10.1109/TIM.2021.3111994. [BB12] James Bergstra and Yoshua Bengio. Random Search for Hyper-Parameter Optimization. The Journal of Machine Learning Research, 13:281–305, null, February 1, 2012. 79 [BBL+23] Bernd Bischl, Martin Binder, Michel Lang, Tobias Pielok, Jakob Richter, Stefan Coors, Janek Thomas, Theresa Ullmann, Marc Becker, Anne- Laure Boulesteix, Difan Deng, and Marius Lindauer. Hyperparameter optimization: Foundations, algorithms, best practices, and open challenges. WIREs Data Mining and Knowledge Discovery, 13(2):e1484, 2023. doi: 10.1002/widm.1484. [BMR+20] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language Models Are Few-Shot Learners. July 22, 2020. doi: 10.48550/ arXiv.2005.14165. [BSF94] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, March 1994. doi: 10.1109/72.279181. [BTD+21] Lefteris Benos, Aristotelis C. Tagarakis, Georgios Dolias, Remigio Berruto, Dimitrios Kateris, and Dionysis Bochtis. Machine Learning in Agriculture: A Comprehensive Updated Review. Sensors, 21(11):3758, January 2021. doi: 10.3390/s21113758. [BWL20] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. YOLOv4: Optimal Speed and Accuracy of Object Detection. April 22, 2020. doi: 10.48550/arXiv.2004.10934. preprint. [Cau47] M. Augustine Cauchy. Méthode générale pour la résolution des systèmes d’équations simultanées. Comptes rendus hebdomadaires des séances de l’Académie des sciences, 25:399–402, October 18, 1847. [CCR+21] Narendra Singh Chandel, Subir Kumar Chakraborty, Yogesh Anand Ra- jwade, Kumkum Dubey, Mukesh K. Tiwari, and Dilip Jat. Identifying Crop Water Stress Using Deep Learning Models. Neural Computing and Applications, 33(10):5353–5367, May 1, 2021. doi: 10.1007/s00521- 020-05325-4. [Cha20] Russell Chan. Healthy and Wilted Houseplant Images. January 2020. url: https://www.kaggle.com/datasets/russellchan/healthy- and-wilted-houseplant-images (visited on 12/08/2023). [Dav92] A.M. Davis. Operational prototyping: a new development approach. IEEE Software, 9(5):70–78, September 1992. doi: 10.1109/52.156899. 80 [DDS+09] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, June 2009. doi: 10.1109/CVPR.2009.5206848. [DT05] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, 886–893 vol. 1, June 2005. doi: 10.1109/CVPR.2005.177. [DZM+21] Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun. RepVGG: Making VGG-style ConvNets Great Again. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR). 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13728–13737, June 2021. doi: 10. 1109/CVPR46437.2021.01352. [EST+14] Dumitru Erhan, Christian Szegedy, Alexander Toshev, and Dragomir Anguelov. Scalable Object Detection Using Deep Neural Networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 2155– 2162, June 2014. doi: 10.1109/CVPR.2014.276. [EVW+10] Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 88(2):303–338, June 1, 2010. doi: 10.1007/s11263-009-0275-4. [FGM+10] Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, and Deva Ramanan. Object Detection with Discriminatively Trained Part-Based Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, September 2010. doi: 10.1109/TPAMI.2009.167. [FMR08] Pedro Felzenszwalb, David McAllester, and Deva Ramanan. A discrimina- tively trained, multiscale, deformable part model. In 2008 IEEE Conference on Computer Vision and Pattern Recognition. 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, June 2008. doi: 10.1109/CVPR.2008.4587597. [FS95] Yoav Freund and Robert E. Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In Paul Vitányi, editor, Computational Learning Theory, Lecture Notes in Computer Science, pages 23–37, Berlin, Heidelberg. Springer, 1995. doi: 10.1007/3-540- 59119-2_166. 81 [Fuk69] Kunihiko Fukushima. Visual Feature Extraction by a Multilayered Network of Analog Threshold Elements. IEEE Transactions on Systems Science and Cybernetics, 5(4):322–333, October 1969. doi: 10.1109/TSSC.1969. 300225. [GB10] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of train- ing deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. Proceed- ings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 249–256. JMLR Workshop and Conference Proceed- ings, March 31, 2010. url: https://proceedings.mlr.press/v9/ glorot10a.html (visited on 11/08/2023). [GBC16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, November 10, 2016. 801 pages. [GDD+14] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich Fea- ture Hierarchies for Accurate Object Detection and Semantic Segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 580–587, June 2014. doi: 10.1109/CVPR.2014.81. [Gev22] Zhora Gevorgyan. SIoU Loss: More Powerful Learning for Bounding Box Regression. May 25, 2022. doi: 10.48550/arXiv.2205.12740. preprint. [GFM] Ross B. Girshick, Pedro F. Felzenszwalb, and David McAllester. Dis- criminatively Trained Deformable Part Models (Release 5). url: https: //www.rossgirshick.info/latent/ (visited on 10/26/2023). [GID+15] Ross Girshick, Forrest Iandola, Trevor Darrell, and Jitendra Malik. De- formable part models are convolutional neural networks. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 437–446, June 2015. doi: 10.1109/CVPR.2015.7298641. [Gir15] Ross Girshick. Fast R-CNN. In 2015 IEEE International Conference on Computer Vision (ICCV). 2015 IEEE International Conference on Computer Vision (ICCV), pages 1440–1448, December 2015. doi: 10. 1109/ICCV.2015.169. [HLV+17] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Wein- berger. Densely Connected Convolutional Networks. In 2017 IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR). 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, July 2017. doi: 10.1109/CVPR.2017.243. 82 [HSC+19] Andrew Howard, Mark Sandler, Bo Chen, Weijun Wang, Liang-Chieh Chen, Mingxing Tan, Grace Chu, Vijay Vasudevan, Yukun Zhu, Ruom- ing Pang, Hartwig Adam, and Quoc Le. Searching for MobileNetV3. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1314–1324, October 2019. doi: 10.1109/ICCV.2019.00140. [HSW89] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feed- forward networks are universal approximators. Neural Networks, 2(5):359– 366, January 1, 1989. doi: 10.1016/0893-6080(89)90020-8. [HZC+17] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. April 16, 2017. doi: 10.48550/arXiv.1704.04861. preprint. [HZR+15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1904– 1916, September 2015. doi: 10.1109/TPAMI.2015.2389824. [HZR+16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 770–778, June 2016. doi: 10.1109/CVPR.2016.90. [IS15] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceed- ings of the 32nd International Conference on Machine Learning. Interna- tional Conference on Machine Learning, pages 448–456. PMLR, June 1, 2015. url: https://proceedings.mlr.press/v37/ioffe15. html (visited on 11/09/2023). [Joc20] Glenn Jocher. YOLOv5 by Ultralytics, version 7.0, May 2020. doi: 10. 5281/zenodo.3908559. [KB17] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. January 29, 2017. doi: 10.48550/arXiv.1412.6980. [KDA+17] Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu-El-Haija, Alina Kuznetsova, Hassan Rom, Jasper Uijlings, Stefan Popov, Shahab Kamali, Matteo Malloci, Jordi Pont-Tuset, Andreas Veit, Serge Belongie, Victor Gomes, Abhinav Gupta, Chen Sun, Gal Chechik, David Cai, Zheyun Feng, Dhyanesh Narayanan, and Kevin Murphy. OpenImages: A public dataset for large-scale multi-label and multi-class image classification. 2017. url: https://storage.googleapis.com/openimages/ web/index.html (visited on 12/06/2023). 83 [KRA+20] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexan- der Kolesnikov, Tom Duerig, and Vittorio Ferrari. The Open Images Dataset V4: Unified Image Classification, Object Detection, and Visual Relationship Detection at Scale. International Journal of Computer Vision, 128(7):1956–1981, July 2020. doi: 10.1007/s11263-020-01316-z. [KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Clas- sification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. url: https://papers.nips.cc/paper/2012/hash/ c399862d3b9d6b76c8436e924a68c45b-Abstract.html (visited on 10/22/2023). [LAE+16] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: Single Shot MultiBox Detector. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, Lecture Notes in Computer Science, pages 21–37, Cham. Springer International Publishing, 2016. doi: 10.1007/978-3-319-46448-0_2. [LBB+98] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278– 2324, November 1998. doi: 10.1109/5.726791. [LBD+89] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1(4):541–551, December 1, 1989. doi: 10.1162/neco.1989.1.4.541. [LD15] Shuying Liu and Weihong Deng. Very deep convolutional neural net- work based image classification using small training sample size. In 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR). 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pages 730–734, November 2015. doi: 10.1109/ACPR.2015.7486599. [LDG+17] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariha- ran, and Serge Belongie. Feature Pyramid Networks for Object Detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017 IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 936–944, July 2017. doi: 10.1109/CVPR.2017. 106. [LGG+17] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dol- lár. Focal Loss for Dense Object Detection. In 2017 IEEE International Conference on Computer Vision (ICCV). 2017 IEEE International Confer- ence on Computer Vision (ICCV), pages 2999–3007, October 2017. doi: 10.1109/ICCV.2017.324. 84 [LIM+22] Patricia López-García, Diego Intrigliolo, Miguel A. Moreno, Alejandro Martínez-Moreno, José Fernando Ortega, Eva Pilar Pérez-Álvarez, and Rocío Ballesteros. Machine Learning-Based Processing of Multispectral and RGB UAV Imagery for the Multitemporal Monitoring of Vineyard Water Status. Agronomy, 12(9):2122, September 2022. doi: 10.3390/ agronomy12092122. [LLJ+22] Chuyi Li, Lulu Li, Hongliang Jiang, Kaiheng Weng, Yifei Geng, Liang Li, Zaidan Ke, Qingyuan Li, Meng Cheng, Weiqiang Nie, Yiduo Li, Bo Zhang, Yufei Liang, Linyuan Zhou, Xiaoming Xu, Xiangxiang Chu, Xiaoming Wei, and Xiaolin Wei. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. September 7, 2022. doi: 10.48550/arXiv. 2209.02976. preprint. [LMB+15] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zit- nick, and Piotr Dollár. Microsoft COCO: Common Objects in Context. February 20, 2015. doi: 10.48550/arXiv.1405.0312. [Low99] David G. Lowe. Object Recognition from Local Scale-Invariant Features. In Proceedings of the International Conference on Computer Vision-Volume 2 - Volume 2, ICCV ’99, page 1150, USA. IEEE Computer Society, Septem- ber 20, 1999. [LQQ+18] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path Aggregation Network for Instance Segmentation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8759–8768, June 2018. doi: 10.1109/CVPR.2018.00913. [Mis20] Diganta Misra. Mish: A Self Regularized Non-Monotonic Activation Func- tion. August 13, 2020. doi: 10.48550/arXiv.1908.08681. preprint. [Mit97] Thomas M. Mitchell. Machine Learning. McGraw-Hill, Inc., USA, 1st edi- tion, February 1997. 432 pages. [MP17] Marvin Minsky and Seymour A. Papert. Perceptrons: An Introduction to Computational Geometry. The MIT Press, September 22, 2017. doi: 10.7551/mitpress/11301.001.0001. [MP43] Warren S. McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, December 1, 1943. doi: 10.1007/BF02478259. [MWL22] Patrick McEnroe, Shen Wang, and Madhusanka Liyanage. A Survey on the Convergence of Edge Computing and AI for UAVs: Opportunities and Challenges. IEEE Internet of Things Journal, 9(17):15435–15459, September 2022. doi: 10.1109/JIOT.2022.3176400. 85 [PDD+09] Nicolas Pinto, David Doukhan, James J. DiCarlo, and David D. Cox. A High-Throughput Screening Approach to Discovering Good Forms of Biologically Inspired Visual Representation. PLOS Computational Biology, 5(11):e1000579, November 26, 2009. doi: 10.1371/journal.pcbi. 1000579. [PY10] Sinno Jialin Pan and Qiang Yang. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, Oc- tober 2010. doi: 10.1109/TKDE.2009.191. [RDG+16] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You Only Look Once: Unified, Real-Time Object Detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779–788, June 2016. doi: 10.1109/CVPR.2016.91. [RF17] Joseph Redmon and Ali Farhadi. YOLO9000: Better, Faster, Stronger. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017 IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 6517–6525, July 2017. doi: 10.1109/CVPR.2017. 690. [RF18] Joseph Redmon and Ali Farhadi. YOLOv3: An Incremental Improvement. April 8, 2018. doi: 10.48550/arXiv.1804.02767. preprint. [RHG+15] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems, volume 28. Cur- ran Associates, Inc., 2015. url: https://proceedings.neurips. cc/paper/2015/hash/14bfa6bb14875e45bba028a21ed38046- Abstract.html (visited on 10/27/2023). [RHG+17] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R- CNN: Towards Real-Time Object Detection with Region Proposal Net- works. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, June 2017. doi: 10.1109/TPAMI.2016.2577031. [RHW86] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. Nature, 323(6088):533–536, 6088, October 1986. doi: 10.1038/323533a0. [Ros57] Frank Rosenblatt. The Perceptron: A Perceiving and Recognizing Automa- ton. Technical Report 85-460-1, Cornell Aeronautical Laboratory, Ithaca, NY, January 1957. [Ros62] Frank Rosenblatt. Principles of Neurodynamics: Perceptrons and the The- ory of Brain Mechanisms. Spartan Books, 1962. 648 pages. 86 [RRL+20] Paula Ramos-Giraldo, Chris Reberg-Horton, Anna M. Locke, Steven Mirsky, and Edgar Lobaton. Drought Stress Detection Using Low-Cost Computer Vision Systems and Machine Learning Techniques. IT Profes- sional, 22(3):27–29, May 2020. doi: 10.1109/MITP.2020.2986103. [RRM+20] Paula Ramos-Giraldo, S. Chris Reberg-Horton, Steven Mirsky, Edgar Lobaton, Anna M. Locke, Esleyther Henriquez, Ane Zuniga, and Artem Minin. Low-Cost Smart Camera System for Water Stress Detection in Crops. In 2020 IEEE SENSORS. 2020 IEEE SENSORS, pages 1–4, October 2020. doi: 10.1109/SENSORS47125.2020.9278744. [RTG+19] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized Intersection Over Union: A Met- ric and a Loss for Bounding Box Regression. In 2019 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 658–666, June 2019. doi: 10.1109/CVPR.2019.00075. [SA15] David Sussillo and L. F. Abbott. Random Walk Initialization for Training Very Deep Feedforward Networks. February 27, 2015. doi: 10.48550/ arXiv.1412.6558. preprint. [Sam59] A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development, 3(3):210–229, July 1959. doi: 10.1147/rd.33.0210. [SCD+20] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakr- ishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. Inter- national Journal of Computer Vision, 128(2):336–359, February 2020. doi: 10.1007/s11263-019-01228-7. [SCL+20] Jinya Su, Matthew Coombes, Cunjia Liu, Yongchao Zhu, Xingyang Song, Shibo Fang, Lei Guo, and Wen-Hua Chen. Machine Learning-Based Crop Drought Mapping System by UAV Remote Sensing RGB Imagery. Un- manned Systems, 08(01):71–83, January 2020. [SGG16] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training Region- Based Object Detectors with Online Hard Example Mining. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 761–769, June 2016. doi: 10.1109/CVPR.2016.89. [SGZ16] Falong Shen, Rui Gan, and Gang Zeng. Weighted residuals for very deep networks. In 2016 3rd International Conference on Systems and Informat- ics (ICSAI). 2016 3rd International Conference on Systems and Informatics (ICSAI), pages 936–941, November 2016. doi: 10.1109/ICSAI.2016. 7811085. 87 [SHZ+18] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. MobileNetV2: Inverted Residuals and Linear Bottle- necks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4510–4520, June 2018. doi: 10.1109/CVPR.2018. 00474. [SJJ07] Prototyping Tools and Techniques. In Andrew Sears, Julie A. Jacko, and Julie A. Jacko, editors, The Human-Computer Interaction Hand- book, pages 1043–1066. CRC Press, September 19, 2007. doi: 10.1201/ 9781410615862-66. [SLJ+15] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, June 2015. doi: 10.1109/CVPR.2015.7298594. [SSP03] P.Y. Simard, D. Steinkraus, and J.C. Platt. Best practices for convolu- tional neural networks applied to visual document analysis. In Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings. Pages 958–963, August 2003. doi: 10.1109/ICDAR.2003.1227801. [SZ15] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Net- works for Large-Scale Image Recognition. April 10, 2015. doi: 10.48550/ arXiv.1409.1556. preprint. [UvdSG+13] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. Selective Search for Object Recognition. International Journal of Computer Vision, 104(2):154–171, September 1, 2013. doi: 10.1007/s11263-013- 0620-5. [VFH19] Maria Cecilia A. Venal, Arnel C. Fajardo, and Alexander A. Hernandez. Plant Stress Classification for Smart Agriculture utilizing Convolutional Neural Network - Support Vector Machine. In 2019 International Confer- ence on ICT for Smart Society (ICISS). 2019 International Conference on ICT for Smart Society (ICISS), volume 7, pages 1–5, November 2019. doi: 10.1109/ICISS48059.2019.8969799. [VJ01] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on Com- puter Vision and Pattern Recognition. CVPR 2001, volume 1, pages I–I, December 2001. doi: 10.1109/CVPR.2001.990517. 88 [WBL22] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object de- tectors. July 6, 2022. doi: 10.48550/arXiv.2207.02696. preprint. [WLY22] Chien-Yao Wang, Hong-Yuan Mark Liao, and I.-Hau Yeh. Designing Network Design Strategies Through Gradient Path Analysis. November 9, 2022. doi: 10.48550/arXiv.2211.04800. preprint. [WPL+18] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. CBAM: Convolutional Block Attention Module. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Computer Vision – ECCV 2018, Lecture Notes in Computer Science, pages 3–19, Cham. Springer International Publishing, 2018. doi: 10.1007/978-3-030-01234- 2_1. [ZF14] Matthew D. Zeiler and Rob Fergus. Visualizing and Understanding Convo- lutional Networks. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, Lecture Notes in Com- puter Science, pages 818–833, Cham. Springer International Publishing, 2014. doi: 10.1007/978-3-319-10590-1_53. [ZHT22] Yiwei Zhong, Baojin Huang, and Chaowei Tang. Classification of Cas- sava Leaf Disease Based on a Non-Balanced Dataset Using Transformer- Embedded ResNet. Agriculture, 12(9):1360, September 2022. doi: 10. 3390/agriculture12091360. [ZKL+15] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning Deep Features for Discriminative Localization. Decem- ber 13, 2015. doi: 10.48550/arXiv.1512.04150. [ZQD+21] Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A Comprehensive Survey on Transfer Learning. Proceedings of the IEEE, 109(1):43–76, January 2021. doi: 10.1109/JPROC.2020.3004555. [ZWD+21] Haoyang Zhang, Ying Wang, Feras Dayoub, and Niko Sünderhauf. Var- ifocalNet: An IoU-aware Dense Object Detector. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8510–8519, June 2021. doi: 10.1109/CVPR46437.2021. 00841. [ZWJ+17] Shuo Zhuang, Ping Wang, Boran Jiang, Maosong Li, and Zhihong Gong. Early Detection of Water Stress in Maize Based on Digital Images. Com- puters and Electronics in Agriculture, 140:461–468, August 1, 2017. doi: 10.1016/j.compag.2017.06.022. 89 [ZWL+20] Zhaohui Zheng, Ping Wang, Wei Liu, Jinze Li, Rongguang Ye, and Dongwei Ren. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. Proceedings of the AAAI Conference on Artificial Intelligence, 34(07):12993–13000, 07, April 3, 2020. doi: 10.1609/aaai.v34i07. 6999. 90