Human Gait Analysis Machine Learning-Based Classification of Gait Disorders DISSERTATION zur Erlangung des akademischen Grades Doktor der Technischen Wissenschaften eingereicht von Dipl.-Ing. Djordje Slijepčević, BSc Matrikelnummer 00925240 an der Fakultät für Informatik der Technischen Universität Wien Betreuung: Univ.-Prof. Dipl.-Ing. Dr.techn. Christian Breiteneder Zweitbetreuung: FH-Prof. Priv.-Doz. Dipl.-Ing. Mag. Dr. Matthias Zeppelzauer Diese Dissertation haben begutachtet: Neil Cronin Morgan Sangeux Wien, 21. Mai 2024 Djordje Slijepčević Technische Universität Wien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at Human Gait Analysis Machine Learning-Based Classification of Gait Disorders DISSERTATION submitted in partial fulfillment of the requirements for the degree of Doktor der Technischen Wissenschaften by Dipl.-Ing. Djordje Slijepčević, BSc Registration Number 00925240 to the Faculty of Informatics at the TU Wien Advisor: Univ.-Prof. Dipl.-Ing. Dr.techn. Christian Breiteneder Second advisor: FH-Prof. Priv.-Doz. Dipl.-Ing. Mag. Dr. Matthias Zeppelzauer The dissertation has been reviewed by: Neil Cronin Morgan Sangeux Vienna, 21st May, 2024 Djordje Slijepčević Technische Universität Wien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at Erklärung zur Verfassung der Arbeit Dipl.-Ing. Djordje Slijepčević, BSc Hiermit erkläre ich, dass ich diese Arbeit selbständig verfasst habe, dass ich die verwen- deten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der Arbeit – einschließlich Tabellen, Karten und Abbildungen –, die anderen Werken oder dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter Angabe der Quelle als Entlehnung kenntlich gemacht habe. Wien, 21. Mai 2024 Djordje Slijepčević v Acknowledgements I am deeply grateful for the guidance, support, and expertise of my first supervisor, Prof. Christian Breiteneder, whose insights and direction were invaluable throughout this journey. His profound knowledge has been a driving force and inspiration behind this dissertation. A heartfelt thanks to Matthias Zeppelzauer, my second supervisor and daily mentor at the university. Matthias, your involvement in the day-to-day aspects of my academic work and your mentorship have been fundamental in shaping the quality and direction of this research. My gratitude extends to Brian Horsak, my mentor in the field of human gait analysis. Brian, your leadership in the gait analysis projects at the university and your insightful feedback have been essential in my understanding and exploration of this complex subject. A special thanks to Fabian Horst for the enriching collaboration over the last years. Your moral and content-wise support, especially during the crucial stages of this thesis, have been of immense value. I would like to express my appreciation to all the co-authors of the publications related to this thesis. Your contributions and collaborative spirit have been essential in achieving the quality of the presented work. A special thanks to Jürgen Pannosch and Lukas Daniel Klausner for their meticulous proofreading and feedback. To my parents, Nataša and Radomir, and my brother Dimitrije, your unwavering support and belief in me have been the foundation for this achievement. This journey would not have been possible without your love, encouragement, and sacrifice. Your motivating words from the Master’s thesis period, “Piši, piši ponekad dva-tri reda” have motivated me throughout this endeavor as well. Last but not least, to my wife Birgit and my two children, Lora and Iva, your patience, understanding, and unending support have been my greatest strength. Balancing family life with academic endeavors is a challenge, one that would have been impossible to overcome without your constant love and support (e.g., such as during the late-night proofreading sessions). vii Kurzfassung Die klinische Ganganalyse ermöglicht die Bewertung des menschlichen Gangbildes. Sie liefert die Grundlage für KlinikerInnen, um präzise Diagnosen zu stellen und effektive Behandlungspläne zu entwickeln. Die klinische dreidimensionale Ganganalyse, die als Goldstandard in der klinischen Praxis gilt, umfasst verschiedene Datenmodalitäten wie z.B. kinematische Daten (z.B. Gelenkwinkel), die mithilfe optischer Bewegungserfassungs- systeme berechnet werden, und kinetische Daten (z.B. Bodenreaktionskräfte), die über Kraftmessplatten aufgezeichnet werden. Diese Daten sind multivariate hochdimensio- nale Zeitreihen, die zeitliche Abhängigkeiten und nichtlineare Beziehungen zueinander aufweisen. Die Komplexität dieser Daten und der entsprechenden klinischen Aufgaben- stellungen, insbesondere bei der Identifikation spezifischer pathologischer Gangmuster, hat zur Anwendung von Methoden des maschinellen Lernens (ML) geführt. Der Einsatz von ML-Methoden zielt darauf ab, die Effizienz der klinischen Ganganalyse zu erhöhen und zu einer besser informierten Entscheidungsfindung beizutragen. Durch den Einsatz von ML-Methoden können ForscherInnen und KlinikerInnen große Mengen von Gangda- ten analysieren, um neue Erkenntnisse zu gewinnen, die mit konventionellen Methoden schwer zu erlangen wären. Viele dieser Ansätze sind jedoch mit Einschränkungen ver- bunden, wie zum Beispiel die Verwendung von kleinen Datensätzen oder vereinfachten Aufgabenstellungen mit wenigen Klassen. In dieser Dissertation werden bestehende Limitationen in der klinischen Ganganalyse behandelt, und es wird ein methodischer Beitrag dazu geleistet, komplexe Mehrklassen- Klassifikationsaufgaben mit der Entwicklung erklärbarer ML-Ansätzen zu bewältigen. Zu diesem Zweck werden traditionelle ML- und Deep-Learning-Ansätze entwickelt, und ihre Anwendbarkeit auf Gangdaten und entsprechende Klassifikationsaufgaben untersucht. In dieser Dissertation werden erstmals Erklärungsansätze für ML-Methoden (einschließ- lich Deep-Learning-Methoden) für klinische Gangdaten vorgestellt, die es ermöglichen, Entscheidungen für KlinikerInnen nachvollziehbar zu machen. Darüber hinaus wird die Nützlichkeit von Erklärbarkeitsmethoden bei der Identifizierung von Verzerrungen inner- halb der Daten und der trainierten ML-Modelle aufgezeigt. Neben einer systematischen Evaluierung von Datenaufbereitungsstrategien in Bezug auf die Skalierung und Extrak- tion von Merkmalen sowie Unausgewogenheit der Daten wird auch die diskriminative Fähigkeit von Bodenreaktionskräften und kinematischen Daten untersucht. ix Die vorliegende Dissertation leistet einen bedeutenden Beitrag, indem sie einen großen rea- len Datensatz namens GaitRec einführt. Dieser Datensatz soll als Benchmark-Datensatz dienen und bildet eine entscheidende Grundlage für die standardisierte Bewertung der Leistung von ML-Ansätzen. In dieser Arbeit werden zwei Anwendungsfälle mit unter- schiedlich komplexen Klassifikationsaufgaben untersucht, die große Mengen klinischer Daten nutzen. Der erste Anwendungsfall verwendet den GaitRec Datensatz und beinhaltet Bodenreaktionskraftdaten sowohl von gesunden Personen als auch von PatientInnen mit funktionellen Gangstörungen. Der zweite Anwendungsfall umfasst kinematische Daten (z.B. Gelenkwinkel) und Bodenreaktionskraftdaten von PatientInnen mit Zerebralparese. Abschließend werden in dieser Arbeit zukünftige Forschungsrichtungen aufgezeigt, die das Potenzial haben, den Bereich der automatisierten Klassifizierung von klinischen Gangdaten voranzubringen. Abstract Clinical gait analysis is a central approach for assessing human gait, which forms the foundation for clinicians to make accurate diagnoses and to develop effective treatment plans. Clinical three-dimensional gait analysis, considered as gold standard in clinical practice, involves various data modalities such as kinematic data (e.g., joint angles) calculated using optical motion capture systems and kinetic data (e.g., ground reaction forces) recorded via force plates. These data represent multivariate high-dimensional time series signals that exhibit temporal dependencies and non-linear relationships. The complexity of these data and corresponding clinical tasks, particularly in the identifica- tion of specific pathological gait patterns, has motivated researchers to investigate the suitability of machine learning (ML) methods to solve gait analysis tasks. The use of ML methods aims to improve the efficiency of clinical gait analysis and to contribute to better informed decision-making. By using ML, researchers and clinical experts can analyze large amounts of gait data to gain new insights, which would be difficult with conventional methods. However, many of these approaches are also subject to limitations, such as using small datasets for training or addressing simplified tasks with only a few classes. The present thesis addresses existing gaps and limitations and makes a significant methodological contribution to explainable ML approaches for complex multi-class gait classification tasks. For this purpose, traditional ML and deep learning approaches are developed, and their suitability for gait data and corresponding classification tasks is investigated. This thesis proposes for the first time explainability approaches for ML methods (including deep learning methods) for clinical gait data that enable clinicians to trace decisions. Additionally, it demonstrates the usefulness of explainability methods in identifying biases within ML pipelines and gait data. In addition to a systematic evaluation of data handling strategies concerning feature scaling, feature extraction, and data imbalances, this thesis investigates the discriminative power of ground reaction force and joint angle data. A significant contribution of the current thesis lies in the publication of a large-scale real-world dataset named GaitRec. This dataset serves as a benchmark, providing a crucial foundation for assessing the performance of ML approaches in a standardized way. In this work, two use cases with complex binary and multi-class classification tasks are investigated, utilizing large-scale clinical datasets. The first use case utilizes the xi GaitRec dataset and involves ground reaction force data from both healthy individuals and patients with functional gait disorders. The second use case encompasses kinematic (i.e., joint angles) and ground reaction force data from patients with cerebral palsy. Finally, the present thesis identifies future research directions that have the potential to advance the field of automated classification of clinical gait data. Contents Kurzfassung ix Abstract xi Preface xv 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Aims of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Synopsis and Publications . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.6 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . 34 1.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Bibliography 43 2 Publications 53 2.1 GaitRec, a Large-Scale Ground Reaction Force Dataset of Healthy and Impaired Gait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.2 Automatic Classification of Functional Gait Disorders . . . . . . . . . 62 2.3 Input Representations and Classification Strategies for Automated Human Gait Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 2.4 Explaining Machine Learning Models for Clinical Gait Analysis . . . . 86 2.5 Explainable Machine Learning in Human Gait Analysis: A Study on Children With Cerebral Palsy . . . . . . . . . . . . . . . . . . . . . . . 128 xiii Preface The present thesis follows a cumulative approach, encompassing a collection of selected publications derived from my contributions to machine learning in the field of clinical gait analysis. The following five journal publications constitute the main body of this thesis: • Brian Horsak, Djordje Slijepcevic, Anna-Maria Raberger, Caterine Schwab, Mar- ianne Worisch, and Matthias Zeppelzauer. GaitRec, a Large-Scale Ground Reaction Force Dataset of Healthy and Impaired Gait. Scientific Data, 7(1):143, 2020. DOI: 10.1038/s41597-020-0481-z • Djordje Slijepcevic, Matthias Zeppelzauer, Anna-Maria Gorgas, Caterine Schwab, Michael Schüller, Arnold Baca, Christian Breiteneder, and Brian Horsak. Auto- matic Classification of Functional Gait Disorders. IEEE Journal of Biomedi- cal and Health Informatics, 22(5):1653–1661, 2017. DOI: 10.1109/JBHI.2017.2785682 • Djordje Slijepcevic, Matthias Zeppelzauer, Caterine Schwab, Anna-Maria Raberger, Christian Breiteneder, and Brian Horsak. Input Representations and Classifi- cation Strategies for Automated Human Gait Analysis. Gait & Posture, 76:198–203, 2020. DOI: 10.1016/j.gaitpost.2019.10.021 • Djordje Slijepcevic, Fabian Horst, Sebastian Lapuschkin, Brian Horsak, Anna- Maria Raberger, Andreas Kranzl, Wojciech Samek, Christian Breiteneder, Wolfgang Immanuel Schöllhorn, and Matthias Zeppelzauer. Explaining Machine Learning Models for Clinical Gait Analysis. ACM Transactions on Computing for Healthcare (HEALTH), 3(2):1–27, 2021. DOI: 10.1145/3474121 • Djordje Slijepcevic, Matthias Zeppelzauer, Fabian Unglaube, Andreas Kranzl, Christian Breiteneder, and Brian Horsak. Explainable Machine Learning in Human Gait Analysis: A Study on Children With Cerebral Palsy. IEEE Access, 11:65906–65923, 2023. DOI: 10.1109/ACCESS.2023.3289986 Chapter 1 motivates the undertaking of the thesis (Section 1.1), outlines the aims (Section 1.2), offers a synopsis and summary of the publications (Section 1.3), details the methodology employed (Section 1.4), discusses the results (Section 1.5), and addresses xv both limitations and future research directions (Section 1.6). Section 1.1 provides a general overview of the field of clinical gait analysis and motivates the use of machine learning for the automated analysis of clinical gait data. Section 1.1 concludes with the current gaps and limitations of machine learning application in clinical gait analysis that serve as the foundation for motivating the aims of the thesis. The aims and corresponding research questions of the thesis are presented in Section 1.2. Section 1.3 provides concise summaries of the publications contributing to this thesis, along with individual contributions classified according to the contributor roles taxonomy (CRedIT) [1]. Section 1.4 provides an overview of the methodology (i.e., investigated datasets, machine learning methods, and explainability methods) that was utilized to address the research questions and to achieve the goals of this thesis. Section 1.4 concludes with a general overview of the field of explainable artificial intelligence and matches the utilized methods to an established taxonomy. Section 1.5 provides a comprehensive discussion of the research findings from the perspective of the defined goals and the corresponding research questions. Section 1.6 offers an extensive exploration of limitations in the addressed research field while identifying potential directions for future research. Section 1.7 summarizes the scientific contributions of the thesis. Chapter 2 contains the publications that constitute the main body of this thesis in their original form as they were published. The order in which the publications are presented is selected to ensure logical coherence for the reader and alignment with the research questions (independent of the chronological order of publication dates). CHAPTER 1 Introduction 1.1 Motivation Diseases or injuries of the musculoskeletal locomotor system as well as neurological disor- ders can affect people of any age (although they are more prevalent in older populations), regardless of gender and social status, and are one of the main causes of pathological impairments of human motor function. Various factors, including, e.g., infections, in- flammations, degenerative processes, traumatic events, neoplastic and vascular diseases, as well as neurological conditions such as cerebral palsy, Parkinson’s disease, multiple sclerosis, and stroke, can cause impairments in human motor function [2]. Affected people may lose the ability to interact with their environment and participate fully in social activities or the labor market as a result of these disabilities. According to the Global Burden of Disease Study 2019, musculoskeletal disorders were identified as one of the leading factors contributing to the growing burden on health systems across 202 countries [3]. In Austria, similar trends can be observed, with diseases related to the musculoskeletal system and connective tissues accounting for approximately 21.9% of the causes of sick leave in 2021 [4]. Furthermore, among the population aged 15 and above in Austria, 9.1% experienced challenges when walking longer distances, while 11.3% encountered difficulties while climbing stairs [5]. The aforementioned statistics, their implications, and the burden they impose on national health care systems provide strong motivation for conducting extensive research on causes and symptoms associated with diseases related to the musculoskeletal system. Various factors can be linked to diseases and injuries related to the musculoskeletal system, such as physical activity, diet, obesity, and smoking. Extensive studies have been dedicated to human gait, due to its role as an indicator of both physical activity and overall quality of life. To better understand gait impairments and allow for an optimal patient treatment, an accurate assessment of underlying movement mechanisms is essential. Different gait analysis approaches of varying complexity have been developed for this purpose. 1 1. Introduction 1.1.1 Clinical Gait Analysis Clinical gait analysis serves as a tool for the evaluation of human gait, with the primary aim to identify impairments that affect a patient’s gait pattern [6]. Clinical gait analysis supports clinicians in making accurate diagnoses and developing individualized treatment plans for their patients. For this reason, clinical gait analysis has become a crucial assessment tool in hospitals and rehabilitation centers. There are different gait analysis approaches with varying levels of complexity and accuracy, as well as different requirements for equipment and personnel. These approaches range from observational gait analysis [7] to more quantitative measures such as kinematic (e.g., joint angles) and kinetic (e.g., joint moments) data derived from instrumented three-dimensional gait analysis (3DGA) [8]. Clinical 3DGA is well-established in clinical practice and regarded as the gold standard for the quantification of a patient’s gait performance due to the high accuracy and quality of derived information [6]. This approach relies primarily on motion capture techniques in which retro-reflective markers are placed at specific anatomical landmarks on the human body. Using the 3D trajectories of these markers in conjunction with geometric biomechanical models, kinematic data such as joint angles can be accurately calculated [6]. Alternatively, recent approaches use inertial measurement units (IMUs) to extract kinematic information outside the gait laboratory [9, 10, 11]. In addition to kinematic data, assessments often include the measurement of muscle activation via electromyography as well as ground reaction forces via force plates [6]. The ground reaction force (GRF) corresponds to the force generated by the ground as a reaction force equal to the (averaged) force applied by the human body to the ground (i.e. body weight) [8]. By utilizing these different data modalities, a comprehensive understanding of the patient’s walking behavior can be obtained as they capture complementary information. The general drawbacks of 3DGA are the need for highly trained staff as well as high acquisition and maintenance costs. In addition, a major drawback is the time-consuming and complex setup process, which includes the system calibration and the attachment of markers or sensors to specific landmarks on the patient’s body. For certain groups of patients, this time investment might not be feasible, leading to the recording of only GRF data. 1.1.2 Automated Classification of Gait Data The prevalent approach in the current clinical setting involves the manual analysis of gait data obtained via 3DGA by representing it in the form of line plots during the assessment and diagnostic process (see Figure 1.1). However, this approach is susceptible to subjectivity, time-consuming, and can be costly. Furthermore, experienced and qualified domain experts are necessary for the manual analysis of gait data due to the high-dimensional nature and the presence of temporal dependencies, strong variability, non-linear relationships, and inter-correlations within the different signals [12]. Clinical gait data and medical history are stored in databases that implicitly contain a vast amount of valuable clinical knowledge, which is, however, currently hardly accessible. 2 1.1. Motivation Figure 1.1: Data typically presented in a 3D gait analysis report used in clinical practice. Retro-reflective markers (depicted as pink spheres) are attached to specific anatomical landmarks on the human body, enabling the quantification of human locomotion through a 3D motion capture system. The 3D trajectories of these markers combined with geometrical biomechanical models are utilized to determine joint angles. In addition, ground reaction forces are determined via force plates. In clinical practice, these data are typically used to inform medical decision-making. The data from clinical gait analysis reports are typically presented as simple line plots. Blue and red colors encode the right and left body sides, respectively. Deriving a diagnosis from these abstract line plots is a challenging task that requires the expertise of trained medical professionals. Automated data analysis methods that utilized machine learning (ML) bear the potential to exploit this implicit knowledge and provide an efficient and data-driven way for the automated detection of pathological gait patterns. The application of automated data analysis methods can assist clinicians by providing efficient insights into gait data without the need for extensive manual analysis of the complex data. The aim of developing 3 1. Introduction automated analysis approaches is not to replace clinicians, but rather to enhance their capabilities and provide them with valuable tools for faster and more accurate assessment. Clinicians could benefit from immediate insights and data-driven support that enable them to make better-informed decisions and create personalized treatment plans. Furthermore, accelerating the diagnosis and decision-making process through ML-based assistance systems would also save time and thus healthcare costs. In recent years, ML has made significant contributions to healthcare applications. For example, in the medical domain, ML methods have already been able to detect skin and breast cancer more efficiently and accurately than clinicians [13, 14, 15]. However, the field of clinical gait analysis still lags behind despite having accumulated a wealth of data through different measurement methods over the past decades. The demand for rapid and accurate decision-making, coupled with the complexity of gait data, has driven research efforts to leverage ML [16]. However, existing literature addressing ML approaches in clinical gait analysis exhibits limitations, which are outlined in Section 1.1.4. To overcome the limitations of existing ML approaches based on handcrafted domain-specific features and linear compression using principal component analysis (PCA), there has been a strong motivation to explore non-linear representations [17]. Utilizing deep learning enables the autonomous learning of non-linear representations through a data-driven approach. The motivation for the application of deep learning builds on the idea that human motor actions consist of elementary building blocks, so-called motion primitives, at different levels (e.g., neural or kinematic) [18, 19, 20]. Different transformations and combinations of these motion primitives to more complex modules form increasingly complex motor actions. Thus, the human gait as such a complex motor action is also based on redundant modules on every level of the motor hierarchy. This structural property makes gait data especially interesting for feature learning [18]. For hierarchically structured data, e.g., images, music, or speech, deep learning methods have shown to be particularly suitable to learn hierarchical representations that combine basic building blocks to complex and abstract concepts [18]. The basic assumption for biomechanical gait data is that specific pathologies are associated with distinct motion primitives that compose the gait pattern of a patient. Deep learning-based approaches represent a promising candidate to learn meaningful gait representations from the data. Furthermore, the application of explainability methods could serve as a valuable tool in identifying the location and pattern of motion primitives associated with the investigated pathologies. 1.1.3 Significance of Explainability While ML approaches show promising outcomes in terms of classification performance, they often suffer from a significant drawback, which is their black-box nature [21]. This implies that even if we understand the underlying mathematical principles of these methods, their decision-making process is often incomprehensible and their predictions are hard to trace. Thus, the problem in the context of clinical gait analysis is that it remains unclear to clinical experts whether predictions are based on clinically relevant patterns or if they are influenced by spurious correlations or biases in the data that are 4 1.1. Motivation not causally related to the targeted pathologies. The inability to validate the functioning of complex ML models and the challenge of understanding the learned patterns and rules are currently restricting the application of ML-based decision-support systems in clinical practice. However, clinicians require full transparency of decisions [22, 23]. The absence of transparency in ML approaches poses a significant challenge in offering justifications for their predictions. These justifications are essential for compliance with regulations such as the General Data Protection Regulation (GDPR, EU 2016/679) [24] and the recently proposed Artificial Intelligence Act [25] by the European Commission. For simpler ML models that are inherently explainable, such as decision trees, generating decision and model explanations can be relatively straightforward (e.g., by utilizing feature importance). However, to identify patterns within the input data that contribute to the predictions of complex ML models, methods from the field of explainable artificial intelligence (XAI) are necessary. In general, these explainability methods aim to reveal the workings of complex non-linear ML models and the way they produce their predictions. 1.1.4 Current Gaps and Limitations In the context of clinical gait analysis and the automated analysis of gait data, various limitations become evident. These limitations manifest across different levels, and the subsequent listing is not intended to present a comprehensive compilation but a summary of limitations that serve as motivation for the present thesis. Data and annotation availability. Existing literature on ML approaches in clinical gait analysis has primarily focused on simple use cases and small-scale datasets. Furthermore, in the field of clinical gait analysis, there is a lack of comprehensive publicly available datasets containing data from patients and healthy controls. An important constraint for gait data is also the absence of annotations, which are particularly crucial in clinical scenarios (e.g., pathological gait patterns) as the annotation process typically involves a subjective and time-consuming evaluation by clinical experts. Each laboratory collects the data independently, and the use of different laboratory settings further complicates the merging of different data modalities. However, it is critical to consider incorporating heterogeneous and large datasets to train and validate robust ML models. This approach is central to ensuring the applicability of these models in diverse populations and to improve their generalizability. Differently expressive data modalities. As the gold standard for assessing human gait, 3DGA considers the kinematic and kinetic aspects of movement. However, in everyday clinical practice, clinicians and therapists face challenges due to the necessity to examine a large number of patients. There is a trade-off between the accuracy and time efficiency of 3DGA. Additionally, motion capture systems utilized for 3DGA are expensive and the operation of such systems requires trained personnel, which further complicates their integration into clinical practice. Thus, an alternative that is sometimes used involves recording only GRFs using force plates. Considering the time-efficient process of collecting only GRF data, as opposed to 3DGA, the accumulation 5 1. Introduction of datasets suitable for automated analysis becomes more feasible. Moreover, the availability of GRF data is higher, as they can be obtained from regular 3DGA, as well as from gait laboratories without motion capture systems and can include historic data from periods when such complex recording systems were not utilized. In addition to simplified data collection, GRF data also offers advantages such as easier integration of datasets from different gait laboratories. This facilitates the setup of multi-center studies, while the integration of kinematic data from different gait laboratories is more complicated due to differences in marker setups and biomechanical models across gait laboratories. As a result, numerous studies in the literature have utilized GRF data and demonstrated high classification performance. However, these studies primarily focused on distinguishing between one or two specific pathological gait patterns and healthy controls (physiological gait) [16]. These studies investigated pathological gait patterns associated with conditions such as Parkinson’s disease [26, 27, 28], cerebral palsy [29], multiple sclerosis [29], osteoarthritis [30], transfemoral amputation [31], and lower limb fracture [32]. The main drawback of utilizing only GRF data is that the view on the biomechanical processes of the lower extremities is narrowed compared to data derived from 3DGA, as kinematic processes are not explicitly represented. This is also the reason why GRF data have often only been utilized for binary classification tasks, e.g., to distinguish between healthy controls and a single pathological gait pattern. The quantitative assessment of the discriminative power of GRF data for multi-class classification tasks, in comparison to 3DGA data, remains unexplored in the literature. Lack of systematic evaluation of data handling strategies. In the existing literature on automated gait classification, evaluating the impact of different data processing strategies on performance has not yet been thoroughly addressed for complex multi-class classification tasks. In particular, there is a gap in the study of how factors such as feature scaling and feature extraction, data imbalance, and dealing with various trials per individual affect the performance of ML approaches. Understanding the impact of these data factors on complex, clinically relevant datasets is crucial for optimizing the performance and robustness of such approaches. Limitations in systematically evaluating traditional ML and deep learning approaches. The main difference between deep learning and traditional ML relates to the concept of feature extraction. In deep learning, there is no need for explicit feature extraction since the model inherently learns the features directly from the data (i.e., feature learning). This capability is enabled by the architecture of deep neural networks. These models are composed of multiple stacked layers that facilitate the learning of higher-level, more abstract features from the raw input data. These high-level features enable deep learning models to handle complex multi-class classification tasks [17]. There has been an increasing trend towards the use of deep learning methods for gait data in the literature in recent years [17]. These studies are often subject to limitations such as very small datasets or simplified classification tasks with few classes. Therefore, uncertainties remain regarding the suitability of deep learning for complex multi-class gait classification tasks and how deep learning methods compare to traditional ML methods. 6 1.1. Motivation Lack of explainability. Explainability methods (see Section 1.4.3) have been successfully used to explain ML models in a variety of domains and their application in the medical field has also received considerable attention [33]. The motivation behind this is to increase transparency and thereby trust in ML models among medical professionals [34]. However, the use of explainability methods in the context of clinical gait analysis still needs to be explored. This is particularly interesting because most explainability methods have been developed for image data and structured data and evaluating explanations becomes particularly challenging when dealing with more abstract data such as multivariate time series. The suitability and usefulness of explainability methods for gait analysis and for clinical practice in general is currently an open question. 7 1. Introduction 1.2 Aims of the Thesis The primary aim of this thesis is to address and overcome the aforementioned gaps and limitations through the development and investigation of novel ML approaches. These approaches are intended for the automated analysis of measurement data obtained from clinical gait analysis with the overall aim of supporting clinical decision-making. The present thesis focuses on developing and evaluating the performance of explainable ML and deep learning methods in modeling motion primitives at both kinematic (i.e., joint angle) and kinetic (i.e., GRF) levels while addressing complex classification tasks and larger datasets compared to the current state of the art. For the development and evaluation of these methods, two use cases with binary and multi-class classification tasks will be explored: i) the use case on functional gait disorders (UC: functional gait disorders), which includes GRF data from healthy controls and four classes with functional gait disorders related to the hip, knee, ankle, or calcaneus, and ii) the use case on cerebral palsy (UC: cerebral palsy) that utilizes a dataset containing GRF and joint angle data from patients with cerebral palsy with four distinct pathological gait patterns. The UC: cerebral palsy aims to enable a quantitative assessment of the discriminative power of both GRF and joint angle data for classifying multiple pathological gait patterns. Overall, this thesis proposes a set of methodologies (published in high-ranked peer- reviewed journals) designed and implemented to achieve the following research goals: • Goal 1 – Creation of high-quality dataset: Creation and publication of a high- quality (from a biomechanical point of view) gait dataset that contains clinically relevant annotations and GRF data and is comprehensive concerning the quantity of participants and number of trials per participant, as well as the diversity of pathological gait patterns. • Goal 2 – Evaluation of discriminative power of 3DGA modalities: Evalua- tion of the discriminative power of different 3DGA data modalities, i.e., GRF and joint angle data, for automated gait classification. • Goal 3 – Evaluation of data handling strategies: Evaluation of the impact of various data handling strategies (including feature scaling, feature extraction, data imbalance, and different aggregation strategies) on the performance of automated gait classification. • Goal 4 – Comparison of traditional ML and deep learning: Development and comparison of traditional ML models and deep neural networks in terms of the classification performance. • Goal 5 – Evaluation of explainability approaches: Development and eval- uation of explainability approaches for traditional ML models and deep neural networks and assessing their ability to utilize clinically relevant input features for gait classification. 8 1.2. Aims of the Thesis In the context of the above-mentioned challenges and goals, the main research questions (RQs) addressed in the present thesis are the following. Research question related to Goal 1: • RQ1.1: Which steps should a preprocessing pipeline for GRF data include to facilitate the collaborative use of data gathered from different gait laboratories? A common challenge in using ML for gait analysis is the limited availability of large datasets. Typically, ML models are trained and evaluated on small datasets from a single gait laboratory. The absence of comprehensive benchmark datasets makes it challenging to provide clear guidance on appropriate data preprocessing and classification methods for specific classification tasks. Regarding data preprocessing in the field of gait analysis, it is important to introduce and assess domain-specific as well as ML-related preprocessing procedures. These procedures, including data thresholding, data filtering, outlier detection, and data normalization, have been evaluated on a large real-world dataset. The outcome is a standardized preprocessing pipeline for GRF data that can be applied across various gait laboratory settings, enabling the collaborative use of these data from different research laboratories. Research questions related to Goal 2: • RQ2.1: What level of classification performance can be achieved using only GRF data for automated gait classification? In the related literature, GRF data are commonly utilized for binary classifica- tion tasks to distinguish between physiological and pathological gait. Multi-class classification tasks using GRF data are less common and are usually employed to identify patient groups exhibiting large differences in their gait patterns. To evalu- ate the discriminative power of GRF data, two complex multi-class datasets were employed. For the UC: functional gait disorders the classification performance was evaluated in a binary classification task by merging all pathological classes and distinguishing them from physiological gait. Subsequently, the classification performance was evaluated on the more complex multi-class classification task (as originally defined in the GaitRec dataset). The UC: cerebral palsy examined the discriminative power of GRF data for different gait patterns within the cere- bral palsy population. The outcome is a quantitative comparison of classification performance, based exclusively on the use of GRF data, for the two given use cases. 9 1. Introduction • RQ2.2: What is the advantage in classification performance when using kinematic data compared to GRF data for automated gait classification, and is there an improved classification performance when using both inputs together as opposed to using them separately? Recording only GRF data is a more time- and resource-efficient approach compared to 3DGA. However, the exclusive use of GRF data represents a significant limitation for understanding the biomechanical processes of the human body. The amount of relevant information is significantly limited compared to the data obtained from 3DGA, as the exclusive use of GRF data excludes the explicit representation of gait kinematics. This drawback emphasizes the importance of incorporating 3DGA data, in particular for multi-class classification tasks. The experiments to assess the classification performance of the different data modalities were carried out on the complex multi-class classification task within the UC: cerebral palsy (see Section 1.4.1). The outcome is a quantitative comparison of classification performance, evaluating the effectiveness of each individual data modality separately and in combination. • RQ2.3: To what degree do the signals from the affected and unaffected sides differ in terms of their discriminative power for automated gait classification? Conditions affecting the musculoskeletal locomotor system or neurological disorders have consequences not only for the (more) affected leg but also for the unaffected (less affected) leg. Individuals experiencing these conditions tend to develop com- pensatory strategies in the unaffected side, primarily influenced by the increased use of this side [35]. Leveraging these additional compensatory strategies encoded in the data from the unaffected side could potentially enhance classification performance. This research question was addressed by evaluating the classification performance on the classification tasks defined in the UC: functional gait disorders. The outcome is a quantitative comparison of classification performance, evaluating the discriminative power of data from the affected and unaffected sides, both separately and in combination. Research questions related to Goal 3: • RQ3.1: To what extent do different feature scaling and feature extraction techniques impact the performance of automated gait classification? In ML practice, it is well established that feature scaling and feature extraction techniques can greatly aid in the training process of ML models. Feature scal- ing is a necessary step before applying ML models to ensure uniform numerical ranges across different input features and signals. Thus, feature scaling prevents that signals with larger numeric ranges (amplitude) dominate those with smaller dynamic ranges. The primary focus in this thesis is on different feature scaling techniques (e.g., min-max normalization and z-standardization). Furthermore, 10 1.2. Aims of the Thesis various methods for feature extraction were evaluated, aiming to obtain diverse data parameterizations. The investigated parameterizations encompass handcrafted domain-specific parameters as well as PCA-derived representations of the raw data and handcrafted parameters. The assessment of these feature scaling variants and representations involved evaluating the classification performance on the tasks defined in the UC: functional gait disorders. The outcome is a quantitative comparison of classification performance, evaluating the different feature scaling and feature extraction approaches. • RQ3.2: What is the impact of data imbalance on the performance of automated gait classification? One of the most significant factors influencing the classification performance of ML models is class imbalance [36]. Imbalanced data refers to a scenario with an unequal distribution of samples among different classes. Imbalanced data can result in the training of biased ML models, which in turn can lead to lower classification performance especially for the minority classes. Real-world datasets in the field of human gait analysis exhibit various imbalances. Certain pathological classes are inherently much rarer than others. Furthermore, in some conditions, such as those involving long therapy processes, subjects may have significantly more sessions recorded than in other cases (in which only a low number of sessions are recorded). In a single session, the number of recorded trials can also vary, influenced by factors such as the patient’s condition. In this thesis, two causes of imbalance, i.e., variations in the number of patients and sessions per patient, were investigated both individually and in combination in the UC: functional gait disorders. To this end, the classification performance was evaluated on subsets that are balanced with respect to these two causes of imbalance. The outcome is a quantitative comparison of achievable classification performance in balanced and imbalanced scenarios. • RQ3.3: To what extent do different data aggregation methods impact the performance of automated gait classification? In clinical practice, multiple trials are often recorded during a recording session. Clinicians usually analyze these trials by averaging them to achieve more robust rep- resentations. With multiple trials per recording session in the datasets, the question arises whether these trials can be combined or aggregated to enhance prediction robustness of ML models. The baseline approach involved using all available trials from a session without aggregation to train ML models. Furthermore, different early fusion techniques were investigated, such as aggregating (i.e., averaging) and subselecting (i.e., using the median or the most representative trial) trials from a session prior to training the ML model. Additionally, a late fusion strategy was evaluated, which aggregated the predictions of the ML model trained on all trials (i.e., baseline approach) using majority voting. The evaluation of these aggrega- tion methods involved assessing the classification performance on the classification 11 1. Introduction tasks within the UC: functional gait disorders. The outcome is a quantitative comparison of classification performance for the three early fusion approaches, the late fusion approach, and the baseline approach. Research question related to Goal 4: • RQ4.1: How do traditional ML models compare to deep neural networks for the automated gait classification in terms of performance? Deep learning and traditional ML methods differ in their learning paradigms, as elaborated in Section 1.1.4 and Section 1.4.2. To assess the classification performance of these methods, systematic comparisons using the datasets of the two use cases were performed. Within the UC: functional gait disorders, the classification performance of convolutional neural networks (CNNs), multi-layer perceptrons (MLPs), and support vector machines (SVMs) was evaluated across six tasks (comprising four binary and two multi-class tasks). In the UC: cerebral palsy, the performance of convolutional neural networks, self-normalizing neural networks, random forests, decision trees, support vector machines, and gradient boosting classifiers was evaluated. The outcome is a quantitative comparison of classification performance that should provide information regarding the strengths and limitations of the ML methods when utilized in specific gait classification tasks. Research questions related to Goal 5: • RQ5.1: To what extent can explainability approaches be employed to determine the input features on which ML models base their decisions for automated gait classification, and are these relevant input features statistically justified and in line with clinical assessment? To develop decision-support systems for clinical practice using ML, it is essential to integrate explainability approaches, which can be implemented at following levels (see Section1.4.3): i) at the data level (i.e., data exploration), ii) at the decision level (i.e., explanation of a specific prediction), and iii) at the model level (i.e., explanation of class-specific and model-specific patterns and learning strategies). Different explainability approaches were implemented and investigated explanations on the three levels using the datasets of the two use cases. The evaluation of explainability at the data level was performed through the use of linear discriminant analysis (LDA). The evaluation of explainability at decision and model level was conducted using two state-of-the-art methods, i.e., layer-wise relevance propagation (LRP) [37] and gradient-weighted class activation mapping (Grad-CAM) [38]. For the UC: functional gait disorders, LRP [37] was utilized to explain convolutional neural networks, multi-layer perceptrons, and support vector machines across the binary tasks. For the UC: cerebral palsy, Grad-CAM [38] was employed to generate explanations for convolutional neural networks and self-normalizing 12 1.2. Aims of the Thesis neural networks. Additionally, for random forests and decision trees, the feature importance based on Gini impurity served as model explanation. The outcomes are i) a quantitative evaluation from a statistical perspective using statistical parametric mapping (SPM) [39] to assess whether relevant input features exhibit also statistical differences between the classes, and ii) a qualitative examination of the explainability results and the differences in these results among the different ML methods, conducted via a series of focus group interviews with clinical experts. • RQ5.2: How effective are explainability approaches in detecting bias in ML models used for automated gait classification? In practice, ML approaches are prone to biases. These biases are often present in the training data and originate from factors such as imbalanced data distributions, differences in walking speeds among different populations (e.g., physiological vs. pathological classes), or inadequate data preprocessing (e.g., unequal data scaling). An explainability approach was utilized to identify biases in ML models caused by the absence of feature scaling and variations in walking speed between healthy controls and patients. The outcome is a qualitative evaluation of the explainability results via focus group interviews with clinical experts to identify specific biases, followed by experiments designed to address and mitigate the underlying causes of these biases. 13 1. Introduction 1.3 Synopsis and Publications In the following, the reader will find a brief summary of the publications that contribute to this thesis. The emphasis lies on five journal publications encompassing methodological advancements beyond the respective state of the art. Table 1.1 illustrates the relationship between the publications and the research goals and questions defined in Section 1.2. Table 1.2 states the personal contributions (indicated with ✓) for each paper according to the contributor roles taxonomy (CRedIT) [1]. Publications Goal 1 Goal 2 Goal 3 Goal 4 Goal 5 RQ1.1 RQ2.1 RQ2.2 RQ2.3 RQ3.1 RQ3.2 RQ3.3 RQ4.1 RQ5.1 RQ5.2 Horsak & Slijepcevic et al. (2020)∗ ✓ Slijepcevic et al. (2017) ✓ ✓ ✓ ✓ Slijepcevic et al. (2020) ✓ ✓ ✓ Slijepcevic & Horst et al. (2022)∗ ✓ ✓ ✓ ✓ ✓ Slijepcevic et al. (2023) ✓ ✓ ✓ ✓ Table 1.1: Correspondence (indicated with ✓) between the publications comprising the present thesis and the research goals and question (RQ) addressed within the thesis. The asterisk (*) indicates publications with co-shared first authorship. The research question RQ1.1 related to Goal 1 is addressed in Horsak et al. (2020), a publication that was released along a real-world dataset containing clinical gait data. This dataset forms also the fundamental basis for three of the other publications. Several RQs have been explored in multiple publications, with RQ2.1 being the most frequently addressed and covered in all of the publications. In Slijepcevic et al. (2017) [40], Slijepcevic et al. (2020) [41], Slijepcevic et al. (2022) [42], and Slijepcevic et al. (2023) [43], we utilized GRF data from the UC: functional gait disorders and examined various classification tasks, offering a comprehensive evaluation related to RQ2.1. RQ2.2 was addressed in Slijepcevic et al. (2023) [43] due to the availability of both GRF and joint angle data in the UC: cerebral palsy. The assessment of the discriminative power between the affected and unaffected side (RQ2.3) was conducted in Slijepcevic et al. (2020) [41] and Slijepcevic et al. (2022) [42]. Aspects related to RQ3.1, such as exploring different feature extraction methods (i.e., handcrafted domain-specific parameters as well as PCA-derived representations of the data), and investigating the influence of feature scaling techniques on classifi- cation performance were examined in Slijepcevic et al. (2017) [40]. In Slijepcevic et al. (2017) [40], we examined also the impact of data imbalance (RQ3.2), which guided our approach to utilize balanced datasets in subsequent publications. Slijepcevic et al. (2020) [41] evaluated the influence of different data aggregation approaches, i.e., early and late fusion strategies, on the classification performance in scenarios with multiple trials per person (RQ3.3). The research question RQ4.1 related to Goal 4 was predominantly addressed in Slijepce- vic et al. (2022) [42] and Slijepcevic et al. (2023) [43]. These two publications explored 14 1.3. Synopsis and Publications the comparison between traditional ML models and deep neural networks concerning classification performance. Slijepcevic et al. (2017) [40], Slijepcevic et al. (2022) [42], and Slijepcevic et al. (2023) [43], examined RQ5.1 from different perspectives. In Slijepcevic et al. (2017) [40], explain- ability was investigated on the data level by utilizing linear discriminant analysis. This approach allowed the assessment of the discriminative power of handcrafted domain- specific features and PCA representations. In Slijepcevic et al. (2022) [42] and Slijepcevic et al. (2023) [43], various explainability approaches were proposed to obtain explanations at the prediction, class, and model level. Finally, the investigation of how explainability methods enable the identification of bias related to walking speed differences and data scaling (RQ5.2) was addressed in Slijepcevic et al. (2022) [42]. Publications C on ce pt ua liz at io n D at a C ur at io n Fo rm al A na ly sis Fu nd in g A cq ui sit io n In ve st ig at io n M et ho do lo gy Pr oj ec t A dm in ist ra tio n R es ou rc es So ftw ar e Su pe rv isi on Va lid at io n V isu al iz at io n W rit in g – O rig in al D ra ft W rit in g – R ev ie w & Ed iti ng Horsak & Slijepcevic et al. (2020)∗ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Slijepcevic et al. (2017) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Slijepcevic et al. (2020) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Slijepcevic & Horst et al. (2022)∗ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Slijepcevic et al. (2023) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Table 1.2: Correspondence (indicated with ✓) between the publications comprising the present thesis and the personal contributions based on the contributor roles taxonomy (CRedIT) [1]. The asterisk (*) indicates publications with co-shared first authorship. The following subsections contain a brief summary of each publication included in the present thesis. For more detailed information, please refer to the corresponding publication in Chapter 2. 15 1. Introduction 1.3.1 GaitRec, a Large-Scale Ground Reaction Force Dataset of Healthy and Impaired Gait Brian Horsak, Djordje Slijepcevic, Anna-Maria Raberger, Caterine Schwab, Marianne Worisch, and Matthias Zeppelzauer. GaitRec, a Large-Scale Ground Reaction Force Dataset of Healthy and Impaired Gait. Scientific Data, 7(1):143, 2020. DOI: 10.1038/s41597-020-0481-z The GaitRec dataset is derived from a clinical gait database maintained by an Austrian rehabilitation center. The dataset contains anonymized GRF measurements from 2,085 patients with various musculoskeletal impairments and data from 211 healthy controls, along with accompanying metadata such as age, sex, footwear, and walking speed. The dataset covers the entire rehabilitation progress of a patient during the patient’s stay. The labels provided in the dataset indicate the anatomical joint level of orthopedic impairment, i.e., hip, knee, ankle, and calcaneus. During data collection, patients and healthy controls were asked to walk unassisted at a self-selected walking speed on a walkway equipped with two centrally embedded force plates that recorded bilateral GRF data. The dataset contains multiple left and right foot contacts of one person from one session. In addition to the unprocessed GRF data, the dataset also contains preprocessed data that are ready for immediate use. We developed and published a preprocessing pipeline (including data filtering and thresholding) with the aim of standardizing GRF datasets from various gait laboratories. This pipeline facilitates the consolidation of GRF data from diverse laboratories, allowing for their collaborative utilization. 1.3.2 Automatic Classification of Functional Gait Disorders Djordje Slijepcevic, Matthias Zeppelzauer, Anna-Maria Gorgas, Caterine Schwab, Michael Schüller, Arnold Baca, Christian Breiteneder, and Brian Horsak. Automatic Clas- sification of Functional Gait Disorders. IEEE Journal of Biomedical and Health Informatics, 22(5):1653–1661, 2017. DOI: 10.1109/JBHI.2017.2785682 This publication presents a comprehensive investigation on the UC: functional gait disorders (i.e., automated classification of functional gait disorders using GRF data). The main objective of the study was to assess the effectiveness of i) handcrafted domain- specific GRF parameters and ii) PCA-based representations of GRF data for distinguishing functional gait disorders, as well as to establish a performance baseline for the automated classification of functional gait disorders using a large-scale dataset. This study was one of the first to examine such a comprehensive dataset of domain-specific gait features for automated gait classification. We utilized a subset of the GaitRec datasets that included measurements from 279 patients with gait disorders and data from 161 healthy controls and resulted in a total 16 1.3. Synopsis and Publications of 9,496 gait measurements. The study included two classification experiments: i) a binary task that distinguishes between healthy and impaired gait (healthy controls vs. gait disorder patients) and ii) multi-class classification between healthy gait and all four gait disorder classes. Various data parameterization methods were examined, including domain-specific handcrafted GRF parameters, PCA-based representations, and a combined representation using PCA on these handcrafted GRF parameters. Boxplots were generated for each parameter and each class, allowing for an initial assessment of both intra- and inter-class variation. These boxplots offered valuable insights into the potential of the parameters to distinguish between the different classes. A more comprehensive assessment of the discriminative power of each parameterization was conducted using linear discriminant analysis. The results of the experiments showed promising outcomes, but also highlighted the impact of factors such as data imbalance (i.e., differences in class sizes and varying numbers of measurements per patient) and feature scaling (i.e., min-max normalization and z-standardization) on the classification performance. The overall results showed that: i) for the multi-class classification task, the accuracy was 54.3%, and for the binary classification task, it was 90.8%; ii) when considering balanced data with an equal number of persons and sessions, the accuracies were 59.2% for multi-class and 85.4% for binary classification (it should be noted that the accuracy was significantly higher than the random baseline in this case compared to the unbalanced setting); iii) the linear support vector machine outperformed the radial basis function kernel in terms of classification performance; and iv) the application of PCA-based parameterization of the raw GRF data yielded better results compared to using handcrafted domain-specific GRF features, with a difference of 7.5% in classification accuracy. 1.3.3 Input Representations and Classification Strategies for Automated Human Gait Analysis Djordje Slijepcevic, Matthias Zeppelzauer, Caterine Schwab, Anna-Maria Raberger, Christian Breiteneder, and Brian Horsak. Input Representations and Classification Strategies for Automated Human Gait Analysis. Gait & Posture, 76:198–203, 2020. DOI: 10.1016/j.gaitpost.2019.10.021 In this study, we compared two data aggregation methods, i.e., early fusion and late fusion, within the UC: functional gait disorders. In the gait classification literature, prior approaches employed either an early fusion method, which involved averaging multiple recorded trials of a subject into a single waveform, or a classification approach without data aggregation was performed, in which all available trials were used to train the ML models. In addition to these two methods, we further investigated an early fusion approach which determined the most representative trial based on a statistical method. Additionally, we explored a late fusion approach where the model was trained on all trials, but during inference, a majority voting scheme was used to combine the decisions from individual trials. 17 1. Introduction Subsequently, we explored the optimal input representations and combinations thereof for automated gait classification. This involved various options, such as using raw gait waveforms, relative changes within these waveforms, or signal differences between the affected and unaffected side. We utilized a subset of the GaitRec dataset, which included measurements from 728 patients with gait disorders and data from 182 healthy controls. The dataset was balanced in terms of the number of persons per class, recorded sessions per person, and trials per person. The multi-class classification task focused on distinguishing between healthy controls and each gait disorder class associated with the hip, knee, ankle, and calcaneus. In line with the results from Slijepcevic et al. (2017) [40], we employed an ML pipeline that involved PCA, z-standardization, and support vector machines as classifier. The results demonstrated the advantage of aggregating multiple trials from a single subject, especially when using late fusion or the mean waveform approach. In addition, the results suggested that the inclusion of both the original signals and their derived representations increased the informativeness of the data in feature extraction and classification. Even when certain input signals or representations contain redundancies, the combination of these signals, such as the GRF and center of pressure components with derived representations, improved classification performance in this study. Thus, the main finding from these experiments is that using a larger number of input signals and representations, even when redundancies exist, can lead to better results. This observation is especially true when combining GRF and center of pressure data and using derivatives from both the affected and unaffected sides. In addition, the inclusion of both the affected and unaffected side, whether explicitly or implicitly, seems to be beneficial. 1.3.4 Explaining Machine Learning Models for Clinical Gait Analysis Djordje Slijepcevic, Fabian Horst, Sebastian Lapuschkin, Brian Horsak, Anna-Maria Raberger, Andreas Kranzl, Wojciech Samek, Christian Breiteneder, Wolfgang Immanuel Schöllhorn, and Matthias Zeppelzauer. Explaining Machine Learning Models for Clinical Gait Analysis. ACM Transactions on Computing for Healthcare (HEALTH), 3(2):1–27, 2021. DOI: 10.1145/3474121 This publication investigated explainability methods to enhance transparency in auto- mated gait classification within the UC: functional gait disorders. The main goal was to investigate and explain the class-specific characteristics learned by ML models from these data. To this end, various classification models, i.e., convolutional neural networks, multi-layer perceptrons, and support vector machines, were trained for different gait classification tasks, and prediction explanations were derived using a popular explain- ability method for the image domain, i.e., layer-wise relevance propagation (LRP). In addition, we proposed also two types of model explanations using the individual prediction explanations: The initial approach involved averaging relevance scores across all samples within a specific class. However, to conduct a more comprehensive analysis capable of 18 1.3. Synopsis and Publications identifying different learning strategies employed by the ML models, we adapted spectral relevance analysis (SpRAy) [44] for GRF data. This approach clustered the relevance scores obtained from various samples and classes and allowed to conduct a detailed examination of the resulting clusters and subclusters. The evaluation of the obtained explanations followed a two-step approach. First, a statistical analysis was conducted using statistical parametric mapping (SPM) [39] to assess whether relevant input features exhibit also statistical differences between the classes. Second, two clinical experts interpreted the explainability results from a clinical perspective to assess whether the explanations align with clinical practice. Additionally, the investigation explored various aspects that could influence classification performance and explainability. These aspects included the impact of different classification methods, feature scaling techniques, and the role of various input signal components (i.e., horizontal forces and measurements of the affected and unaffected side). The study utilized a subset of the GaitRec dataset, comprising GRF measurements during barefoot walking from 132 patients with lower-body gait disorders and data from 62 healthy controls with varying physical composition and gender. The dataset comprised three classes of orthopedic gait disorders related to the hip, knee, and ankle, in addition to a class representing healthy controls. The results emphasize that ML models used in various clinical gait classification tasks base their predictions mostly on meaningful features from GRF data. These features have been validated through statistical and clinical evaluation. Within the scope of the analysis, several significant observations were made. First, highly relevant regions were identified in both the affected and unaffected sides, suggesting that the unaffected side contains complementary information that is relevant for the classification. Second, statistical parametric mapping proved to be a suitable statistical reference for the explainability results. Regions identified as highly relevant by the explainability method were generally found to be significantly different according to statistical parametric mapping and aligned with clinical evaluation. Furthermore, our results showed that not only the vertical GRF force but also the other force components exhibit highly relevant regions. This observation is consistent with the existing literature on clinical gait analysis. The results suggest that ML models tend to learn an over-complete set of features that may contain redundant information. This finding potentially explains why certain changes, such as occluding certain force components and using different input normalization methods, had negligible influence on the classification performance. Furthermore, ML models for gait classification exhibited the capability to learn different strategies for individual persons and patient groups, reflecting the capability to adapt to different patterns in the data. Finally, the implementation of the proposed explainability approaches allowed clinical experts to identify a bias related to the walking speed in ML models and accurately assess their functionality. This aspect is crucial for clinicians, as it is the only way to strengthen their trust in the predictions generated by these models. 19 1. Introduction 1.3.5 Explainable Machine Learning in Human Gait Analysis: A Study on Children With Cerebral Palsy Djordje Slijepcevic, Matthias Zeppelzauer, Fabian Unglaube, Andreas Kranzl, Chris- tian Breiteneder, and Brian Horsak. Explainable Machine Learning in Human Gait Analysis: A Study on Children With Cerebral Palsy. IEEE Access, 11:65906–65923, 2023. DOI: 10.1109/ACCESS.2023.3289986 The main objective of this publication was to explore the effectiveness of various ML methods for the UC: cerebral palsy. Similar to the work presented in Slijepcevic et al. (2022) [42], this research also aimed to develop explainability approaches to assess the clinical relevance of the features learned by these models. In our study, we conducted a comparison between various traditional ML methods, such as random forests, decision trees, and gradient boosting classifiers, and deep learning methods, including convolutional neural networks and self-normalizing neural networks. For decision trees and random forests, Gini impurity-based feature importance served as the basis for the model explanation. For the deep neural networks, individual prediction explanations determined via the gradient-weighted class activation mapping (Grad-CAM) [38] method, were aggregated at different levels to provide insights at the decision, class, and model levels. The study investigated the discriminative power of two different data modalities recorded during 3DGA, i.e., joint angle and GRF data, for classifying gait patterns associated with cerebral palsy. We conducted experiments using a 3DGA dataset comprising 302 patients with cerebral palsy exhibiting four distinct gait patterns associated with this condition. The results indicate that joint angle data (peak performance of 93.4%) significantly outperforms GRF data (peak performance of 47.2%) for this classification task. Moreover, traditional ML approaches like random forests and decision trees achieved better results and focused on clinically relevant regions more effectively than deep neural networks. The best configuration, utilizing sagittal knee and ankle angles with a random forest, achieved a classification accuracy of 93.4%. Deep neural networks employed both clinically relevant features but also additional features for their predictions. These additional features could offer novel insights into the data and raise new research questions. Overall, this publication highlights the significance of explainability in fostering understanding of ML models for clinical practice. 20 1.4. Methodology 1.4 Methodology This section outlines the methodology that was utilized to address the research questions and achieve the goals of this thesis. More detailed specification of the employed data, approaches, and methods can be found in the publications presented in Chapter 2. For each RQ, a systematic approach is followed, beginning with the experimental design, followed by the selection of a suitable dataset or subset, then the ML pipelines are implemented and finally an evaluation of the results is conducted. Almost all RQs were evaluated quantitatively by evaluating standard performance metrics (e.g., classification accuracy, precision, recall, and F1 score) of the ML methods either through a dedicated train/validation/test split or a k-fold cross-validation approach. In all four publications, when evaluating classification accuracy, a comparison was made with the zero-rule baseline (i.e., representing the theoretical accuracy obtained by assigning class labels based on selecting the most frequent class in the dataset). In the case of Slijepcevic et al. (2020) [41], the zero-rule baseline equals the random baseline due to the perfectly balanced nature of the utilized dataset. RQs associated with explainability were evaluated qualitatively in a series of focus group interviews with the clinical experts. In addition to the qualitative assessment, this thesis proposes an additional assessment of the explainability results based on statistical analysis of the underlying data. 1.4.1 Clinical Use Cases and Datasets To develop methods which are capable of learning higher-level and non-linear features, certain prerequisites regarding the quality and size of the utilized datasets have to be fulfilled. Primarily, the dataset should contain informative gait data and clinically relevant annotations determined by clinical experts. Additionally, the dataset has to exhibit substantial variability in the metadata that is specific to the respective population. These metadata include anthropometric properties, such as body weight and height, as well as other factors that influence gait, such as age, sex, and walking speed. Furthermore, a subset of clinically relevant pathological classes has to be identified for the purpose of automated analysis and classification. The literature demonstrates a broad range of use cases, ranging from orthopedic issues related to post-joint replacement surgery, ligament ruptures, and osteoarthritis, to complex diseases that cause neuromuscular mobility impairments in the lower extremities, including e.g., cerebral palsy, Parkinson’s disease, and Alzheimer’s disease. The dataset should contain data from several hundred people, to allow for the modeling of inter-individual variability within a specific pathological group. Generally, clinical gait analysis involves the recording of several trials (i.e., steps) in order to account for intra-individual step variability. Thus, a representative sample size per subject has to be ensured [45]. The imbalance of a dataset represents an additional challenge. This imbalance can naturally occur when the number of patients with different pathologies varies significantly. In order to model also healthy gait, a comprehensive dataset should comprise not only data from pathological gait, but also data from healthy controls. 21 1. Introduction The proposed thesis investigates two clinical use cases employing two comprehensive real-world datasets: i) a dataset comprising GRF data from patients with different functional deficits associated with a patient’s condition after joint replacement surgery, fractures, ligament ruptures, and osteoarthritis, and ii) a dataset containing joint angle data and GRF data obtained via 3DGA from patients with cerebral palsy. For the UC: functional gait disorders, different subsets of the GaitRec dataset [46] served as the basis for addressing and exploring research questions RQ1.1, RQ2.1, RQ2.3, RQ3.1, RQ3.2, RQ3.3, RQ4.1, RQ5.1, and RQ5.2. This dataset is one of the largest publicly available collections of clinical gait data and it was published within the scope of this thesis. The GaitRec dataset comprises anonymized GRF and center of pressure data from an existing clinical gait database maintained by a rehabilitation center of the Austrian Workers’ Compensation Board (Allgemeine Unfallversicherungsanstalt, AUVA). Kinematic data was not recorded during the gait analyses. The entire dataset comprises GRF measurements from 2,085 patients with gait disorders and data from 211 healthy controls, both of various physical composition and sex. Data were manually classified into four classes – hip, knee, ankle, and calcaneus – by a physical therapist, based on the available medical diagnosis of each patient. The individual pathological gait patterns are related to joint replacement surgery, fractures, ligament ruptures, and related disorders associated with the hip, knee, ankle, or calcaneus. Participants walked unassisted at a self-selected walking speed on an approximately 10 m long walkway with two force plates. In healthy controls, measurements were also conducted at different walking speeds, which differ from the habitual speed. Each participant performed one or several measurement sessions. In each session, at least eight valid recordings for two consecutive steps were performed, leading to a total of 75,732 bilateral individual measurements for the entire dataset. The preprocessed GRF data, which includes the vertical, anterior-posterior, and medio-lateral force components, along with the center of pressure data, were normalized to 100% stance phase and to the body weight. The UC: cerebral palsy focuses on a dataset comprising anonymized 3DGA data from an existing database created and maintained by the Laboratory of Gait and Human Movement of the Orthopaedic Hospital Vienna-Speising (Austria). The dataset includes 3DGA measurements, which consist of simultaneously recorded kinematic and GRF data. The dataset comprises anonymized data from 302 patients with cerebral palsy. Furthermore, the dataset included anthropometric data, along with annotations of four pathological gait patterns associated with cerebral palsy: true equinus, jump gait, apparent equinus, and crouch gait. This dataset served as the basis for addressing and exploring research questions RQ2.1, RQ2.2, RQ4.1, and RQ5.1. The 3DGA was conducted on a 12 m walkway using a motion capture system consisting of a minimum of 14 infrared cameras and three force plates. Patients walked without a walking aid and at a self-selected walking speed until a minimum of five valid recordings had been collected. Kinematic data in terms of joint angles were computed using the raw marker trajectories. Additionally, the kinematic data were time-normalized to 100% of the corresponding gait cycle (or stance phase in the case of GRFs). Time normalization of the gait data to 22 1.4. Methodology 100% of the gait cycle or stance phase ensures that the input data has a uniform length and therefore no additional padding is required. To obtain more robust data, average curves were computed for each joint angle (for the pelvis, hip, knee, and ankle) and GRF component by aggregating data from all gait cycles within one recording session. 1.4.2 Representation Learning and Classification Methods The initial classification baseline was established on the UC: functional gait disorders by applying traditional ML methods, such as k-nearest neighbor (k-NN) classifier, multi- layer perceptron, and support vector machine. These methods were trained on handcrafted domain-specific features derived from raw gait signals (e.g., local minima and maxima of the waveforms, as well as spatio-temporal gait parameters such as cadence, walking speed, and step length), which clinicians commonly use in clinical practice. For this purpose, Slijepcevic et al. (2017) [40] examined a comprehensive set of handcrafted domain-specific features. In the automated analysis of gait data, another state-of-the-art approach employs PCA as a feature extraction method on the raw data and combines it with traditional ML methods [16]. Despite providing a linear feature representation, PCA has shown great suitability for biomechanical gait data, resulting in higher classification performances in the literature compared to handcrafted features [16]. Slijepcevic et al. (2017) [40] assessed the suitability of using PCA and kernel PCA (with a polynomial kernel) [47] as a feature extraction method for the UC: functional gait disorders. According to RQ3.1, the suitability of various feature extraction techniques and the resulting data parameterizations for gait analysis data is investigated in Slijepcevic et al. (2017) [40]. The ML methods most frequently used in the literature are support vector machines with different kernel functions [48, 49, 50, 30, 51, 28].This served as motivation for employing support vector machines as either the main classifier [40, 41, 42] or as a baseline method [43] in the publications of this thesis. In addition, this thesis applied the following traditional ML methods to handcrafted gait features, PCA-based representations, or the raw gait data: k-nearest neighbor [40], multi-layer perceptron [40, 42], random forest [43], decision tree [43], and gradient boosting classifier [43]. In order to compensate for the limitations of existing representations for clinical gait data, higher-level and non-linear feature representations are investigated within the scope of this thesis. Deep learning methods inherently learn feature representations directly from the input data (i.e., feature learning) and do not demand specific feature engineering. Therefore, no domain-specific knowledge is needed to derive specific input features. The architecture of deep neural networks incorporates multiple stacked layers and enables the learning of higher-level hierarchically related features that are employed by the top-most (classification) layers to tackle complex tasks. Recently, various deep learning approaches have been employed for the analysis of human gait data [52, 17]. Matsushita et al. [17] identified convolutional neural networks, recurrent neural networks, and auto-encoders as the most commonly employed deep 23 1. Introduction learning methods within the existing literature for gait analysis. To this end, the present thesis investigates the suitability of convolutional neural networks [42, 43] and self- normalizing neural networks [43] for analyzing GRF and kinematic data. Additionally, within the scope of this thesis, bi-directional long short-term memory (LSTM) networks have been explored. However, despite their intended design to capturing temporal dependencies in time series data, this recurrent network architecture yielded poorer results when compared to other methods. The increasing use of deep learning has raised questions concerning the suitability and efficiency of deep learning versus traditional ML methods for gait analysis (see RQ4.1). The comparison of the classification performance between traditional ML and deep learning methods in Slijepcevic et al. (2023) [43] addresses this question. 1.4.3 Explainability Approaches The lack of transparency in complex ML models has led to significant progress in the development of explainability methods. These methods are specifically designed to provide explanations for automated predictions and model behavior, aiding clinical experts in understanding the patterns and rules behind specific predictions. The present thesis involved the development and investigation of explainability approaches to tackle the research questions RQ4.1 and RQ5.1. Explainability methods can be classified based on the type of explanation they offer. According to the taxonomy proposed by Arya et al. [53], these approaches can be catego- rized into three coarse types: i) data exploration approaches, ii) decision explanations (also known as local model explanations), and iii) global model explanations. These different types of explanations complement each other. Data exploration approaches do not provide explanations for an ML model, but instead focus on the data that were used to train the model. These approaches aim to visualize and transform the data, enabling domain experts to uncover significant structures and patterns within the data with the final goal of generating novel insights from the data. In the context of this thesis, various data exploration methods were employed, with a focus on visualizing the various distributions in the data (Figure 1.2: static → data → distributions). Slijepcevic et al. (2017) [40] employed boxplots to evaluate each manually crafted gait parameter. The examination of boxplots for each parameter and class allowed an assessment of both intra-class and inter-class variability, providing insights into the parameters’ ability to distinguish between different classes. Subsequently, linear discriminant analysis was applied to the individual parameters and their combinations, as well as to the higher-dimensional PCA-based representations. The purpose of this analysis was to quantify the discriminative power of the studied representations and assess their suitability for the classification task. Furthermore, we employed one-dimensional statistical parametric mapping [39], a method that allows for the statistical analysis of time series data, to identify statistically significant differences in clinical gait data among various patient groups [42]. 24 1.4. Methodology One-shot static or interactive explanations? static interactiveUnderstand data or model? data model Explanations as samples, distributions or features? distributions samples features local global Explanations for individual samples (local) or overall model behavior (global)? post-hoc samples Explanations based on samples or features? features post-hoc A self-explaining model or post-hoc explanations? A directly interpretable model or post-hoc explanations? surrogate A surrogate model or visualize behavior? visualize self-explaining direct Figure 1.2: A taxonomy introduced by Arya et al. [53] classifies explanations based on the following criteria: what is being explained (e.g., data or the model), the way in which the explanation is determined/provided (e.g., direct or post-hoc explanations; static or interactive explanations), and the level of explanation (whether it is local or global). The color of the leaves indicates whether the explainability approach has been implemented and evaluated within the context of this thesis. Blue leaves indicate the explainability approaches that have been implemented for clinical gait data. Adopted from [53]. Decision explanation methods explain the local behavior of ML models. Thus, such explanations can reveal the contributing regions of the input data responsible for the prediction of a particular data sample. Most decision explanation methods are post-hoc approaches that provide a certain flexibility as they can be directly applied to already trained ML models [53]. These methods typically produce saliency maps, which highlight the input features that are most relevant for a specific prediction [38]. When applied to gait data, these methods have the ability to detect distinctive regions in the input data that the ML model associates with a particular gait disorder [42]. Within the scope of this thesis, two decision explanation methods were applied, with a specific emphasis on post-hoc explanations of the features utilized by ML models (Figure 1.2: static → model → local → post-hoc → features). At first, we implemented layer-wise relevance propagation (LRP) [37] for clinical gait data [42]. This method propagates relevance scores from the output layer to the input layer throughout the entire network. The final relevance scores at the input layer can be mapped back to the original signals, thereby highlighting the input features that contributed to the prediction. For the final publication [43], we implemented gradient-weighted class activation mapping (Grad- CAM) [38] for clinical gait data. Grad-CAM is a method that provides explanations based on abstract features learned in the last convolutional layer. Unlike propagating gradients (or relevance scores) back to the input space, Grad-CAM propagates the gradients with respect to the class to be explained back to the last convolutional layer in a convolutional 25 1. Introduction neural network. Subsequently, the activation map of the last convolutional layer is weighted with these gradients and then averaged over all channels of the layer. This results in an activation map that captures more abstract patterns used for the prediction. For the final decision explanation, the activation pattern is upscaled and mapped to the input signal. We recently developed gaitXplorer [54], a visual analytics approach for classifying gait patterns associated with cerebral palsy, which utilizes Grad-CAM to provide explanations for the predictions made by convolutional neural networks. Both of the aforementioned methods are regarded as propagation-based approaches because they identify the impact of input features on the model’s prediction by (partially) back-propagating either the gradient or relevance scores from the output to the input of the model. In this thesis, the focus has been on propagation-based methods instead of perturbation-based methods, mainly due to the computational efficiency of the former and the well-documented issues with reliability and consistency of the latter [55]. Additionally, perturbation-based methods are highly depended on the choice of hyperparameters, such as the number of perturbations. Model explanation methods aim to explain which learning strategies and patterns a trained ML model has learned at a global level. Model explanations enable the assessment of whether an ML model has been trained correctly and whether the modeled classes rely on meaningful patterns. As a result, model explanations facilitate the identification of ambiguous features and biases that the model has learned, while also enabling the detection of overlaps in learning strategies between different classes. In the context of this thesis, multiple model explanation approaches have been developed that rely on aggregating individual decision explanations (Figure 1.2: static → model → global → post-hoc → visualize). In Slijepcevic et al. (2022) [42], we averaged the individual decision explanations for each class, allowing to derive common patterns that ML models use to predict a specific class. Building upon this approach, we developed an explanation by incorporating the median, which proved to be more robust for Grad-CAM explanations compared to the mean [43]. Additionally, we introduced a visualization of individual decision explanations to allow visual evaluation of the distribution rather than relying only on the median/mean relevances. Furthermore, we adopted a model explanation approach based on SpRAy [44], which clusters individual decision explanations, enabling the identification of learning strategies for subgroups in the data utilized by the ML models [42]. The aforementioned approaches have been explored to explain sex- and age-dependent gait patterns utilized by ML models [56, 57]. For inherently explainable models like decision trees, we employed feature importance based on Gini impurity (Figure 1.2: static → model → global → direct) [43]. For more complex tree-based models like random forests, we adopted a similar approach. In the area of clinical gait analysis, there has been a lack of usage of explainable methods to unveil the inner workings of black-box models and facilitate their application in clinical settings. Our efforts in this domain have played an important role in introducing and promoting explainability approaches specifically tailored for clinical gait analysis. 26 1.5. Results and Discussion 1.5 Results and Discussion The five goals and the corresponding research questions of the thesis, outlined in Sec- tion 1.2, are utilized to present and discuss the obtained results. 1.5.1 Goal 1 – Creation of High-Quality Dataset RQ1.1: Which steps should a preprocessing pipeline for GRF data include to facilitate the collaborative use of data gathered from different gait laborato- ries? Despite the existence of publicly available gait datasets, access to fully annotated, comprehensive datasets with patient data remains quite limited. The survey conducted by Matsushita et al. [17] indicates that many of the publicly available gait datasets involve only a limited number of subjects, typically in the range of a few dozen. Exceptions to this trend are the dataset provided by Hausdorff via PhysioNet [58], which includes insole force data from 93 patients with Parkinson’s disease and 73 healthy controls, as well as the dataset by Ferrari et al. [59], which contains kinematic data (i.e., marker trajectories) from 178 patients with cerebral palsy. The GaitRec dataset [46] is currently among the largest publicly accessible gait datasets, containing GRF and center of pressure data from 211 healthy controls and 2,085 patients with various musculoskeletal impairments. The dataset exhibits a remarkable degree of diversity due to several factors, including the number of subjects, multiple sessions, and trials within each session. This diversity also extends to different orthopedic conditions as well as the heterogeneous conditions under which the data were collected. To this end, each healthy control subject made walking trials at three different walking speed conditions (i.e., slow, self-selected, and fast), both with and without footwear and patients walked also under different conditions, such as barefoot, with orthopedic or normal shoes, and with or without orthopedic insoles. In combination with the dataset, we introduced a universal preprocessing pipeline suitable for GRF data from different gait laboratories. Clinical experts were consulted to validate this preprocessing pipeline, which was designed to address RQ1.1. This pipeline includes several steps: i) ensuring uniform orientation of the medio-lateral and anterior-posterior signals (independent of the walking direction in the gait laboratory); ii) applying a threshold of 25 N to remove noise at the signal edges; iii) noise reduction using a second- order low-pass Butterworth filter with a cutoff frequency of 20 Hz; iv) time normalization to 100% stance; and v) normalization based on body weight. For the preprocessing of center of pressure signals, we applied a threshold of 80 N with respect to the vertical GRF component, aiming to reduce inaccuracies in calculation of the center of pressure during lower force values. Additionally, the medio-lateral and anterior-posterior center of pressure components were mean-centered and zero-centered, respectively. Furthermore, to ensure a high level of data quality, we applied an outlier detection algorithm proposed by Sangeux and Polak [60] to the data of a single session per individual. In the course of this thesis, the same pipeline was also employed to preprocess and publish the Gutenberg Gait Database [61]. This dataset is one of the largest publicly 27 1. Introduction available GRF dataset containing data from healthy controls. Additionally, the same pipeline was applied to the publicly available AIST Gait Database [62]. The combination of these three datasets opened up unprecedented data dimensions and led us to the exploration of research questions related to the uniqueness of gait data in person re- identification [63] and the identification of sex-related [64] and age-related [65] walking patterns. Furthermore, several research papers proposed ML approaches based on the GaitRec dataset [66, 67, 68, 69, 70]. Additionally, the dataset has been employed for transfer learning for an auditory feedback system based on GRF data [71, 72], and in the context of biomechanical analysis [73]. This demonstrates that the proposed standardization of GRF data from different gait laboratories already offers the possibility to investigate innovative aspects in the field of gait analysis. 1.5.2 Goal 2 – Evaluation of Discriminative Power of 3DGA Modalities RQ2.1: What level of classification performance can be achieved using only GRF data for automated gait classification? The results indicate that bilateral GRF data (especially when combining all three force components) can be effectively utilized for a binary task of distinguishing physiological gait from pathological gait. This was demonstrated using various subsets of the GaitRec dataset. In the task of distinguishing the healthy control class from a combined gait disorder class (i.e., encompassing all pathological patterns), we achieved peak classification accuracies of 89.5% (90.8% in combination with center of pressure data) [40] and 88.8%[42] using GRF data. Furthermore, in the tasks of distinguishing the healthy control class from individual pathological classes (e.g., healthy control class versus hip class), we also achieved high accuracies ranging from 86.5% to 88.8% [42]. In multi-class tasks, GRF data did not yield the desired results in either of the use cases. For the UC: functional gait disorders based on the GaitRec data, accuracies of 51.6% (54.3% in combination with center of pressure data) [40], and a maximum of 60% (62.0% in combination with center of pressure data) for a balanced setting [41] were achieved in the task with five classes. When the calcaneus class was removed and only the barefoot condition was selected, comparable results were achieved with an accuracy of 59.5% [42]. It is noteworthy that lower classification results with a peak accuracy of 51.8% were obtained when attempting to classify the hip, knee, and ankle classes [42]. This observation implies that GRF data may have limited capability in capturing distinct patterns among pathological classes in this specific context. This observation was also confirmed in experiments conducted within the UC: cerebral palsy. In this multi-class task consisting of four classes, a peak performance of 47.2% was achieved [43]. With respect to RQ2.1, it can be concluded that GRF data have the potential to classify physiological and pathological gait patterns (as a binary classification task). However, GRF data lack sufficient discriminative power for multi-class classification tasks. 28 1.5. Results and Discussion RQ2.2: What is the advantage in classification performance when using kinematic data compared to GRF data for automated gait classification, and is there an improved classification performance when using both inputs together as opposed to using them separately? The UC: cerebral palsy revealed that kinematic data are significantly more discrimina- tive than GRF data. In the multi-class task involving four pathological gait patterns, the kinematic data achieved a peak performance of 93.4%, marking a substantial difference of 46.2% compared to GRF data [43]. From a clinical perspective, this outcome may not be surprising, as kinematic data alone often contain sufficient information for the analysis and diagnosis. However, it is important to assess the discriminative power to identify potential use cases where utilizing only GRF data might be sufficient. Furthermore, the classification results showed that combining kinematic and GRF data does not yield any advantage [43]. Moreover, for almost all classification methods the use of GRF data resulted in a slight decrease in performance. This observation suggests that GRF data do not offer complementary information compared to the kinematic data for the task at hand. However, it is important to consider the potential benefits of including both types of data for a more comprehensive analysis, even though the combination did not yield benefits in this use case. By integrating kinematic and GRF data, ML models could gain a more holistic picture of gait patterns. RQ2.3: To what degree do the signals from the affected and unaffected sides differ in terms of their discriminative power for automated gait classification? In relation to RQ2.3, several studies within the UC: functional gait disorders revealed that leveraging data from both the affected and unaffected side provides a slight advantage [74, 41]. Including both the affected and unaffected sides, either explicitly or implicitly through calculating the sample-wise difference between them, prove beneficial for certain input scenarios. The explainability results presented in Slijepcevic et al. (2022) [42] provide further support for this finding. In all classification tasks, relevant regions are evident not only in the GRF data of the affected side but also in the unaffected side, although to a slightly lesser degree. This observation suggests that the unaffected side contains complementary information for the classification task. Furthermore, it is essential to note that using only the GRF data from the unaffected side resulted in significantly poorer classification results compared to utilizing only the GRF data from the affected side [74, 41]. This observation contradicts the findings of Williams et al. [75], who obtained higher classification performance for the less affected side in classifying six pathological gait patterns associated with traumatic brain injury. 1.5.3 Goal 3 – Evaluation of Data Handling Strategies RQ3.1: To what extent do different feature scaling and feature extraction techniques impact the performance of automated gait classification? With respect to RQ3.1, we explored various parameterizations for clinical gait data, including handcrafted domain-specific parameters, PCA-based representations of raw 29 1. Introduction gait data, and a combined representation using PCA on GRF parameters [40]. The first parameterization involved 52 handcrafted parameters extracted from the GRF and center of pressure data. To address the significant variation in parameter value ranges, feature scaling was crucial. We evaluated both min-max normalization and z-standardization, with z-standardization showing slightly better results. The second parameterization relied on PCA of raw GRF data. PCA representations obtained from only the three force components performed better than the handcrafted GRF parameters. Incorporating PCA representations of the center of pressure further improved results for both tasks. Normalization of PCA-based representations proved to be vital, as performance significantly dropped without it. The third parameterization applied PCA on the normalized handcrafted GRF parameters. However, the results did not exhibit an improvement compared to using handcrafted GRF parameters without PCA. Based on these findings, PCA-representations of raw GRF data are recommended as input instead of relying only on handcrafted GRF parameters. These results are consistent with a study by Burdack et al. [76], where the highest performance was also obtained by employing PCA on raw GRF data along with support vector machines for the task of person re-identification in healthy controls. When taking into account additional aspects such as the explainability of the utilized ML methods, it is advisable to employ raw input data. In the publications in which we focused on explainability [42, 43], we intentionally avoided using PCA, as it introduces an additional abstract feature space prior to the application of ML methods. Our main goal was to provide explanations at the input level, which is crucial because it is the domain where clinical experts analyze the data. RQ3.2: What is the impact of data imbalance on the performance of automated gait classification? In Slijepcevic et al. (2017) [40], three experiments were conducted for both tasks of UC: functional gait disorders to investigate the impact of imbalanced data on the classification results, specifically addressing RQ3.2. The classification results were compared to those of the unbalanced setting, which served as the baseline. Balanced number of sessions: The dataset was balanced by randomly selecting only one session per person (while the number of individuals per class remained unbalanced) to assess the effect of balanced numbers of recorded sessions per individual. Balanced number of persons: The dataset was balanced by randomly subselecting individuals per class to match the size of the smallest class (while including all sessions from these individuals) to examine the effect of balanced numbers of individuals per class. Balanced number of persons and sessions: A fully balanced dataset was created, containing only one session per person and equal numbers of persons per class (i.e., with respect to the smallest class), to explore the combined effect of balancing the number of individuals and sessions. In all three experimental settings, and especially in the last case, balancing the dataset led to significant improvements in terms of the deviation from the random baseline compared to the results without balancing. These findings emphasize the importance of considering intra-patient variability and data imbalance when conducting automated analysis of 30 1.5. Results and Discussion clinical gait data. Moreover, these findings demonstrate that using balanced datasets, in terms of both the number of sessions per person and the number of persons per class, can lead to considerable improvements in classification performance (with respect to the random baseline). This observation served as motivation to predominantly use balanced datasets in subsequent publications. RQ3.3: To what extent do different data aggregation methods impact the performance of automated gait classification? To address RQ3.3, we explored the effectiveness of different aggregation methods for classifying gait analysis data. The baseline approach involved using all available trials from a session without aggregation. We explored various early fusion approaches, which involved aggregating or subselecting data samples from an individual before training the ML model. Additionally, we considered a late fusion approach that aggregated the predictions of the ML model, trained on all trials from an individual, using majority voting. The median waveform and most representative trial approaches failed to surpass the baseline performance. In contrast, among the early fusion approaches, the mean waveform method showed the most significant improvement. The late fusion approach demonstrated better results compared to early fusion methods, suggesting that introducing an abstraction layer to the classifier’s outputs could enhance robustness. 1.5.4 Goal 4 – Comparison of Traditional ML and Deep Learning RQ4.1: How do traditional ML models compare to deep neural networks for the automated gait classification in terms of performance? Regarding RQ4.1, we observed somewhat diverse outcomes when comparing deep learning and traditional ML methods in the two examined use cases. In the UC: functional gait disorder, convolutional neural networks, support vector machines, and multi-layer perceptrons were examined across six classification tasks [42]. The classification results revealed that there were no significant performance differences among the ML methods. In the UC: cerebral palsy, we investigated the performance and explainability of various ML models, including convolutional neural networks, self-normalizing neural networks, random forests, and decision trees. For performance comparison, support vector machines and gradient boosting classifiers served as baseline models. The results revealed that random forests outperformed all other ML methods achieving consistent results across diverse input scenarios with kinematic data (peak performance of 93.4%). Gradient boosting exhibited slightly lower performance (peak performance of 92.0%), while decision trees ranked third in most input scenarios (peak performance of 89.7%). Both convolutional neural networks and self-normalizing neural networks achieved peak performances of 86.6% and 85.7%, respectively, which were slightly inferior to the performance of decision trees. Surprisingly, support vector machines achieved the lowest overall performance, reaching only a peak performance of 78.8%. The higher performance of tree-based ML models can be attributed to their robust generalization ability with limited training data, a characteristic not shared by convolutional neural networks and 31 1. Introduction self-normalizing neural networks. Deep learning methods tend to overfit when trained on smaller datasets, which may have contributed to their comparatively lower performance. 1.5.5 Goal 5 – Evaluation of Explainability Approaches RQ5.1: To what extent can explainability approaches be employed to de- termine the input features on which ML models base their decisions for automated gait classification, and are these relevant input features statisti- cally justified and in line with clinical assessment? The evaluation of RQ5.1 was conducted on different levels in three publications. The study presented in Slijepcevic et al. (2017) [40] evaluated explainability on the data level by employing linear discriminant analysis to assess the discriminative power of hand- crafted domain-specific features and PCA representations. The publications focusing on explainability [42, 43] proposed various explainability approaches to provide explanations at the prediction, class, and model level. Utilizing linear discriminant analysis and the visual assessment of boxplots of the hand- crafted domain-specific features revealed that discrete parameters identified at the local minima and maxima within the GRF signals, as well as spatio-temporal parameters, showed the highest discriminative properties. These results are in line with clinical research, as this subset of domain-specific features is frequently employed to evaluate the progress of therapy in clinical practice [77]. For the UC: functional gait disorders, the model explanations for the three investi- gated ML methods, i.e., convolutional neural networks, multi-layer perceptrons, support vector machines, exhibited a high degree of overlap, particularly regarding the location of relevant regions in the input data [42]. In certain signal regions, there were only slight differences in the amplitude of relevance scores. Furthermore, in Slijepcevic et al. (2022) [42], we proposed the use of statistical parametric mapping for the statistical assessment of input data. By employing this method, we were able to identify regions in the input data that exhibit significant statistical differences between the classes. This analysis played a crucial role in evaluating the explainability results from a statistical perspective. The results demonstrate that in the majority of cases, statistical parametric mapping reveals statistically significant differences in regions that are highly relevant according to the explainability method. Furthermore, according to clinical experts, relevant regions are strongly linked to the existing clinical literature and are considered clinically plausible. Similarly, for the UC: cerebral palsy, model explanations demonstrated the highest relevance in the two clinically most relevant signals, i.e., sagittal knee and ankle angles [43]. This observation aligns with clinical expectations and is consistent with findings from other studies, which have also identified these signals as the most promising for distinguishing crouch gait, apparent equinus, jump gait, and true equinus. The explainability results revealed that deep neural networks showed a tendency to learn patterns from a wide range of input signals, including clinically relevant regions but also on less relevant and 32 1.5. Results and Discussion potentially unrelated regions. In contrast, random forests and decision trees focused specifically on the clinically relevant regions. Interestingly, for the deep neural networks some of the relevant regions outside the sagittal knee and ankle angles were also considered clinically meaningful by clinicians, such as the sagittal hip angle. On the other hand, some regions were not considered clinically meaningful. These regions can be either attributed to a bias in the data or might not have been considered in clinical practice because they exhibit subtle differences that haven’t been recognized as clinically relevant yet. These findings highlight the potential of explainability approaches not only to assist in evaluating the behavior of ML models but also to gain novel clinical insights into the underlying data. RQ5.2: How effective are explainability approaches in detecting bias in ML models used for automated gait classification? Two experiments presented in Slijepcevic et al. (2022) [42] investigated the suitability of explainability approaches for detecting biases related to differences in walking speed between healthy controls and patients and the absence of feature scaling in ML models. During the evaluation of the explainability results in the UC: functional gait disorders, clinicians identified relevant regions in the unaffected side that they believed were not directly linked to the specific gait disorders. The clinicians hypothesized that these regions might be influenced by differences in walking speed between healthy controls and patients (rather than compensatory strategies of the unaffected side in patients’ data), suggesting a potential bias in the trained ML model. We were able to confirm this hypothesis in an experiment using a subset of the data in which walking speed was not statistically significantly different between the two groups. We trained the same model architecture on this subset and observed that relevant regions remained consistent between the two models, except for the regions previously identified by the clinicians. In this case, the explainability results provided the necessary information that led to a deeper understanding of the ML model and the underlying data. This allowed clinicians to identify the bias related to differences in walking speeds between the healthy controls and patients with functional gait disorders. To investigate the effect of feature scaling on ML models, we conducted experiments with and without min-max normalization of the input data for the UC: cerebral palsy. For the classification of non-normalized data, the most relevant input features were found in the vertical GRF component. The absence of relevant regions in the horizontal forces suggests that the ML models might not effectively utilize them, as a result of their small value range. On the other hand, explainability results for min-max normalized input data revealed highly relevant regions in the vertical and horizontal forces. The normalization process expanded the value range of the horizontal forces, allowing them to contribute at a level comparable to the vertical component. Despite the slightly better classification results achieved with non-normalized data for the multi-class tasks, the explainability results suggest that normalization is crucial for obtaining unbiased predictions. These results underscore the effectiveness of explainability approaches in identifying biases introduced by the absence of feature scaling techniques. 33 1. Introduction 1.6 Limitations and Future Work This section discusses limitations observed in the research related to this thesis and identifies future research directions that hold the potential to advance the field of automated classification of clinical gait data. 1.6.1 Performance Considerations The results obtained in this thesis have demonstrated that the classification performance is highly dependent on the data modality and the classification task at hand. For example, when utilizing only GRF data, the multi-class classification yields relatively moderate results, with the highest accuracy reaching 62.0% in the UC: functional gait disorders [41] and 47.2% in the UC: cerebral palsy [43]. However, in the case of employing GRF data for a binary task, such as distinguishing between physiological and pathological gait, classification accuracy can reach up to 90.8% [40]. Compared to GRF data, joint angles offer a more detailed representation of the kinematics during walking. Consequently, the utilization of joint angles results in a significant enhancement of performance in the multi-class classification task for the UC: cerebral palsy, achieving a classification accuracy of 93.4% (i.e., a difference of 46.2%) [43]. In comparison to previous studies addressing the same classification task (i.e., classification of four gait patterns associated with cerebral palsy as defined by Rodda et al. [78]), our results achieved a similar level of classification performance. Reported performances in the literature ranged from 93.5% in the study by Zhang and Ma [79], which was based on a dataset comprising 200 samples, to 94.0% as reported by Darbandi et al. [80], utilizing a dataset of 60 samples. In comparison, our study comprises a much larger dataset consisting of 302 children and shows that this performance level can be achieved even for large-scale data [43]. Generally, the observed results leave room for improvement and may still not meet clinical requirements. However, the assessment of whether this level of performance is adequate and clinically suitable predominantly relies on comprehending the human baseline. A promising direction for future research involves establishing a human baseline for a range of classification tasks in the field of human gait analysis. To define this baseline performance, an analysis of evaluations and annotations from multiple clinical experts from different gait laboratories is essential. In the course of this thesis, we conducted an evaluation of a human baseline for the multi-class classification task within the UC: functional gait disorders. Interestingly, the performance of the human baseline was significantly lower than the ML performance. One possible reason for this outcome is that clinical experts had to assess only the GRF data, which deviated from their typical clinical practice, without access to any contextual information (e.g., observing the patients while walking). Moreover, it is important to acknowledge the presence of uncertainties and class overlaps in annotations, resulting in classification outcomes that may not be perfect. 34 1.6. Limitations and Future Work 1.6.2 Unexplored ML Methodologies Multi-label learning. Through the interviews with the clinical experts conducted during the evaluation of the explainability results, we identified that the classes in the UC: cerebral palsy are not always mutually exclusive in practice. These classes can exhibit overlaps (e.g., different trends in the patterns of the knee and ankle) but clinical experts inherently assess which pattern is more pronounced and use this for the annotation. These uncertainties and class overlaps can introduce bias into the annotation process. Within this context, a potential future research direction involves exploring the appropriateness of a multi-label classification approach for gait pattern classification. With suitable ML approaches, dependencies between variables that are relevant to different classes can be modeled with greater accuracy and flexibility. A multi-label approach might be closer to the real-world setting and therefore more suitable for clinical practice. Few-shot and zero-shot learning. A related topic is the challenge of modeling out-of- distribution samples, which include patterns that deviate from predefined categories, as well as handling “unseen” classes, referring to patterns or conditions not encountered during the training process. This is frequently the case in situations where data collection is limited, especially when encountering rare or unusual pathological gait patterns. These challenges underscore the significance of exploring few-shot and zero-shot learning in future research. Few-shot learning allows to model effectively even sparsely sampled classes, by extracting information from only a few training samples per class. Zero-shot learning, an extreme case of few-shot learning, represents a learning paradigm that enables the detection of classes that were not part of the initial training data at all. Few-shot and zero-shot learning approaches have been rarely investigated in the context of gait data analysis [81, 82, 83]. However, addressing the aforementioned challenges is crucial for developing more adaptable and clinically relevant decision-support systems. Multi-modal learning. An aspect we realized while determining the aforementioned human baseline is that clinical experts utilize far more than just raw gait data during the assessment process of patients. Clinical experts utilize also contextual information, e.g., they can visually observe individuals walking and estimate anthropometric data. This implies that clinicians employ a multi-modal approach when assessing gait patterns. A promising future direction is multi-modal learning for human gait data. A preliminary step in this direction was taken in this thesis by combining joint angles and GRF data. The additional inclusion of GRF data did not impact the results in this case, but it may yield different outcomes in other classification tasks. Promising modalities for modeling gait patterns might encompass not only joint angles and GRF data, but also subject-specific metadata (sex, age, walking speed, and anthropometric data), muscle activation determined via electromyography, data from inertial measurement units, and video recordings of patients. Physics-informed ML approaches. Nowadays, ML approaches typically learn gait representations in a completely data-driven way, with the consequence of neglecting the biomechanical context and constraints in the data modeling process. Data-driven 35 1. Introduction approaches have demonstrated limitations in effectively capturing gait primitives with respect to biomechanical constraints. A promising direction for future research is to incorporate kinematic and kinetic constraints directly in the ML process via physics- informed ML methods (e.g., [84]). The loss function of these methods can be constrained to follow biomechanical principles. This enables a more accurate modeling of the underlying physics of biomechanical data and motion primitives. Consequently, the ML models could demonstrate greater generalizability. 1.6.3 Effects of Influencing Factors on Gait Data Human gait data exhibit a high level of inter-subject [63] and intra-subject [85] variability. Furthermore, pathological and physiological gait patterns are strongly influenced by numerous interacting factors, including sex, age, body height, body weight, walking speed, and the use of footwear and prostheses. Hence, when evaluating a pathology using gait data during walking, it is crucial to consider that these data can be affected not only by the presence of an underlying pathology but also by the aforementioned influencing factors. Preliminary investigations of some of these influencing factors, i.e., sex [56] and age [57], were conducted within the scope of this thesis. Figure 1.3 illustrates the outcomes with respect to the influencing factor of sex. In the subfigures B) and C), the color coding represents relevance scores (obtained via layer-wise relevance propagation), highlighting relevant input feature for the distinction between male and female healthy controls. The results show a certain agreement of relevant features (according to the explainability method), the gait literature, and statistical assessment. However, there are also discrepancies among these three approaches. This motivates future research regarding sex differences on larger datasets. Future research should conduct similar investigations to explore the effects of all types of influencing factors on human gait as well as their interactions. These investigations can provide insights into the actual extent of these influences, opening up follow-up questions on how to incorporate these factors into modeling and how to make automated gait analysis robust to their effects. 1.6.4 Data Availability The limited size of gait datasets could be the reason why deep learning methods fail to meet expectations in terms of outperforming traditional ML methods. Therefore, this thesis emphasizes the importance of considering heterogeneous and large-scale benchmark datasets to train and evaluate robust ML models and ensure their generalizability across different populations and gait laboratories. Another important aspect of large-scale benchmark datasets would be the potential to train foundational models for human gait data (e.g., for various gait data modalities), which can then be adapted to specific gait use cases using a transfer learning approach [92]. In addition to data quantity, benchmark datasets should be controlled for various influencing factors, such as age, sex, body height, body mass, and speed differences (i.e., to ensure they represent a wide range of population variability and are balanced), enabling unbiased training and evaluation of ML models. Moreover, benchmark datasets should incorporate also data obtained from 36 1.6. Limitations and Future Work [87] [86] [88] [89] [90] [91] Figure 1.3: Explainability results for sex classification (adapted from [56]). A) Averaged GRF signals for both classes. The first three signals represent the three GRF components of the right side and are followed by the three GRF components of the left side. The shaded areas highlight the input features where statistical parametric mapping (two- sample t-test (p < 0.05)) indicated a statistically significant difference between both classes. B)–C) Averaged GRF signals for female/male class, with a band of one standard deviation, color-coded via relevance scores. D) Effect size obtained from statistical parametric mapping and total relevance (absolute sum of input relevance scores of both classes). The total relevance indicates the common relevance of the input signal for the classification task. E) Significant (filled boxes) and non-significant (empty boxes) handcrafted GRF parameters according to the literature [86, 87, 88, 89, 90, 91]. different walking surfaces (i.e., including indoor and outdoor environments), as well as various footwear conditions. This inclusion is important to enable future ML models to capture the inter- and intra-subject variability observed in real-world biomechanical data. To reach this goal, collaboration with various research institutes and health care facilities is essential to gain access to a broader range of clinical gait data. In this regard, we have made an initial step by merging the GRF data from the GaitRec [46] dataset and the Gutenberg Gait Database [61] in a consistent and directly comparable data format. In the future, it is crucial to expand such data sharing initiatives to encompass also further data modalities (e.g., joint angles, joint moments, and muscle activation). The imbalance within a dataset poses an additional challenge when working with real- world data. Inequalities in class cardinality may arise naturally, as certain pathologies may be less common than others, or healthy controls may be measured less frequently in gait laboratories than pathological cases. There are various strategies for dealing with 37 1. Introduction data imbalances, which can be divided into two main groups, i.e., data-centered and algorithm-centered approaches [93]. In the present thesis, to address RQ3.2, the most commonly employed data-centered approach of data subsampling has been explored. Another frequently used data-centered approach involves upsampling the data using either new measurements, data augmentation techniques, or the generation of synthetic data. In the course of the present thesis, various data augmentation techniques for time series data, as presented in the survey by Iwana and Uchita [94] (i.e., jittering, magnitude warping, scaling, window slicing, window warping, guided warping, and time aligned averaging) have been experimented with for the data in UC: cerebral palsy. However, none of the augmentation approaches yielded improvements in the results for random forests and convolutional neural networks. Promising algorithm-centered approaches include the use of cost-sensitive methods, where upweighting is utilized for the samples of the minority class (i.e., assigning more weight to those samples in terms of cost). For gait analysis data, for example, Chia et al. [95] investigated the weighted Brier score as a cost function for the classification of musculoskeletal impairments in cerebral palsy, while Dumphart et al. [96] utilized the weighted cross-entropy loss to address the high imbalance in gait event classification. Future work should investigate various data-centered and algorithm-centered approaches to address data imbalances in real-world gait datasets. In addition to the recorded data collected in laboratories, synthetic data can be employed to expand the volume of training data and compensate for data imbalances in the datasets. For this purpose, generative adversarial networks (GANs) [97] have been employed for other domains [98, 99]. In generative adversarial networks two models are trained simultaneously, a generative model that learns the distribution of the training data and generates synthetic data and a discriminative model which decides if a sample originates from the training data or was generated artificially. The use of generative methods could be particularly valuable, especially for imbalanced datasets and pathological gait patterns that are rare. Alternatively, auxiliary classifier generative adversarial networks (AC- GANs) [100] could be utilized to simultaneously generate synthetic data and model the classification task. For this purpose, the discriminator is trained not only to discriminate between real and artificial data, but also to classify the input data according to the task at hand. Future research should explore whether generative methods are appropriate for human gait data and the available datasets and whether the synthetic data they generate can enhance the learning process of ML models. 1.6.5 From Explainability to Trustworthiness The present thesis in particular underscores the significance of explainability in automated gait classification. The proposed explainability approaches enable the identification and comparison of learning strategies across various classification methods. They effectively highlight the signal regions on which predictions of specific classes are grounded. However, approaches that provide saliency maps provide explanations of which features are relevant to a certain prediction and to what extent, but they fall short in exploring the underlying 38 1.6. Limitations and Future Work reasons for this relevance or the specific patterns and concepts involved. This circumstance sometimes complicates the interpretation of explainability results and consequently limits trustworthiness. Therefore, there is a need for the development of human-centered interactive explanation methods (see Figure 1.2: interactive) that would enable clinicians to manipulate the input data, create counterexamples, and observe the behavior of ML models in near real-time. Another approach that could be combined with interactive methods to further increase trust in ML models is the development of self-explaining (deep) learning methods (see Figure 1.2: static → model → local → self-explaining) that are inherently explainable by nature. Baumhauer et al. [101] introduced an appropriate explainability method for this purpose, known as bounded logit attention. This approach introduces a trainable explanation module that can be integrated into a deep neural network (typically a one- or two-dimensional convolutional neural network), whether it is pretrained or not. By training this module or the entire network, it serves as a feature extractor at the final convolutional layer, inherently providing the features used for classification as an explanation. A promising direction for future research is the development of self-explaining and interactive explainability approaches as these would provide a deeper understanding of the ML models’ decision-making process and aid clinicians in gaining valuable insights from ML predictions. Developing models with transparent decision-making processes and providing insightful explanations for predictions would not only improve trust among clinicians but also pave the way for wider adoption of such methods in clinical practice. 39 1. Introduction 1.7 Conclusion The present thesis has made significant scientific contributions to the field of clinical gait analysis, addressing key challenges and gaps in the automated analysis of human gait. The publications of this thesis have accumulated a total of 159 citations according to Google Scholar until May 2024 (Horsak et al. (2020) [46]: 42 citations, Slijepcevic et al. (2017) [40]: 57 citations, Slijepcevic et al. (2020) [41]: 17 citations, Slijepcevic et al. (2022) [42]: 39 citations, and Slijepcevic et al. (2023) [43]: 4 citation). For more than two decades, machine learning has been applied to clinical gait data with the aim to enhance the efficiency of clinical gait analysis and contribute to better informed decision-making. However, many of the ML approaches proposed prior to the present thesis fail to satisfy the prerequisites for clinical practice. These limitations include reliance on small datasets for training, tackling simplified tasks, and utilizing non-transparent ML approaches. The present thesis addressed the existing gaps and limitations. The main conclusions from the thesis are summarized in the following: • The creation of a high-quality publicly available dataset establishes a solid foun- dation for further research, providing a valuable resource for the development and validation of gait analysis algorithms. The evaluation of the discriminative power of 3DGA modalities demonstrated significant quantitative benefits in favor of kinematic data over GRF data, especially in complex multi-class classification tasks. However, employing GRF data could be advantageous in efficiently differentiating between physiological and pathological gait patterns (i.e., binary classification) or longitudinally monitoring the gait pattern of individuals, for instance, on a population basis utilizing wearable pressure insoles. • The evaluation of data handling strategies offers effective solutions to manage the complexities inherent in large-scale gait datasets. These complexities encompass, e.g., variations in value ranges of gait signals, imbalances in the dataset, and the requirement to handle multiple trials for each individual. The findings can serve as valuable guidelines for data preprocessing and data aggregation in the domain of automated gait analysis. • This thesis involves the development and evaluation of various classification ap- proaches, ranging from traditional ML to deep learning methods. Interestingly, the performance of deep learning approaches did not meet expectations, as they either performed at a comparable level to or were surpassed by traditional ML methods. To assess further the full potential of deep learning for gait analysis, it is essential to support data sharing initiatives and conduct experiments on datasets even larger than those utilized in the present thesis. 40 1.7. Conclusion • Furthermore, the development and evaluation of explainability approaches for the utilized classification models addresses a crucial aspect of translating automated gait classification approaches and findings into real-world applications. By offering techniques for data exploration and presenting methods for both decision and model explanations, this thesis lays the groundwork for clinicians to establish trust in automated gait classification. In conclusion, the goals and outcomes attained in this thesis create a fertile foundation for significant progress in patient care, elevating diagnostic standards and contributing to the development of more efficient treatment plans in the future. 41 Bibliography [1] A. Brand, L. Allen, M. Altman, M. Hlava, and J. Scott, “Beyond Authorship: Attribution, Contribution, Collaboration, and Credit,” Learned Publishing, vol. 28, no. 2, pp. 151–155, 2015. [2] National Academies of Sciences, Engineering, and Medicine, Selected Health Condi- tions and Likelihood of Improvement with Treatment. National Academies Press, 2020. [3] T. Vos, S. S. Lim, C. Abbafati, K. M. Abbas, M. Abbasi, M. Abbasifard, M. Abbasi- Kangevari, H. Abbastabar, F. Abd-Allah, A. Abdelalim, et al., “Global Burden of 369 Diseases and Injuries in 204 Countries and Territories, 1990—2019: A Systematic Analysis for the Global Burden of Disease Study 2019,” The Lancet, vol. 396, no. 10258, pp. 1204–1222, 2020. [4] C. Mayrhuber, B. Bittschi, et al., “Fehlzeitenreport 2022. Krankheits-und unfallbe- dingte Fehlzeiten in Österreich,” WIFO Studies, 2022. [5] J. Klimont, “Österreichische Gesundheitsbefragung 2019: Hauptergebnisse des Austrian Health Interview Survey (ATHIS) und methodische Dokumentation,” 2020. [6] R. Baker, Measuring Walking: A Handbook of Clinical Gait Analysis. Mac Keith Press, 2013. [7] B. Toro, C. Nester, and P. Farren, “A Review of Observational Gait Assessment in Clinical Practice,” Physiotherapy Theory and Practice, vol. 19, no. 3, pp. 137–149, 2003. [8] C. Kirtley, Clinical Gait Analysis: Theory and Practice. Elsevier Health Sciences, 2006. [9] E. Dorschky, M. Nitschke, A.-K. Seifer, A. J. van den Bogert, and B. M. Eskofier, “Estimation of Gait Kinematics and Kinetics From Inertial Sensor Data Using Optimal Control of Musculoskeletal Models,” Journal of Biomechanics, vol. 95, p. 109278, 2019. 43 Bibliography [10] M. Mundt, W. R. Johnson, W. Potthast, B. Markert, A. Mian, and J. Alderson, “A Comparison of Three Neural Network Approaches for Estimating Joint Angles and Moments From Inertial Measurement Units,” Sensors, vol. 21, no. 13, p. 4535, 2021. [11] R. Caldas, M. Mundt, W. Potthast, F. B. de Lima Neto, and B. Markert, “A Systematic Review of Gait Analysis Methods Based on Inertial Sensors and Adaptive Algorithms,” Gait & Posture, vol. 57, pp. 204–210, 2017. [12] T. Chau, “A Review of Analytical Techniques for Gait Data. Part 1: Fuzzy, Statistical and Fractal Methods,” Gait & Posture, vol. 13, no. 1, pp. 49–66, 2001. [13] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun, “Dermatologist-Level Classification of Skin Cancer With Deep Neural Networks,” Nature, vol. 542, no. 7639, pp. 115–118, 2017. [14] H. A. Haenssle, C. Fink, R. Schneiderbauer, F. Toberer, T. Buhl, A. Blum, A. Kalloo, A. B. H. Hassen, L. Thomas, A. Enk, et al., “Man Against Machine: Diagnostic Performance of a Deep Learning Convolutional Neural Network for Dermoscopic Melanoma Recognition in Comparison to 58 Dermatologists,” Annals of Oncology, vol. 29, no. 8, pp. 1836–1842, 2018. [15] S. M. McKinney, M. Sieniek, V. Godbole, J. Godwin, N. Antropova, H. Ashrafian, T. Back, M. Chesus, G. S. Corrado, A. Darzi, et al., “International Evaluation of an AI System for Breast Cancer Screening,” Nature, vol. 577, no. 7788, pp. 89–94, 2020. [16] J. Figueiredo, C. P. Santos, and J. C. Moreno, “Automatic Recognition of Gait Patterns in Human Motor Disorders Using Machine Learning: A Review,” Medical Engineering & Physics, 2018. [17] Y. Matsushita, D. T. Tran, H. Yamazoe, and J.-H. Lee, “Recent Use of Deep Learning Techniques in Clinical Applications Based on Gait: A Survey,” Journal of Computational Design and Engineering, vol. 8, no. 6, pp. 1499–1532, 2021. [18] M. Längkvist, L. Karlsson, and A. Loutfi, “A Review of Unsupervised Feature Learning and Deep Learning for Time-Series Modeling,” Pattern Recognition Letters, vol. 42, pp. 11–24, 2014. [19] T. Flash and B. Hochner, “Motor Primitives in Vertebrates and Invertebrates,” Current Opinion in Neurobiology, vol. 15, no. 6, pp. 660–666, 2005. [20] J. Guerra, J. Uddin, D. Nilsen, J. Mclnerney, A. Fadoo, I. B. Omofuma, S. Hughes, S. Agrawal, P. Allen, and H. M. Schambra, “Capture, Learning, and Classification of Upper Extremity Movement Primitives in Healthy Controls and Stroke Patients,” in 2017 International Conference on Rehabilitation Robotics (ICORR), pp. 547–554, IEEE, 2017. 44 Bibliography [21] A. Adadi and M. Berrada, “Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI),” IEEE Access, vol. 6, pp. 52138–52160, 2018. [22] A. Holzinger, C. Biemann, C. S. Pattichis, and D. B. Kell, “What Do We Need to Build Explainable AI Systems for the Medical Domain?,” CoRR, vol. abs/1712.09923, 2017. [23] W. Samek, T. Wiegand, and K.-R. Müller, “Explainable Artificial Intelligence: Understanding, Visualizing and Interpreting Deep Learning Models,” ITU Journal: ICT Discoveries, vol. 1, no. 1, pp. 39–48, 2017. [24] European Union, “Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the Protection of Natural Persons With Regard to the Processing of Personal Data and on the Free Movement of Such Data, and Repealing Directive 95/46/EC (General Data Protection Regulation),” Official Journal of the European Union, vol. L 119, pp. 1–88, 2016. [25] European Commission, “Proposal for a Regulation of the European Parliament and the Council Laying Down Harmonised Rules on Artificial Intelligence (Artifi- cial Intelligence Act) and Amending Certain Union Legislative Acts,” EUR-Lex- 52021PC0206, 2021. [26] I. El Maachi, G.-A. Bilodeau, and W. Bouachir, “Deep 1D-Convnet for Accurate Parkinson Disease Detection and Severity Prediction From Gait,” Expert Systems with Applications, vol. 143, p. 113075, 2020. [27] W. Zeng, F. Liu, Q. Wang, Y. Wang, L. Ma, and Y. Zhang, “Parkinson’s Dis- ease Classification Using Gait Analysis via Deterministic Learning,” Neuroscience Letters, vol. 633, pp. 268–278, 2016. [28] F. Wahid, R. K. Begg, C. J. Hass, S. Halgamuge, and D. C. Ackland, “Classification of Parkinson’s Disease Gait Using Spatial-Temporal Gait Features,” IEEE Journal of Biomedical and Health Informatics, vol. 19, no. 6, pp. 1794–1802, 2015. [29] M. Alaqtash, T. Sarkodie-Gyan, H. Yu, O. Fuentes, R. Brower, and A. Abdelgawad, “Automatic Classification of Pathological Gait Patterns Using Ground Reaction Forces and Machine Learning Algorithms,” in Engineering in Medicine and Biology Society, 2011 Annual International Conference of the IEEE, pp. 453–457, IEEE, 2011. [30] C. Nüesch, V. Valderrabano, C. Huber, V. von Tscharner, and G. Pagenstert, “Gait Patterns of Asymmetric Ankle Osteoarthritis Patients,” Clinical Biomechanics, vol. 27, no. 6, pp. 613–618, 2012. [31] D. Soares, M. de Castro, E. Mendes, and L. Machado, “Principal Component Analysis in Ground Reaction Forces and Center of Pressure Gait Waveforms of People With Transfemoral Amputation,” Prosthetics and Orthotics International, vol. 40, no. 6, pp. 729–738, 2016. 45 Bibliography [32] A. Muniz and J. Nadal, “Application of Principal Component Analysis in Vertical Ground Reaction Force to Discriminate Normal and Abnormal Gait,” Gait & Posture, vol. 29, no. 1, pp. 31–35, 2009. [33] E. Tjoa and C. Guan, “A Survey on Explainable Artificial Intelligence (XAI): Toward Medical XAI,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 11, pp. 4793–4813, 2020. [34] A. Holzinger, G. Langs, H. Denk, K. Zatloukal, and H. Müller, “Causability and Explainability of Artificial Intelligence in Medicine,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 9, no. 4, p. e1312, 2019. [35] C. Beyaert, R. Vasa, and G. E. Frykberg, “Gait Post-stroke: Pathophysiology and Rehabilitation Strategies,” Neurophysiologie Clinique/Clinical Neurophysiology, vol. 45, no. 4-5, pp. 335–355, 2015. [36] W. Zheng and M. Jin, “The Effects of Class Imbalance and Training Data Size on Classifier Learning: An Empirical Study,” SN Computer Science, vol. 1, pp. 1–13, 2020. [37] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek, “On Pixel-Wise Explanations for Non-linear Classifier Decisions by Layer-Wise Relevance Propagation,” PLoS One, vol. 10, no. 7, p. e0130140, 2015. [38] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad- CAM: Visual Explanations From Deep Networks via Gradient-Based Localization,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 618– 626, 2017. [39] T. C. Pataky, “Generalized n-dimensional Biomechanical Field Analysis Using Statistical Parametric Mapping,” Journal of Biomechanics, vol. 43, no. 10, pp. 1976– 1982, 2010. [40] D. Slijepcevic, M. Zeppelzauer, A.-M. Gorgas, C. Schwab, M. Schüller, A. Baca, C. Breiteneder, and B. Horsak, “Automatic Classification of Functional Gait Disorders,” IEEE Journal of Biomedical and Health Informatics, vol. 22, no. 5, pp. 1653–1661, 2017. [41] D. Slijepcevic, M. Zeppelzauer, C. Schwab, A.-M. Raberger, C. Breiteneder, and B. Horsak, “Input Representations and Classification Strategies for Automated Human Gait Analysis,” Gait & Posture, vol. 76, pp. 198–203, 2020. [42] D. Slijepcevic, F. Horst, S. Lapuschkin, B. Horsak, A.-M. Raberger, A. Kranzl, W. Samek, C. Breiteneder, W. I. Schöllhorn, and M. Zeppelzauer, “Explaining Machine Learning Models for Clinical Gait Analysis,” ACM Transactions on Computing for Healthcare (HEALTH), vol. 3, no. 2, pp. 1–27, 2021. 46 Bibliography [43] D. Slijepcevic, M. Zeppelzauer, F. Unglaube, A. Kranzl, C. Breiteneder, and B. Horsak, “Explainable Machine Learning in Human Gait Analysis: A Study on Children With Cerebral Palsy,” IEEE Access, vol. 11, pp. 65906–65923, 2023. [44] S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek, and K.-R. Müller, “Unmasking Clever Hans Predictors and Assessing What Machines Really Learn,” Nature Communications, vol. 10, no. 1, p. 1096, 2019. [45] G. Giakas and V. Baltzopoulos, “Time and Frequency Domain Analysis of Ground Reaction Forces During Walking: An Investigation of Variability and Symmetry,” Gait & Posture, vol. 5, no. 3, pp. 189–197, 1997. [46] B. Horsak, D. Slijepcevic, A.-M. Raberger, C. Schwab, M. Worisch, and M. Zep- pelzauer, “GaitRec, a Large-Scale Ground Reaction Force Dataset of Healthy and Impaired Gait,” Scientific Data, vol. 7, no. 1, p. 143, 2020. [47] D. Slijepcevic, B. Horsak, C. Schwab, A. Raberger, M. Schüller, A. Baca, C. Bre- iteneder, and M. Zeppelzauer, “Ground Reaction Force Measurements for Gait Classification Tasks: Effects of Different PCA-Based Representations,” Gait & Posture, vol. 57, pp. 4–5, 2017. [48] D. Janssen, W. I. Schöllhorn, K. M. Newell, J. M. Jäger, F. Rost, and K. Vehof, “Di- agnosing Fatigue in Gait Patterns by Support Vector Machines and Self-Organizing Maps,” Human Movement Science, vol. 30, no. 5, pp. 966–975, 2011. [49] B. M. Eskofier, P. Federolf, P. F. Kugler, and B. M. Nigg, “Marker-Based Classifica- tion of Young–Elderly Gait Pattern Differences via Direct PCA Feature Extraction and SVMs,” Computer Methods in Biomechanics and Biomedical Engineering, vol. 16, no. 4, pp. 435–442, 2013. [50] J. Christian, J. Kröll, G. Strutzenberger, N. Alexander, M. Ofner, and H. Schwameder, “Computer Aided Analysis of Gait Patterns in Patients With Acute Anterior Cruciate Ligament Injury,” Clinical Biomechanics, vol. 33, pp. 55–60, 2016. [51] R. Altilio, M. Paoloni, and M. Panella, “Selection of Clinical Features for Pat- tern Recognition Applied to Gait Analysis,” Medical & Biological Engineering & Computing, vol. 55, no. 4, pp. 685–695, 2017. [52] E. J. Harris, I.-H. Khoo, and E. Demircan, “A Survey of Human Gait-Based Artificial Intelligence Applications,” Frontiers in Robotics and AI, vol. 8, p. 749274, 2022. [53] V. Arya, R. K. Bellamy, P.-Y. Chen, A. Dhurandhar, M. Hind, S. C. Hoffman, S. Houde, Q. V. Liao, R. Luss, A. Mojsilović, et al., “One Explanation Does Not Fit All: A Toolkit and Taxonomy of AI Explainability Techniques,” arXiv preprint arXiv:1909.03012, 2019. 47 Bibliography [54] A. Rind, D. Slijepčević, M. Zeppelzauer, F. Unglaube, A. Kranzl, and B. Horsak, “Trustworthy Visual Analytics in Clinical Gait Analysis: A Case Study for Patients with Cerebral Palsy,” in 2022 IEEE Workshop on TRust and EXpertise in Visual Analytics (TREX), pp. 8–15, IEEE, 2022. [55] D. Slack, S. Hilgard, E. Jia, S. Singh, and H. Lakkaraju, “Fooling LIME and SHAP: Adversarial Attacks on Post Hoc Explanation Methods,” in Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp. 180–186, 2020. [56] F. Horst, D. Slijepcevic, M. Zeppelzauer, A. Raberger, S. Lapuschkin, W. Samek, W. Schöllhorn, C. Breiteneder, and B. Horsak, “Explaining Automated Gender Classification of Human Gait,” Gait & Posture, vol. 81, pp. 159–160, 2020. [57] D. Slijepcevic, F. Horst, M. Simak, S. Lapuschkin, A.-M. Raberger, W. Samek, C. Breiteneder, W. I. Schöllhorn, M. Zeppelzauer, and B. Horsak, “Explaining Machine Learning Models for Age Classification in Human Gait Analysis,” Gait & Posture, vol. 97, pp. S252–S253, 2022. [58] A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley, “PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals,” Circulation, vol. 101, no. 23, pp. e215–e220, 2000. [59] A. Ferrari, L. Bergamini, G. Guerzoni, S. Calderara, N. Bicocchi, G. Vitetta, C. Borghi, R. Neviani, and A. Ferrari, “Gait-Based Diplegia Classification Using LSMT Networks,” Journal of Healthcare Engineering, vol. 2019, 2019. [60] M. Sangeux and J. Polak, “A Simple Method to Choose the Most Representative Stride and Detect Outliers,” Gait & Posture, vol. 41, no. 2, pp. 726–730, 2015. [61] F. Horst, D. Slijepcevic, M. Simak, and W. I. Schöllhorn, “Gutenberg Gait Database, a Ground Reaction Force Database of Level Overground Walking in Healthy Individuals,” Scientific Data, vol. 8, no. 1, p. 232, 2021. [62] Y. Kobayashi, N. Hida, K. Nakajima, M. Fujimoto, and M. Mochimaru, “AIST Gait Database 2019.” https://unit.aist.go.jp/harc/ExPART/GDB2019_ e.html, 2019. Online; accessed 28 July 2023. [63] F. Horst, D. Slijepcevic, M. Simak, B. Horsak, W. I. Schöllhorn, and M. Zeppelzauer, “Modeling Biological Individuality Using Machine Learning: A Study on Human Gait,” Computational and Structural Biotechnology Journal, 2023. [64] F. Horst, D. Slijepcevic, M. Zeppelzauer, A. Raberger, S. Lapuschkin, W. Samek, W. Schöllhorn, C. Breiteneder, and B. Horsak, “Explaining Automated Gender Classification of Human Gait,” Gait & Posture, vol. 81, pp. 159–160, 2020. ESMAC 2020 Abstracts. 48 Bibliography [65] D. Slijepcevic, F. Horst, M. Simak, S. Lapuschkin, A. Raberger, W. Samek, C. Breiteneder, W. Schöllhorn, M. Zeppelzauer, and B. Horsak, “Explaining Machine Learning Models for Age Classification in Human Gait Analysis,” Gait & Posture, vol. 97, pp. S252–S253, 2022. ESMAC 2022 Abstracts. [66] M. N. I. Shuzan, M. E. Chowdhury, M. B. I. Reaz, A. Khandakar, F. F. Abir, M. A. A. Faisal, S. H. M. Ali, A. A. A. Bakar, M. H. Chowdhury, Z. B. Mahbub, et al., “Machine Learning-Based Classification of Healthy and Impaired Gaits Using 3D-GRF Signals,” Biomedical Signal Processing and Control, vol. 81, p. 104448, 2023. [67] D. Jani, V. Varadarajan, R. Parmar, M. H. Bohara, D. Garg, A. Ganatra, and K. Kotecha, “An Efficient Gait Abnormality Detection Method Based on Classifi- cation,” Journal of Sensor and Actuator Networks, vol. 11, no. 3, p. 31, 2022. [68] C. Pandey, D. S. Roy, R. C. Poonia, A. Altameem, S. R. Nayak, A. Verma, and A. K. J. Saudagar, “GaitRec-Net: A Deep Neural Network for Gait Disorder Detection Using Ground Reaction Force,” PPAR Research, vol. 2022, 2022. [69] R. Yun, M. Salama, and L. Elrefaei, “An Exploratory Study on the Effect of Applying Various Artificial Neural Networks to the Classification of Lower Limb Injury,” Turkish Journal of Electrical Engineering and Computer Sciences, vol. 31, no. 2, pp. 448–461, 2023. [70] J. Chakraborty, S. Upadhyay, and A. Nandy, “Musculoskeletal Injury Recovery Assessment Using Gait Analysis With Ground Reaction Force Sensor,” Medical Engineering & Physics, vol. 103, p. 103788, 2022. [71] M. Iber, B. Dumphart, V.-A. de Jesus Oliveira, S. Ferstl, J. M. Reis, D. Slijepčević, M. Heller, A.-M. Raberger, and B. Horsak, “Mind the Steps: Towards Auditory Feedback in Tele-Rehabilitation Based on Automated Gait Classification,” in Proceedings of the 16th International Audio Mostly Conference, pp. 139–146, 2021. [72] V. A. de Jesus Oliveira, D. Slijepčević, B. Dumphart, S. Ferstl, J. Reis, A.- M. Raberger, M. Heller, B. Horsak, and M. Iber, “Auditory Feedback in Tele- Rehabilitation Based on Automated Gait Classification,” Personal and Ubiquitous Computing, pp. 1–14, 2023. [73] J. E. Deffeyes and D. M. Peters, “Time-Integrated Propulsive and Braking Impulses Do Not Depend on Walking Speed,” Gait & Posture, vol. 88, pp. 258–263, 2021. [74] D. Slijepcevic, M. Zeppelzauer, C. Schwab, A.-M. Raberger, B. Dumphart, A. Baca, C. Breiteneder, and B. Horsak, “P 011–Towards an Optimal Combination of Input Signals and Derived Representations for Gait Classification Based on Ground Reaction Force Measurements,” Gait & Posture, vol. 65, pp. 249–250, 2018. 49 Bibliography [75] G. Williams, D. Lai, A. Schache, and M. Morris, “Classification of Gait Disorders Following Traumatic Brain Injury,” The Journal of Head Trauma Rehabilitation, vol. 30, no. 2, pp. E13–E23, 2015. [76] J. Burdack, F. Horst, S. Giesselbach, I. Hassan, S. Daffner, and W. I. Schöllhorn, “Systematic Comparison of the Influence of Different Data Preprocessing Methods on the Performance of Gait Classifications Using Machine Learning,” Frontiers in Bioengineering and Biotechnology, vol. 8, p. 260, 2020. [77] S. Winiarski and A. Rutkowska-Kucharska, “Estimated Ground Reaction Force in Normal and Pathological Gait,” Acta of Bioengineering & Biomechanics, vol. 11, no. 1, 2009. [78] J. Rodda and H. Graham, “Classification of Gait Patterns in Spastic Hemiplegia and Spastic Diplegia: A Basis for a Management Algorithm,” European Journal of Neurology, vol. 8, pp. 98–108, 2001. [79] Y. Zhang and Y. Ma, “Application of Supervised Machine Learning Algorithms in the Classification of Sagittal Gait Patterns of Cerebral Palsy Children With Spastic Diplegia,” Computers in Biology and Medicine, vol. 106, pp. 33–39, 2019. [80] H. Darbandi, M. Baniasad, S. Baghdadi, A. Khandan, A. Vafaee, and F. Farahmand, “Automatic Classification of Gait Patterns in Children With Cerebral Palsy Using Fuzzy Clustering Method,” Clinical Biomechanics, vol. 73, pp. 189–194, 2020. [81] C. Dindorf, J. Konradi, C. Wolf, B. Taetz, G. Bleser, J. Huthwelker, F. Werthmann, P. Drees, M. Fröhlich, and U. Betz, “Machine Learning Techniques Demonstrating Individual Movement Patterns of the Vertebral Column: The Fingerprint of Spinal Motion,” Computer Methods in Biomechanics and Biomedical Engineering, vol. 25, no. 7, pp. 821–831, 2022. [82] P. Krondorfer, D. Slijepčević, F. Unglaube, A. Kranzl, C. Breiteneder, M. Zep- pelzauer, and B. Horsak, “Deep Learning-Based Similarity Retrieval in Clinical 3D Gait Analysis,” Gait & Posture, vol. 90, pp. 127–128, 2021. [83] K. A. Duncanson, S. Thwaites, D. Booth, G. Hanly, W. S. Robertson, E. Abbas- nejad, and D. Thewlis, “Deep Metric Learning for Scalable Gait-Based Person Re-identification Using Force Platform Data,” Sensors, vol. 23, no. 7, p. 3392, 2023. [84] J. Zhang, Y. Zhao, F. Shone, Z. Li, A. F. Frangi, S. Q. Xie, and Z.-Q. Zhang, “Physics-Informed Deep Learning for Musculoskeletal Modeling: Predicting Muscle Forces and Joint Kinematics From Surface EMG,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 31, pp. 484–493, 2022. [85] F. Horst, F. Kramer, B. Schäfer, A. Eekhoff, P. Hegen, B. Nigg, and W. Schöllhorn, “Daily Changes of Individual Gait Patterns Identified by Means of Support Vector Machines,” Gait & Posture, vol. 49, pp. 309–314, 2016. 50 Bibliography [86] E. Chao, R. Laughman, E. Schneider, and R. Stauffer, “Normative Data of Knee Joint Motion and Ground Reaction Forces in Adult Level Walking,” Journal of Biomechanics, vol. 16, no. 3, pp. 219–233, 1983. [87] B. Nigg, V. Fisher, and J. Ronsky, “Gait Characteristics as a Function of Age and Gender,” Gait & Posture, vol. 2, no. 4, pp. 213–220, 1994. [88] M.-C. Chiu and M.-J. Wang, “The Effect of Gait Speed and Gender on Perceived Exertion, Muscle Activity, Joint Motion of Lower Extremity, Ground Reaction Force and Heart Rate During Normal Walking,” Gait & Posture, vol. 25, no. 3, pp. 385–392, 2007. [89] M.-J. Chung and M.-J. J. Wang, “The Change of Gait Parameters During Walking at Different Percentage of Preferred Walking Speed for Healthy Adults Aged 20-–60 Years,” Gait & Posture, vol. 31, no. 1, pp. 131–135, 2010. [90] H. Toda, A. Nagano, and Z. Luo, “Age and Gender Differences in the Control of Vertical Ground Reaction Force by the Hip, Knee and Ankle Joints,” Journal of Physical Therapy Science, vol. 27, no. 6, pp. 1833–1838, 2015. [91] K. A. Boyer, G. S. Beaupre, and T. P. Andriacchi, “Gender Differences Exist in the Hip Joint Moments of Healthy Older Walkers,” Journal of Biomechanics, vol. 41, no. 16, pp. 3360–3365, 2008. [92] N. J. Cronin, “Using deep neural networks for kinematic analysis: Challenges and opportunities,” Journal of Biomechanics, vol. 123, p. 110460, 2021. [93] B. Krawczyk, “Learning from imbalanced data: open challenges and future direc- tions,” Progress in Artificial Intelligence, vol. 5, no. 4, pp. 221–232, 2016. [94] B. K. Iwana and S. Uchida, “An empirical survey of data augmentation for time series classification with neural networks,” PLoS One, vol. 16, no. 7, p. e0254841, 2021. [95] K. Chia, I. Fischer, P. Thomason, K. Graham, and M. Sangeux, “Is It Feasible to Use an Automated System to Identify Gait Impairments?,” Gait & Posture, vol. 57, pp. 167–168, 2017. [96] B. Dumphart, D. Slijepcevic, M. Zeppelzauer, A. Kranzl, F. Unglaube, A. Baca, and B. Horsak, “Robust deep learning-based gait event detection across various pathologies,” PLoS One, vol. 18, no. 8, p. e0288555, 2023. [97] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Nets,” in Advances in Neural Information Processing Systems, pp. 2672–2680, 2014. 51 Bibliography [98] E. Brophy, Z. Wang, Q. She, and T. Ward, “Generative Adversarial Networks in Time Series: A Systematic Literature Review,” ACM Computing Surveys, vol. 55, no. 10, pp. 1–31, 2023. [99] S. Kazeminia, C. Baur, A. Kuijper, B. van Ginneken, N. Navab, S. Albarqouni, and A. Mukhopadhyay, “GANs for Medical Image Analysis,” Artificial Intelligence in Medicine, vol. 109, p. 101938, 2020. [100] A. Odena, C. Olah, and J. Shlens, “Conditional Image Synthesis With Auxiliary Classifier GANs,” in International Conference on Machine Learning, pp. 2642–2651, PMLR, 2017. [101] T. Baumhauer, D. Slijepcevic, and M. Zeppelzauer, “Bounded Logit Attention: Learning to Explain Image Classifiers,” in NeurIPS’22 Workshop on All Things Attention: Bridging Different Perspectives on Attention, 2022. 52 CHAPTER 2 Publications 2.1 GaitRec, a Large-Scale Ground Reaction Force Dataset of Healthy and Impaired Gait Brian Horsak, Djordje Slijepcevic, Anna-Maria Raberger, Caterine Schwab, Marianne Worisch, and Matthias Zeppelzauer. GaitRec, a Large-Scale Ground Reaction Force Dataset of Healthy and Impaired Gait. Scientific Data, 7(1):143, 2020. DOI: 10.1038/s41597-020-0481-z The final version of this publication is available at: https://doi.org/10.1038/ s41597-020-0481-z. Permission for reprint granted, © 2020 Slijepcevic 53 1Scientific Data | (2020) 7:143 | https://doi.org/10.1038/s41597-020-0481-z www.nature.com/scientificdata GaitRec, a large-scale ground reaction force dataset of healthy and impaired gait Brian Horsak 1,4 ✉, Djordje Slijepcevic 2,4, anna-Maria Raberger1, Caterine Schwab1, Marianne Worisch3 & Matthias Zeppelzauer2 The quantification of ground reaction forces (GRF) is a standard tool for clinicians to quantify and analyze human locomotion. Such recordings produce a vast amount of complex data and variables which are difficult to comprehend. This makes data interpretation challenging. Machine learning approaches seem to be promising tools to support clinicians in identifying and categorizing specific gait patterns. However, the quality of such approaches strongly depends on the amount of available annotated data to train the underlying models. therefore, we present GaitRec, a comprehensive and completely annotated large-scale dataset containing bi-lateral GRF walking trials of 2,084 patients with various musculoskeletal impairments and data from 211 healthy controls. The dataset comprises data of patients after joint replacement, fractures, ligament ruptures, and related disorders at the hip, knee, ankle or calcaneus during their entire stay(s) at a rehabilitation center. The data sum up to a total of 75,732 bi-lateral walking trials and enable researchers to classify gait patterns at a large-scale as well as to analyze the entire recovery process of patients. Background & Summary The quantification of ground reaction forces (GRF) is a standard tool for clinicians to objectively measure human locomotion and to describe and analyze a patient’s gait performance in detail. The primary aim of instrumented gait analysis, regardless of which technology used, is to identify impairments that affect a patient’s gait pattern and to describe those quantitatively1. Recordings obtained during clinical gait analyses produce a vast amount of data which are difficult to comprehend and analyze due to their high-dimensionality, temporal dependencies, strong variability, non-linear relationships and correlations within the data2. This makes data interpretation challenging and requires an experienced clinician to draw valid conclusions. Therefore, there is a constantly growing interest in applying machine learning techniques to clinical gait analysis data for the purpose of pattern identification and automated classification. Such systems might bear potential to assist clinicians in identifying and categorizing specific gait patterns into clinically relevant categories2,3. Machine learning methods employed in this context comprise, but are not limited to, neural networks4–6, support vector machines7–9, nearest neighbor classifiers10,11, and different clustering approaches12. Our research group is collaborating with a local Austrian rehabilitation center of the Austrian Workers’ Compensation Board (AUVA). The AUVA is the social insurance for occupational risks for more than 3.3 million employees and 1.4 million pupils and students in Austria. They have been using GRF assessments during walking to diagnose, plan and evaluate therapy outcomes for more than two decades. Our main research goal within this collaboration was to develop automatic classification algorithms which support clinicians during data inspection and interpretation. To this end, we have developed a machine learning framework for gait classification and have performed comprehensive experiments13–16. One conclusion of our experiments is that the performance of automatic classification methods strongly depends on the amount of available training data. One reason for this is that state-of-the-art classifiers such as deep neural networks17 are extremely data hungry and require large-scale data to learn meaningful and generalizable patterns from the data. The training process, however, requires each walking-trial in the dataset to be annotated and categorized exactly. Even though there are datasets available 1St. Pölten University of Applied Sciences, institute of Health Sciences, St. Pölten, Austria. 2St. Pölten University of Applied Sciences, institute of creative Media technologies, St. Pölten, Austria. 3Rehabilitation center Weißer Hof, Austrian Workers’ compensation Board (AUVA), Klosterneuburg, Austria. 4these authors contributed equally: Brian Horsak, Djordje Slijepcevic. ✉e-mail: brian.horsak@fhstp.ac.at Data DeSCRiptoR opeN 2. Publications 54 2Scientific Data | (2020) 7:143 | https://doi.org/10.1038/s41597-020-0481-z www.nature.com/scientificdatawww.nature.com/scientificdata/ relevant to instrumented gait analysis, e.g.18, the availability of completely annotated large-scale datasets is very scarce. Our collaboration with the AUVA and their gait laboratory gave us the unique opportunity to process and manually annotate thousands of walking GRF trials from several years of clinical practice. These data have been used in our previous research and show a large potential for further research in gait analysis (see section usageN- otes) to achieve the long-term goal to put assistive machine learning techniques into clinical gait analysis practice. For this purpose, we make these data available to the public as the GaitRec dataset. Methods Data recording & testing protocol. The presented dataset is part of an existing clinical gait database maintained by a local Austrian rehabilitation center, which offers care to patients across entire Austria. Prior to the experiments involved and the publication of the dataset, approval was obtained from the local Ethics Committee of Lower Austria (GS1-EK-4/299-2014). Data were recorded during clinical practice between 2007 and 2018. Bi-lateral GRF were recorded by asking patients and healthy controls to walk unassisted and without a walking aid at self-selected walking speed on an approximately 10 m walkway with two centrally embedded force plates (Kistler, Type 9281B12, Winterthur, CH). The force plates were placed in a consecutive order and flush with the ground. Both plates were covered with the same walkway surface material, so that targeting was not an issue. During one session, subjects walked until a minimum number of (usually) ten valid recordings were available. These recordings were defined as valid by the assessor when the participant walked naturally (e.g. with respect to targeting) and there was a clean foot strike on each force plate. Left and right foot contacts for each force plate were identified and set by visual inspection by the assessor during each recording. Patients were asked to walk at their self-selected walking speed. Healthy controls walked at three different walking speeds (mean and standard deviation, m/s): slow 0.98 (0.14), self-selected 1.27 (0.13), and fast 1.55 (0.15). In accordance with the internal rehabilitation center’s standards, patients walked either barefoot, with their orthopedic or normal shoes, and with or without orthopedic insoles. Healthy controls walked either barefoot or with their normal shoes. Prior to the gait analysis session, each participant underwent rigorous physical examination by a physician. The three analog GRF signals (vertical, anterior-posterior and medio-lateral force components) as well as the center of pressure (COP) were converted to digital signals using a sampling rate of 2000 Hz and a 12-bit analog-digital converter (DT3010, Data Translation Incorporation, Marlboro, MA, USA) with a signal input range of ±10 V. COP and GRF were recorded in the local force plate coordinate system (reaction-orientated). For easier usage the orientation of the medio-lateral and anterior-posterior signals for all data were uniformed, so that medial and anterior forces are always represented as positive values. Due to the center’s internal standards raw signals were only available down-sampled to 250 Hz. To avoid noise and signal peaks at the beginning and end of the signals, a threshold of 25 N was applied to all force data and the COP was calculated afterwards. These data are referred to as unprocessed (raw) GRF signals. Additionally, we have generated processed “ready to use” data. For this purpose the COP was only calculated when the vertical force reached 80 N to avoid inaccuracies in COP calculation at small force values. Additionally, the medio-lateral COP coordinates were mean-centered and anterior-posterior coordinates zero-centered. This was in line with the internal standards of the rehabilitation center. The processed force signals were then filtered using a 2nd order low-pass butterworth filter with a cut-off frequency of 20 Hz to reduce noise and were time-normalized to 100% stance (i.e. 101 points). The choice of appropriate cut-off frequency ranges widely in the literature, 20 Hz seems as a good trade-off between reducing noise and attain- ing as much physiological frequency content as possible19. The interested reader may also refer to [ref. 20, p.49]. Amplitude values of the three force components were expressed as a multiple of body weight (BW) by dividing the force by the product of body mass times acceleration due to gravity (g). Amplitude and time normalization are both necessary operations to reduce effects of covariates (such as anthropometry) on the signals and to reduce temporal differences which make comparisons of different steps difficult, e.g.21,22. Note that the processed and amplitude normalized data show small variations at the first and last frame of each signal. This might affect machine learning outcomes and therefore needs to be recognized. Sessions with less than three bi-lateral trials per participant were not included in the dataset. Additionally, we have used an algorithm proposed by Sangeux and Polak to eliminate any outliers before they were included in the GaitRec dataset23. This algorithm is based on the notion of depth, where the deepest signal is the equivalent to the median for univariate data and is sensitive to both shape and position of the signals. As suggested by Sangeux and Polak we have used a score of three to run their algorithm. All processing steps were performed in Matlab 2019a (The MathWorks Inc., Natick, MA, USA). Dataset & annotation. The presented dataset comprises completely anonymized GRF measurements from 2085 patients with different musculoskeletal impairments (“gait disorders”, GD) and data from 211 healthy con- trols (HC) including additional metadata such as age, sex, shod condition, walking speed condition, etc. For details see Table 1. Note that there is a considerable large gender imbalance in all GD classes. Healthy controls were recruited in the geographical region around the clinic’s by public posting and considered eligible if they were free of pain and complaints at the lower extremity and spine and did not have any orthotics or orthopedic insoles. Exclusion criteria were any history of surgery or trauma at the spine or lower extremities. This was assessed by an experienced therapist. A typical stay of a patient at the rehabilitation center ranged from a few days to several weeks and depends on factors such as diagnosis, administered therapy/surgery, and progress in recovery. During that time a patient is usually administered once a week to the gait analysis. At the beginning of a patient’s stay, therapy outcomes are mutually defined between the therapist and the patient. After reaching these goals in whole or in part, patients are usually discharged. However, they can be readmitted if necessary. The present dataset contains the data gathered during the entire stay(s) of each patient and covers a patient’s entire rehabilitation progress. Different types of analyses can thus be performed on the data set: an inter-participant analysis based on the initial assessment (first measurement session), e.g. for gait pattern classification, an intra-participant analysis, e.g. for the assessment of rehabilitation progress, or combinations. 2.1. GaitRec, a Large-Scale Ground Reaction Force Dataset of Healthy and Impaired Gait 55 3Scientific Data | (2020) 7:143 | https://doi.org/10.1038/s41597-020-0481-z www.nature.com/scientificdatawww.nature.com/scientificdata/ Regarding annotation, the dataset was manually labeled by a well-experienced physical therapist (with more than a decade of clinical experience) based on the available medical diagnosis of each patient. The annotation labels are formed by two strings concatenated with an underscore “X_xxx”, where “X” denotes the general ana- tomical joint level at which the orthopedic impairment was located, i.e. at the hip “H”, knee “K”, ankle “A”, or cal- caneus “C”. The second string (“xxx”) gives a more detailed localization and is joint dependent, see the following paragraphs for details. An overview of the class structure is shown in Fig. 1. • Hip class (H_xxx): The most common injuries present in the hip class are fractures of the pelvis and thigh as well as luxation of the hip joint, coxarthrosis, and total hip replacement. The second string “xxx” refers to the following specific anatomical regions: pelvis (H_P), coxa (H_C), the femur (H_F), and their combinations when two or more anatomical areas are affected (H_PC, H_PF, H_CF, H_PCF), as well as one class for other diagnoses (H_O). • Knee class (K_xxx): The knee class comprises patients after patella, femur or tibia fractures, ruptures of the cruciate or collateral ligaments or the meniscus, and total knee replacements. The second string “xxx” refers to the following specific anatomical regions or diagnosis: patella (K_P), a fracture near the knee joint of the femur or the tibia (K_F), rupture of ligaments or the menisci (K_R), and their combinations (K_PF, K_PR, K_FR, K_PFR, as well as one class for other diagnoses (K_O). • Ankle class (A_xxx): The ankle class includes patients after fractures of the malleoli, talus, tibia, or lower leg, and ruptures of ligaments or the Achilles tendon. The second string “xxx” refers to the following specific anatomical regions or diagnosis: fracture of the tibia, fibula or talus near the ankle joint (A_F), rupture of ligaments or the Achilles tendon (A_R), lower leg shaft fracture (A_L), and their combinations (A_FR, A_FL, A_RL, A_FRL, as well as one class for other diagnoses (A_O). • Calcaneus class (C_xxx): The calcaneus class comprises patients after calcaneus fractures or ankle fusion surgery. The second string “xxx” refers to the following specific anatomical regions or diagnosis: fracture (C_F) or arthrodesis (C_A). The hierarchical multi-level categorization allows for grouping the data into a dataset with four GD classes (H ∪ K ∪ A ∪ C) and one healthy controls (HC) class, but also holds more details if needed. Figure 1 and Table 1 give a brief overview of the dataset. Although the metadata includes a structured labelling of musculoskeletal impairments for each subject, there is no information available about the history of similar or other types of musculoskeletal inju- ries for both, the patient and the healthy controls. This limiting factors needs to be recognized when using GaitRec. Class N Age (yrs.) Mean (SD) Body mass (kg) Mean (SD) Sex (m/f) Bi-lateral Trials Healthy C. 211 34.7 (13.9) 73.9 (15.6) 104/107 7,755 Hip 450 42.6 (12.8) 82.4 (15.6) 373/77 12,748 Knee 625 41.6 (12.0) 84.3 (18.6) 426/199 19,873 Ankle 627 41.6 (11.4) 87.0 (18.0) 498/129 21,386 Calcaneus 382 43.5 (10.4) 84.0 (14.5) 339/43 13,970 Total 2,295 41.5 (12.1) 83.6 (17.3) 1,740/555 75,732 Table 1. Demographic overview of the dataset and the pre-defined classes. HC GD H K A C H_P H_C H_F H_O H_PC H_PF H_CF H_PCF K_P K_F K_R K_O K_PF K_PR K_FR K_PFR A_F A_R A_L A_O A_FR A_FL A_RL A_FRL C_F C_A Fig. 1 Class taxonomy. The class structure and the dependencies between the classes of the GaitRec dataset: Healthy Controls (HC), Gait Disorders (GD), Hip (H), Knee (K), Ankle (A), and Calcaneus (C). Details of the subclasses are described in Section Dataset & Annotation. 2. Publications 56 4Scientific Data | (2020) 7:143 | https://doi.org/10.1038/s41597-020-0481-z www.nature.com/scientificdatawww.nature.com/scientificdata/ Data Records All published data are fully anonymized. The data records are available online from figshare24. The dataset consists of twenty files holding the GRF data (see Table 2) and one file holding the metadata, including the annotations and additional subjects’ information, e.g. category label, sex, body mass, etc. All files are available as comma-separated value files (CSV). The twenty GRF data files are organized according to the following naming convention: “GRF-type-processing-side.csv”. The type denotes, whether the file holds the vertical (“F_V”), anterior-posterior (“F_AP”), medio-lateral (“F_ML”) or the anterior-posterior or medio-lateral COP (“COP_AP”, “COP_ML”) Variables Associated file Format Dimension Unit Description Vertical GRF GRF_F_V-RAW_*.csv double 1 × n Newton Raw vertical ground reaction force Anterior-posterior GRF GRF_F_AP-RAW_*.csv double 1 × n Newton Raw breaking and propulsive shear force Medio-lateral GRF GRF_F_ML_RAW_*.csv double 1 × n Newton Raw medio-lateral shear force COP anterior-posterior GRF_COP_AP_RAW_*.csv double 1 × n Centimeter Raw COP coordinate in walking direction COP medio-lateral GRF_COP_ML_RAW_*.csv double 1 × n Centimeter Raw COP coordinate in medio- lateral direction Vertical GRF GRF-F_V_PRO_*.csv double 1 × n Multiple of body weight Post-processed vertical ground reaction force Anterior-posterior GRF GRF_F_AP_PRO_*.csv double 1 × n Multiple of body weight Post-processed breaking and propulsive shear force Medio-lateral GRF GRF-F_ML_PRO_*.csv double 1 × n Multiple of body weight Post-processed medio-lateral shear force COP anterior-posterior GRF_COP_AP_PRO_*.csv double 1 × n % stance Post-processed COP coordinate in walking direction COP medio-lateral GRF_COP_ML_PRO_*.csv double 1 × n % stance Post-processed COP coordinate in medio-lateral direction Table 2. Description of the data stored in the “GRF_*.csv” files. “*” for the associated file name is a placeholder for “right” and “left”. n is either the number of frames during one step across the force plate for the unprocessed data (“RAW”) or a time-normalized vector of 101 points for the post-processed (“PRO”) data. Note that the first three columns of each file hold the SUBJECT_ID, SESSION_ID, and TRIAL_ID. Categories/Variables Format Unit Description Identifiers SUBJECT_ID integer — Unique identifier of a subject SESSION_ID integer — Unique identifier of a session Labels CLASS_LABEL string — Annotated class labels CLASS_LABEL_DETAILED string — Annotated class labels for subclasses Subject Metadata SEX binary — female = 0, male = 1 AGE integer years Age at recording date HEIGHT integer centimeter Body height in centimeters BODY_WEIGHT double kg m s2 Body weight in Newton BODY_MASS double kg Body mass SHOE_SIZE double EU Shoe size in the Continental European System AFFECTED_SIDE integer — left = 0, right = 1, both = 2 Trial Metadata SHOD_CONDITION integer — barefoot & socks = 0, normal shoe = 1, orthopedic shoe = 2 ORTHOPEDIC_INSOLE binary — without insole = 0, with insole = 1 SPEED integer — slow = 1, self-selected = 2, fast = 3 walking speed READMISSION integer — indicates the number of re-admission = 0 … n SESSION_TYPE integer — initial measurement = 1, control measurement = 2, initial measurement after readmission = 3 SESSION_DATE string — date of recording session in the format “DD-MM-YYYY” Train-Test Split Information TRAIN binary — is part (=1) or is not part (=0) of TRAIN TRAIN_BALANCED binary — is part (=1) or is not part (=0) of TRAIN_BALANCED TEST binary — is part (=1) or is not part (=0) of TEST Table 3. Description of the information stored in the metadata file. 2.1. GaitRec, a Large-Scale Ground Reaction Force Dataset of Healthy and Impaired Gait 57 5Scientific Data | (2020) 7:143 | https://doi.org/10.1038/s41597-020-0481-z www.nature.com/scientificdatawww.nature.com/scientificdata/ time-series. Processing denotes, if the files hold the unprocessed raw data (“RAW”) or the post-processed data (“PRO”). The side denotes, if the data are from the “left” or “right” body side. The common prefix for all files is “GRF-”. An example filename is thus: “GRF_F_V_RAW_left.csv”. Each of the “GRF-type-processing-side.csv” files is structured as a matrix with N rows × M columns. Each row holds the data of one subject and trial. The first column identifies each subject (“SUBJECT_ID”), the second col- umn each recording session (“SESSION_ID”), and the third column each single trial within a recording session (“TRIAL_ID”). Note that due to the non-normalized nature of the data and the resulting different vector lengths in the “RAW” files, non-available numbers have been replaced by “NaN” to maintain a constant matrix-dimension. The metadata file, which contains annotations and additional subject-related information is available in “GRF-metadata.csv”. It is structured as a matrix with N rows × M columns (see Table 3). Here, the first two columns hold the SUBJECT_ID and SESSION_ID, the other columns hold information such as class labels, sex, body mass, age, shod-condition, see Table 3 for details. Note that this information is available in all records. Missing values are identified as “NaN”. A particularly notable field is “AFFECTED_SIDE”, which indicates which leg is affected by a certain impairment (e.g. left knee) or if both sides are affected. To foster comparability of classification results derived from the GaitRec dataset, we included a predefined randomized partitioning of the dataset into three subsets for training and testing. This information is stored in the metadata file. The GaitRec dataset is split into an unbalanced training set (TRAIN) and a test set (TEST). The first can be used for training and optimization of the machine learning models (e.g. by cross-validation) and the latter for the final evaluation. However, unbalanced classes might negatively affect the optimization of machine learning models, therefore we have created a balanced subset of TRAIN, referred to as TRAIN_BALANCED. The TRAIN_BALANCED subset comprises only data from initial assessments (first measurement session), which at least hold five trials for each body side per session. This is also the reason why the balanced splits populated sightly different amounts of trials. The data allocation to the different subsets was always performed on a random basis. Details of the train/test split configuration are depicted in Fig. 2. technical Validation The provided data are available in raw format and post-processed with well-established de-noising and normali- zation procedures. This allows future researchers to either use the raw data and post-process them as desired (e.g., filtering, thresholding, normalization, etc.) or to employ the ready-to use post-processed data. The accuracy of the force plates was not specifically assessed during the data capturing period. However, the force plates and the measurement equipment has been checked and serviced regularly during clinical practice. To get a picture of the data integrity, the post-processed data are plotted in Fig. 3. Usage Notes The data records are stored in *.csv files and can be easily imported into any desired software package for further data analysis. The dataset also contains two scripts which allow easy data import for Matlab (The MathWorks, Inc., Natick, Massachusetts, United States, 2019a) and Python (Python Software Foundation, 3.7). Benchmarks for auto- matically classifying the presented data based on the first annotation level into five classes, i.e. H vs. K vs. A vs. C Train-Test Split U nb al an ce d tr ai ni ng s et Train: 52745 (70%) Test: 22987 (30%) Classes in Train Split A:15213 (29%) C:10728 (20%) H:7900 (15%) HC:5563 (11%) K:13341 (25%) Classes in Test Split A:6173 (27%) C:3242 (14%) H:4848 (21%) HC:2192 (10%) K:6532 (28%) B al an ce d tr ai ni ng s et Train: 6308 (22%) Test: 22987 (78%) A:1182 (19%) C:1230 (19%) H:1245 (20%) HC:1434 (23%) K:1217 (19%) A:6173 (27%) C:3242 (14%) H:4848 (21%) HC:2192 (10%) K:6532 (28%) Fig. 2 Dataset composition. Configuration of the balanced and unbalanced train/test splits of the GaitRec dataset. The pie-charts show the amount of trials populated (in total amount and percentage) within each class and split. 2. Publications 58 6Scientific Data | (2020) 7:143 | https://doi.org/10.1038/s41597-020-0481-z www.nature.com/scientificdatawww.nature.com/scientificdata/ vs. HC, can be found in our earlier work13–15. These works also provide a baseline approach that employs a signal representation based on Principal Component Analysis (PCA) combined with a Support Vector Machine (SVM) as a classifier for orientation and comparison. Note, however, that the presented dataset is an extended version of the dataset used in these studies and that results may thus slightly deviate from those of our previous studies. The studies further elaborate on the optimization of post-processing of GRF data for the purpose of gait classification. Future work with the GaitRec dataset might focus on one of the research questions stated below. However, one should be aware that depending on the research question not all subsets of our dataset might be perfectly applicable due to their reduced sample size (i.e. for the balanced subsamples). • Classifying healthy vs. pathological gait • Build statistical models of normative walking • Classify gait disorders • Evaluation and prediction of therapy progress • Gait data-record retrieval and similarity retrieval of trials • Identification of subject-specific gait patterns • Modeling dependencies between anthropometric/demographic data and the GRF signals Fig. 3 Data overview. Visualization of all body-weight normalized vertical, anterior-posterior, and medio- lateral GRF signals of the affected side available per subject and class. For healthy controls all available recordings are visualized. The plots also show the mean (solid line) and its one-fold standard deviation (dotted line). Note that for easier usage the orientation of the medio-lateral and anterior-posterior signals were uniformed, so that medial and anterior forces are always represented as positive values. 2.1. GaitRec, a Large-Scale Ground Reaction Force Dataset of Healthy and Impaired Gait 59 7Scientific Data | (2020) 7:143 | https://doi.org/10.1038/s41597-020-0481-z www.nature.com/scientificdatawww.nature.com/scientificdata/ For the purpose of comparability of derived results from the GaitRec dataset, we highly recommend per- forming model optimization (e.g. by cross-validation) on the training set only and to keep the test set untouched until the final evaluation. However, it has to be noted that the train/test set split does not coincide exactly with the splits in our baseline experiments because both are larger now13–15. Received: 20 December 2019; Accepted: 6 April 2020; Published: 12 May 2020 References 1. Baker, R. Measuring Walking: A Handbook of Clinical Gait Analysis (Mac Keith Press, London, 2013). 2. Chau, T. A review of analytical techniques for gait data. Part 1: fuzzy, statistical and fractal methods. Gait Posture 13, 49–66 (2001). 3. Chau, T. A review of analytical techniques for gait data. Part 2: neural network and wavelet methods. Gait Posture 13, 102–120 (2001). 4. Lozano-Ortiz, C. A., Muniz, A. M. S. & Nadal, J. Human gait classification after lower limb fracture using Artificial Neural Networks and principal component analysis. Conf. Proc. IEEE Eng. Med. Biol. Soc. 2010, 1413–1416 (2010). 5. Zeng, W. et al. Parkinson’s disease classification using gait analysis via deterministic learning. Neurosci. Lett. 633, 268–278 (2016). 6. Vieira, A. et al. Software for human gait analysis and classification. In 2015 IEEE 4th Portuguese Meeting on Bioengineering (ENBENG), 1–1 (2015). 7. Wu, J., Wang, J. & Liu, L. Feature extraction via KPCA for classification of gait patterns. Hum. Movement Sci. 26, 393–411 (2007). 8. Wu, J. & Wang, J. PCA-based SVM for automatic recognition of gait patterns. J. Appl. Biomech. 24, 83–87 (2008). 9. Levinger, P., Lai, D., Begg, R. K., Webster, K. E. & Feller, J. A. The application of support vector machines for detecting recovery from knee replacement surgery using spatio-temporal gait parameters. Gait Posture 29, 91–96 (2009). 10. Mezghani, N. et al. Automatic classification of asymptomatic and osteoarthritis knee gait patterns using kinematic data features and the nearest neighbor classifier. IEEE T. Bio-Med. Eng. 55, 1230–1232 (2008). 11. Alaqtash, M. et al. Automatic classification of pathological gait patterns using ground reaction forces and machine learning algorithms. In Conf. Proc. IEEE. Eng. Med. Biol. Soc., 453–457 (2011). 12. Ferrarin, M. et al. Gait pattern classification in children with Charcot–Marie–Tooth disease type 1a. Gait Posture 35, 131–137 (2012). 13. Slijepcevic, D. et al. Automatic classification of functional gait disorders. IEEE J. Biomed. Health 22, 1653–1661 (2017). 14. Slijepcevic, D. et al. Ground reaction force measurements for gait classification tasks: Effects of different PCA-based representations. Gait Posture 57, 4–5 (2017). 15. Slijepcevic, D. et al. P 011–Towards an optimal combination of input signals and derived representations for gait classification based on ground reaction force measurements. Gait Posture 65, 249–250 (2018). 16. Slijepcevic, D. et al. Input representations and classification strategies for automated human gait analysis. Gait Posture 76, 198–203 (2020). 17. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015). 18. Brantley, J., Luu, T., Nakagome, S., Zhu, F. & Contreras-Vidal, J. Full body mobile brain-body imaging data during unconstrained locomotion on stairs, ramps, and level ground. Sci. Data 5, 180133 (2018). 19. Mai, P. & Willwacher, S. Effects of low-pass filter combinations on lower extremity joint moments in distance running. J. Biomech. 95, 109311 (2019). 20. Winter, D. A. Biomechanics and Motor Control of Human Movement (Wiley, Hoboken, NJ, 2009), 4 edn. 21. Mullineaux, D. R., Milner, C. E., Davis, I. S. & Hamill, J. Normalization of ground reaction forces. J. Appl. Biomech. 22, 230–233 (2006). 22. Helwig, N. E., Hong, S., Hsiao-Wecksler, E. T. & Polk, J. D. Methods to temporally align gait cycle data. J. Biomech. 44, 561–566 (2011). 23. Sangeux, M. & Polak, J. A simple method to choose the most representative stride and detect outliers. Gait Posture 41, 726–730 (2015). 24. Horsak, B. et al. GaitRec, a large-scale ground reaction force dataset of healthy and impaired gait. figshare, https://doi.org/10.6084/ m9.figshare.c.4788012 (2020). acknowledgements This work was partly funded by the NFB - Lower Austrian Research and Education Company (NFB) and the Provincial Government of Lower Austria, Department of Science and Research (LSC14-005 and FTI17-014). We want to thank Marianne Worisch, Szava Zoltán, and Theresa Fischer for their great assistance in data preparation and their great support in clinical and technical questions. author contributions B.H. and M.Z. developed the research agenda behind this work and raised the funding for this research. Both supervised the team during the entire project. B.H. and A.M.R. drafted the first manuscript of this article and coordinated the manuscript with all co-authors. MW was responsible for dataset annotation. D.S. was responsible for data cleaning, dataset construction and in creating the final files. D.J. was supported by C.S., M.W., and B.H. D.S. (post-)processed the GRF data, verified their validity in classification experiments and created the main data record files. D.S. implemented the data import scripts. All authors contributed to the writing of the manuscript and to proof-reading. Competing interests The authors declare no competing interests. additional information Correspondence and requests for materials should be addressed to B.H. Reprints and permissions information is available at www.nature.com/reprints. Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. 2. Publications 60 8Scientific Data | (2020) 7:143 | https://doi.org/10.1038/s41597-020-0481-z www.nature.com/scientificdatawww.nature.com/scientificdata/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Cre- ative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not per- mitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver http://creativecommons.org/publicdomain/zero/1.0/ applies to the metadata files associated with this article. © The Author(s) 2020 2.1. GaitRec, a Large-Scale Ground Reaction Force Dataset of Healthy and Impaired Gait 61 2. Publications 2.2 Automatic Classification of Functional Gait Disorders Djordje Slijepcevic, Matthias Zeppelzauer, Anna-Maria Gorgas, Caterine Schwab, Michael Schüller, Arnold Baca, Christian Breiteneder, and Brian Horsak. Automatic Clas- sification of Functional Gait Disorders. IEEE Journal of Biomedical and Health Informatics, 22(5):1653–1661, 2017. DOI: 10.1109/JBHI.2017.2785682 The final version of this publication is available at: https://doi.org/10.1109/ JBHI.2017.2785682. Permission for reprint granted, © 2018 IEEE 62 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 22, NO. 5, SEPTEMBER 2018 1653 Automatic Classification of Functional Gait Disorders Djordje Slijepcevic , Matthias Zeppelzauer , Anna-Maria Gorgas, Caterine Schwab, Michael Schüller, Arnold Baca , Christian Breiteneder, and Brian Horsak Abstract—This paper proposes a comprehensive inves- tigation of the automatic classification of functional gait disorders (GDs) based solely on ground reaction force (GRF) measurements. The aim of this study is twofold: first, to investigate the suitability of the state-of-the-art GRF parameterization techniques (representations) for the discrimination of functional GDs; and second, to provide a first performance baseline for the automated classification of functional GDs for a large-scale dataset. The utilized database comprises GRF measurements from 279 patients with GDs and data from 161 healthy controls (N). Patients were manually classified into four classes with different functional impairments associated with the “hip”, “knee”, “ankle”, and “calcaneus”. Different parameterizations are investigated: GRF parameters, global principal component analysis (PCA) based representations, and a combined representation applying PCA on GRF parameters. The discriminative power of each parameterization for different classes is investigated by linear discriminant analysis. Based on this analysis, two classification experiments are pursued: distinction between healthy and impaired gait (N versus GD) and multiclass classification between healthy gait and all four GD classes. Experiments show promising results and reveal among others that several factors, such as imbalanced class cardinalities and varying numbers of measurement sessions per patient, have a strong impact on the classification accuracy and therefore need to be taken into account. The results represent a promising first step toward the automated classification of GDs and a first per- formance baseline for future developments in this direction. Manuscript received June 28, 2017; revised October 30, 2017; accepted December 11, 2017. Date of publication December 20, 2017; date of current version August 31, 2018. This work was sup- ported by the NFB Lower Austrian Research and Education Com- pany and the Provincial Government of Lower Austria, Depart- ment of Science and Research (LSC14-005). (Corresponding author: Djordje Slijepcevic.) D. Slijepcevic and M. Zeppelzauer are with the ICMT Institute of Cre- ative Media Technologies, St. Pölten University of Applied Sciences, St. Pölten 3100, Lower Austria, Austria (e-mail: djordje.slijepcevic@ fhstp.ac.at; matthias.zeppelzauer@fhstp.ac.at). A.-M. Gorgas, C. Schwab, and B. Horsak are with the St. Pölten University of Applied Sciences, St. Pölten 3100, Austria (e-mail: Anna-Maria.Gorgas@fhstp.ac.at; Caterine.Schwab@fhstp.ac.at; brian. horsak@fhstp.ac.at). M. Schüller and A. Baca are with the University of Vienna, Vienna 1010, Austria (e-mail: schueller.michael@gmx.at; arnold.baca@univie. ac.at). C. Breitender is with the TU Wien, Vienna 1040, Austria (e-mail: christian.breiteneder@tuwien.ac.at). This paper has supplementary downloadable material available at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JBHI.2017.2785682 Index Terms—Ground reaction force (GRF), gait classification, principal component analysis (PCA), gait parameters, machine learning. I. INTRODUCTION GAIT analysis is a tool for clinicians to objectively quantify human locomotion and to describe and analyze a patient‘s gait performance. The primary aim is to identify impairments that affect a patient’s gait pattern [1]. Recordings obtained during clinical gait analyses produce a vast amount of data which are difficult to comprehend and analyze due to their high-dimensionality, temporal depen- dences, strong variability, non-linear relationships and correla- tions within the data [2]. This makes data interpretation chal- lenging and requires an experienced clinician to draw valid conclusions. Several automatic analysis approaches based on machine learning have been published in recent years to tackle these problems and to support clinicians in identifying and categorizing specific gait patterns into clinically relevant cate- gories [2], [3]. Machine learning methods employed in this con- text comprise neural networks [4]–[6], support vector machines (SVMs) [7]–[9], nearest neighbor classifiers [10], [11], and dif- ferent clustering approaches (hierarchical, k-means, etc.) [12]. The performance of such methods strongly depends on the input data representation [13]. Frequently used representations in gait analysis comprise discrete kinematic gait parameters (e.g. local minima and maxima of gait signals and time-distance parame- ters) [11], [14], [15]. Additionally, previous research has shown that global signal representations obtained by principal com- ponent analysis (PCA) [16], [17], kernel-based PCA (KPCA) [18], [19] and discrete wavelet transformation (DWT) [10], [11] are suitable for subsequent classification [10], [16]. Typical use cases for automatic gait analysis described in the literature show a moderate to high accuracy in distinguish- ing between different pathologies or patient groups [4], [7]–[9], [11], [16], [17]. However, most of the existing literature investi- gated rather simple cases such as the differentiation between the affected/non-affected limb in hemiplegic patients [20], and the distinction of healthy gait from people with neurological dis- orders [5], [11], transfemoral amputation [16], and lower limb fractures [4], [17]. A more complex study is presented in [21], where several disorders associated with traumatic brain injuries are classified. The majority of published articles employed kine- matic and kinetic data derived from three-dimensional gait 2168-2194 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. Authorized licensed use limited to: TU Wien Bibliothek. Downloaded on July 21,2023 at 22:47:18 UTC from IEEE Xplore. Restrictions apply. 2.2. Automatic Classification of Functional Gait Disorders 63 1654 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 22, NO. 5, SEPTEMBER 2018 analysis (3DGA), which provide a vast amount of kinematic and kinetic information for multiple joints. Drawbacks of such 3DGA measurement systems are the relatively time-consuming data recording, the need for highly trained staff as well as high acquisition and maintenance costs. Therefore, such analysis tools are often not suitable for daily use in clinical practice. To manage the high patient throughput in rehabilitation cen- ters, a frequently used approach is to combine simple visual inspection or 2D video recordings with the quantification of ground reaction forces (GRF) by force platforms, as changes in the morphology of the GRF waveforms reflect pathological gait [11], [17]. One major drawback of this approach is the loss of clinically relevant and quantifiable information (e.g. gait kinematics), causing a potential decrease in classification ac- curacy [22]. However, such simple approaches are common in clinical practice as they overcome the before-mentioned limita- tions of 3DGA. To date, few attempts have been published that use only GRF data for automated gait pattern classification [16], [23]. Most of these gait classification approaches show promis- ing results. However, the majority of previous works employed relatively small datasets. Alaqtash et al. [11], for example, com- pared the data of 12 healthy adults to those of patients with cerebral palsy and multiple sclerosis (4 patients each), Muniz and Nadal [17] used data from 38 healthy controls and 13 pa- tients with lower limb fracture, and Soares et al. [16] classified GRF data of 20 able-bodied and 12 patients with transfemoral amputation. Such small datasets make it difficult to train robust and reliable classifiers that are applicable in complex real-world scenarios. Furthermore, a majority of studies [10], [17], [23] re- lies solely on the vertical ground reaction force for classification purposes, rather than considering all available GRF components, including the center of pressure (COP), for a more conclusive picture of the underlying gait pattern. Previous classification at- tempts mainly focused on the differentiation between specific diseases rather than drawing a distinction between functional gait disorders. The work of Köhle and Merkl [24], [25], who clustered and classified GRF measurements into deficits of dif- ferent body regions, represents an exception in this regard. Their dataset was about half of the size of the one presented in this article and their work also focused on patients walking with a prosthesis. In this article we define a functional gait disorder as the cause of a gait impairment, which is reflected by the indi- vidual gait patterns. These may be associated with a patient’s condition after joint replacement surgery, fractures, ligament ruptures, osteoarthritis or related disorders. The classification of functional gait disorders is of particular interest in clinical examinations, as it may play a key role in detecting arthropathies or diseases at an early stage. In addition, such a classification may also indicate secondary disorders that otherwise might be easily overlooked during clinical examination. The aim of this article is to present a detailed investigation of the automated classification of several functional gait disor- ders solely based on GRF data. The presented approach builds upon the aforementioned studies, e.g. [16], [17], [23], inves- tigates the suitability of frequently used state-of-the-art GRF parameterization techniques for gait classification and analyzes their discriminative power. In the experiments we evaluate the individual representations on a large-scale and real-world dataset for different classification tasks. This paper therefore presents a first performance baseline for the automatic classifi- cation of different gait disorders in a real-world setting. II. MATERIAL AND METHODS A. Patients and Dataset The presented retrospective study was approved by the lo- cal Ethics Committee of Lower Austria (GS1-EK-4/299-2014). The anonymized data used in this study are part of an existing clinical gait database maintained by a rehabilitation center of the Austrian Workers’ Compensation Board (AUVA). The AUVA is the social insurance for occupational risks for more than 3.3 million employees and 1.4 million pupils and students in Aus- tria. The utilized database comprises GRF measurements from 279 patients with gait disorders (GD) and data from 161 healthy controls (N), both of various physical composition and gender (see Table I for details on the dataset). Patients were manually classified into four classes - calcaneus “C” (n = 82), ankle “A” (n = 62), knee “K” (n = 69), and hip “H” (n = 66) - by a physical therapist, based on the available medical diagnosis of each patient. Thus, GD refers to C ∪ A ∪ K ∪ H. The individ- ual GD classes include patients after joint replacement surgery, fractures, ligament ruptures, and related disorders associated with the above-mentioned anatomical areas. The most common injuries present in the hip class are fractures of the pelvis and thigh as well as luxation of the hip joint, coxarthrosis, and total hip replacement. The knee class comprises patients after patella, femur or tibia fractures, ruptures of the cruciate or collateral lig- aments or the meniscus and total knee replacements. The ankle class includes patients after fractures of the calcaneus, malleoli, talus, tibia or lower leg, and ruptures of ligaments or the achilles tendon. The calcaneus class comprises patients after calcaneus fractures or ankle fusion surgery. All of the above-mentioned injuries may occur individually or in combination within each class. Each patient performed one or several measurement sessions. In each session, eight recordings for two consecutive steps were performed. Each bilateral recording is referred to as one trial in this paper. Thus, the utilized dataset contains 1,187 sessions comprising 9,496 individual trials (see Table I for details). B. Data Recording and Preprocessing Gait analysis was performed on a 10 m walkway with two centrally embedded force plates (Kistler, Type 9281B12). The force plates were placed in a consecutive order, allowing a per- son to walk across by placing one foot on each plate. Both plates were flush with the ground and covered with the same walkway surface material, so that targeting was not an issue. During a session, participants walked unassisted and without a walking aid at a self-selected walking speed until a minimum of eight valid recordings were available. These recordings were defined as valid by a supervisor when the participant walked naturally and there was a clean foot strike on the force plate. Prior to Authorized licensed use limited to: TU Wien Bibliothek. Downloaded on July 21,2023 at 22:47:18 UTC from IEEE Xplore. Restrictions apply. 2. Publications 64 SLIJEPCEVIC et al.: AUTOMATIC CLASSIFICATION OF FUNCTIONAL GDs 1655 TABLE I DETAILS OF THE DATASET AND CLASSES Class Amount Age (yrs.) Mean ± SD Body Mass (kg) Mean ± SD Sex (m/f) Num. Sessions Num. Trials Healthy Control (N) 161 32.4 ± 13.6 74.1 ± 16.2 84/77 161 1,288 Calcaneus (C) 82 42.4 ± 9.9 84.5 ± 12.1 74/8 320 2,560 Ankle (A) 62 40.0 ± 11.5 88.3 ± 16.9 56/6 259 2,072 Knee (K) 69 41.5 ± 11.4 83.7 ± 19.6 44/25 258 2,064 Hip (H) 66 43.6 ± 14.7 81.6 ± 18.3 53/13 189 1,512 SUM 440 38.4 ± 13.3 80.7 ± 17.3 311/129 1,187 9,496 the gait analysis session, each participant underwent rigorous physical examination by a physician. All processing steps and subsequent analyses were performed in Matlab 2016a (The MathWorks Inc., Natick, MA, USA). The three analog GRF signals as well as the two COP signals were converted to digital signals using a sampling rate of 2000 Hz and a 12-bit analog-digital converter (DT3010, Data Translation Incorporation, Marlboro, MA, USA) with a signal input range of ± 10 V. A threshold of 10 N was used for step detection and 30 N for COP calculation. Raw signals were filtered using a 2nd order low-pass butterworth filter with a cut-off frequency of 20 Hz. All gait measurements were temporally aligned so that they all started with the initial contact and ended with toe-off. They were further time-normalized to 100% stance by re-sampling the data to 1000 points. The processed signals are referred to as waveforms in this article. Amplitude values of the three force components, i.e. vertical (V), medio-lateral (ML), and anterior- posterior (AP), were expressed as a multiple of body weight (BW ) by dividing the force by the product of body mass times acceleration due to gravity (g). The COP waveforms from each trial were normalized by the foot length (FL) determined during each session, expressed as a multiple of foot length. C. Signal Representation The representations employed in our investigation comprise (1) discrete GRF parameters (DP) in combination with time- distance parameters (TDP) [11], [14], [15]; (2) PCA-based pa- rameterizations of the entire GRF waveforms [4], [8], [16] and (3) a combination of the first two approaches, i.e. PCA applied to DPs and TDPs [7]. In the following, all three approaches are described in detail. DPs were calculated for the affected limb and extracted from all three force components, FV (t) (vertical), FAP (t) (anterior- posterior), and FML (t) (medio-lateral), as well as from the COP displacement in the anterior-posterior (walking) direction COPAP (t) and in the medio-lateral direction COPML (t). An example of the GRF and corresponding COP waveforms is pre- sented in Fig. 1. Furthermore, a more detailed visualization of the mean GRF waveforms over each class is illustrated in Fig. S1 (supplementary material). DPs include a set of predefined (most prominent) local minima and maxima of the waveforms, which were extracted by peak detection in a fully automatic way from each trial. Furthermore, impulses were calculated over different segments of the waveform by multiplying the average force (in N ) by the time this force is active. To account for differences Fig. 1. (Top) The characteristic shape of the three components of the GRF: the vertical force (FV ), the anterior-posterior shear (FAP ), and the medio-lateral shear (FM L ). (bottom) The corresponding COP path for one step. Note that x and y axes are scaled slightly differently for better visualization. in body mass between participants [26], all impulses were di- vided by the product of body mass times acceleration due to gravity (g) and then multiplied by 100 (%BW s). TDPs such as cadence (CAD), double support time (DS), gait velocity (GV ), step length (STEPLEN ), and stance time (ST ) were calcu- lated from two consecutive steps (affected and unaffected limb) and averaged over the eight valid trials. Table II lists all 52 extracted parameters. In contrast to the GRF parameters (DPs and TDPs), the PCA takes the entire waveforms1 of the affected limb into account and provides a holistic representation of the data. Complemen- tary information to the parameters is thus captured. The main goal of PCA is to reduce the dimensionality of a dataset by transforming the data into a set of uncorrelated variables, i.e. the principal components (PCs) [27]. Each PC points in (and thus explains) one orthogonal direction of variance in the data. 1For the purpose of the present study, every third sample was used in order to reduce redundancy in the data, thereby improving the robustness of the decomposition. Authorized licensed use limited to: TU Wien Bibliothek. Downloaded on July 21,2023 at 22:47:18 UTC from IEEE Xplore. Restrictions apply. 2.2. Automatic Classification of Functional Gait Disorders 65 1656 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 22, NO. 5, SEPTEMBER 2018 TABLE II DISCRETE AND TIME-DISTANCE PARAMETERS, DESCRIPTION, TYPE OF NORMALIZATION AND PHYSICAL UNIT Abbreviation Description Normalization Unit ST Stance time is the duration of the stance phase of one foot − s FV 1 Maximum value of FV within the breaking phase of stance Body weight BW TV 1 Time of FV 1 Stance time %ST FV 2 Minimum value of FV between TV 1 and TV 3 Body weight BW TV 2 Time of FV 2 Stance time %ST FV 3 Maximum value of FV within the propulsive phase of stance Body weight BW TV 3 Time of FV 3 Stance time %ST FAP 1 Maximum value of FAP between initial contact and TAP 2 Body weight BW TAP 1 Time of FAP 1 Stance time %ST FAP 2 Minimum value of FAP within the breaking phase of stance Body weight BW TAP 2 Time of FAP 2 Stance time %ST FAP 3 Maximum value of FAP within the propulsive phase of stance Body weight BW TAP 3 Time of FAP 3 Stance time %ST FM L 1 Minimum value of FM L within the breaking phase of stance Body weight BW TM L 1 Time of FM L 1 Stance time %ST FM L 2 Maximum value of FM L within the breaking phase of stance Body weight BW TM L 2 Time of FM L 2 Stance time %ST FM L 3 Maximum value of FM L within the propulsive phase of stance Body weight BW TM L 3 Time of FM L 3 Stance time %ST FV AV G Mean value of FV Body weight BW FAP AV G Mean value of FAP Body weight BW FM LAV G Mean value of FM L Body weight BW IFV Impulse of FV during stance Body weight %BW·s IFAP Impulse of FAP during stance Body weight %BW·s IFM L Impulse of FM L during stance Body weight %BW·s IFV 1 Impulse of FV between initial contact and TV 1 Body weight %BW·s IFV 2 Impulse of FV between initial contact and TV 2 Body weight %BW·s IFV 3 Impulse of FV between initial contact and TV 3 Body weight %BW·s IFAP DEC Impulse of FAP during the breaking phase Body weight %BW·s IFAP AC C Impulse of FAP during the propulsive phase Body weight %BW·s IFLAT Impulse of the lateral component of FM L Body weight %BW·s IFM ED Impulse of the medial component of FM L Body weight %BW·s COPANG COP angle is the horizontal angle between the COP linear regression line and the x-axes (= foot rotation) − deg COPDEV COP deviation is the root mean square error of the COP linear regression Foot length FL COPAP COP range is the range in the anterior-posterior direction during stance phase Foot length FL COPV COP velocity is calculated as the ratio of foot length and stance time Foot length FL/s COPM L COP range is the range in the medio-lateral direction during stance phase Foot length FL DECT Deceleration time (breaking phase) is the duration of FAP being negative − s ACCT Acceleration time (propulsive phase) is the duration of FAP being positive − s LR0080 Loading rate represented as the slope of FV from the initial contact to 80% of FV 1 Body weight N/s LR2080 Loading rate represented as the slope of FV from 20% to 80% of FV 1 Body weight N/s UR8000 Unloading rate represented as the slope of FV from 80% of FV 3 to the toe-off Body weight N/s UR8020 Unloading rate represented as the slope of FV from 80% to 20% of FV 3 Body weight N/s DS Double support time during one stride − s STEPLEN Step length is the distance of the COP position from initial contact to following contralateral initial contact − m STEPV Step velocity is calculated as the ratio of step length and step time − km/h STRIDET Stride time is the duration from initial contact to initial contact of the ipsilateral foot − s BF Basic frequency is the mean number of strides per second (1/STRIDET ) − Hz CAD Cadence is the number of steps per minute − 1/min STEPWD Step width is the medio-lateral distance of the mean COP between both feet − m STRLEN Stride length is the distance of the COP position from initial contact to following ipsilateral initial contact − m GV Gait velocity is calculated as the mean step velocity of both feet − km/h Body weight (BW): product of body mass and acceleration due to gravity; %ST: percentage of stance time; %BW: percentage of body weight; FL: multiple of foot length. The main intention is to obtain a lower-dimensional representa- tion of our time- and weight-normalized waveforms similar to [4], [8], [16] by projecting the data onto those PCs which ex- plain most variance in the data. This dimensionality reduction fosters subsequent machine learning [3]. We performed PCA on each of the five signals separately and concatenated the resulting PCs to obtain a feature vector for classification. This approach proved to be superior to other PCA-based representations in a preliminary study [28]. The final dimensionality of the obtained representations is specified by the amount of variance preserved in a particular projection, i.e. 98%, 95%, and 90%. An ex- emplary visualization of the different PCA representations is presented in Fig. S2 (supplementary material). A preliminary evaluation indicated that preserving 98% of the variance results in a good trade-off between data reduction and classification performance. Thus, all results presented in the following are based on the approach in which 98% of the variance is preserved (the number of resulting PCs is waveform-specific and ranges from four to twelve, i.e. for all five signals there are 39 PCs in total). Authorized licensed use limited to: TU Wien Bibliothek. Downloaded on July 21,2023 at 22:47:18 UTC from IEEE Xplore. Restrictions apply. 2. Publications 66 SLIJEPCEVIC et al.: AUTOMATIC CLASSIFICATION OF FUNCTIONAL GDs 1657 As a third representation, PCA was applied to the previously extracted DPs and TDPs (a vector comprising of 52 parameters), similarly to Wu et al. [7]. This approach combines both method- ologies and aims at extracting the most important information from the (possibly redundant) parameters. D. Statistical Analysis Our first aim was to investigate the suitability of different pa- rameterization techniques for subsequent gait classification. For this purpose we analyzed the variance and discriminative power of each DP and TDP across the different classes by descriptive statistics in a first step. We calculated the median, interquartile- range (IQR) and range of each parameter within each class and visualized them as boxplots. This enabled us to visually inspect variances and distributions in and across the classes, thereby allowing a first estimation of the discriminative power of each parameter. In a second step, we investigated the discriminative power of the parameters and the global PCA-based representations by linear discriminant analysis (LDA). A natural measure to de- scribe the separation of two distributions (classes) is the Fisher criterion, which represents the core of LDA [29]. We applied (multi-class) LDA to assess the discriminative power of individ- ual parameters for two (or more) classes. The advantage of this approach is that the discriminative power of a parameter (even across multiple classes) can be expressed by one scalar value that directly reflects the statistical properties of the input data. Hence, there is no need to apply additional modelling and data transformations (which may influence results) prior to LDA, which would be necessary for other methods such as SVM. Fur- thermore, this approach can easily be extended to estimate the discriminative power of a combination of several parameters by multi-dimensional LDA (e.g. in case of PCA-based represen- tations). We computed the accuracy of LDA and reported the divergence from a random baseline [30] to quantify to which degree an input parameter or input representation is able to sepa- rate the underlying classes. The random baseline was estimated by the zero rule (always choosing the most frequent class in the dataset). Thus, in the case of five classes where the largest class contains 30% of the data the random baseline equals 30%. E. Classification We applied two classification tasks to the dataset by using SVMs as classifiers: (1) (binary) classification between nor- mal gait and all gait disorders (N/GD) and (2) (multi-class) classification between N and each of the four GD classes (N/C/A/K/H). In the first task, the class priors are imbal- anced, i.e. there are many more observations in the combined GD class than in the normal class (see Table I). The second task separates each type of disorder from each other and from the normal class. For the classification experiments the dataset was split into a training (65%) and a test set (35%), thereby mutually disjoining the groups of patients in both sets. The training dataset in combi- nation with a k-fold cross-validation approach served to train the classifiers and to optimize their parameters (model selection), whereas the test dataset was used to evaluate the generalization ability of the trained models (and was not considered during model selection and hyper-parameter optimization). The calcu- lated DPs and TDPs as well as the PCA-based representations served as input to classification. The parameters (DPs and TDPs) were normalized (each independently) in a twofold way, by min- max normalization and z-standardization, in order to determine the more suitable approach. The PCA representations were z- standardized. We employed SVMs for the classification with linear and radial basis function (RBF) kernels, provided by the LIBSVM library [31]. For hyper-parameter selection we applied a grid search over the regularization parameter C ∈ [2−5 , 215 ] for the linear SVM and overC ∈ [2−1 , 215 ] and the kernel hyper- parameter γ ∈ [2−15 , 25 ] for the RBF SVM. During the grid search, a 5-fold cross-validation was performed on the training dataset. Finally, an SVM with the best parameters estimated during model selection was trained on the entire training set and evaluated on the test set. Additionally a k-nearest neighbor (k-NN) classifier and a multi-layer perceptron (MLP) were em- ployed as a reference to compare their results to the performance of the SVM. Grid search was performed over various values of k for the k-NN. For the MLPs different numbers and sizes of hidden layers were employed. As a performance measure we use the classification accuracy, which is the percentage of cor- rect classifications among all classes and input samples. Since in different experiments the random baseline varies, the abso- lute values of accuracy are of limited expressiveness. To enable a fair comparison, we employ the divergence from a random baseline approach [30] and thus provide for each experiment the difference between the random baseline and the absolute classification accuracy. III. RESULTS AND DISCUSSION This section presents and discusses the results of the statistical analysis and the classification experiments. A. Statistical Analysis The statistical analysis aimed at assessing the suitability of the individual GRF parameters (DPs and TDPs) for distinguishing different classes of gait disorders. In order to be considered a ”good” parameter, intra-class variation should be low (e.g. small IQR inside a given class), while the inter-class variation should be high (e.g. significantly different means or medians between the samples of different classes) [15]. The visual inspection of the boxplots for each parameter en- ables a first assessment of the intra- and inter-class variation and thereby gives an impression of the parameters’ potential to dif- ferentiate between different classes. Fig. 2 shows boxplots for selected parameters. A presentation of boxplots for all 52 inves- tigated parameters for all classes is provided in Fig. S3 (supple- mentary material). Parameters such as FV 3 (see Fig. 2(a)) show a clear difference in the median and the IQR between the healthy controls and all four GD classes. This indicates a high potential to discriminate between normal gait and arbitrary gait disor- ders. However, the overlap of the distributions within the GD classes indicates a low potential to discriminate between them. Authorized licensed use limited to: TU Wien Bibliothek. Downloaded on July 21,2023 at 22:47:18 UTC from IEEE Xplore. Restrictions apply. 2.2. Automatic Classification of Functional Gait Disorders 67 1658 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 22, NO. 5, SEPTEMBER 2018 Fig. 2. Example of boxplots for three parameters. Each boxplot shows the median and the IQR (box) for each class (outliers were removed for better visualization). Box-whiskers correspond to 1.5 of the box-length, thus show approximately ± 2.7 standard deviations. The overlap of dis- tributions between the classes gives an impression of the parameters’ discriminative power. (a) FV3 [BW]. (b) TAP3 [%ST]. (c) IFAP [%BW·s]. Other parameters such as TAP 3 (see Fig. 2(b)) vary strongly in the IQR and the median across the classes. While the IQR is high for calcaneus and ankle, the classes hip, knee, and the normal controls exhibit a very similar distribution. Thus, such a parameter has solely limited potential to separate normal gait from general gait disorders. There may be, however, a certain potential to separate individual classes (in this case calcaneus) from other classes. Other parameters may lack in discrimina- tive power. An example is IFAP (see Fig. 2(c)), which shows a similar median and overlapping distributions with a similar IQR across all classes. Several parameters are discriminative for particular classes or a group of classes. However, none of the observed parameters discriminates well between all classes. Therefore, the combination of several parameters for the dis- tinction between classes seems advisable. These assumptions are further corroborated by the LDA results. LDA was applied to the individual parameters and their com- bination, as well as to the higher-dimensional PCA-based repre- sentations. This analysis aimed at quantifying the discriminative power of the investigated representations and thereby evaluat- ing their suitability for automated classification. Fig. 3 illustrates discriminativity scores obtained by LDA in terms of deviation from the random baseline (zero rule). In detail, results for dif- ferent combinations of classes (rows) are illustrated: rows 1-4 provide results for the discrimination of normal gait vs. ankle, calcaneus, hip or knee (each class separately). Row 5 shows how well all 5 classes can be differentiated from each other. Row 6 illustrates how well normal gait can be differentiated from all types of gait disorders. Rows 7–12 show how all possi- ble pairs of gait disorder classes can be differentiated from each other. Positive discriminativity scores are represented by a color scale from blue (corresponding to low values) to yellow (rep- resenting high values), whereas negative values are colored in gray. Positive values mean that the random baseline is exceeded and that the respective input parameter or input representation exhibits a certain discriminative power (the higher the value the better). Negative values indicate the absence of discrimi- native power, i.e. the random baseline is not reached. It has to be noted that, since the different class partitions represented by the rows of Fig. 3 have different random baselines, the val- ues across rows cannot be compared directly. Comparisons are solely valid along the rows. In general, however, columns includ- ing a larger number of high values indicate parameters or repre- sentations with a higher discriminative power. Similarly, rows with higher values represent tasks that are easier to solve than others. The leftmost part of Fig. 3 illustrates the discriminativity scores for the individual parameters. Several parameters achieve high scores for individual classes or combinations of classes, e.g. FAP 3 , FV 3 , FV AV G , FV 1 , TV 3 , TAP 3 , GV , STEPV , DS, STRLEN , FV 2 , STEPWD, CAD, BF , and STRIDET . No parameter, however, performs well across all tasks. This indicates that individual parameters are quite limited in ex- pressiveness. The second part (ALL PARAMS) of Fig. 3 illustrates the results from the combination of all parameters. The combination yields much better discrimination across all rows of Fig. 3. This demonstrates that the individual param- eters contain complementary information and attain synergies when they are combined. The third part of Fig. 3 visualizes the results of the PCA-based representations of the five input sig- nals FV , FAP , FML , COPAP , and COPML . The three GRF components achieved higher scores compared to the COP sig- nals. The rightmost part of Fig. 3 shows the discriminativity scores for combined PCA representations, i.e all three GRF components combined (PCA FALL ), both COP signals com- bined (PCA COPAP,M L ), and all five components combined (PCA ALL). In general, the combination of components im- proved the results, which indicates that the individual GRF com- ponents are complementary to each other. The addition of the COP further improved the discriminative power. Thus, adding COP to a classification may contribute positively to the re- sults. The representations (PCA ALL and ALL PARAMS) are combined able to contribute to all evaluated tasks (rows) of Fig. 3. The evaluated representations are more suitable for differen- tiating between the healthy control group and a functional gait disorder (rows 1-4) than between two functional gait disorders (rows 7-12). Regarding the task N/GD, solely a few parame- ters are able to exceed the random baseline. This is due to the fact that the combined set of all gait disorders contains much more samples than the class of healthy controls (i.e. 279 vs. 161 samples). This yields a random baseline around 87.1% which is more difficult to exceed than random baselines in other tasks. B. Classification The results of the classification experiments, which were per- formed on data from the test set, are summarized in Table III. The test set was not presented to the classifier during the training phase and the selection of its parameters. Results are provided for the two classification tasks (N/C/A/K/H and N/GD) and for three different parameterizations. The results of the addi- tional experiments with other classifiers such as the multi-layer perceptron (MLP) and the k-nearest neighbors algorithm (k- NN) were all outperformed by the SVM results, which confirms also the results of Janssen, Schöllhorn et al. [32]. Therefore, and due to the limited space available, these results will not be discussed in detail. The first evaluated parameterization comprises of 52 GRF parameters (DPs and TDPs) that are extracted from all five GRF input signals. Due to the strong variation in the parameters’ value Authorized licensed use limited to: TU Wien Bibliothek. Downloaded on July 21,2023 at 22:47:18 UTC from IEEE Xplore. Restrictions apply. 2. Publications 68 SLIJEPCEVIC et al.: AUTOMATIC CLASSIFICATION OF FUNCTIONAL GDs 1659 Fig. 3. Discriminativity scores obtained by LDA for different selections of classes (rows). The figure is divided into four blocks. Each column represents a different input parameter or higher-dimensional input representation. Best viewed in color. TABLE III CLASSIFICATION RESULTS (%) OF TWO TASKS - N/C/A/K/H AND N/GD - AND THREE DIFFERENT PARAMETERIZATION APPROACHES Parameterization Norm. Dim. N/C/A/K/H (RB: 31.8%) N/GD (RB: 87.1%) linear SVM RBF SVM linear SVM RBF SVM GRF Parameters (DPs and TDPs) z-score 52 15.0 (46.8) 8.8 (40.6) 2.4 (89.5) −0.8 (86.3) GRF Parameters (DPs and TDPs) min-max 52 14.3 (46.1) 9.5 (41.3) 1.6 (88.7) −3.8 (83.3) PCA on FV , FAP , FM L z-score 30 19.8 (51.6) 15.4 (47.2) 2.4 (89.5) 2.0 (89.1) PCA on FV , FAP , FM L , COPAP , COPM L z-score 39 22.5 (54.3) 19.4 (51.2) 3.7 (90.8) 1.9 (89.0) PCA on z-standardized GRF parameters z-score 28 13.8 (45.6) 8.8 (40.6) 2.6 (89.7) −0.6 (86.5) PCA on min-max normalized GRF parameters z-score 28 13.5 (45.3) 7.9 (39.7) 2.8 (89.9) 0.1 (87.2) Note that the random baseline (RB) is stated next to the task name and that the values in the table represent the deviation from the random baseline (RB) and the corresponding absolute accuracy in brackets. ranges, a suitable normalization of the data is essential. We eval- uated min-max normalization as well as z-standardization. The use of z-standardization resulted in a slightly higher deviation from the RB for both tasks (except for the RBF SVM in task N/C/A/K/H) compared to min-max normalization. Further- more, the RBF SVM failed to exceed the random baseline for both methods in the task N/GD. The second parameterization was obtained by PCA of the raw GRF waveforms. PCAs obtained solely from the three force components clearly outperform the GRF parameters (DPs and TDPs). By adding the COP measurements the results were further improved for both tasks. Normalization of the PCA-based representations is crucial as performance otherwise drops significantly. The third parameterization applied PCA on the z-standardized and min-max normalized DPs and TDPs. The dimensionality reduction resulted in a 28-dimensional vector which was also z- standardized prior to classification. In this case, results for both normalizations (last two rows of Table III) were improved for the taskN/GD compared to the representation with the original GRF parameters (first two rows of Table III). However, this is not the case for task N/C/A/K/H , where the deviation from the RB slightly decreased. In summary, the best performance (marked in bold in Table III) was achieved by applying PCA to all five GRF sig- nals. The linear SVM achieved the highest deviation from the RB (22.5%) for task N/C/A/K/H as well as for task N/GD (3.7%). Alternative classifiers which were also evaluated yielded a lower deviation from the RB: MLP 21.0% and k-NN 13.4% for task N/C/A/K/H and MLP 2.6% and k-NN 2.2% for task N/GD. In terms of accuracy and deviation from the RB, the linear SVM performed better in all experiments. The RBF SVM has an advantage solely in terms of runtime. The main reason for the great difference in the performance between the two tasks is the strong class imbalance in task N/GD, which makes this task particularly difficult to solve. One way of dealing with unbalanced datasets in SVMs is the use of different weights for different classes, thereby emphasiz- ing the importance of the under-represented classes. Therefore, additional class-weighted experiments were performed. Results with different cost functions showed that no further performance increase can be achieved. The uniform cost function seems to work best on the data. C. Discussion and Further Aspects We presented a study on the classification of different func- tional gait disorders, stemming from a wide range of possible impairments, into categories that represent the main affected body region. The motivation for selecting these broad cate- gories is that identifying the region of impairment is essential for clinical practice and may allow to pinpoint impairments al- ready at an early stage. In addition, it could indicate secondary impairments which may easily be overlooked by the physician during clinical examination. The present study represents a first Authorized licensed use limited to: TU Wien Bibliothek. Downloaded on July 21,2023 at 22:47:18 UTC from IEEE Xplore. Restrictions apply. 2.2. Automatic Classification of Functional Gait Disorders 69 1660 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 22, NO. 5, SEPTEMBER 2018 TABLE IV RESULTS (%) OF ANALYSES ASSESSING THE INFLUENCE OF SEVERAL FACTORS ON THE RESULTS OF THE TWO TASKS - N/C/A/K/H AND N/GD Partitions of the dataset N/C/A/K/H N/GD Session are balanced 23.7 (60.2) 20.6 (84.1) Persons are balanced 28.3 (59.5) 5.3 (84.7) Persons & sessions are balanced 39.2 (59.2) 35.4 (85.4) Male population 20.9 (51.3) 0.6 (91.4) The experiments are performed with a PCA on all five signals in combination with a linear SVM. Note that the values represent the deviation from the random baseline (RB) and the corresponding absolute accuracy in brackets. performance baseline for the classification of gait disorders. Results are particularly promising for task N/GD. However, an absolute classification accuracy of 91% still lies below an ac- ceptable threshold for clinical practice. For the classification of individual disorder categories, the results indicate that further improvements are necessary. To date, the proposed approach could, however, already serve as a support for clinicians indi- cating the presence of (additional) arthropathies or diseases. In order to reduce the classification complexity, while still provid- ing support for clinicians, similar classes could be merged, i.e. the hip and knee classes into a thigh class and the ankle and cal- caneus classes into a shank class. The results of this additional experiment showed a deviation from the RB of 26.8% (using a linear SVM, RB: 51%, absolute accuracy: 77.8%). Compared to the distinction of all five classes (N/C/A/K/H), this is a clear increase in accuracy and deviation from the RB. Different influencing factors, i.e. the imbalance of the class priors, the variability in the number of sessions per person and gender-specific aspects may introduce a bias into the afore- mentioned analyses. To investigate the effect of these factors on classification performance, we performed additional experi- ments. To this end, we used the best configuration found so far as a baseline, i.e. PCA on all five signals with a linear SVM (4th parametrization in Table III) and applied it to different balanced subsets of our dataset. The results are presented in Table IV and are discussed in the following. For the experiments in Section III-B we decided to use all available sessions of patients recorded in the course of their re- habilitation to account for different progression stages of impair- ments. This, however, may introduce a bias in the experiments as more trials exist for some patients than for others. To eval- uate to which extent the varying number of recorded sessions per patient influences the overall result, we balanced the dataset by selecting only one random session per person. Interestingly, the deviation from the RB improved for task N/C/A/K/H to 23.7% (+1.2%) and for task N/GD to 20.6% (+16.9%), as presented in the first row of Table IV. These results show that intra-patient variability needs to be taken into account and re- quires additional modeling in a classification approach. Another factor causing an imbalance in the data are the dif- ferent class cardinalities, i.e. different numbers of persons per class. In order to investigate the influence of this imbalance we performed an experiment for both tasks with a dataset contain- ing the same number of participants per class (but keeping all sessions in the dataset). For task N/C/A/K/H the balanced dataset is composed of data from 62 persons from each class (overall 310 persons, 7616 trials). For task N/GD the bal- anced dataset contained data from 160 healthy controls and 160 persons with a deficit (40 from each GD class, overall 320 per- sons and 6096 trials). The deviation from the RB improved for task N/C/A/K/H to 28.3% (+5.8%) and for task N/GD to 5.3% (+1.6%), as shown in the second row of Table IV. Although the results show that balancing the number of patients among classes is beneficial, the results of task N/GD reveal the still existing imbalance in the dataset (due to the fact that healthy con- trols have only one session and patients up to several sessions). The next question deals with the effect of balancing the num- ber of patients and the number of sessions at the same time. We performed experiments with a completely balanced version of our dataset for each task, containing only one session per person and equal numbers of persons per class. For task N/C/A/K/H the balanced dataset is composed of data from 62 persons from each class (overall 310 persons, 2480 trials). For taskN/GD the balanced dataset contained data from 160 healthy controls and 160 persons with a deficit (40 from each GD class, overall 320 persons and 2560 trials). The results of our experiments showed clear performance improvements of +16.7% in the deviation from the RB compared to the baseline for task N/C/A/K/H and +31% compared to the baseline for task N/GD (see the third row in Table IV). Other biases in the data may be introduced by variations in gender, walking velocity, leg length and other parameters [33] leading to a variability of GRF parameterizations in the indi- vidual disorder classes. Additional normalization of the input data may be necessary to reduce intra-class variation and im- prove classification accuracy. Several studies have shown that in particular gender causes strong variability in gait signals [34], [35]. To assess the influence of gender on our results, an ex- periment was performed on a reduced dataset containing only data from male participants (note that the number of female participants in our dataset is not sufficient to perform separate experiments). Surprisingly, the results did not improve (see the last row in Table IV). This indicates that for our data, gender has rather little influence on the results, which, however, does not imply that the influence of gender can be neglected a priori. A detailed study on the influence of gender is subject to future investigation. IV. CONCLUSIONS The present study aimed at classifying patients with different orthopedic gait impairments at the hip, knee, ankle, and calca- neus from healthy controls using GRF measurements. For this purpose a dataset of 9,496 gait measurements from clinical prac- tice was utilized. In a first step we investigated the suitability of state-of-the-art GRF parameterizations and analyzed their sta- tistical properties and discriminative power among the classes. Based on these results, the use of entire GRF waveform param- eterizations as input (such as PCA), rather than relying on GRF parameters (DPs and TDPs) seems advisable. Furthermore, the use of GRF force components paired with the respective COP Authorized licensed use limited to: TU Wien Bibliothek. Downloaded on July 21,2023 at 22:47:18 UTC from IEEE Xplore. Restrictions apply. 2. Publications 70 SLIJEPCEVIC et al.: AUTOMATIC CLASSIFICATION OF FUNCTIONAL GDs 1661 measurements yielded the best results. Our experiments further showed that balancing the dataset significantly improves re- sults (e.g. increasing the deviation from the random baseline by +16.7% for the classification into healthy controls and all four GD classes and by +31% for distinguishing between healthy controls and patients). The presented study shows that results heavily depend on the employed GRF representation. Future work will investi- gate and evaluate adaptively learned signal representations [36], [37] to obtain more discriminative and expressive parameteri- zations of GRF measurements. Furthermore, we will focus on establishing a large, open-source, and balanced data set to fos- ter further developments in this area. Our results thereby pro- vide a first performance baseline for the classification of func- tional gait disorders and can serve as a reference for future improvements. ACKNOWLEDGMENT The authors want to thank Marianne Worisch and Szava Zoltán for their great assistance in data preparation and their great support in clinical and technical questions. REFERENCES [1] R. Baker, Measuring Walking: A Handbook of Clinical Gait Analysis. London, U.K.: Mac Keith Press, 2013. [2] T. Chau, “A review of analytical techniques for gait data. Part 1: Fuzzy, statistical and fractal methods,” Gait Posture, vol. 13, no. 1, pp. 49–66, 2001. [3] T. Chau, “A review of analytical techniques for gait data. Part 2: Neural network and wavelet methods,” Gait Posture, vol. 13, no. 2, pp. 102–120, 2001. [4] C. A. Lozano-Ortiz, A. M. S. Muniz, and J. Nadal, “Human gait classifica- tion after lower limb fracture using artificial neural networks and principal component analysis,” in Proc. IEEE 2010 Annu. Int. Conf. Eng. Med. Biol. Soc., 2010, pp. 1413–1416. [5] W. Zeng, F. Liu, Q. Wang, Y. Wang, L. Ma, and Y. Zhang, “Parkinson’s disease classification using gait analysis via deterministic learning,” Neu- rosci. Lett., vol. 633, pp. 268–278, 2016. [6] A. Vieira et al., “Software for human gait analysis and classification,” in Proc. 2015 IEEE 4th Portuguese Meeting Bioeng., Porto, 2015, pp. 1–1. [7] J. Wu, J. Wang, and L. Liu, “Feature extraction via KPCA for classification of gait patterns,” Hum. Movement Sci., vol. 26, no. 3, pp. 393–411, 2007. [8] J. Wu and J. Wang, “PCA-based SVM for automatic recognition of gait patterns,” J. Appl. Biomech., vol. 24, no. 1, pp. 83–87, 2008. [9] P. Levinger, D. T. H. Lai, R. K. Begg, K. E. Webster, and J. A. Feller, “The application of support vector machines for detecting recovery from knee replacement surgery using spatio-temporal gait parameters,” Gait Posture, vol. 29, no. 1, pp. 91–96, 2009. [10] N. Mezghani et al., “Automatic classification of asymptomatic and os- teoarthritis knee gait patterns using kinematic data features and the nearest neighbor classifier,” IEEE Trans. Biomed. Eng., vol. 55, no. 3, pp. 1230– 1232, Mar. 2008. [11] M. Alaqtash, T. Sarkodie-Gyan, H. Yu, O. Fuentes, R. Brower, and A. Abdelgawad, “Automatic classification of pathological gait patterns using ground reaction forces and machine learning algorithms,” in Proc. IEEE 2011 Annu. Int. Conf. Eng. Med. Biol. Soc., 2011, pp. 453–457. [12] M. Ferrarin et al., “Gait pattern classification in children with Charcot– Marie–Tooth disease type 1A,” Gait Posture, vol. 35, no. 1, pp. 131–137, 2012. [13] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828, Aug. 2013. [14] G. Giakas and V. Baltzopoulos, “Time and frequency domain analysis of ground reaction forces during walking: An investigation of variability and symmetry,” Gait Posture, vol. 5, no. 3, pp. 189–197, 1997. [15] R. Lafuente, J. M. Belda, J. Sánchez-Lacuesta, C. Soler, and J. Prat, “Design and test of neural networks and statistical classifiers in computer- aided movement analysis: A case study on gait analysis,” Clin. Biomech., vol. 13, no. 3, pp. 216–229, 1998. [16] D. P. Soares, M. P. de Castro, E. A. Mendes, and L. Machado, “Principal component analysis in ground reaction forces and center of pressure gait waveforms of people with transfemoral amputation,” Prosthetics Orthotics Int., vol. 40, no. 6, pp. 729–738, 2016. [17] A. M. S. Muniz and J. Nadal, “Application of principal component analysis in vertical ground reaction force to discriminate normal and abnormal gait,” Gait Posture, vol. 29, no. 1, pp. 31–35, 2009. [18] Z. Peng, C. Cao, Q. Liu, and W. Pan, “Human walking pattern recognition based on KPCA and SVM with ground reflex pressure signal,” Math. Probl. Eng., vol. 2013, 2013, Art. no. 143435. [19] Y. Xu, D. Zhang, Z. Jin, M. Li, and J. Y. Yang, “A fast kernel-based non- linear discriminant analysis for multi-class problems,” Pattern Recognit., vol. 39, no. 6, pp. 1026–1033, 2006. [20] R. LeMoyne, W. Kerr, T. Mastroianni, and A. Hessel, “Implementation of machine learning for classifying hemiplegic gait disparity through use of a force plate,” in Proc. IEEE 2014 13th Int. Conf. Mach. Learn. Appl., 2014, pp. 379–382. [21] G. Williams, D. Lai, A. Schache, and M. E. Morris, “Classification of gait disorders following traumatic brain injury,” J. Head Trauma Rehabil., vol. 30, no. 2, pp. E13–E23, 2015. [22] W. I. Schöllhorn, B. M. Nigg, D. J. Stefanyshyn, and W. Liu, “Identification of individual walking patterns using time discrete and time continuous data sets,” Gait Posture, vol. 15, no. 2, pp. 180–186, 2002. [23] K. L. Goh, K. H. Lim, A. A. Gopalai, and Y. Z. Chong, “Multilayer per- ceptron neural network classification for human vertical ground reaction forces,” in Proc. 2014 IEEE Conf. Biomed. Eng. Sci., 2014, pp. 536–540. [24] M. Köhle and D. Merkl, “Analyzing human gait patterns for malfunction detection,” in Proc. 2000 ACM Symp. Appl. Comput., 2000, vol. 1, pp. 41– 45. [25] M. Köhle and D. Merkl, “Identification of gait patterns with self- organizing maps based on ground reaction force,” in Proc. Eur. Symp. Artif. Neural Netw., 1996, vol. 96, pp. 24–26. [26] K. C. Moisio, D. R. Sumner, S. Shott, and D. E. Hurwitz, “Normalization of joint moments during gait: A comparison of two techniques,” J. Biomech., vol. 36, no. 4, pp. 599–603, 2003. [27] I. Jolliffe, Principal Component Analysis, 2nd ed. New York, NY, USA: Springer-Verlag, 2002. [28] D. Slijepcevic et al., “Ground reaction force measurements for gait classi- fication tasks: Effects of different PCA-based representations,” Gait Pos- ture, vol. 57, pp. 4–5, 2017. [29] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. New York, NY, USA: Wiley, 2012. [30] C. M. De Vries, S. Geva, and A. Trotman, “Document clustering eval- uation: Divergence from a random baseline,” in Workshop Information Retrieval 2012 (IR-2012), Dortmund, Germany, 2012. [31] C. C. Chang and C. J. Lin, “LIBSVM: A library for support vector ma- chines,” ACM Trans. Intell. Syst. Technol., vol. 2, pp. 27:1–27:27, 2011. [32] D. Janssen, W. I. Schöllhorn, K. M. Newell, J. M. Jäger, F. Rost, and K. Vehof, “Diagnosing fatigue in gait patterns by support vector machines and self-organizing maps,” Hum. Movement Sci., vol. 30, no. 5, pp. 966–975, 2011. [33] M. R. Pierrynowski and V. Galea, “Enhancing the ability of gait analyses to differentiate between groups: Scaling gait data to body size,” Gait Posture, vol. 13, no. 3, pp. 193–201, 2001. [34] B. M. Eskofier, M. Kraus, J. T. Worobets, D. J. Stefanyshyn, and B. M. Nigg, “Pattern classification of kinematic and kinetic running data to distinguish gender, shod/barefoot and injury groups with feature ranking,” Comput. Methods Biomech. Biomed. Eng., vol. 15, no. 5, pp. 467–474, 2012. [35] M. C. Chiu and M. J. Wang, “The effect of gait speed and gender on per- ceived exertion, muscle activity, joint motion of lower extremity, ground reaction force and heart rate during normal walking,” Gait Posture, vol. 25, no. 3, pp. 385–392, 2007. [36] Y. Zhang, P. O. Ogunbona, W. Li, B. Munro, and G. G. Wallace, “Patho- logical gait detection of Parkinson’s disease using sparse representation,” in Proc. IEEE 2013 Int. Conf. Digit. Image Comput., Technn. Appl., 2013, pp. 1–8. [37] J. Hannink, T. Kautz, C. F. Pasluosta, K. G. Gaßmann, J. Klucken, and B. M. Eskofier, “Sensor-based gait parameter extraction with deep convolu- tional neural networks,” IEEE J. Biomed. Health Informat., vol. 21, no. 1, pp. 85–93, Jan. 2017. Authorized licensed use limited to: TU Wien Bibliothek. Downloaded on July 21,2023 at 22:47:18 UTC from IEEE Xplore. Restrictions apply. 2.2. Automatic Classification of Functional Gait Disorders 71 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, DECEMBER 2017 1 Supplementary Material for: Automatic Classification of Functional Gait Disorders Djordje Slijepcevic, Matthias Zeppelzauer, Anna-Maria Gorgas, Caterine Schwab, Michael Schüller, Arnold Baca, Christian Breiteneder, Brian Horsak (a) FV (b) FAP (c) FML Fig. S1. Mean pattern of the three ground reaction forces (GRF) enveloped by ± 1 standard deviations for each class. Data normalized by body weight and 100% stance. (a) FV (b) FAP (c) FML Fig. S2. Comparison of different PCA representations. The final dimension- ality of the obtained representations is specified by the amount of variance preserved in a particular projection, i.e. 98%, 95%, and 90%. Data normalized by body weight and 100% stance. 2. Publications 72 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, DECEMBER 2017 2 ST [s] 0.6 0.7 0.8 0.9 F V1 [BW] 0.9 1 1.1 1.2 1.3 T V1 [%ST] 20 25 30 35 F V2 [BW] 0.6 0.7 0.8 0.9 T V2 [%ST] 40 50 60 F V3 [BW] 0.9 1 1.1 1.2 T V3 [%ST] 60 70 80 F AP1 [BW] 0 0.02 0.04 0.06 0.08 T AP1 [%ST] 1 2 3 4 5 F AP2 [BW] -0.3 -0.2 -0.1 T AP2 [%ST] 10 15 20 25 F AP3 [BW] 0.1 0.2 0.3 T AP3 [%ST] 70 80 90 F ML1 [BW] -0.1 -0.05 0 T ML1 [%ST] 2 4 6 8 10 12 F ML2 [BW] 0 0.02 0.04 0.06 0.08 T ML2 [%ST] 20 40 60 F ML3 [BW] 0 0.02 0.04 0.06 0.08 T ML3 [%ST] 50 60 70 80 90 F VAVG [BW] 0.65 0.7 0.75 0.8 F APAVG [BW] -0.04 -0.02 0 0.02 0.04 F MLAVG [BW] 0 0.02 0.04 IF V [%BW"s] 40 50 60 70 IF AP [%BW"s] -2 0 2 4 IF ML [%BW"s] 0 1 2 3 IF V1 [%BW"s] 8 10 12 14 16 18 IF V2 [%BW"s] 20 25 30 35 IF V3 [%BW"s] 30 40 50 IF APDEC [%BW"s] -4 -3 -2 -1 IF APACC [%BW"s] 1 2 3 4 5 IF LAT [%BW"s] 1 2 3 IF MED [%BW"s] -0.6 -0.4 -0.2 0 COPANG [deg] -5 0 5 10 15 COPDEV [FL] 0.01 0.02 0.03 COP AP [FL] 0.85 0.9 0.95 1 COPV [FL/s] 1 1.2 1.4 1.6 1.8 COP ML [FL] 0.05 0.1 0.15 0.2 0.25 DECT [s] 0.3 0.4 0.5 ACCT [s] 0.2 0.3 0.4 0.5 LR0080 [N/s] 2000 4000 6000 8000 10000 LR2080 [N/s] 2000 4000 6000 8000 10000 12000 UR8000 [N/s] -10000 -8000 -6000 -4000 -2000 UR8020 [N/s] -12000 -10000 -8000 -6000 -4000 -2000 DS [s] 0.1 0.15 0.2 STEPLEN [m] 0.55 0.6 0.65 0.7 0.75 STEPV [km/h] 3 4 5 STRIDET [s] 1 1.2 1.4 BF [Hz] 0.7 0.8 0.9 1 CAD [1/min] 80 100 120 STEPWD [m] 0.05 0.1 0.15 STRLEN [m] 1.1 1.2 1.3 1.4 1.5 GV [km/h] 3 4 5 normal calcaneus ankle knee hip Fig. S3. Boxplots for all 52 GRF parameters. Each boxplot shows the median and the IQR (box) for each class (outliers were removed for better visualization). Box-whiskers correspond to 1.5 of the box-length, thus show approximately ± 2.7 standard deviations. The overlap of distributions between the classes gives an impression of the parameters’ discriminative power (inter-class variation). Data normalized by body weight and 100% stance, prior to the calculation of the parameters. 2.2. Automatic Classification of Functional Gait Disorders 73 2. Publications 2.3 Input Representations and Classification Strategies for Automated Human Gait Analysis Djordje Slijepcevic, Matthias Zeppelzauer, Caterine Schwab, Anna-Maria Raberger, Christian Breiteneder, and Brian Horsak. Input Representations and Classification Strategies for Automated Human Gait Analysis. Gait & Posture, 76:198–203, 2020. DOI: 10.1016/j.gaitpost.2019.10.021 The final version of this publication is available at: https://doi.org/10.1016/j. gaitpost.2019.10.021. Permission for reprint granted, © 2019 Slijepcevic 74 Contents lists available at ScienceDirect Gait & Posture journal homepage: www.elsevier.com/locate/gaitpost Input representations and classification strategies for automated human gait analysis Djordje Slijepcevica,*, Matthias Zeppelzauera, Caterine Schwabb, Anna-Maria Rabergerb, Christian Breitenederc, Brian Horsakb a St. Pölten University of Applied Sciences, Institute for Creative Media Technologies, St. Pölten, Austria b St. Pölten University of Applied Sciences, Institute of Health Sciences, St. Pölten, Austria c TU Wien, Institute of Visual Computing and Human-Centered Technology, Vienna, Austria A R T I C L E I N F O Keywords: Ground reaction force Gait classification Machine learning Gait disorders Support vector machine A B S T R A C T Background: Quantitative gait analysis produces a vast amount of data, which can be difficult to analyze. Automated gait classification based on machine learning techniques bear the potential to support clinicians in comprehending these complex data. Even though these techniques are already frequently used in the scientific community, there is no clear consensus on how the data need to be preprocessed and arranged to assure optimal classification accuracy outcomes. Research question: Is there an optimal data aggregation and preprocessing workflow to optimize classification accuracy outcomes? Methods: Based on our previous work on automated classification of ground reaction force (GRF) data, a se- quential setup was followed: firstly, several aggregation methods – early fusion and late fusion – were compared, and secondly, based on the best aggregation method identified, the expressiveness of different combinations of signal representations was investigated. The employed dataset included data from 910 subjects, with four gait disorder classes and one healthy control group. The machine learning pipeline comprised principle component analysis (PCA), z-standardization and a support vector machine (SVM). Results: The late fusion aggregation, i.e., utilizing majority voting on the classifier's predictions, performed best. In addition, the use of derived signal representations (relative changes and signal differences) seems to be ad- vantageous as well. Significance: Our results indicate that great caution is needed when data preprocessing and aggregation methods are selected, as these can have an impact on classification accuracies. These results shall serve future studies as a guideline for the choice of data aggregation and preprocessing techniques to be employed. 1. Introduction Gait disorders can affect anyone, regardless of age, and often im- pede an individual's ability to participate in daily living activities such as walking and might even reduce movement efficiency in terms of energy consumption [1,2]. Gait analysis based on ground reaction force (GRF) assessment is a well-established method to diagnose the me- chanisms that underlie gait disorders. The quantitative analysis of such data can provide relevant information for clinicians in diagnosing gait impairments, planning therapies and surgeries, supporting rehabilita- tion processes, or evaluating treatment outcomes [3]. However, quan- titative gait analysis produces a vast amount of data, which are difficult to comprehend and analyze due to their high-dimensionality, temporal dependencies, strong variability, non-linear relationships, and inter- correlations [4]. Therefore, there is growing interest in employing machine learning techniques that allow for a cost-effective, fast and objective analysis of large amounts of gait measurements. Recently, automated gait classification has been successfully used for various patient groups [5] affected by stroke [6], Parkinson's disease [7], cer- ebral palsy [8], multiple sclerosis [9], osteoarthritis [10], or by age- related impairments [11]. Automated classification of gait is, however, a complex task con- sisting of many different processing steps which have to be carried out in a methodically correct way and for which various approaches exist. According to Figueiredo et al. [5] gait pattern recognition comprises the following main steps: (1) feature extraction, (2) feature normalization, https://doi.org/10.1016/j.gaitpost.2019.10.021 Received 22 December 2018; Received in revised form 8 October 2019; Accepted 14 October 2019 ⁎ Corresponding author at: St. Pölten University of Applied Sciences, Matthias Corvinus-Straße 15, 3100 St. Pölten, Austria. E-mail address: djordje.slijepcevic@fhstp.ac.at (D. Slijepcevic). Gait & Posture 76 (2020) 198–203 0966-6362/ © 2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/BY-NC-ND/4.0/). T 2.3. Input Representations and Classification Strategies for Automated Human Gait Analysis 75 (3) feature selection, (4) forming a training and a testing dataset, (5) training a classification model, and (6) evaluating the performance. To date, there is no clear consensus on how to proceed in each of these steps. For tasks (2) to (6), the systematic review by Figueiredo et al. [5] might serve as a first guideline. For the first step of feature extraction, a variety of options can be found in the current literature, but so far no clear recommendation can be derived. However, different approaches in feature extraction might significantly effect classification accuracies. Firstly, in the literature on gait classification, several recorded trials of a subject are usually either averaged to a single waveform [12–14] or all available trials are provided to a classifier [15,16,10]. To date, it is unclear which of these data aggregation strategies serves best for gait classification. Recently, a statistical method based on the notion of depth was suggested which identifies the most representative trial [17]. This approach, however, has not been employed in the gait classification community yet. Secondly, there is no clear consensus on how the raw signals should be preprocessed and transformed to form an appropriate input feature vector for the machine learning algorithm. Regarding the available input data (ground reaction force (GRF) and center of pressure (COP) components), it is still unclear which form of representation (i.e. raw data, relative changes, or signal differences) is best suited for machine learning. Based on our earlier work [18] the two primary aims of this article are: to (i) evaluate the effects of different data aggregation methods on gait classification performance and to (ii) investigate which input representations and combinations of representations perform best for automated gait classification. To facilitate the comparability of machine learning approaches and to optimize performance, it is critical to identify best practice procedures for the individual steps of gait classification. The results of this article shall serve future studies as a guideline on machine learning for gait analysis. 2. Methods 2.1. Patients and dataset The anonymized data used in this study are part of an existing clinical gait database maintained by a rehabilitation center of the Austrian Workers’ Compensation Board (AUVA). The AUVA is the social insurance for occupational risks for more than 3.3 million employees and 1.4 million pupils and students in Austria. This retrospective study was approved by the local Ethics Committee of Lower Austria (GS1-EK- 4/299-2014). The dataset utilized comprises GRF measurements from 728 pa- tients with gait disorders (GD) and data from 182 healthy controls, both of various physical composition and gender (see Table 1).The dataset is balanced regarding the number of persons per class, the number of recorded sessions per person and the number of trials per person. The dataset includes gait disorders associated with the calcaneus (n = 182), ankle (n = 182), knee (n = 182), and hip (n = 182). A well-experi- enced physical therapist (with more than a decade of clinical experi- ence) has manually labeled the dataset based on the available medical diagnosis of each patient. The individual GD classes include patients after joint replacement surgery, fractures, ligament ruptures, and re- lated disorders associated with the above-mentioned anatomical areas. The most common injuries present in the hip class are fractures of the pelvis and thigh as well as luxation of the hip joint, coxarthrosis, and total hip replacement. The knee class comprises patients after patella, femur or tibia fractures, ruptures of the cruciate or collateral ligaments or the meniscus, and total knee replacements. The ankle class includes patients after fractures of the malleoli, talus, tibia or lower leg, and ruptures of ligaments or the Achilles tendon. The calcaneus class comprises patients after calcaneus fractures or ankle fusion surgery. All of the injuries mentioned above may occur individually or in combi- nations within each class. 2.2. Data recording and preprocessing Gait analysis was performed on a 10 m walkway with two centrally embedded force plates (Kistler, Type 9281B12). The force plates were placed in consecutive order, allowing a person to walk across by placing one foot on each plate. Both plates were flush with the ground and covered with the same walkway surface material, so that targeting was not an issue. During a session, participants walked unassisted and without walking aid at self-selected walking speed until a minimum of eight valid recordings were available. All processing steps and subsequent analyses were performed in Matlab 2017b (The MathWorks Inc., Natick, MA, USA). The three analog GRF signals, as well as the two COP signals, were converted to digital signals using a sampling rate of 2000 Hz and a 12-bit analog- digital converter (DT3010, Data Translation Incorporation, Marlboro, MA, USA) with a signal input range of ± 10 V. A threshold of 10 N was used for step detection and 30 N for COP calculation. Raw signals were filtered using a 2nd order low-pass Butterworth filter with a cut-off frequency of 20 Hz. All gait measurements were time-normalized to 1000 points (100% stance). Amplitude values of the three force com- ponents, i.e., vertical (V), medio-lateral (ML), and anterior–posterior (AP), were expressed as a multiple of body weight by dividing the force by the product of body mass times acceleration due to gravity. 2.3. Gait classification The present paper builds upon the general gait classification pipe- line established by Slijepcevic et al. [18] and uses it as a baseline for the performed experiments. A schematic illustration of the pipeline is shown in Fig. 1. In a first step, Principal Component Analysis (PCA) is applied to the raw input data, i.e. to each input representation sepa- rately (feature extraction).1 Next, the resulting features, i.e., the prin- cipal components that retain 98% of the overall variance in the input data, are concatenated and z-standardized (feature normalization). The features are provided to a classifier which is trained and evaluated in a cross-validation manner. For the best parameters found during cross- validation the model is trained on the entire training set. To account for generalizability of this model we evaluated it on a completely in- dependent and unseen dataset (see Figure S1 in the supplementary material). As demonstrated in [18], Support Vector Machines (SVM) are a suitable classifier for gait data outperforming several competitors, e.g., multi-layer perceptrons and the k-nearest neighbors algorithm. The SVM is trained in a multi-class fashion using a one-vs-one strategy. 2.3.1. Data aggregation methods Usually, several trials per person are recorded during gait analysis. Thus, the question arises whether and how the information from these different trials can be aggregated. Such an aggregation step could be Table 1 Details on the dataset employed, the demography of the participants and the pre-defined classes. Age (yrs.) Body Mass (kg) Sex Class n Mean (SD) Mean (SD) (m/f) Num. trials Healthy controls 182 34.3 (14.0) 74.6 (15.8) 94/88 1,456 Calcaneus 182 44.3 (10.5) 86.3 (16.4) 167/15 1,456 Ankle 182 40.6 (10.9) 88.3 (18.2) 151/31 1,456 Knee 182 40.4 (12.3) 86.2 (20.3) 133/49 1,456 Hip 182 40.6 (12.8) 81.5 (15.0) 153/29 1,456 Total 910 40.0 (12.1) 83.4 (17.1) 698/212 7,280 1 For each original input signal and derived representation a PCA is per- formed on a matrix of size 334 (samples) × t (trials), where t depends on the considered dataset. D. Slijepcevic, et al. Gait & Posture 76 (2020) 198–203 199 2. Publications 76 implemented in an early fusion or a late fusion manner (see Fig. 1). The former directly affects the input data and thus precedes the feature extraction step, whereas the latter is directly applied to the classifier's predictions and affects mostly the classification scheme. Popular early fusion approaches include: (i) mean waveform, (ii) median waveform, and (iii) the most representative trial. The mean waveform approach consists of averaging each measure- ment from a session (in this case, eight trials) pointwise. The resulting waveform should result in a more robust representation than the ori- ginal signals by removing inter-trial variations and retaining the overall characteristic shape. The median waveform approach is similar but uti- lizes the point-wise median instead. It is more robust to outliers but may generate less smooth waveforms than the mean waveform ap- proach. Both approaches could diminish informative waveform char- acteristics, or even cause artifacts that provide a distorted representa- tion [19]. To overcome this problem, Sangeux et al. [17] proposed a statistical method to determine the most representative trial. Thereby, this approach assures that original measurement data is used. For ma- chine learning, however, performance might be affected by the fact that not all available and potentially essential information is considered. A schematic illustration of the early fusion approaches is given in Fig. 2. The late fusion approach utilizes all available original trials for the training of the model. As a result, the classifier returns one prediction per trial. These predictions are considered weak because they are based on individual measurements. The late fusion approach combines these weak predictions into a strong prediction. A robust approach for the combination of several predictions is majority voting. The majority vote is calculated based on the statistical mode, which returns the element (class label) that occurs most often in a set of predictions. For majority voting, only predictions with a likelihood of more than 40% for one of the five classes are used. Thereby, the negative influence of ambiguous trials is reduced. A schematic illustration of the late fusion approach is presented in Fig. 2. To provide a baseline without aggregation of the available data, we employ all eight trials per person individually during the training and testing. Thus, each trial was predicted separately, and the information about the membership of the trial to a specific person was not utilized. 2.3.2. Input representations We further investigate the expressiveness and suitability of different input representations for gait classification. Two different types of input representations are distinguished here: original input signals and derived signals. Original input signals comprise the time and body weight Fig. 1. Illustration of the employed gait classification framework. The dataset consisted of a training set (blue, dashed) and an independent test set (orange, solid). The latter was used to evaluate the generalizability of our classification. Fig. 2. (a) Schematic for the early fusion aggregation, i.e., mean, median, and most representative trial (MRT) approaches. Prior to training, the eight signals of one subject are aggregated by calculating a mean or median waveform, respectively or one trial is selected by MRT. (b) Schematic for the late fusion aggregation, which employs majority voting. For the training of an SVM, all recorded trials of the subjects are used. For the actual prediction of the test set, majority voting is applied to obtain a decision at subject level. D. Slijepcevic, et al. Gait & Posture 76 (2020) 198–203 200 2.3. Input Representations and Classification Strategies for Automated Human Gait Analysis 77 normalized waveforms, i.e., FV, FAP, FML, COPAP, and COPML compo- nents of the affected (A) and unaffected (U) lower extremity. The af- fected and unaffected body side were defined by the physical therapist during data annotation. In case of healthy controls or bi-laterally af- fected patients the affected side was chosen randomly to avoid a bias. The derived signal representations are calculated based on the ori- ginal input signals. Two types of derived signals are investigated: the approximate first derivative (DA,DU) of each original input signal and the absolute difference between the input signals of the affected and unaffected lower extremity (Δ). Furthermore, the expressive power of different combinations of the individual signal representations is examined, i.e., the combination of the original input signals and the derived representations of the affected and unaffected sides. 2.4. Experimental setup Prior to the experiments, the dataset was randomly divided into a training set (65%) and an independent test set (35%), see Fig. 1. This split remained unchanged for all experiments. The classification ex- periments utilized a probabilistic SVM with a linear kernel (provided by the LIBSVM library [20]). For hyper-parameter selection, a grid search over the regularization parameter C ∈ [2−5;210] was employed. During the grid search, a five-fold cross-validation was performed on the training set. After hyper-parameter selection an SVM with the best parameters was trained on the entire training set. To assess the gen- eralizability of the methods, the test set was divided into three equally large and balanced test splits, on which we evaluated the SVM. By using multiple splits, it was possible to estimate not only the generalization ability but also the expected variation in performance for different subsets of test samples. The evaluation was conducted by calculating four performance measures, i.e. classification accuracy (Acc), precision (P), recall (R), and F1-score (F1), defined in terms of number of true positives (TP), true negatives (TN), false positives (FP), and false ne- gatives (FN) as follows: = ++ + +Acc TP TN TP TN FP FN = + = + = × ×+P R F P R P R TP TP FP , TP TP FN , 1 2 Furthermore, a sequential setup was followed: first, different ag- gregation methods were examined, and second, based on the best ag- gregation method the expressiveness of different (combinations of) signal representations was investigated. All results are reported as mean (SD), unless otherwise stated. 3. Results The results of the first experiment investigating different aggregation methods over several trials are summarized in Table 2. The performances of the five-fold cross-validation and the evaluation on the independent test set showed similar trends. This demonstrates the generalization ability of our method. In the following, we discuss the results of the independent test set, which are more objective than the results on the training set. The baseline approach, where all available trials are employed (without aggregation), yielded an accuracy of 56.5% (2.3) (RB2 : 20%). The use of the median waveform and MRT did not outperform the baseline performance. Within the group of early fusion approaches the mean waveform approach showed the greatest improvement with an accuracy of 58.9% (2.8). The late fusion approach, i.e., majority voting, achieved the highest absolute scores in all performance measures (although not statistically significant in this experi- ment). The results of the second experiment in which we investigated the expressiveness of different signal representations on the independent test set are presented in Table 3 (further performance measures can be found in the supplementary material). The results obtained during the five-fold cross-validation follow a trend similar to that of the in- dependent test set and are presented in the supplementary material. The first column in Table 3 indicates which components were used in each experiment: (1) each input signal separately (first five rows), (2) the combination of all three GRF components (row 6), (3) the combi- nation of both COP components (row 7), and (4) the combination of all signals (GRF + COP, last row). For each of these selections, the columns show which (combinations of) derived representations were employed for both affected (A) and unaffected (U) sides. Most notably, column {A, Δ, DA} shows the highest performance for most input configurations (in six of the eight rows), including also the overall best result with a classification accuracy of 62% (GRF+COP). For the FML component (row three), combination {A, DA} provides the best result. The combi- nation {A, DA, U, DU} provides the best results for the COPML compo- nent (row four) and the combination of both COP components (row seven). The comparison between the individual GRF and COP compo- nents (first five rows) and the three combinations, GRF, COP, and GRF + COP (last three rows) indicates that the combination of all components (last row) performed best. Furthermore, incorporating the information from both legs (via Δ) as well as using the first derivative (in particular of the affected leg) shows to be beneficial. 4. Discussion From the first experiment (see Table 2) we observe that achieved performances of all approaches are higher for the test set than for the training set. The reason for this is that for experiments on the test set the SVM was trained on the entire training data with the optimal parameters determined during cross-validation and grid search. The improved results on the test set show that additional training data are beneficial for the classifier. Furthermore, the first experiment indicates that the inclusion of membership information can be beneficial. Two aggregation methods, i.e. the mean waveform approach and majority Table 2 Classification results (%) of the experiment investigating different aggregation methods over several trials (RB: 20%). Highest achieved results are highlighted bold. Trial selection Five-fold cross-validation on training set Independent test set Acc P R F1 Acc P R F1 Baseline without aggregation 52.0 (2.0) 51.8 (1.8) 52.7 (2.6) 51.4 (1.7) 56.5 (2.3) 56.2 (2.7) 56.6 (2.3) 56.0 (2.6) Mean waveform approach 53.9 (5.2) 53.5 (6.5) 54.4 (5.9) 52.7 (5.9) 58.9 (2.8) 59.5 (1.0) 58.6 (2.2) 58.5 (1.8) Median waveform approach 51.4 (3.3) 51.1 (3.7) 52.0 (3.7) 50.3 (3.8) 56.7 (3.5) 57.9 (2.7) 56.9 (2.9) 56.4 (2.6) Most representative trial (MRT) 50.1 (2.0) 50.4 (2.1) 50.7 (2.4) 49.0 (2.0) 56.9 (5.3) 57.7 (4.6) 57.0 (4.3) 56.5 (4.8) Majority voting 55.5 (2.4) 54.4 (2.5) 56.6 (2.3) 54.3 (2.5) 61.0 (2.4) 60.9 (2.9) 61.1 (2.4) 60.1 (2.7) 2 RB refers to the analytical “random baseline” and represents the theoretical accuracy obtained when assigning class labels randomly, i.e. the case where nothing is learned from the data. For a balanced dataset the analytical RB is the reciprocal of the number of classes, i.e. 20% in our case. The empirically esti- mated RB according to [21] which further takes the sample size into account is approximately 26% in our case. Every increase over the RB means that the underlying model has learned something from the data. D. Slijepcevic, et al. Gait & Posture 76 (2020) 198–203 201 2. Publications 78 voting, achieved an improvement compared to the baseline where no aggregation was performed. Specifically, the late fusion approach, i.e., majority voting, achieved better results in absolute scores than the early fusion approaches. To evaluate the robustness of our approach in more detail, we repeated the experiment with 10 different (randomly selected) train-test splits. The results are presented in the supplementary material in Table S1 due to space limitations. For all 10 repetitions, the previously determined optimal SVM parameters (obtained from grid search on the original train-test split) remained unchanged to avoid overfitting. Additional statistical comparisons on the F1-scores from Table S1 in the supplementary material revealed that majority voting and the mean waveform approach significantly outperformed all other methods (see supplementary material for details). In total numbers and on average, majority voting showed the best performance results. We assume that this is because in early fusion large parts of the available input information are removed at an early stage and are not available during the training process. For late fusion, this is not the case. Furthermore, a comparison of the baseline method and the late fusion approach revealed that the aggregation of weak predictions by majority voting allows for a more accurate prediction at subject level. Majority voting adds a layer of abstraction to the outputs of the classifier, which seems to increase robustness. The performance level from the results in Table S1 (supplementary material) are equivalent to that of Table 2. This shows that the employed training-test split does not bias the test result in Table 2. The conclusion from the first experiment is that as much informa- tion as possible should be retained during the classification process and thus late fusion is recommended. Aggregation of information at later stages of the process seems to be superior to aggregation at an early stage, as relevant information of the individual trials is lost. The second experiment suggests that using only the original input signals might not always be the best choice. In most of our experiments, a combined representation of input signals and derived representations was ad- vantageous, especially the combinations {A, Δ, DA} and {A, DA, U, DU} in Table 3. Considerably lower accuracy was achieved when only the individual signals (first five rows in Table 3) were used. The use of a single COP signal (rows 4 and 5) lead to degeneration of the classifier in some cases, i.e., one class could not be modeled at all by the classifier. The combination of the three GRF components is considerably more expressive than the combination of the COP components. The best choice seems to be a combination of all signals (GRF + COP). This also supports our previous findings [16,18,22]. We further observed that the signals of the affected side are more expressive than those of the unaffected side ({A} vs. {U} in Table 3). This observation contradicts the findings of Williams et al. [23]. The combination of affected and unaffected input signals improved the re- sults in five out of eight cases. The Δ-waveform represents the difference between the affected and the unaffected side and thus explicitly captures the symmetry between both sides. When combined with the signals of the affected side, a moderate increase in accuracy was present in three of eight cases ({A} vs. {A, Δ}). This result suggests that the classifier is able to derive symmetry-related information also from the raw input signals and does not necessarily need it to be explicitly provided. For the unaffected side, the Δ-waveform provides an improvement in seven of eight cases ({U} vs. {U, Δ}). Therefore, the Δ-waveform seems to carry important in- formation. Adding the first derivative as an additional input re- presentation to the signals of the affected or unaffected side showed improvements in 30 out of 40 cases (evident by comparing the first five columns with the last five columns in Table 3). To obtain additional indicators for the usefulness of the re- presentations, we conducted further experiments with the overall best input representation (GRF + COP, last row in Table 3). We have cal- culated all (25 − 1 =31) possible combinations of A, DA,U,DU and Δ for the case GRF+COP and examined how often each representation oc- curs within the best 10 results. The most useful representations seem to be DA (contained in 8 of the 10 best results) as well as Δ and A (each contained in 6 of the 10 best results). DU (5/10 results) and U (4/10 results) seem less important. The overall recommendation that can be derived from these ex- periments is that the combination of more input signals and input re- presentations (even when they contain redundant information) can lead to better results. This is especially true for combining GRF and COP components but also for using the derivatives of the affected and un- affected sides. Even though the derivatives represent redundant in- formation to the original signals, they might still help the classifier to better grasp class differences. Furthermore, the combination of the af- fected and unaffected side (either explicitly or implicitly trough Δ) seems to be beneficial as well. The results of our study provide a first indication of which signals to use and how to fuse them. Further in- vestigations with alternative datasets are required to corroborate these findings. 5. Conclusions The presented work aims at clarifying which aggregation method and which signal representations are best suited for the classification of data obtained from gait analysis (based on GRF assessment).The results show that the aggregation of several trials of one subject is beneficial especially when late fusion or mean waveform is used. Furthermore, the results indicate that the combination of the original signals with de- rived representations increases the expressive power of the data during feature extraction and classification. The combination of GRF and COP components with derived representations, even though they may be partially redundant, improved classification performance on our data. Future research will investigate adaptively-learned feature re- presentations as well as the modeling of relationships within a gait cycle to derive more expressive representations. Acknowledgments This work was partly funded by the NFB – Lower Austrian Research and Education Company (NFB) and the Provincial Government of Lower Austria, Department of Science and Research (LSC14-005 and FTI17-014) and the Austrian Research Promotion Agency (FFG) and the BMDW within the COIN-program (866855). We want to thank Marianne Worisch and Szava Zoltán for their great assistance in data preparation and their great support in clinical and technical questions. Table 3 Classification accuracies (%) for different combinations of input signals and derived representations (RB: 20%). Highest achieved results are highlighted bold. Signals {A} {U} {A,U} {A,Δ} {U,Δ} {A,DA} {U,DU} {A,DA,U,DU} {A,Δ,DA} {U,Δ,DU} FV 42.6 38.7 47.2 44.9 47.5 47.5 37.1 46.6 48.9 44.3 FAP 44.3 40.7 45.6 42.0 42.3 42.6 40.3 44.3 46.6 45.3 FML 44.3 32.5 44.6 43.3 34.8 45.6 37.7 44.6 44.3 38.4 COPML 28.2 26.9 31.2 26.6 25.3 43.6 34.1 44.9 44.6 35.4 COPAP 36.4 26.9 35.1 40.0 33.1 45.3 30.8 45.3 46.2 35.1 GRF 56.7 45.6 54.4 55.7 46.9 55.1 45.6 55.4 60.0 48.2 COP 37.1 30.8 43.0 41.6 32.1 48.2 34.1 52.8 51.8 36.1 GRF + COP 61.0 47.2 58.7 60.3 49.8 59.3 49.8 61.3 62.0 51.2 D. Slijepcevic, et al. Gait & Posture 76 (2020) 198–203 202 2.3. Input Representations and Classification Strategies for Automated Human Gait Analysis 79 Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at https://doi.org/10.1016/j.gaitpost.2019.10.021. References [1] W. Pirker, R. Katzenschlager, Gait disorders in adults and the elderly, Wiener Klinische Wochenschrift 129 (3–4) (2017) 81–95. [2] P. Mahlknecht, S. Kiechl, B.R. Bloem, J. Willeit, C. Scherfler, A. Gasperi, G. Rungger, W. Poewe, K. Seppi, Prevalence and burden of gait disorders in elderly men and women aged 60–97 years: a population-based study, PLoS ONE 8 (7) (2013) e69627. [3] R. Baker, Measuring Walking: A Handbook of Clinical Gait Analysis, Mac Keith Press, London, 2013. [4] T. Chau, A review of analytical techniques for gait data. Part 1: Fuzzy, statistical and fractal methods, Gait Posture 13 (1) (2001) 49–66, https://doi.org/10.1016/ S0966-6362(00)00094-1. [5] J. Figueiredo, C.P. Santos, J.C. Moreno, Automatic recognition of gait patterns in human motor disorders using machine learning: a review, Med. Eng. Phys. (2018), https://doi.org/10.1016/j.medengphy.2017.12.006. [6] H. Lau, K. Tong, H. Zhu, Support vector machine for classification of walking conditions of persons after stroke with dropped foot, Hum. Mov. Sci. 28 (4) (2009) 504–514, https://doi.org/10.1016/j.humov.2008.12.003. [7] F. Wahid, R.K. Begg, C.J. Hass, S. Halgamuge, D.C. Ackland, Classification of Parkinson’s disease gait using spatial-temporal gait features, IEEE J. Biomed. Health Inform. 19 (6) (2015) 1794–1802, https://doi.org/10.1109/JBHI.2015.2450232. [8] L. Van Gestel, T. De Laet, E. Di Lello, H. Bruyninckx, G. Molenaers, A. Van Campenhout, E. Aertbelin, M. Schwartz, H. Wambacq, P. De Cock, K. Desloovere, Probabilistic gait classification in children with cerebral palsy: a Bayesian approach, Res. Dev. Disabilit. 32 (6) (2011) 2542–2552, https://doi.org/10.1016/j.ridd.2011. 07.004. [9] M. Alaqtash, T. Sarkodie-Gyan, H. Yu, O. Fuentes, R. Brower, A. Abdelgawad, Automatic classification of pathological gait patterns using ground reaction forces and machine learning algorithms, 2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, IEEE, 2011, pp. 453–457. [10] C. Nesch, V. Valderrabano, C. Huber, V. von Tscharner, G. Pagenstert, Gait patterns of asymmetric ankle osteoarthritis patients, Clin. Biomech. 27 (6) (2012) 613–618, https://doi.org/10.1016/j.clinbiomech.2011.12.016. [11] J. Wu, J. Wang, PCA-based SVM for automatic recognition of gait patterns, J. Appl. Biomech. 24 (1) (2008) 83–87. [12] D. Soares, M. de Castro, E. Mendes, L. Machado, Principal component analysis in ground reaction forces and center of pressure gait waveforms of people with transfemoral amputation, Prosthet. Orthot. Int. 40 (6) (2016) 729–738. [13] J. Christian, J. Krll, G. Strutzenberger, N. Alexander, M. Ofner, H. Schwameder, Computer aided analysis of gait patterns in patients with acute anterior cruciate ligament injury, Clin. Biomech. 33 (2016) 55–60, https://doi.org/10.1016/j. clinbiomech.2016.02.008. [14] B.M. Eskofier, P. Federolf, P.F. Kugler, B.M. Nigg, Marker-based classification of young-elderly gait pattern differences via direct PCA feature extraction and SVMs, Comput. Methods Biomech. Biomed. Eng. 16 (4) (2011) 435–442. [15] P. Levinger, D. Lai, R. Begg, K. Webster, J. Feller, The application of support vector machines for detecting recovery from knee replacement surgery using spatio-tem- poral gait parameters, Gait Posture 29 (1) (2009) 91–96. [16] D. Slijepcevic, B. Horsak, C. Schwab, A. Raberger, M. Schüller, A. Baca, C. Breiteneder, M. Zeppelzauer, Ground reaction force measurements for gait classification tasks: effects of different PCA-based representations, Gait Posture 57 (2017) 4–5. [17] M. Sangeux, J. Polak, A simple method to choose the most representative stride and detect outliers, ResearchGate 41 (2) (2014), https://doi.org/10.1016/j.gaitpost. 2014.12.004. [18] D. Slijepcevic, M. Zeppelzauer, A.-M. Gorgas, C. Schwab, M. Schüller, A. Baca, C. Breiteneder, B. Horsak, Automatic classification of functional gait disorders, IEEE J. Biomed. Health Inform. 22 (5) (2018) 1653–1661. [19] T. Chau, S. Young, S. Redekop, Managing variability in the summary and com- parison of gait data, J. NeuroEng. Rehabil. 2 (1) (2005) 22, https://doi.org/10. 1186/1743-0003-2-22. [20] C. Chang, C. Lin, LIBSVM: a library for support vector machines, ACM Trans. Intell. Syst. Technol. 2 (2011) 27:1-27:27. [21] E. Combrisson, K. Jerbi, Exceeding chance level by chance: the caveat of theoretical chance levels in brain signal classification and statistical assessment of decoding accuracy, J. Neurosci. Methods 250 (2015) 126–136. [22] D. Slijepcevic, M. Zeppelzauer, C. Schwab, A. Raberger, B. Dumphart, A. Baca, C. Breiteneder, B. Horsak, P 011-towards an optimal combination of input signals and derived representations for gait classification based on ground reaction force measurements, Gait Posture 65 (2018) 249. [23] G. Williams, D. Lai, A. Schache, M. Morris, Classification of gait disorders following traumatic brain injury, J. Head Trauma Rehabil. 30 (2) (2015) E13–E23. D. Slijepcevic, et al. Gait & Posture 76 (2020) 198–203 203 2. Publications 80 Supplementary Material for: Input representations and classification strategies for automated human gait analysis Djordje Slijepcevica,∗, Matthias Zeppelzauera, Caterine Schwabb, Anna-Maria Rabergerb, Christian Breitenederc, Brian Horsakb aSt. Pölten University of Applied Sciences, Institute for Creative Media Technologies, St. Pölten, Austria bSt. Pölten University of Applied Sciences, Institute of Health Sciences, St. Pölten, Austria cTU Wien, Institute of Visual Computing and Human-Centered Technology, Vienna, Austria Abstract Background: Quantitative gait analysis produces a vast amount of data, which can be difficult to analyze. Automated gait classification based on machine learning techniques bear the potential to support clinicians in comprehending these complex data. Even though these techniques are already frequently used in the scientific community, there is no clear consensus on how the data need to be preprocessed and arranged to assure optimal classification accuracy outcomes. Research question: Is there an optimal data aggregation and preprocessing workflow to opti- mize classification accuracy outcomes? Methods: Based on our previous work on automated classification of ground reaction force (GRF) data, a sequential setup was followed: firstly, several aggregation methods - early fusion and late fusion - were compared, and secondly, based on the best aggregation method identified, the expressiveness of different combinations of signal representations was investigated. The em- ployed dataset included data from 910 subjects, with four gait disorder classes and one healthy control group. The machine learning pipeline comprised principle component analysis (PCA), z-standardization and a support vector machine (SVM). Results: The late fusion aggregation, i.e., utilizing majority voting on the classifier’s predictions, performed best. In addition, the use of derived signal representations (relative changes and signal differences) seems to be advantageous as well. Significance: Our results indicate that great caution is needed when data preprocessing and ag- gregation methods are selected, as these can have an impact on classification accuracies. Our results shall serve future studies as a guideline for the choice of data aggregation and preprocess- ing techniques to be employed. 1. Methods Figure S1 shows how the dataset is split for evaluation purposes during the conducted ex- periments. We have randomly divided the dataset into a training (65%) and a test set (35%). ∗Corresponding author Email address: djordje.slijepcevic@fhstp.ac.at (Djordje Slijepcevic) Preprint submitted to Journal of Gait & Posture February 28, 2024 2.3. Input Representations and Classification Strategies for Automated Human Gait Analysis 81 5-fold cross validation is performed on the training set to determine the best parameters for the employed classifier, i.e. a probabilistic linear support vector machine (SVM), and to evaluate its dependency on the training data. Once the best configuration for the SVM was determined, it was trained on the entire training data. The test set was divided into three equal and balanced splits, which are used to evaluate the generalization ability of the best obtained model on previously unseen data. valtrain train train train train train train trainval train train train train train train train trainval train train train trainval fold 1 fold 2 fold 3 fold 4 fold 5 val train set test set dataset 5-fold cross validation evaluation on independent test set splits train set train model with best parameters t1 t2 t3 t1 t2 t3 Figure S1: The dataset was randomly divided into a training set and an independent test set. For hyper-parameter selection, a 5-fold cross-validation grid search was employed on the training set. The SVM with the best parameters was trained on the entire training set and evaluated on the three test set splits. 2 2. Publications 82 2. Results To further assess the generalizability of the first experiment, i.e. investigating different ag- gregation methods over several trials, we repeatedly evaluated the experiment with 10 different train-test splits. Results of these experiments are presented in Table S1. A repeated measures ANOVA with a Greenhouse-Geisser correction determined that the F1-score differed statistically significantly between the five methods (F(2.157,19.411) = 15.969, p < 0.001). A Shapiro-Wilk test confirmed the normal distribution of all variables. Bonferroni-Holm corrected post hoc tests re- vealed that majority voting was superior to all other methods (p < 0.01), except for the mean waveform approach. Here the difference between majority voting and the mean waveform ap- proach slightly missed significance (p = 0.102). The mean waveform approach was superior to the baseline and median waveform approach, but slightly missed significance for the most representative trial approach (p = 0.053). Table S1: Classification results (%) of the experiment investigating different aggregation methods over several trials evaluated on 10 different train-test splits (RB: 20%). Trial selection Measure Independent test set (10 different train-test splits) Mean (SD)1 2 3 4 5 6 7 8 9 10 Baseline without aggregation Acc 56.5 53.2 53.2 53.0 56.1 53.7 52.9 55.4 57.2 59.1 55.0 (2.2) P 55.9 51.7 53.1 52.7 55.8 53.7 52.8 54.3 57.0 58.3 54.5 (2.1) R 56.5 53.2 53.2 53.1 56.1 53.7 53.0 55.5 57.2 59.1 55.1 (2.1) F1 56.0 52.1 53.1 52.8 55.8 53.7 52.9 54.6 57.1 58.5 54.7 (2.1) Mean waveform approach Acc 59.0 55.9 54.6 56.2 56.4 55.4 52.6 55.6 58.4 61.6 56.6 (2.5) P 58.8 55.0 54.9 56.2 56.3 55.6 52.0 54.9 58.6 60.9 56.3 (2.5) R 59.0 56.0 54.6 56.3 56.4 55.4 52.7 55.6 58.4 61.6 56.6 (2.5) F1 58.7 55.2 54.5 55.9 56.2 55.4 52.0 55.1 58.4 61.1 56.3 (2.6) Median waveform approach Acc 56.7 53.6 52.9 52.9 57.1 53.4 51.0 54.3 59.3 57.4 54.9 (2.6) P 56.5 52.7 52.9 52.3 57.1 53.7 50.6 53.2 59.0 56.6 54.5 (2.7) R 56.7 53.7 53.0 53.0 57.1 53.4 51.0 54.3 59.3 57.4 54.9 (2.6) F1 56.2 52.8 52.7 52.3 56.7 53.5 50.6 53.5 59.1 56.6 54.4 (2.6) Most representative trial (MRT) Acc 57.1 48.0 50.7 51.3 55.4 56.1 54.3 52.0 53.4 52.1 53.0 (2.8) P 57.1 47.1 50.7 51.1 55.3 55.3 53.6 51.0 52.7 51.4 52.5 (2.9) R 57.1 48.1 50.7 51.4 55.4 56.1 54.3 52.1 53.4 52.1 53.1 (2.7) F1 56.7 47.3 50.7 51.2 55.3 55.6 53.8 51.4 53.0 51.4 52.6 (2.8) Majority voting Acc 62.6 54.9 55.9 56.2 60.7 56.7 59.5 59.5 62.3 63.6 59.2 (3.1) P 62.2 52.8 55.5 55.7 60.1 56.5 58.9 58.6 62.0 62.9 58.5 (3.4) R 62.6 55.0 55.9 56.3 60.7 56.7 59.6 59.6 62.3 63.6 59.2 (3.1) F1 62.0 52.9 55.5 55.3 60.1 56.5 59.1 58.5 61.9 62.9 58.5 (3.3) 3 2.3. Input Representations and Classification Strategies for Automated Human Gait Analysis 83 The following tables contain additional results from our second experiment, where effects of different combinations of input signals and derived representations were examined. Table S2, Table S3, Table S4, and Table S5 show the classification accuracy, precision, recall and F1-score results obtained during 5-fold cross-validation. Table S6, Table S7, and Table S8 show precision, recall and F1-score results obtained during the evaluation on the independent test set. Table S2: Mean (SD) of classification accuracy (%) obtained within 5-fold cross-validation for different combinations of input signals and derived representations (RB: 20%). Signals {A} {U} {A,U} {A,Δ} {U,Δ} {A,DA} {U,DU} {A,DA,U,DU} {A,Δ,DA} {U,Δ,DU} FV 44.8 (2.7) 35.9 (1.5) 45.3 (1.8) 47.6 (2.2) 47.1 (2.2) 46.1 (2.7) 35.7 (3.5) 47.6 (1.9) 49.8 (3.1) 47.3 (2.2) FAP 44.5 (3.8) 38.0 (4.3) 48.3 (1.1) 46.9 (2.1) 44.0 (4.5) 42.5 (2.7) 37.2 (2.2) 45.5 (2.1) 45.6 (2.1) 44.0 (3.2) FML 40.8 (4.1) 32.2 (4.6) 43.3 (5.4) 45.1 (4.5) 34.9 (2.2) 41.0 (2.1) 33.2 (2.8) 40.7 (4.9) 44.1 (4.4) 36.0 (3.5) COPML 27.1 (4.2) 25.0 (2.9) 30.2 (5.6) 28.1 (4.4) 25.3 (3.2) 37.9 (4.8) 33.6 (4.1) 40.5 (4.3) 38.0 (6.1) 34.2 (4.0) COPAP 35.9 (6.2) 26.6 (6.6) 35.7 (6.8) 38.0 (4.1) 31.2 (4.3) 39.2 (2.8) 32.9 (2.2) 43.0 (3.2) 41.0 (2.4) 36.9 (3.8) GRF 55.2 (2.3) 46.0 (2.1) 55.0 (4.8) 55.9 (3.3) 49.6 (3.3) 54.9 (2.5) 45.8 (1.2) 53.1 (2.5) 55.7 (1.9) 51.2 (4.3) COP 38.0 (4.6) 29.4 (5.3) 39.8 (4.0) 38.8 (2.9) 31.2 (4.5) 42.1 (3.9) 34.4 (3.4) 46.4 (2.3) 43.1 (3.3) 36.2 (4.0) GRF + COP 55.4 (3.0) 49.1 (3.6) 55.9 (3.2) 57.9 (3.1) 52.9 (1.9) 55.9 (4.4) 48.6 (1.7) 54.9 (4.1) 57.9 (3.3) 52.7 (2.1) Table S3: Mean (SD) of precision (%) obtained within 5-fold cross-validation for different combinations of input signals and derived representations (RB: 20%). If the performance measure is specified as 0, one of the five classes is not modeled at all by the classifier. Signals {A} {U} {A,U} {A,Δ} {U,Δ} {A,DA} {U,DU} {A,DA,U,DU} {A,Δ,DA} {U,Δ,DU} FV 41.1 (3.5) 34.0 (1.7) 43.7 (1.8) 46.1 (2.0) 45.6 (2.3) 44.7 (3.8) 33.1 (4.0) 46.8 (2.2) 48.4 (4.1) 45.5 (1.9) FAP 41.5 (3.5) 37.5 (3.8) 46.6 (2.4) 44.9 (2.5) 41.6 (5.5) 40.0 (3.2) 34.7 (1.9) 43.5 (2.7) 44.5 (1.3) 40.9 (3.4) FML 36.8 (3.5) 29.3 (7.7) 41.8 (5.5) 42.3 (5.5) 30.5 (3.0) 38.3 (2.6) 32.6 (4.4) 39.2 (6.0) 42.0 (4.0) 34.5 (3.1) COPML 5.1 (10.2) 0 (0) 22.1 (12.4) 0 (0) 0 (0) 37.3 (5.1) 32.5 (9.2) 39.3 (4.1) 37.1 (6.7) 33.3 (8.4) COPAP 34.3 (4.0) 0 (0) 31.0 (7.0) 29.0 (14.7) 7.7 (9.6) 39.2 (2.2) 33.6 (5.6) 41.9 (3.8) 40.3 (1.4) 36.8 (6.7) GRF 54.2 (2.7) 46.9 (2.7) 54.7 (5.6) 55.4 (3.5) 48.7 (4.2) 54.3 (2.4) 45.7 (1.1) 53.1 (2.6) 54.7 (2.0) 51.1 (5.2) COP 38.0 (6.6) 0 (0) 38.9 (3.1) 36.3 (2.7) 12.3 (10.3) 41.9 (3.4) 32.8 (2.1) 46.5 (4.0) 43.0 (2.8) 34.0 (3.8) GRF + COP 54.3 (2.8) 49.0 (2.5) 55.9 (3.3) 57.2 (3.5) 53.0 (3.1) 55.7 (4.5) 48.7 (2.0) 54.7 (4.1) 57.2 (3.1) 52.4 (1.8) Table S4: Mean (SD) of recall (%) obtained within 5-fold cross-validation for different combinations of input signals and derived representations (RB: 20%). Signals {A} {U} {A,U} {A,Δ} {U,Δ} {A,DA} {U,DU} {A,DA,U,DU} {A,Δ,DA} {U,Δ,DU} FV 45.2 (1.6) 37.0 (3.2) 45.7 (1.8) 48.3 (0.8) 48.2 (3.9) 46.3 (3.1) 36.7 (3.3) 48.0 (1.7) 50.0 (2.6) 48.9 (1.0) FAP 44.9 (1.9) 39.2 (2.7) 48.9 (2.4) 47.4 (0.5) 45.1 (4.7) 43.0 (2.2) 37.3 (2.0) 45.8 (2.5) 46.3 (2.9) 44.6 (4.1) FML 41.3 (2.7) 34.1 (5.9) 43.8 (4.5) 45.6 (2.9) 36.0 (2.7) 41.9 (1.5) 35.0 (5.0) 41.5 (4.0) 44.8 (3.0) 37.6 (2.4) COPML 28.3 (3.0) 25.6 (1.9) 31.0 (2.8) 28.8 (3.1) 26.1 (1.3) 38.6 (4.0) 34.3 (2.1) 41.5 (3.6) 38.6 (4.7) 35.4 (2.0) COPAP 36.4 (3.7) 26.8 (2.2) 36.0 (4.1) 38.6 (2.6) 31.7 (1.4) 39.5 (2.0) 33.5 (2.1) 43.4 (3.7) 41.5 (2.1) 37.5 (2.0) GRF 56.1 (3.4) 47.5 (3.1) 56.0 (6.1) 56.6 (2.2) 51.0 (3.8) 55.6 (2.1) 46.4 (2.3) 53.6 (2.9) 56.3 (1.3) 52.8 (3.2) COP 39.1 (3.9) 29.8 (1.9) 40.6 (3.0) 39.6 (3.0) 31.5 (1.4) 42.3 (4.0) 35.0 (2.0) 47.3 (4.3) 43.6 (3.5) 36.8 (3.1) GRF + COP 56.3 (3.0) 50.0 (2.2) 56.5 (4.3) 58.2 (2.1) 53.9 (2.9) 56.4 (4.8) 49.2 (2.7) 55.6 (5.3) 58.4 (3.0) 53.7 (2.1) Table S5: Mean (SD) of F1-scores (%) obtained within 5-fold cross-validation for different combinations of input signals and derived representations (RB: 20%). If the performance measure is specified as 0, one of the five classes is not modeled at all by the classifier. Signals {A} {U} {A,U} {A,Δ} {U,Δ} {A,DA} {U,DU} {A,DA,U,DU} {A,Δ,DA} {U,Δ,DU} FV 33.9 (17.0) 32.7 (1.9) 43.0 (1.9) 45.4 (2.0) 44.8 (2.6) 44.4 (3.5) 32.6 (4.1) 46.3 (2.1) 48.1 (3.6) 45.4 (2.1) FAP 41.2 (3.5) 34.8 (4.9) 45.8 (1.3) 44.5 (2.4) 41.0 (4.5) 39.7 (2.9) 27.4 (13.8) 43.4 (2.4) 43.7 (1.9) 32.9 (16.7) FML 36.5 (4.2) 5.8 (11.6) 40.7 (6.2) 41.8 (5.3) 18.7 (15.4) 37.5 (2.8) 30.8 (2.9) 38.6 (5.9) 41.1 (4.8) 33.3 (3.7) COPML 0 (0) 0 (0) 11.0 (13.7) 0 (0) 0 (0) 35.8 (5.3) 17.8 (14.8) 37.8 (4.2) 35.8 (6.6) 12.0 (14.9) COPAP 23.8 (12.9) 0 (0) 14.0 (17.1) 20.6 (16.9) 0 (0) 36.9 (2.3) 28.1 (2.2) 40.6 (3.1) 38.3 (2.1) 26.5 (13.6) GRF 54.1 (2.6) 44.8 (2.5) 54.3 (5.3) 54.9 (3.2) 48.4 (3.6) 53.8 (2.2) 44.7 (1.4) 52.6 (2.8) 54.5 (1.8) 50.4 (4.8) COP 27.8 (14.5) 0 (0) 36.7 (2.8) 34.3 (3.0) 0 (0) 40.4 (3.1) 30.6 (3.5) 45.0 (2.9) 41.6 (2.8) 32.8 (4.5) GRF + COP 54.1 (3.0) 48.0 (3.2) 55.3 (3.2) 56.7 (3.1) 51.8 (2.2) 55.0 (4.3) 47.8 (1.7) 54.3 (4.3) 56.9 (2.9) 51.8 (1.8) 4 2. Publications 84 Table S6: Precision (%) for different combinations of input signals and derived representations obtained on the indepen- dent test set. If the performance measure is specified as 0, one of the five classes is not modeled at all by the classifier. Signals {A} {U} {A,U} {A,Δ} {U,Δ} {A,DA} {U,DU} {A,DA,U,DU} {A,Δ,DA} {U,Δ,DU} FV 40.3 35.9 45.4 42.8 46.0 46.1 35.4 46.7 48.9 44.3 FAP 40.6 40.3 43.8 37.6 42.2 41.6 42.1 43.6 45.2 45.5 FML 40.2 29.5 42.0 38.9 32.8 43.5 37.2 42.5 41.6 35.7 COPML 0 0 23.4 0 0 42.9 34.2 42.6 43.7 34.3 COPAP 33.7 0 30.8 35.4 27.6 44.1 25.9 43.8 44.3 31.3 GRF 55.8 44.8 54.0 54.6 45.8 54.5 44.8 55.8 59.6 47.6 COP 32.0 49.7 41.5 40.6 34.4 47.7 32.8 52.7 50.9 30.8 GRF + COP 60.9 45.8 58.2 59.5 49.2 59.0 48.9 61.5 61.5 50.8 Table S7: Recall (%) for different combinations of input signals and derived representations obtained on the independent test set (RB: 20%). Signals {A} {U} {A,U} {A,Δ} {U,Δ} {A,DA} {U,DU} {A,DA,U,DU} {A,Δ,DA} {U,Δ,DU} FV 42.6 38.7 47.2 44.9 47.5 47.5 37.1 46.6 48.9 44.3 FAP 44.3 40.7 45.6 42.0 42.3 42.6 40.3 44.3 46.6 45.3 FML 44.3 32.5 44.6 43.3 34.8 45.6 37.7 44.6 44.3 38.4 COPML 28.2 26.9 31.2 26.6 25.3 43.6 34.1 44.9 44.6 35.4 COPAP 36.4 26.9 35.1 40.0 33.1 45.3 30.8 45.3 46.2 35.1 GRF 56.7 45.6 54.4 55.7 46.9 55.1 45.6 55.4 60.0 48.2 COP 37.1 30.8 43.0 41.6 32.1 48.2 34.1 52.8 51.8 36.1 GRF + COP 61.0 47.2 58.7 60.3 49.8 59.3 49.8 61.3 62.0 51.2 Table S8: F1-score (%) for different combinations of input signals and derived representations obtained on the indepen- dent test set. If the performance measure is specified as 0, one of the five classes is not modeled at all by the classifier. Signals {A} {U} {A,U} {A,Δ} {U,Δ} {A,DA} {U,DU} {A,DA,U,DU} {A,Δ,DA} {U,Δ,DU} FV 39.3 35.6 45.8 43.5 46.0 46.4 35.5 46.3 47.9 43.0 FAP 41.2 38.6 43.5 38.8 40.4 41.5 38.8 43.7 45.2 44.2 FML 38.6 27.6 41.6 38.8 30.2 43.3 35.2 42.3 41.5 35.7 COPML 0 0 0 0 0 41.4 27.8 42.5 42.3 29.5 COPAP 28.8 0 29.7 33.5 22.8 43.2 25.2 43.4 43.8 29.7 GRF 55.4 45.0 54.1 55.0 46.1 54.6 44.9 55.4 59.6 47.7 COP 30.9 21.4 39.0 38.0 23.4 46.5 31.7 51.3 50.5 31.4 GRF + COP 60.7 46.1 58.3 59.7 49.3 59.0 49.1 61.2 61.7 50.7 5 2.3. Input Representations and Classification Strategies for Automated Human Gait Analysis 85 2. Publications 2.4 Explaining Machine Learning Models for Clinical Gait Analysis Djordje Slijepcevic, Fabian Horst, Sebastian Lapuschkin, Brian Horsak, Anna-Maria Raberger, Andreas Kranzl, Wojciech Samek, Christian Breiteneder, Wolfgang Immanuel Schöllhorn, and Matthias Zeppelzauer. Explaining Machine Learning Models for Clinical Gait Analysis. ACM Transactions on Computing for Healthcare (HEALTH), 3(2):1–27, 2021. DOI: 10.1145/3474121 The final version of this publication is available at: https://doi.org/10.1145/ 3474121. Permission for reprint granted, © 2021 Slijepcevic 86 14 Explaining Machine Learning Models for Clinical Gait Analysis DJORDJE SLIJEPCEVIC, Institute of Creative Media Technologies, Department of Media & Digital Technologies, St. Pölten University of Applied Sciences, Austria FABIAN HORST, Department of Training and Movement Science, Institute of Sport Science, Johannes Gutenberg-University Mainz, Germany SEBASTIAN LAPUSCHKIN, Department of Artificial Intelligence, Fraunhofer Heinrich Hertz Institute, Germany BRIAN HORSAK, Institute of Health Sciences, Department of Health Sciences, St. Pölten University of Applied Sciences, Austria and Center for Digital Health and Social Innovation, St. Pölten University of Applied Sciences, Austria ANNA-MARIA RABERGER, Institute of Health Sciences, Department of Health Sciences, St. Pölten Univer- sity of Applied Sciences, Austria ANDREAS KRANZL, Laboratory for Gait and Movement Analysis, Orthopaedic Hospital Vienna-Speising, Austria WOJCIECH SAMEK, Department of Artificial Intelligence, Fraunhofer Heinrich Hertz Institute, Germany CHRISTIAN BREITENEDER, Institute of Visual Computing and Human-Centered Technology, TU Wien, Austria WOLFGANG IMMANUEL SCHÖLLHORN, Department of Training and Movement Science, Institute of Sport Science, Johannes Gutenberg-University Mainz, Germany MATTHIAS ZEPPELZAUER, Institute of Creative Media Technologies, Department of Media & Digital Technologies, St. Pölten University of Applied Sciences, Austria Djordje Slijepcevic and Fabian Horst contributed equally to this research. This work was partly funded by the Austrian Research Promotion Agency (FFG) and the Austrian Federal Ministry for Digital and Economic Affairs (BMDW) within the COIN-program (ReMoCapLab – #866855 and BigDataAnalytics – #866880), the Lower Austrian Research and Education Company (NFB), the Provincial Government of Lower Austria (IntelliGait3D – #FTI17-014). Further support was received from the German Ministry for Education and Research as BIFOLD (#01IS18025A and #01IS18037A) and TraMeExCo (#01IS18056A), as well as the European Union’s Horizon 2020 research and innovation programme through the iToBoS project (#965221). Authors’ addresses: D. Slijepcevic andM. Zeppelzauer, Institute of CreativeMedia Technologies, Department ofMedia &Digital Technologies, St. Pölten University of Applied Sciences, St. Pölten, Austria; email: Djordje.Slijepcevic@fhstp.ac.at; F. Horst andW. I. Schöllhorn, Department of Training and Movement Science, Institute of Sport Science, Johannes Gutenberg-University Mainz, Mainz, Germany; email: horst@uni- mainz.de; S. Lapuschkin andW. Samek, Department of Artificial Intelligence, Fraunhofer HeinrichHertz Institute, Berlin, Germany; B. Horsak and A.-M. Raberger, Institute of Health Sciences, Department of Health Sciences, St. Pölten University of Applied Sciences, St. Pölten, Austria; A. Kranzl, Laboratory for Gait and Movement Analysis, Orthopaedic Hospital Vienna-Speising, Vienna, Austria; C. Breiteneder, Institute of Visual Computing and Human-Centered Technology, TU Wien, Vienna, Austria. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License. © 2021 Copyright held by the owner/author(s). 2637-8051/2021/12-ART14 https://doi.org/10.1145/3474121 ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2.4. Explaining Machine Learning Models for Clinical Gait Analysis 87 14:2 • D. Slijepcevic et al. Machine Learning (ML) is increasingly used to support decision-making in the healthcare sector. While ML approaches pro- vide promising results with regard to their classification performance, most share a central limitation, their black-box charac- ter. This article investigates the usefulness of Explainable Artificial Intelligence (XAI) methods to increase transparency in au- tomated clinical gait classification based on time series. For this purpose, predictions of state-of-the-art classification methods are explained with a XAI method called Layer-wise Relevance Propagation (LRP). Our main contribution is an approach that explains class-specific characteristics learned by MLmodels that are trained for gait classification. We investigate several gait classification tasks and employ different classification methods, i.e., Convolutional Neural Network, Support Vector Machine, and Multi-layer Perceptron. We propose to evaluate the obtained explanations with two complementary approaches: a statis- tical analysis of the underlying data using Statistical Parametric Mapping and a qualitative evaluation by two clinical experts. A gait dataset comprising ground reaction force measurements from 132 patients with different lower-body gait disorders and 62 healthy controls is utilized. Our experiments show that explanations obtained by LRP exhibit promising statistical prop- erties concerning inter-class discriminativity and are also in line with clinically relevant biomechanical gait characteristics. CCS Concepts: • Computing methodologies→ Neural networks; • Applied computing→ Health care information systems; Additional Key Words and Phrases: Explainable artificial intelligence, clinical gait analysis, human gait classification, layer- wise relevance propagation, statistical parametric mapping, ground reaction forces, convolutional neural networks ACM Reference format: Djordje Slijepcevic, Fabian Horst, Sebastian Lapuschkin, Brian Horsak, Anna-Maria Raberger, Andreas Kranzl, Wojciech Samek, Christian Breiteneder, Wolfgang Immanuel Schöllhorn, and Matthias Zeppelzauer. 2021. Explaining Machine Learn- ing Models for Clinical Gait Analysis. ACM Trans. Comput. Healthcare 3, 2, Article 14 (December 2021), 27 pages. https://doi.org/10.1145/3474121 1 INTRODUCTION Artificial Intelligence (AI) and Machine Learning (ML) techniques have become almost ubiquitous in our daily lives by supporting or guiding our decisions and providing recommendations. Impressively, there are cer- tain medical tasks, such as the detection of skin or breast cancer, that ML approaches have already been able to solve more efficiently and effectively than humans [16, 21, 42]. Therefore, it is not surprising that ML approaches are currently becoming popular in the healthcare sector [73]. This trend has also been recognized in the field of clinical gait analysis (CGA) [18, 61]. CGA focuses on the quantitative description and analysis of human gait from a kinematic (i.e., joint angles), kinetic (i.e., ground reaction forces and joint moments), and muscular (i.e., electromyographic activity) point of view [9, 79]. Thereby, CGA produces a vast amount of data [22, 54], which are difficult to comprehend due to their multi-dimensional and multi-correlated nature [13, 80]. In recent years, ML methods have been successfully employed in CGA for the classification of patient groups [18, 61], such as stroke [36], Parkinson’s disease [76], cerebral palsy [74], multiple sclerosis [3], osteoarthritis [50], and patients suffering from different functional gait disorders [66]. While ML approaches yield promising results re- garding classification performance, most share a central limitation, which is their black-box character [1]. This means that even if the underlying mathematical principles of these methods are understood, it is often unclear why a particular prediction has been made and if meaningfully grounded patterns have led to this prediction. Additionally, the black-box character hinders ML approaches to provide justifications of their predictions. This is, however, necessary for compliance with legislation such as the General Data Protection Regulation (GDPR, EU 2016/679) [1, 17, 23]. These factors currently limit the application of ML-based decision-support systems in medical practice [26, 59]. Due to the aforementioned reasons, the field of Explainable Artificial Intelligence (XAI) gained increasing attention in recent years. Different approaches have been proposed (see Section 2: Related work). In general, XAI methods intend to illustrate how complex and non-linear ML models operate and how they produced their predictions. However, explanation is understood in the sense of providing more differentiated insights into the behaviour of ML models to fathom the dependence of the results on input variables (without claiming to give ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2. Publications 88 Explaining Machine Learning Models for Clinical Gait Analysis • 14:3 causation). Even though research in XAI is still in an early stage, the application of such approaches in medicine has already raised attention [26, 72]. Themotivation is to increase the traceability of MLmodels and trust in them among medical professionals [27]. However, the application of XAI methods to the field of CGA remains to be investigated. A first step in this direction has recently been taken by Horst et al. [29] for explaining predictions in gait-based person recognition. The primary aim of this article is to investigate and explain which class-specific characteristics ML models learn from CGA data, i.e., time series. For this purpose, we train several classification models for different gait classification tasks and extract prediction explanations from the trained models via Layer-wise Relevance Propagation (LRP). Subsequently, the explanations of the individual predictions are aggregated to obtain class- specific model explanations. The assessment of the resulting explanations is, however, a challenge, since no ground truth exists for automatically generated explanations in CGA. In contrast to images, which are more frequently subject to explainability studies [2, 19, 57, 58], the evaluation of explanations becomes particularly challengingwhen the input signals aremore abstract and thus not straightforward to interpret, as often is the case with biomedical signals. Recently, it has been shown that XAI approaches do not necessarily refer to the actual prediction of the classification model and sometimes even build upon unrelated information [2]. Thus, a more comprehensive investigation of explanations obtained by XAI methods is necessary to verify whether they are meaningful and justified. To account for the above-mentioned challenges, we suggest a two-step approach for the evaluation of the obtained explanations. First, we analyze the discriminatory power of the obtained explanations from a statistical perspective. For this purpose, we leverage Statistical Parametric Mapping (SPM) [51], a method building upon randomfield theory, to derive statistical measures alongwith the input signals and thereby investigate how statistically justified the obtained explanations are. Second, two experienced clinical experts interpret the explainability results from a clinical perspective to evaluate whether obtained explanations match characteristics from clinical practice. Our investigation focuses on two leading research questions: (1) Which input features or signal regions are most relevant for automatic gait classification? (2) To what extent are input features or signal regions identified as being relevant for a given gait classification task statistically justified and in line with clinical assessment? In addition to these two leading questions, we investigate several further aspects that may influence classi- fication performance as well as explainability in more detail, including the influence of different classification methods, the impact of data normalization, and the role of different input signal components (i.e., the horizontal forces, measurements of the affected leg, and measurements of the unaffected leg). We perform our investigation on the GaitRec dataset [28], which contains ground reaction force (GRF)measurements from clinical practice. We design prediction models for different gait classification tasks and derive possible explanations from the re- sulting models that are based on relevance scores. These relevance scores are directly related to specific regions in the input signal. Subsequently, we analyze the explanations from a statistical as well as a clinical perspec- tive. The results show that explanations share promising statistical properties concerning class discriminativity and thus indicate that predictions are grounded on statistically justified information for the task. Further, we show that input features considered as relevant can also be interpreted as meaningful and clinically relevant biomechanical gait characteristics. Overall, our investigation demonstrates the usefulness of XAI in the domain of gait classification, exemplifies how to apply XAI methods to gait measurement data, and suggests approaches to evaluate their quality. The performed study suggests that XAI methods can be useful to better understand and interpret automatic predictions in clinical gait analysis and thus has the potential to yield an added value for clinical practice in the future. 2 RELATED WORK Methods from XAI can be grouped according to the type of explanation they provide. We distinguish between XAI approaches for (i) data exploration, (ii) prediction explanation, and (iii)model explanation based on ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2.4. Explaining Machine Learning Models for Clinical Gait Analysis 89 14:4 • D. Slijepcevic et al. an adaptation of the taxonomy introduced by Arya et al. [6]. In the following, we briefly introduce the three different types of approaches and their capabilities. Data exploration includes methods from the fields of visual analytics, statistics and unsupervised machine learning. As such, the methods are not capable of explaining a model but rather the data on which the model is trained. Thesemethods focus on projecting the data into a space where it is possible to findmeaningful structures or clusters and thus understand the data in more detail. A popular approach for data exploration introduced by Maaten and Hinton [39] is t-distributed Stochastic Neighbor Embedding (t-SNE), which projects high- dimensional data into a lower-dimensional and visualizable space. The projection is performed in a way that the cluster structure in the original data space is optimally exposed. Thereby, an understanding of the data and the identification of typical patterns and clusters in the data is facilitated. Other approaches in this category are visual analytics approaches that employ advanced techniques for the interactive visualization of data to support data exploration, i.e., finding characteristic patterns or dependencies within data [75, 77]. Prediction explanation aims at explaining the local behavior of a model, i.e., the prediction for a given input instance. For a classification task, these methods can provide, for example, explanations about which part of the input influenced the classifier’s prediction the most. For classification of gait data, the explanation should highlight all relevant signal regions and characteristic signal shapes in the input data, which are associated with a particular gait disorder. Two main categories can be distinguished for explaining the local behavior of a machine learning model: (i) self-explaining models and (ii) post-hoc methods. Self-explaining models integrate components that learn relationships between input data and predictions dur- ing training. Simultaneously, they learn how these relationships relate to terms from a predefined dictionary and consequently generate explanations from them. A self-explaining approach that does not visually highlight relevant regions in input data but generates textual explanations was proposed by Hendricks et al. [24]. This self-explaining model combines aConvolutional Neural Network (CNN) and aRecurrent Neural Network (RNN). The CNN learns discriminative features to perform a classification task, while the RNN generates tex- tual explanations of the prediction. This approach cannot be applied to a previously trained model in a post-hoc manner, which limits its practical applicability. Post-hoc methods provide much greater applicability, as they can be applied to already-trained models. These methods can be further categorized into (i) propagation-based, (ii) perturbation-based, and (iii) Shapley- value-based methods. Propagation-based methods determine the contributions of each input feature by (back-) propagating some quantity of interest from the model’s output layer to the input layer. Sensitivity Analy- sis [82, 83] has been introduced to Support Vector Machines (SVMs) [8] and CNNs [65] in the form of saliency maps. Layer-wise Relevance Propagation (LRP) [7, 44] and Deep Learning Important FeaTures (DeepLIFT) [63] are methods that propagate importance scores from the output layer back to the input, thereby enabling the identification of positive and negative evidences for a specific prediction. Sensitivity Analysis and the therewith obtained explanations, in general, suffer from the effects of shattered gradients [10], especially so in more complex (deeper) networks. Modern approaches to CNN explainability, such as LRP or DeepLift, do not have this problem and work well for a wider range of network architectures and models in general [32, 46]. Perturbation-based methods, such as those introduced by Fong and Vedaldi [19] or Zintgraf et al. [81], treat the model as a black box and estimate the importance of input features by (partially) occluding the input and mea- suring the effect on the model output. While some methods produce explanations directly from a perturbation process, others employ a learning component, e.g., the Interpretable Model-agnostic Explanations (LIME) method [55], to estimate locally interpretable surrogate models mimicking the prediction process of the black- box model. Perturbation-based methods can be considered to be model-agnostic, as they do not require access to internal model parameters or structures to operate. However, this model-agnosticism is bought at a consid- erable computational cost, compared to propagation-based approaches. Shapley-value-based methods are rooted in game theory [84] and attempt to approximate the Shapley values of a given prediction. For this purpose, the effect of omitting an input feature is examined, taking into account all possible combinations of other input ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2. Publications 90 Explaining Machine Learning Models for Clinical Gait Analysis • 14:5 features, which can be included or excluded [71]. Lundberg and Lee [38] proposed the SHapley Additive exPla- nations (SHAP) method, which is a unified approach building upon the theory of Shapley values and existing propagation-based and perturbation-based methods, e.g., LIME, DeepLIFT, and LRP. Model explanation provides an interpretation of what a trained model has learned, i.e., the most character- istic representations or prototypes for an entire class are visualized (e.g., a class of gait disorders in CGA). These methods can indicate which classes overlap and point out ambiguous input features. In addition to saliency maps, Simonyan et al. [65] proposed a method for generating a representative visualization for a specific class that was learned by a CNN. For this purpose, they applied activation maximization, i.e., starting with a blank image, each pixel is changed by utilizing back-propagation so the activity of a neuron is increased. The resulting visualiza- tions give a first impression about the patterns learned but are highly abstract and can only be interpreted to a limited extent. To generate visualizations that are easier to interpret, Nguyen et al. [48] proposed a method to constrain the optimization process by image priors that were learned automatically. Lapuschkin et al. [35] proposed the Spectral Relevance Analysis (SpRAy), which summarizes a model’s learned strategies by ana- lyzing similarities and dissimilarities over large quantities of input relevance maps computed with respect to a category of interest. For gait classification, prediction explanation is desirable to provide clinical experts with detailed information about which patterns in the input signals are important for a specific prediction. Additionally, based on aggre- gations of these explanations, differences between patient groups can be assessed, i.e., in terms of class-specific model explanations. In this context, post-hoc methods are preferable, because they provide a classifier-agnostic approach (can be applied to any classification model) and do not require retraining or additional labels. We, therefore, choose an established post-hoc explainability method, i.e., LRP, in our experiments. 3 APPROACH AND METHODOLOGY The general approach we followed in this study was to design and train classification models for automated gait classification tasks (see Figure 1(B)) based on three-dimensional ground reaction forces (GRFs) of both legs (see Figure 1(A)), to explain the predictions of these models based on relevance scores that are related to the input signal space by using LRP (see Figure 1(C)) and to evaluate these results from a statistical (see Figure 1(D)) and a clinical perspective (see Figure 1(E)). The experimental setup, including a detailed description of the data (pre-) processing and classification pipeline, can be found in Section 4. 3.1 Gait Classification The main task in automated gait classification is to determine whether a person has a healthy or pathological gait pattern based on gait measurements. We employed three-dimensional GRFs of the affected and unaffected sides as input signals and investigated the classification performance of several state-of-the-art classification methods. Furthermore, the input signals were fed directly into the classification models. This ensures that the results of the employed explainability method (LRP) can be directly mapped to the original signals. For easier interpretation of the XAI results, we refrained from using data reduction techniques such as, e.g., Principal Component Analysis (PCA), which is a common practice in automated gait classification [12, 22, 68]. 3.2 Prediction Explanation We employed Layer-wise Relevance Propagation (LRP) for prediction explanation [7] as a propagation-based post-hoc method that provides explanations in the input space, which is the space where the signals are usually interpreted by experts in clinical practice. LRP reversely iterates over the layered structure of an ML model to produce an explanation. Consider a neural network: f (x ) = fL ◦ · · · ◦ f1 (x ). (1) ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2.4. Explaining Machine Learning Models for Clinical Gait Analysis 91 14:6 • D. Slijepcevic et al. Fig. 1. Overview of our workflow for data acquisition, prediction, and prediction explanation in automated gait classification, showing the data of one participant belonging to the knee disorder class. (A) The clinical gait analysis consists of five recordings of each participant walking barefoot (unassisted) a distance of 10 m at a self-selected walking speed. Two centrally embedded force plates capture the three-dimensional ground reaction forces (GRFs) during the stance phase of the right and left foot. (B) The GRF comprising the medio-lateral (GRFML), anterior-posterior (GRFAP ), and vertical (GRFV ) force components of the affected and unaffected side are used as time-normalized and concatenated input vector x (1 × 606- dimensional) for the prediction of the knee disorder class using a classifier (e.g., CNN). (C) Decomposition of input relevance scores is achieved using LRP. The color spectrum for the visualization of input relevance scores of the model predictions is shown in the bottom right corner. Black line segments are irrelevant to the model’s prediction. Warm hues identify input segments causing a prediction corresponding to the class label, while cool hues are features contradicting the class label. (D) Statistical and (E) Clinical evaluation of class-specific (averaged) relevance scores. An SVM model can be regarded as a single-layer neural network and thus a special case of Equation (1). In a forward pass, activations are computed at each layer fl of the neural network, depending on the learned parameters of the model and the previous layers’ activations. The activation score in the output layer fL forms the prediction f (x ), which is then, for a specific class and neuron of interest, back-propagated and redistributed layer by layer until the input is reached. Themethod yields time- and signal-resolved input relevance scoresRi for each individual value of the input vectorxi . The redistribution process follows a conservation principle analogous to Kirchhoff’s laws in electrical circuits, i.e., all relevance assigned to any neuron during the back-propagation process is redistributed without loss to its inputs in the underlying layer. The relevance back-propagation flow is illustrated in Figure 2. Various purposeful propagation rules have been proposed in the literature [7, 32, 44]. For example, the LRPε rule [7] is defined as: R j←k = zjk zk + ε · sign(zk )Rk , (2) where zjk = ajw jk is the quantity propagated from the jth input neuron to the kth output neuron within a given layer, depending on the input activation aj and the learned weight parameters w jk . The zk = j zjk is the pre- activation of the kth output neuron, aggregating all forward-propagated zjk , which includes any potential bias terms. The variable ε ≥ 0 is a free parameter to tune the decomposition rule with the intent to suppress noisy forward activations zjk and divisions by zero.1 Equation (2) redistributes Rk proportionally based on the relative contribution of zjk to zk towards all input components j. After the step of relevance decomposition, lower layer 1Note that for this purpose the sign function is defined as: sign(x ) = 1 iff. x ≥ 0; else − 1; [7]. ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2. Publications 92 Explaining Machine Learning Models for Clinical Gait Analysis • 14:7 Fig. 2. Illustration of the LRP back-propagation procedure applied to a neural network function f (x ) = fL ◦ · · · ◦ f1 (x ). The prediction at the output is propagated backward in the network, until the input features are reached and relevance scores are obtained for all input features and hidden units as Ri , Rj , and Rk , respectively. The propagation flow is shown in red color. neuron relevance is aggregated from incoming relevance messages as R j = k R j←k . Other propagation rules, such as LRPγ [44], LRPα β , LRPzB , or LRP , are suitable for other application scenarios, layer types, or particularly deeper neural networks [32, 44, 58] and have been shown to work well in practice [57]. LRP enables to explain the prediction of an ML model as partial contributions of an individual input value. LRP indicates which information a model uses to predict in favor or against an output class. Thereby, it enables the interpretation of input relevance scores and their dynamics as representation for a certain class (i.e., healthy controls or functional disorders in ankle, knee, or hip). For the explanation of predictions, we decomposed the input relevance scores of each gait trial with LRP. To analyze patterns learned for a specific class, we used LRP to decompose the ground truth label (and not necessarily the predicted value) of the trial. For the visualization of the explanations, we averaged the underlying GRF signals and the resulting input relevance scores over all trials of a class. Given that the models investigated in this study are comparatively shallow and are largely unaffected by detrimental effects such as gradient shattering [10, 44, 45], we performed relevance decomposition according to LRPε with ε = 10−5 in all layers across the different models (except for the CNN for which we employed the LRP rule at the input layer, which uniformly distributes a neuron’s relevance score Rk across its receptive field, disregarding any applied transformationsw jk or input activations aj ) [32]. 3.3 Statistical Evaluation To evaluate the derived relevance scores of LRP, we employed Statistical Parametric Mapping (SPM) [51, 52], which recently received increased attention in the gait analysis community [11, 49]. While standard inference statistical approaches tend to reduce time-continuous signals to single time-discrete values for statistical testing, SPM allows to use the entire time-continuous signals to make probabilistic conclusions. It follows the same notion and logic as classical inference statistics. The main advantages of SPM are that the statistical results are presented in the original sampling space and that there is no need for a (potentially biasing) parameterization technique [51, 52]. Since the LRP explanations and the results of SPM reside in the same space (the input signal space), we can leverage SPM to demonstrate the meaningfulness of LRP explanations from a statistical point of view. LRP and SPM can both be considered explainability approaches, however, they target different goals. SPM fits linear models (e.g., general linear models) to the data and tries to explain differences in the data (i.e., differ- ences between groups or classes). SPM can thus be considered a data-centric explainability method. LRP tries to explain the inner working of complex (non-linear) models and can thus be considered a model-centric explain- ability method. Both methods are thus complementary to each other. Another difference is that LRP can explain ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2.4. Explaining Machine Learning Models for Clinical Gait Analysis 93 14:8 • D. Slijepcevic et al. Table 1. Demographic Details of the Employed Dataset for Each Pre-defined Class Classes N Age (yrs.) Mean (SD) Body Mass (kg) Mean (SD) Gender (m/f) Walking Speed (m/s) Num. Trials Healthy Control 62 36.0 (10.8) 72.3 (15.0) 28/34 4.1 (0.3) 310 Hip 37 44.2 (12.5) 81.4 (14.1) 31/6 3.7 (0.3) 185 Knee 52 43.5 (13.8) 85.6 (16.4) 37/15 3.5 (0.4) 260 Ankle 43 42.6 (10.9) 91.6 (20.4) 36/7 3.4 (0.4) 215 Total 194 41.1 (12.4) 81.9 (18.0) 132/62 3.7 (0.5) 970 individual model predictions (even without using ground-truth information), while SPM explains data charac- teristics by taking the ground truth information (group or class information) into account. As part of Section 6.3, we will discuss the results obtained with both approaches to address the additional value of LRP in CGA. For the statistical evaluation, we computed independent t-tests using the SPM1D2 package provided by Pataky [52] for Matlab and investigate differences between each GRF signal between two classes (for visual- ization purposes, we concatenated the results obtained on each GRF component). To take into account the de- pendence of SPM results on the choice of a distinct alpha level, we performed experiments with three different alpha levels: 0.01, 0.05, and 0.1. The output of SPM provides t-values for each point of the investigated time series and the threshold corresponding to the chosen alpha level. The t-values exceeding this threshold indicate statis- tically significant differences in the corresponding sections of the time series. For a better visibility, we depicted these significant sections as gray-shaded areas in Figure 5 and Figure 6. We used three different shades of gray for the three different alpha levels, i.e., dark gray for 0.01, gray for 0.05, and light gray for 0.1. Additionally, we computed the effect size by transforming the resulting t-values to Pearson’s correlation coefficient r using the definition by Rosenthal [56]. The effect size provides an indicator for the discriminativeness of a given signal region independent of the alpha level. 3.4 Clinical Evaluation To evaluate the derived relevance scores of LRP from a clinical perspective, two clinical experts with more than 10 and more than 25 years’ experience in human gait analysis analyzed the explainability results. The experts evaluated the extent to which regions with the highest input relevance scores correspond to GRF characteristics from clinical practice and assessed the usefulness of explainability approaches for CGA. 4 EXPERIMENTAL SETUP 4.1 Data Recording and Dataset For the gait classification task, we utilized a subset of the large-scale GaitRec dataset [28]. This dataset is part of an existing clinical gait database maintained by a local Austrian rehabilitation center. Before conduct- ing our experiments approval was obtained from the local Ethics Committee (#GS1-EK-4/299-2014). The em- ployed dataset contains bilateral three-dimensional GRF recordings of patients and healthy controls walking unassisted at self-selected walking speed on an approximately 10 mwalkway with two centrally embedded force plates (Kistler, Type 9281B12, Winterthur, CH). Data were recorded at 2,000 Hz, filtered with a zero-lag Butter- worth filter of 2nd order with a cut-off frequency of 20 Hz, time-normalized to 101 points (100% stance phase), and amplitude-normalized to 100% body weight. During one session, participants walked barefoot or in socks until a minimum of five valid recordings were available. Recordings were defined as valid by an experienced assessor. 2SPM1D v.0.4, http://www.spm1d.org/. ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2. Publications 94 Explaining Machine Learning Models for Clinical Gait Analysis • 14:9 Fig. 3. Visualization of vertical (left panel), anterior-posterior (central panel), and medio-lateral (right panel) force com- ponents of the body weight-normalized GRF measurements of the affected side available per participant and class. For healthy controls, all available measurements are visualized. Mean and standard deviation signals (calculated per class) are highlighted as solid and dashed colored lines. In total, the dataset comprises GRF measurements from 132 patients with lower-body gait disorders (GD) and data from 62 healthy controls (HC), both of various physical composition and gender. The dataset includes three classes of orthopaedic gait disorders associated with the hip (H , N = 37), knee (K , N = 52), and ankle (A, N = 43). For class-specific demographic details of the data, refer to Table 1. The dataset is balanced regarding the number of recorded sessions per person and the number of trials per person. Figure 3 shows an overview of all GRF measurements of the affected side (except for healthy controls where each step is visualized) per class and the associated mean and standard deviation. The GD classes (A, H , and K ) include patients after joint replacement surgery, fractures, ligament ruptures, and related disorders associated with the above-mentioned anatomical areas. A well-experienced physical therapist with more than a decade of clinical experience manually labeled the dataset based on the available medical diagnosis of each patient. ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2.4. Explaining Machine Learning Models for Clinical Gait Analysis 95 14:10 • D. Slijepcevic et al. 4.2 Input Data Preparation The input data for each classification task is a concatenated version of the three-dimensional GRF signals from both force plates (see Figure 1). The concatenation of all six GRF signals (three force components per force plate) results in a 1 × 606-dimensional input vector for each gait trial. The three-dimensional GRF signals are the medio-lateral horizontal force (GRFML), anterior-posterior horizontal force (GRFAP ), and vertical force (GRFV ). The dataset includes only unilateral gait disorders, i.e., disorders where the main physical limitation can be attributed to one leg (the affected leg/side in the following). The signal components of the affected leg (input features: 1 to 303) are concatenated first and are followed by the signal components of the unaffected leg (input features: 304 to 606) in the input vector. For the healthy controls there is no affected and unaffected side (both sides are unaffected). Thus, the order of the signals was randomly assigned, while ensuring an equal distribution, to avoid any bias regarding the side. 4.3 Data Normalization Normalization of input vectors is applied to ensure an equal contribution of all six GRF signals to the classification models and thus avoids that signals with larger numeric ranges dominate those with smaller numeric ranges [14, 31]. We applied min-max normalization to the input signals and thereby scaled each signal to the range [0, 1]. The global minimum and maximum values were determined separately for each of the six GRF signals over all trials. 4.4 Classification Tasks We investigate different classification tasks on the dataset introduced above to provide a more comprehensive picture of the investigated problem and to enable the differentiation between task-specific and general observa- tions. Classification tasks include: • binary classification between healthy controls and all gait disorders (HC/GD), • binary classification between healthy controls and each gait disorder separately (i.e., HC/H , HC/K , and HC/A), • multi-class classification between healthy controls and all gait disorders (HC/H/K/A), • and multi-class classification between all gait disorders (H/K/A). 4.5 Classification Methods In our experiments, three representative machine learning approaches, i.e., (linear) SVM, MLP, and CNN were compared in terms of prediction accuracy and learned input relevance patterns. The SVM models were trained using a standard quadratic optimization algorithm, with an error penalty parameterC = 0.1 and 2-constrained regularization of the learned weight vector w . The MLP models comprised three consecutive fully connected layers with ReLU non-linearities activating the hidden neurons and a final SoftMax activation in the output layer. The size of both hidden layers is 768, whereas the size of the output layer is c , where c is the number of target classes. The CNN models process the given data via three consecutive convolutional layers, with a -- configuration of 8-2-24, 8-2-24, and 6-3-48, and ReLUs for non-linear neuron activation. The resulting 48 × 48 feature mapping is then unrolled into a 2,304-dimensional vector and fed into a fully connected layer, which directly maps to the model output. This fully connected layer is topped with a SoftMax output activation, which is acting as a multi-class predictor output towards the c target classes. Both, the MLP and CNN models, have been trained via standard error back-propagation using stochastic gradient de- scent [37] and a mean absolute ( 1) loss function. The training procedure was executed for 3 · 104 iterations of mini batches of five randomly selected training samples and an initial learning rate of 5 · 10−3. The learning rate was gradually decreased after every 104-th training iteration to 10−3 by a factor of 0.2 and then to 5 · 10−4 by a factor of 0.5. Model weights were initialized with random values drawn from a normal distribution with μ = 0 ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2. Publications 96 Explaining Machine Learning Models for Clinical Gait Analysis • 14:11 and σ =m− 12 , wherem is the number of inputs to each output neuron of the layer [37]. Since the CNN receives as input a 1 × 606-dimensional input vector, its convolution operations can be understood as 1D convolutions, moving over the time axis only. We used 1D convolutions to maintain comparability with the two other classifi- cation methods (MLP and SVM). Preliminary experiments demonstrated negligible differences between 1D and 2D CNNs. 4.6 Performance Evaluation The prediction accuracies were reported over a stratified 10-fold cross-validation configuration, where eight partitions of the data are used for training, one partition is used as validation set, and the remaining partition is reserved for testing. The samples from each class were distributed evenly while ensuring that all gait trials from an individual participant were placed in the same partition of the data to rule out person-related information influencing the measured model performance during testing. All results are reported as mean with standard deviation (SD), unless otherwise stated. Additionally, we calculated the Zero-Rule baseline (ZRB) for each classification task. The ZRB refers to the theoretical accuracy obtained by assigning class labels according to the prior probabilities of the classes, i.e., the target labels are always set to the class with the greatest cardinality in the training dataset. 4.7 Implementation The implementation of the threeMLmethods and the LRPmethodwas conductedwithin the software framework Python 3.7 (Python Software Foundation, USA). Data preprocessing, SPM, and the visualization of the results were performed in Matlab 2017b (MathWorks, USA). Our source code and the utilized dataset are publicly avail- able at: https://github.com/sebastian-lapuschkin/explaining-deep-clinical-gait-classification. 5 RESULTS We first present the results obtained in our classification experiments as well as from the explainability analysis and then discuss them in detail in Section 6. We start with a presentation of the classification accuracies achieved for the different classification methods, tasks, and normalization methods (Section 5.1) and continue with a presentation of the explainability results obtained by LRP (Section 5.2). 5.1 Classification Results The mean prediction accuracy showed a clear superiority over the ZRB for all three classification methods (CNN, SVM, and MLP) and all classification tasks (see Figure 4 and supplementary Table S1). A 2 × 2 repeated measures analysis of variance (ANOVA) (classification method: CNN, SVM, and MLP; normalization: min-max and non- normalized) conducted for each classification task only indicated a significant difference in classification accuracy between the three classifiers for task HC/GD (F2,18 = 4.038, p = 0.036, η2p = 0.310). However, differences were in general not relevant (<2%) and additional pairwise Bonferroni-corrected post-hoc tests failed to identify any dif- ferences as significant. No other significant differences were found for the classifiers’ performances. Regarding normalization, ANOVA revealed two simple main effects of normalization for taskH/K/A (F1,9 = 7.269, p = 0.025, η2p = 0.447) and task HC/H/K/A (F1,9 = 9.054, p = 0.015, η2p = 0.502). Estimated marginal means for normaliza- tion during Bonferroni-corrected post-hoc tests showed a performance increase of 6% and 3% for H/K/A and HC/H/K/A, respectively. No further significant effects and differences were found. 5.2 Explainability Results In the following, we present in detail the results for classification task HC/GD together with respective result visualizations. Figure 5 shows an exemplary result for prediction explanation by LRP, i.e., the averaged signals together with the color-coded averaged relevance values for each of the 606 input values for task HC/GD with ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2.4. Explaining Machine Learning Models for Clinical Gait Analysis 97 14:12 • D. Slijepcevic et al. Fig. 4. Overview of the prediction accuracy obtained for the three employed classification methods (CNN, SVM, and MLP) and all classification tasks with min-max normalized and non-normalized input signals, reported as boxplots enhanced with the classification accuracies obtained over 10-fold cross-validation (represented as individual dots). min-max normalized GRF signals. The input relevance values point out which GRF characteristics were most relevant for (or contradictory to) the classification of a certain class (HC or GD). For visualization, input values neutral to the prediction (Ri ≈ 0) are shown in black color, while warm hues indicate input values supporting the prediction (Ri 0) of the analyzed class and cool hues identify contradictory input values (Ri 0). For binary classification tasks (HC/GD, HC/H , HC/K , and HC/A), note that a high input relevance value for one class results in a contradictory input relevance value for the other class. Therefore, the total relevance, which is the absolute sum of the relevance scores of both classes, is a good indicator for the overall relevance of an ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2. Publications 98 Explaining Machine Learning Models for Clinical Gait Analysis • 14:13 Fig. 5. Results overview for the classification of healthy controls (HC) and the aggregated class of all three gait disorders (GD) based on min-max normalized GRF signals using a CNN as classifier. (A) Averaged GRF signals for HC and GD. The first three signals represent the three GRF components of the affected side and are followed by the three GRF components of the unaffected side. Note that the data for both sides are composed of three GRF components (e.g., input features of the affected side: 1 to 101 (GRFML), 102 to 202 (GRFAP ), and 203 to 303 (GRFV )). This means, for example, that input features 21 (GRFML), 122 (GRFAP ), and 233 (GRFV ) all correspond to the relative time of 20% of the same stance phase. The areas that are depicted in three different shades of gray for the three different alpha levels, i.e., dark gray for 0.01, gray for 0.05, and light gray for 0.1, highlight regions in the input signals where SPM indicates statistically significant differences between both classes (i.e., HC and GD). (B) Averaged GRF signals of all test trials as a line plot for the healthy controls class, with a band of one standard deviation, color-coded via input relevance values for the class (HC) obtained by LRP. (C) Averaged GRF signals of all test trials are shown as a line plot for the class of all the gait disorders (GD), in the same format as in (B). (D) Line plots showing the effect size computed as Pearson’s correlation coefficient and total relevance based on the absolute sum of the LRP relevance values of both classes (HC and GD). The total relevance correlates with the local discriminativity of the input signal for the classification task. input value for a respective classification task. The higher the total relevance at a certain signal region, the more discriminative is this region for the two underlying classes. Figure 5 illustrates the signal regions of high input relevance for the classification between the HC and GD class. These regions are prevalent within all GRF signal components. Themost relevant regions for distinguishing between the two classes have been found to include the local minima and maxima in the affected GRFV signal. A similar pattern, though less pronounced, appears in the unaffectedGRFV . For GRFAP , LRP identified relevant regions in the affected and unaffected signals, with the maximum peak in the affected signal being the most pronounced. ForGRFML , relevant information appears to be predominantly located around the first lateral peak of the affected side and the second lateral peak of the unaffected side. The identified regions of high total relevance according to LRP agree to a large extent with the signal regions assessed as significantly different by SPM (gray- shaded areas in Figure 5). Figure 6 shows the effect size obtained via SPM and the total relevance according to LRP for the task HC/GD (with min-max normalized GRF signals as in Figure 5) and all three employed classification methods (CNN, SVM, ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2.4. Explaining Machine Learning Models for Clinical Gait Analysis 99 14:14 • D. Slijepcevic et al. Fig. 6. Comparison of different classification methods (CNN, SVM, and MLP) for the classification of healthy controls and the class of all three gait disorders (HC/GD) based on min-max normalized GRF signals. The comparison is based on the total relevance of the LRP results as well as statistically significant differences (gray-shaded areas) and effect size computed as Pearson’s correlation coefficient. Note that the gray-shaded areas and the effect size (green curve) are the same, while the total relevance varies between the three classification methods. andMLP). The relevance scores agree strongly between the three classificationmethods. In fact, only some signal regions are prioritized differently, e.g., the affected and unaffected GRFML at the beginning and the end of the signal. These results show that the investigated classification methods rely on the same regions in the input data with only small exceptions. For the sake of brevity, only the results for the classification task HC/GD were presented. For results of the other classification tasks, we refer the reader to the supplementary Figures S4, S7, S10 (CNN), Figures S6, S9, S12 (SVM), and Figures S5, S8, S11 (MLP). In the following, the discussion will incorporate all binary classification tasks. 6 DISCUSSION The primary aim of this article is to investigate whether XAI methods can enhance explainability of ML pre- dictions in clinical gait classification. In this section, the classification results are analyzed, compared, and inter- preted in terms of classification accuracy and relevance-based explanations. These explanations are, furthermore, evaluated from a statistical and clinical viewpoint. Additionally, we discuss dependencies, influences, and inter- esting observations with respect to different classification methods, tasks, normalization methods, and signal components (horizontal forces and affected/unaffected leg signals). 6.1 Classification Results The results expressed in terms of classification accuracy (presented in Figure 4 and supplementary Table S1) demonstrate a comparable level of performance between the three different machine learning methods (CNN, SVM, and MLP). The achieved performance level is not only interesting by itself but also important informa- tion for further explainability experiments. The reason is that an objective analysis of explainability by a post hoc method like LRP is only meaningful if the classification model can robustly differentiate between the target classes, i.e., a certain model quality is necessary to draw meaningful conclusions from explainability results. An analysis of unreliable classification models bears the potential risk that unstable patterns, noise, and spurious correlations bias the explainability results. For this reason, we excluded the classification tasks HC/H/K/A and ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2. Publications 100 Explaining Machine Learning Models for Clinical Gait Analysis • 14:15 H/K/A from our further investigation, as the tasks could not be solved with sufficient accuracy (average clas- sification accuracy above 80%). For the binary classification tasks this risk is much lower, because the higher classification accuracies (and deviations from ZRB) obtained suggest that robust features can be found in the input data. Another aspect we assessed is the influence of normalization on the input data (see Figure 4 and supplementary Table S1). The normalization of the input data is important for machine learning, since highly differing value ranges can have a negative influence on the classification model, i.e., input variables with a higher value range have a stronger influence on the predictions [14, 31]. The same appears to be the case for gait data, where the normalization of the input data strongly influences the classification models, as can be observed from the relevance scores of the horizontal forces in Figure 5 and supplementary Figure S13. Surprisingly, however, min- max normalization does not significantly improve the classification results (see Figure 4 and supplementary Table S1) for the investigated classification tasks. This raises the question of whether the use of GRFV alone would already be sufficient to solve the classification tasks. We discuss this seemingly contradictory behavior in the following section. 6.2 Explainability Results In the following, we discuss different related aspects with regard to our first leading research question:Which input features or signal regions are most relevant for automatic gait classification? The visualizations for all classification tasks and classification methods can be found in the supplementary Figures S1–S12. Which input features are relevant for the classification of functional gait disorders? LRP identified several regions of high relevance in the GRF signals for all classification tasks. The MLmodels often used regions (and not single time-discrete values) encompassing peaks and valleys in the GRF signals to distinguish between the different classes, e.g., for task HC/GD using the CNN (see Figure 5) in the affected and unaffectedGRFV (all three local maxima and minima), affected GRFAP (both peaks), unaffected GRFAP (first peak), affected GRFML (first lateral peak), and unaffected GRFML (both lateral peaks). The highest total relevance scores are present in the signals of the affected side and most commonly inGRFV for all investigated classification tasks. This is in line with earlier studies, e.g., where the peaks and valley (as time-discrete parameters) of the affectedGRFV showed the highest discriminatory power [66]. Are signal regions of the unaffected side important for the classification of functional gait disorders? Across all classification tasks, relevant regions are also pronounced in the GRF signals of the unaffected side, but less than in those of the affected side. In earlier studies [67, 68], we showed that the omission of the unaffected side during classification negatively affected classification accuracy. The explainability results confirm this obser- vation. The unaffected side seems to capture complementary information relevant to the classification task under consideration. In particular, the identified relevant regions in the GRF signals occur at similar relative (e.g., in both peaks ofGRFV ) or absolute (e.g., the second peak of the affectedGRFAP and the first peak of the unaffected GRFAP ) time points of the stance phases of the unaffected and affected side. Are the anterior-posterior and medio-lateral forces relevant for the task? While the highest total rele- vance scores can be observed inGRFV in most cases, relevant regions are always also observed in the horizontal GRF signals (GRFAP and GRFML). However, the locations and degree of relevance within the horizontal signals vary for different classification tasks, e.g., for task HC/A, the highest relevance scores occur in the affected GRFAP (and GRFV ) and hardly any relevant regions exist in GRFML (see supplementary Figure S10), while the highest relevance score for the task HC/H appears at the beginning of the affected GRFML (see supplementary Figure S4). What is the impact of normalization on explainability results? Normalization of input data is a standard procedure prior to classification with ML models to ensure equal numerical ranges of different signals [14, 31]. XAI methods such as LRP allow to visualize the effects of normalization on the predictions of ML models directly ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2.4. Explaining Machine Learning Models for Clinical Gait Analysis 101 14:16 • D. Slijepcevic et al. at the level of the input signals. To gain a deeper understanding of these effects and the underlying data, we also conducted experiments without normalization of input data (see supplementary Figures S13–S24). For the classification of non-normalized GRF signals, the most relevant input values are located inGRFV , i.e., especially the two peaks and the valley in between are relevant for the tasks. Aminimal degree of relevance can be observed in the peaks of the affected and unaffectedGRFAP signals. The reason for the absence of relevant regions in the horizontal forces could be their small value range. The rather small range compared to the GRFV component may lead to a smaller influence on the training of the classification models. Explainability results for min-max normalized input data show that highly relevant regions are identified in the horizontal forces of the affected and unaffected side (e.g., Figure 5). Thus, normalization amplifies the relevance of values in the horizontal forces and thereby makes them similarly important as GRFV . Based on the LRP relevance scores, we conclude that normalization is important to obtain unbiased predictions of ML models (bias introduced by different signal amplitudes). Are all identified relevant regions necessary for the task? For all classification tasks and classification methods, with min-max normalized input data, many regions of the GRF signals are identified to be relevant for classification according to LRP. The classification performance with and without normalization does, however, not vary significantly for the binary classification tasks (see classification results in Section 5.1). This raises the question of whether all regions identified as relevant are necessary to achieve peak performance in classifica- tion or whether some of them are redundant (i.e., not yielding an increase in classification performance when combined). Note that the assumption of redundancy is supported by the fact that the three GRF components represent individual dimensions of the same three-dimensional physical process. Thus, a strong correlation is a priori given in the data. To answer the question, we conducted additional experiments with occluded parts of the input vector and eval- uated the changes in classification performance. Occlusion is realized by replacing the horizontal forces (GRFAP and GRFML) of both sides (affected and unaffected) with zero values. Table 2 shows the classification results for the experiments with occluded input signals as deviation from the mean classification accuracy of the experi- ments with non-occluded input signals. The results decrease on average when the horizontal forces are occluded (except for tasks HC/GD and HC/A using the CNN). Thus, relevant regions in the horizontal forces cannot be completely redundant to those in GRFV and, therefore, represent also complementary information. This is in line with previous quantitative performance evaluations [67, 68]. However, the classification results of the bi- nary classification tasks are not influenced by the occlusion of horizontal forces in a statistically significant way. This was confirmed by several dependent t-tests (p > 0.05) with Bonferroni-Holm [25] correction. Our results indicate that the relevant regions identified by LRP may represent an over-complete set, which exhibits a certain degree of redundancy, as removing relevant sections does not necessarily lead to reduced classification perfor- mance. However, redundancy is not necessarily a negative property, as it may help to achieve higher robustness to noise and possibly also to outliers and missing data [29]. Do different ML methods rely on different patterns? A comparison of the three employed classification methods is depicted in Figure 6. Across all binary classification tasks, relevant signal regions for all three classi- fication methods are largely consistent, especially with respect to their location. Minor differences exist in the amplitude of the relevance scores, e.g., at the beginning of the affectedGRFV or the second peak in the affected GRFAP (see Figure 6). The similarities between MLP and SVM are more pronounced. The remaining binary clas- sification tasks, i.e., HC/H (see supplementary Figures S4, S5, and S6), HC/K (see supplementary Figures S7, S8, and S9), and HC/A (see supplementary Figures S10, S11, and S12) confirm these findings. Although LRP clearly shows where the prediction is grounded, it cannot explain why these patterns are important. However, it allows to identify and compare the learning strategies of different classification methods. Canwe derive additional properties of themodels from the explanations, e.g., different learning strate- gies? Explanations provided by local XAI methods, such as LRP, inform about a model’s reasoning on individual samples. A more general understanding about the model’s learned patterns can be obtained via the evaluation of ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2. Publications 102 Explaining Machine Learning Models for Clinical Gait Analysis • 14:17 Table 2. Classification Results for the Experiment with Occluded Horizontal Forces (GRFAP , GRFML), in Percent Task Normalization CNN SVM MLP HC/GD min-max 0.2 –1.4 –1.4 HC/H min-max –4.5 –6.5 –4.9 HC/K min-max –2.1 –3.7 –4.2 HC/A min-max 1.5 –0.9 –1.3 The results are reported as mean deviation from the prediction accuracy of the original input signals presented in Figure 4 and supplementary Table S1, i.e., negative values signify a decrease and positive values an improvement in classification performance. larger sets of sample-specific explanations [34]. In the previous sections, we achieved this by averaging relevance patterns across all samples of a given class. To perform a more detailed analysis that is able to identify different learning strategies of the ML models, we propose the use of SpRAy [35] as described in [5] for clinical gait data. The basic idea of this approach is to cluster the relevance patterns obtained for different samples and classes and to analyze the resulting clusters and subclusters. SpRAy is a statistical analysis method for the explorative discovery of a model’s characteristic prediction strategies from XAI-based relevance patterns. With its core in Spectral Clustering [43, 47], the method discov- ers structure within the set of given relevance patterns and yields, among its outputs, a spectral embedding Φ together with suggested groupings within the embedding in form of k cluster labels. Here, the embedding Φ directly corresponds to the individual relevance patterns, under consideration of their local, global, and poten- tially non-linear affinity structure. Sets of samples with similar relevance patterns are tightly grouped together in the spectral embedding space, while samples with dissimilar patterns are located far apart. Together with the suggested cluster labels, the analytically derived solution in Φ can then be visualized in R2, e.g., via a t-SNE projection [5, 39]. We implemented and evaluated SpRAy using the CoRelAy3 framework [4] for Python. Figure 7 shows exemplary SpRAy results for task HC/GD (with min-max normalized GRF signals) using the CNN as classification method. Based on the clustering provided in Figure 7(C) and 7(F), we see that the relevance patterns are grouped into clusters. This indicates that the ML model learned different classification strategies. Considering the ground truth class labels (see Figure 7(D)), we see that the model’s explanations for the overall gait disorder (GD) class are grouped into distinct clusters that contain samples from the individual gait disorder classes (H , K , andA), even though the model was never explicitly trained to do so in this classification task. This means that the model learned different strategies for different pathological subclasses in GD. Considering the participant labels (see Figure 7(B) and Figure 7(E)), we can see that the relevance patterns of the five trials of a participant are often clustered together (Figure 7(B) and 7(E)). This means that the model learns similar strategies for the samples belonging to one participant. From a biomechanical perspective, this is plausible because each individual person has unique gait patterns that differ from the gait patterns of other individuals [30]. For clinical experts, it is important to see that the model is able to reflect such patterns. In conclusion, SpRAy demonstrates the ability of ML models to learn patterns and dependencies in the data without explicit label information. For the clinical domain, this ability is of great value, since pathologies have various manifestations (that are sometimes even beyond the expertise of a clinical expert). 6.3 Statistical Evaluation In the following, we investigate the statistical properties of the signal regions found to be relevant by LRP to answer the second leading research question: To what extent are input features or signal regions identified 3https://github.com/virelay/corelay. ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2.4. Explaining Machine Learning Models for Clinical Gait Analysis 103 14:18 • D. Slijepcevic et al. Fig. 7. The spectral embedding Φ derived via SpRAy from LRP explanations for the CNN model on test data, visualized via t-SNE for samples labeled as healthy controls (HC ; N = 30; subfigures A-C) and the aggregated class of all three gait disorders (GD = {H ,K ,A}; N = 65; subfigures D-F). Each column of panels marks the embedded sample explanations with respect to different sets of labels as indicated by color: (subfigures A/D) ground truth class labels (HC , H , K , A), (subfigures B/E) ground truth participant labels, and (subfigures C/F) cluster labels inferred via SpRAy for k = 8 clusters on Φ before projecting the spectral embedding into R2 via t-SNE. The figure shows that the relevance patterns are grouped into clusters, indicating that the ML model learned different classification strategies. as being relevant for a given gait classification task statistically justified? To answer this question, we leverage SPM, which provides statistical inference estimates for each value of the input vector. We compare the LRP regions with those considered as significantly different by SPM. Results show that in the vast majority of cases, the SPM analysis shows statistically significant differences in regions that are also highly relevant for clas- sification according to LRP. Thus, for binary classification tasks, it seems that ML models base their predictions primarily on features that are also significantly different between the two classes. This can be observed across all classification tasks (e.g., see Figure 5(D) for task HC/GD). As the total relevance increases, the effect size usually also increases. We performed a cross-correlation to determine the relationship between the effect size and the total relevance. Both curves show highly correlated behavior for the min-max normalized input data for all classification tasks: HC/GD (r = 0.76), HC/H (r = 0.66), HC/K (r = 0.76), and HC/A (r = 0.78). However, minimal differences between the results of LRP and SPM can be detected, e.g., the location of the first relevant signal region in the unaffected GRFV . For all classification tasks, we observed that LRP already considers the slope to the firstGRFV peak of the unaffected leg as relevant for the classification, whereas SPM, slightly shifted, emphasizes the region encompassing the peak itself with a high effect size. Future research is needed to address this observation and examine differences between LRP and SPM in more detail. ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2. Publications 104 Explaining Machine Learning Models for Clinical Gait Analysis • 14:19 Fig. 8. Overview of the most relevant gait events during the stance phase. In clinical gait analysis, a gait cycle (100%) is defined from the initial contact of one foot to the subsequent initial contact of the same foot. During the first approximately 60% of the gait cycle, referenced as the stance phase (relevant time range for the present work), the foot has contact to the ground. The beginning of the stance phase is defined as initial contact with the ground (typically by the heel), then body weight is shifted to the supporting leg (loading response and mid-stance), followed by terminal stance (forward propulsion), pre-swing (preparation of the swing phase), and toe-off. Adapted from References [9, 62]. Concerning our second research question, we conclude that the relevance estimates according to LRP are to the greatest extent statistically justified. The second part of the research question regarding the validity of the explanations with respect to clinical assessment is investigated in the following section. 6.4 Clinical Evaluation To what extent are input features or signal regions identified as being relevant for a given gait classifica- tion task in line with clinical assessment? This question is answered in the following by two clinical experts in human gait analysis. To assist the reader in following the discussion and to facilitate the interpretation of the input signals, the domain-specific terms and gait cycle definitions are described in Figure 8. For further details on the principles of human gait and its clinical implications, the interested reader is referred to literature such as Perry and Burnfield [53] or Winter [79]. The explainability results for classification of healthy controls (HC) and the aggregated class of all three gait disorders (GD) based on min-max normalized GRF signals illustrate clinically meaningful patterns (see Figure 5). High LRP relevance scores occurred during loading response, terminal stance, and pre-swing in GRFAP and GRFML as well as in loading response, mid-stance, terminal stance, and pre-swing inGRFV . These phases are es- pecially sensitive toward gait anomalies as loading response requires the absorption of body weight and terminal ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2.4. Explaining Machine Learning Models for Clinical Gait Analysis 105 14:20 • D. Slijepcevic et al. stance plays an essential role for forward propulsion [33]. Both aspects are affected in case of gait impairments due to a diminished walking speed (requiring less absorption or push-off) as well as factors that go along with an injury, such as the presence of pain, a decreased range of motion, and/or lessened muscle strength [64, 78]. When analyzing the explainability results in more detail, one can identify specific gait dynamics that can be traced back to an impairment at a certain joint level. For classification task HC/A (see supplementary Figure S10), we can observe pronounced peaks in the total relevance curves of GRFAP and GRFV caused by alterations in the terminal stance and pre-swing phase of the affected side. This is in agreement with the observations of Son et al. [69], who found a significantly increased propulsive force (GRFAP in terminal stance) for patients with chronic ankle instability. They also identified an increased GRFV during late terminal stance (push-off) compared to healthy controls, which is also in line with the relevance scores obtained in our study. Both our explainability results and the study of Son et al. [69] did not indicate any relevance or difference to healthy controls in theGRFML . For classification task HC/K , the highest LRP relevance scores are present in GRFV , GRFAP , and GRFML (see supplementary Figure S7). Changes inGRFV may result from lessened knee flexibility that hinders typical knee dynamics over the entire course of the stance phase. More precisely, healthy walking requires a slightly flexed knee joint during initial contact followed by a knee flexion thereafter, by definition called loading response. During the mid-stance phase the walker’s center of gravity is shifted forward and thus demands further knee extension. This is in line with the study of Cook et al. [15], who analyzed the effects of restricted knee flexion and walking speed on the GRFV . According to their results, the loading rate (slope during loading response), unloading rate (slope during pre-swing), and peak GRFV of the restricted leg showed significant speed-knee flexion restriction interactions. Highest LRP relevance values for the classification taskHC/H are obtained during loading response and termi- nal stance inGRFV of the affected side (see supplementary Figure S4). McCrory et al. [41] and Martinez-Ramirez et al. [40] identified the GRFV as an objective measure of gait for patients following hip arthroplasty. McCrory et al. [41] found significant differences between patients and healthy controls in several variables of the GRFV such as the first and second local peaks, impulse, and stance time. They also identified that the unaffected side holds relevant information, as significant differences were found in the GRFV either compared to the control group or the affected side. This is also seen in our obtained LRP relevance scores for the classification taskHC/H where two distinct relevance peaks are present forGRFV for the first and secondGRFV peak of the affected side. These results are also in agreement with Martinez-Ramirez et al. [40], who demonstrated that patients after suc- cessful hip arthroplasty still show significantly altered GRFV for both the affected and unaffected leg including a continuing GRFV asymmetry between both sides. With regard to our second research question, we conclude that signal regions with high relevance according to LRP can be largely associated with clinical gait analysis literature and are plausible from a clinical point of view according to two domain experts. 6.5 On the Usefulness of XAI Methods for Clinical Gait Analysis XAI methods increase transparency and can make the decision process of ML models more comprehensible for clinical experts. Transparency of state-of-the-art ML models is crucial to promote the acceptance of such systems in clinical practice, allowing clinicians to benefit from high, and in some cases already better than human [16, 21, 42], classification accuracy that ML models achieve. In the previous subsections (i.e., Sections 6.3 and 6.4), we showed that explainability results are consistent from a statistical and domain experts’ point of view. In particular, regions of high relevance according to LRP are highly discriminatory according to SPM, and the clinical experts associated these regions with clinical explanations. Having evaluated the explainability results, we now want to address the question: What is the added value that XAI methods can provide to clinical practice? ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2. Publications 106 Explaining Machine Learning Models for Clinical Gait Analysis • 14:21 Fig. 9. Comparison of explainability results of the original (top) and walking speed-matched (bottom) data for the classifi- cation task HC/K based on the min-max normalized GRF signals using CNN. The two experts reported that they mainly focus on regions in theGRFV signals during the evaluation process of patients in clinical practice. In particular, the evaluation of the unaffected GRFV is very important for the clinicians. The main motivation for this is that many compensatory patterns manifest in this signal, i.e., as patients try to put as little weight on the affected leg as possible, they take shorter steps with the unaffected leg. This is reflected in a reduced slope in the unaffectedGRFV during loading response. Our explainability results show that in addition to regions in GRFV , regions in GRFML and GRFAP are also highly relevant for the classification tasks. These signals are less considered in clinical practice. However, the relevant regions in GRFML and GRFAP indicate additional information about the classification of pathological gait patterns. Explainability approaches can lead to novel insights and a deeper understanding of the models and the un- derlying data as illustrated in the following example. In the clinical evaluation of the explainability results, the experts identified also relevant regions for the ML models that are not directly related to the specific functional gait disorders, according to their personal expertise and the literature. The experts assumed that, e.g., the relevant regions in the affected and unaffectedGRFV , in particular during mid-stance, terminal stance, and pre-swing, are strongly influenced by differences in walking speed between healthy controls and patients. From this observation the clinical experts derived the hypothesis that the trained ML models might be biased by the walking speed. Using the HC/K classification task as an example, we examined whether there is a significant difference in walking speed between HC and K . An independent samples t-test revealed a statistically significant difference in walking speed between HC and K (p < 0.001). The differences in walking speed affect the shape of the signals (although the signals were time-normalized) and the ML models could have learned these dissimilarities. To assess the influence of walking speed on the ML models, we repeated the experiment for the task HC/K on a subsample of the original data. This subsample does not exhibit statistically significant differences with respect ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2.4. Explaining Machine Learning Models for Clinical Gait Analysis 107 14:22 • D. Slijepcevic et al. to walking speed (independent samples t-test; p = 0.068). A comparison of the explainability results obtained for task HC/K (with min-max normalized GRF signals) using CNNs that were trained on the original and walking speed-matched data are presented in Figure 9. The results for the walking speed-matched data clearly show that most of the relevant regions according to LRP agree with the regions obtained for the original data (with only small changes in amplitude). However, relevant regions in the unaffected GRFV after loading response are less relevant for the model trained on walking speed-matched data. Thus, in contrast to the model trained on the original data, this model barely takes these regions into account. The conclusion that can be drawn is that these regions are related to differences in walking speed. Using our XAI approach, we have been able to show that some degree of walking speed-related bias was learned in the original models, but that this influence was not as strong as assumed by the clinical experts. Another interesting aspect of the experiment concerns the SPM results. While the trend of effect size and the total relevance remain similar, the statistically significant regions are clearly reduced (compare gray-shaded areas for both settings in Figure 9), showing the sensitivity of SPM to the alpha level. Overall, we showed that our proposed XAI approach exhibits substantial usefulness for the clinical setting, as we were able to demonstrate that: (i) regions in the signals that are less focused on in the literature and clinical evaluation, i.e.,GRFAP andGRFML , also contain informative and relevant regions that can be associated to the underlying pathology, (ii) ML models learn different strategies for different samples and patient groups (experiment with SpRAy; see Section 6.2), and (iii) XAI methods allow the identification of biases in ML models, e.g., with respect to normalization or walking speed-related differences between classes. The increased transparency provides additional insights into the working mechanisms of the trained ML mod- els, enabling clinicians to better understand them and increase their level of trust [70]. 6.6 Limitations and Future Work A fundamental problem in evaluating the explainability results is the absence of a ground truth. A challenge in interpreting the explainability results is that alterations of the input signals can be caused not only by the influence of a pathology, but also by other independent parameters, e.g., a lower walking speed or an increased body mass. To minimize potential biases introduced by independent parameters on prediction explanations, future research should attempt to develop normalization procedures for input signals that compensate such influencing factors or develop classification models that inherently learn the relationship between influencing factors and input signals. Another limiting factor is that we solely used GRF signals for classification. This does not perfectly reflect best practice in clinical gait analysis where clinicians usually base medical decisions on a combination of GRF and 3D kinematic data [9]. The additional use of kinematic data is expected to improve the classification accuracy to an appropriate level for clinical application, in particular for multi-class classification tasks. However, 3D kinematic data are prone to several difficulties such as inconsistencies due to inter-assessor and inter-laboratory differences [20, 60]. This makes it more difficult to create a homogeneous, large-scale, and real-world dataset compared to using simple data, such as GRF signals. Thus, the utilized GaitRec data [28] provide a large-scale dataset with an easy to comprehend clinical example, which allows to showcase how XAI methods can support transparency of ML models and their predictions. Besides visual explanations as presented in this article, a translation into human-understandable textual expla- nations would be desired for clinical application. An interesting direction for future research is the generation of textual explanations based on biomechanical parameters estimated from the input signals. This would en- able approaches that exceed pure explainability and provide deeper interpretations for clinical experts in the form of, e.g., “there is a high probability of a pathology in the knee due to a limited knee extension during the mid-stance phase.” ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2. Publications 108 Explaining Machine Learning Models for Clinical Gait Analysis • 14:23 We will conduct further research to compare different explanation methods and rule-based approaches [32] for different classification tasks and datasets. In addition, we want to point out that quantitative and objective methods are necessary to assess the quality of prediction explanations [57] including datasets with respective ground truth explanations. 7 CONCLUSION The present findings highlight that the investigated ML models base their predictions on meaningful features of GRF signals in various clinical gait classification tasks. These features are in accordance with a statistical and clinical evaluation. Hence, XAI methods that provide explainability for predictions provided by MLmodels, such as LRP, can be promising to increase justification of automatic classification predictions in CGA and can help to make the prediction processes comprehensible to clinical experts. Thereby, XAI may facilitate the application of ML-based decision-support systems in clinical practice. Within the scope of our analysis, we were able to show that: • Highly relevant regions were identified in the signals of the affected and unaffected sides. Thus, the unaf- fected side captures additional information that are relevant for automated gait classifications. • For time-series data such as GRF signals, SPM has shown to be a suitable statistical reference. Highly relevant regions in the input data (according to LRP) are inmost cases also significantly different (according to SPM) and in line with clinical evaluation. • In addition to GRFV , the horizontal forces contain regions of high relevance, which is consistent with clinical gait analysis literature. • ML models seem to learn an over-complete set of features that may contain redundant information. This might explain why the occlusion of horizontal forces and input normalization in our experiments had negligible influence on the classification accuracies. • ML models for gait classification are able to learn different strategies for individual persons and patient groups. • Explainability approaches can help to detect bias in ML models and help to assess their correct working, which is important for clinicians to enable building trust in the predictions of these models. This article represents a first step towards establishing explainability of ML approaches for time-series classifi- cation. Thereby, we want to promote the application of ML in clinical gait analysis to support medical decision- making in the future. CONFLICT OF INTEREST STATEMENT The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. AUTHOR CONTRIBUTIONS DS, BH, A-MR prepared the dataset. DS, FH, SL, BH, WS, WIS conceived the presented idea. BH, WS, WIS, MZ raised the funding. DS, FH, SL, BH, A-MR, AK, WS, CB, WIS, MZ participated in the data analysis. DS, FH, SL, BH, A-MR, MZ wrote the manuscript. DS, FH, SL, BH, A-MR, AK, WS, MZ designed the figures. DS, FH, SL, BH, A-MR, AK, WS, CB, WIS, MZ reviewed and approved the final manuscript. DATA AVAILABILITY STATEMENT For our analyses, we used a subset of the GaitRec dataset [28]. Our source code and the utilized dataset are publicly available at: https://github.com/sebastian-lapuschkin/explaining-deep-clinical-gait-classification. ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2.4. Explaining Machine Learning Models for Clinical Gait Analysis 109 14:24 • D. Slijepcevic et al. REFERENCES [1] Amina Adadi and Mohammed Berrada. 2018. Peeking inside the black-box: A survey on explainable artificial intelligence (XAI). IEEE Access 6 (2018), 52138–52160. DOI: https://doi.org/10.1109/ACCESS.2018.2870052 [2] Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. 2018. Sanity checks for saliency maps. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS’18). Curran Associates Inc., Red Hook, NY, 9525–9536. [3] Murad Alaqtash, Thompson Sarkodie-Gyan, Huiying Yu, Olac Fuentes, Richard Brower, and Amr Abdelgawad. 2011. Automatic clas- sification of pathological gait patterns using ground reaction forces and machine learning algorithms. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBS). IEEE, 453–457. DOI: https://doi.org/10.1109/ IEMBS.2011.6090063 [4] Christopher J. Anders, David Neumann, Wojciech Samek, Klaus-Robert Müller, and Sebastian Lapuschkin. 2021. Software for Dataset- wide XAI: From local explanations to global insights with Zennit, CoRelAy, and ViRelAy. CoRR abs/2106.13200 (2021). [5] Christopher J. Anders, Leander Weber, David Neumann, Wojciech Samek, Klaus-Robert Müller, and Sebastian Lapuschkin. 2022. Find- ing and removing clever Hans: Using explanation methods to debug and improve deep models. Information Fusion 77 (2022), 261–295. DOI:https://doi.org/10.1016/j.inffus.2021.07.015 [6] Vijay Arya, Rachel K. E. Bellamy, Pin-Yu Chen, Amit Dhurandhar, Michael Hind, Samuel C. Hoffman, Stephanie Houde, Q. Vera Liao, Ronny Luss, Aleksandra Mojsilovic, Sami Mourad, Pablo Pedemonte, Ramya Raghavendra, John T. Richards, Prasanna Sattigeri, Karthikeyan Shanmugam, Moninder Singh, Kush R. Varshney, Dennis Wei, and Yunfeng Zhang. 2019. One explanation does not fit all: A toolkit and taxonomy of AI explainability techniques. CoRR abs/1909.03012 (2019). [7] Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. 2015. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS One 10, 7 (2015), e0130140. DOI: https://doi.org/10.1371/journal.pone.0130140 [8] David Baehrens, Timon Schroeter, StefanHarmeling,Motoaki Kawanabe, Katja Hansen, and Klaus-RobertMüller. 2010. How to explain individual classification decisions. J. Mach. Learn. Res. 11 (2010), 1803–1831. Retrieved from http://portal.acm.org/citation.cfm?id= 1859912. [9] Richard Baker. 2013. Measuring Walking: A Handbook of Clinical Gait Analysis. Mac Keith Press, London. [10] David Balduzzi, Marcus Frean, Lennox Leary, J. P. Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. 2017. The shattered gradients problem: If resnets are the answer, then what is the question? In Proceedings of the International Conference on Machine Learning (ICML). PMLR, 342–350. [11] Brian G. Booth, Noël L. W. Keijsers, Jan Sijbers, and Toon Huysmans. 2018. STAPP: Spatiotemporal analysis of plantar pressure mea- surements using statistical parametric mapping. Gait Post. 63 (2018), 268–275. [12] Johannes Burdack, Fabian Horst, Sven Giesselbach, Ibrahim Hassan, Sabrina Daffner, and Wolfgang I. Schöllhorn. 2020. Systematic comparison of the influence of different data preprocessing methods on the performance of gait classifications using machine learning. Front. Bioeng. Biotechnol. 8 (2020), 260. DOI: https://doi.org/10.3389/fbioe.2020.00260 [13] Tom Chau. 2001. A review of analytical techniques for gait data. Part 1: Fuzzy, statistical and fractal methods. Gait Post. 13, 1 (Feb. 2001), 49–66. DOI: https://doi.org/10.1016/S0966-6362(00)00094-1 [14] François Chollet. 2017. Deep Learning with Python. Manning Publications Company, Shelter Island, NY. [15] Thomas M. Cook, Kevin P. Farrell, Iva A. Carey, Joan M. Gibbs, and Gregory E. Wiger. 1997. Effects of restricted knee flexion and walking speed on the vertical ground reaction force during gait. J. Orthop. Sports Phys. Therap. 25, 4 (1997), 236–244. DOI: https://doi.org/10.2519/jospt.1997.25.4.236 [16] Andre Esteva, Brett Kuprel, Roberto A. Novoa, Justin Ko, Susan M. Swetter, Helen M. Blau, and Sebastian Thrun. 2017. Dermatologist- level classification of skin cancerwith deep neural networks.Nature 542, 7639 (2017), 115–118. DOI: https://doi.org/10.1038/nature21056 [17] European Union. 2016. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). Offic. J. Eur. Union L 119 (2016), 1–88. Retrieved from https://eur-lex.europa.eu/eli/ reg/2016/679/oj. [18] Joana Figueiredo, Cristina P. Santos, and Juan C. Moreno. 2018. Automatic recognition of gait patterns in humanmotor disorders using machine learning: A review. Med. Eng. Phys. 53 (2018), 1–12. DOI: https://doi.org/10.1016/j.medengphy.2017.12.006 [19] Ruth C. Fong and Andrea Vedaldi. 2017. Interpretable explanations of black boxes by meaningful perturbation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, 3429–3437. DOI: https://doi.org/10.1109/ICCV.2017.371 [20] George E. Gorton, David A. Hebert, and Mary E. Gannotti. 2009. Assessment of the kinematic variability among 12 motion analysis laboratories. Gait Post. 29, 3 (2009), 398–402. DOI: https://doi.org/10.1016/j.gaitpost.2008.10.060 [21] Holger A. Haenssle, Christine Fink, R. Schneiderbauer, Ferdinand Toberer, Timo Buhl, A. Blum, A. Kalloo, A. Ben Hadj Hassen, Luc Thomas, A. Enk, et al. 2018. Man against machine: Diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists. Ann. Oncol. 29, 8 (2018), 1836–1842. ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2. Publications 110 Explaining Machine Learning Models for Clinical Gait Analysis • 14:25 [22] Eni Halilaj, Apoorva Rajagopal, Madalina Fiterau, Jennifer L. Hicks, Trevor J. Hastie, and Scott L. Delp. 2018. Machine learning in human movement biomechanics: Best practices, common pitfalls, and new opportunities. J. Biomech. 81 (2018), 1–11. [23] Jianxing He, Sally L. Baxter, Jie Xu, Jiming Xu, Xingtao Zhou, and Kang Zhang. 2019. The practical implementation of artificial intel- ligence technologies in medicine. Nat. Med. 25, 1 (2019), 30–36. DOI: https://doi.org/10.1038/s41591-018-0307-0 [24] Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, and Trevor Darrell. 2016. Generating visual explanations. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, 3–19. DOI: https://doi.org/10.1007/978- 3-319-46493-0_1 [25] Sture Holm. 1979. A simple sequentially rejective multiple test procedure. Scand. J. Statist. 6, 2 (1979), 65–70. [26] Andreas Holzinger, Chris Biemann, Constantinos S. Pattichis, and Douglas B. Kell. 2017. What do we need to build explainable AI systems for the medical domain? CoRR abs/1712.09923 (2017). [27] Andreas Holzinger, Georg Langs, Helmut Denk, Kurt Zatloukal, and Heimo Müller. 2019. Causability and explainability of artificial intelligence in medicine. Data Mining Knowl. Discov. 9, 4 (July 2019), e1312. DOI: https://doi.org/10.1002/widm.1312 [28] Brian Horsak, Djordje Slijepcevic, Anna-Maria Raberger, Caterine Schwab, Marianne Worisch, and Matthias Zeppelzauer. 2020. GaitRec, a large-scale ground reaction force dataset of healthy and impaired gait. Sci. Data 7, 1 (May 2020), 1–8. DOI: https://doi. org/10.1038/s41597-020-0481-z [29] Fabian Horst, Sebastian Lapuschkin, Wojciech Samek, Klaus-Robert Müller, and Wolfgang I. Schöllhorn. 2019. Explaining the unique nature of individual gait patterns with deep learning. Sci. Rep. 9, 1 (2019), 2391. DOI: https://doi.org/10.1038/s41598-019-38748-8 [30] Fabian Horst, Markus Mildner, andWolfgang I. Schöllhorn. 2017. One-year persistence of individual gait patterns identified in a follow- up study—A call for individualised diagnose and therapy. Gait Post. 58 (2017), 476–480. DOI: https://doi.org/10.1016/j.gaitpost.2017.09. 003 [31] Chih-WeiHsu, Chih-ChungChang, andChih-Jen Lin. 2016.A Practical Guide to Support Vector Classification. Technical Report. National Taiwan University. Retrieved from https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf. [32] Maximilian Kohlbrenner, Alexander Bauer, Shinichi Nakajima, Alexander Binder, Wojciech Samek, and Sebastian Lapuschkin. 2020. Towards best practice in explaining neural network decisions with LRP. In Proceedings of the International Joint Conference on Neural Networks (IJCNN). IEEE, 1–7. [33] Arthur D. Kuo and J. Maxwell Donelan. 2010. Dynamic principles of gait and their clinical implications. Phys. Ther. 90, 2 (2010), 157–174. DOI: https://doi.org/10.2522/ptj.20090125 [34] Sebastian Lapuschkin, Alexander Binder, Grégoire Montavon, Klaus-Robert Müller, and Wojciech Samek. 2016. Analyzing classifiers: Fisher vectors and deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (CVPR). IEEE Computer Society, 2912–2920. [35] Sebastian Lapuschkin, Stephan Wäldchen, Alexander Binder, Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. 2019. Unmasking clever Hans predictors and assessing what machines really learn. Nat. Commun. 10 (2019), 1096. DOI: https://doi.org/10. 1038/s41467-019-08987-4 [36] Hong-yin Lau, Kai-yu Tong, and Hailong Zhu. 2009. Support vector machine for classification of walking conditions of persons after stroke with dropped foot. Hum. Movem. Sci. 28, 4 (Aug. 2009), 504–514. DOI:https://doi.org/10.1016/j.humov.2008.12.003 [37] Yann LeCun, Léon Bottou, Genevieve B. Orr, and Klaus-Robert Müller. 2012. Efficient BackProp. In Neural Networks: Tricks of the Trade - Second Edition. Springer, 9–48. DOI: https://doi.org/10.1007/978-3-642-35289-8_3 [38] Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Proceedings of the Interna- tional Conference on Advances in Neural Information Processing Systems (NIPS). Curran Associates, Inc., 4765–4774. Retrieved from http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf. [39] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, Nov. (2008), 2579–2605. [40] Alicia Martínez-Ramírez, Dirk Weenk, Pablo Lecumberri, Nico Verdonschot, Dean Pakvis, and Peter H. Veltink. 2014. Assessment of asymmetric leg loading before and after total hip arthroplasty using instrumented shoes. J. NeuroEng. Rehabil. 11, 1 (2014), 20. DOI: https://doi.org/10.1186/1743-0003-11-20 [41] Jean L. McCrory, Scott C. White, and Robert M. Lifeso. 2001. Vertical ground reaction forces: Objective measures of gait following hip arthroplasty. Gait Post. 14, 2 (2001), 104–109. DOI: https://doi.org/10.1016/S0966-6362(01)00140-0 [42] Scott Mayer McKinney, Marcin Sieniek, Varun Godbole, Jonathan Godwin, Natasha Antropova, Hutan Ashrafian, Trevor Back, Mary Chesus, Greg C. Corrado, Ara Darzi, et al. 2020. International evaluation of an AI system for breast cancer screening. Nature 577, 7788 (2020), 89–94. [43] Marina Meila and Jianbo Shi. 2001. A random walks view of spectral segmentation. In Proceedings of the International Workshop on Artificial Intelligence and Statistics (AISTATS). [44] Grégoire Montavon, Alexander Binder, Sebastian Lapuschkin, Wojciech Samek, and Klaus-Robert Müller. 2019. Layer-wise relevance propagation: An overview. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. Springer, 193–209. DOI: https: //doi.org/10.1007/978-3-030-28954-6_10 [45] Grégoire Montavon, Sebastian Lapuschkin, Alexander Binder, Wojciech Samek, and Klaus-Robert Müller. 2017. Explaining nonlinear classification decisions with deep Taylor decomposition. Pattern Recogn. 65 (2017), 211–222. ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2.4. Explaining Machine Learning Models for Clinical Gait Analysis 111 14:26 • D. Slijepcevic et al. [46] Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. 2018. Methods for interpreting and understanding deep neural net- works. Dig. Sig. Process. 73 (2018), 1–15. DOI: https://doi.org/10.1016/j.dsp.2017.10.011 [47] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. 2002. On spectral clustering: Analysis and an algorithm. In Proceedings of the Inter- national Conference on Advances in Neural Information Processing Systems. 849–856. [48] Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, and Jeff Clune. 2016. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. Curran Associates, Inc., 3387–3395. Retrieved from http://papers.nips.cc/paper/6519-synthesizing-the-preferred- inputs-for-neurons-in-neural-networks-via-deep-generator-networks.pdf. [49] Angela Nieuwenhuys, Eirini Papageorgiou, Kaat Desloovere, Guy Molenaers, and Tinne De Laet. 2017. Statistical parametric mapping to identify differences between consensus-based joint patterns during gait in children with cerebral palsy. PLoS One 12, 1 (2017). [50] Corina Nüesch, Victor Valderrabano, Cora Huber, Vinzenz von Tscharner, and Geert Pagenstert. 2012. Gait patterns of asymmetric ankle osteoarthritis patients. Clin. Biomech. 27, 6 (July 2012), 613–618. DOI: https://doi.org/10.1016/j.clinbiomech.2011.12.016 [51] Todd C. Pataky. 2010. Generalized n-dimensional biomechanical field analysis using statistical parametric mapping. J. Biomech. 43, 10 (July 2010), 1976–1982. DOI: https://doi.org/10.1016/j.jbiomech.2010.03.008 [52] Todd C. Pataky. 2012. One-dimensional statistical parametric mapping in Python. Comput. Meth. Biomech. Biomed. Eng. 15, 3 (Mar. 2012), 295–301. DOI: https://doi.org/10.1080/10255842.2010.527837 [53] Jacquelin Perry and Judith M. Burnfield. 2010. Gait Analysis: Normal and Pathological Function (2nd ed.) Slack, Thorofare, NJ. [54] Angkoon Phinyomark, Giovanni Petri, Esther Ibáñez-Marcelo, Sean T. Osis, and Reed Ferber. 2018. Analysis of big data in gait biome- chanics: Current trends and future directions. J. Med. Biol. Eng. 38, 2 (2018), 244–260. DOI: https://doi.org/10.1007/s40846-017-0297-2 [55] Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Model-agnostic interpretability of machine learning. CoRR abs/1606.05386 (2016). [56] Robert Rosenthal. 1991. Meta-Analytic Procedures for Social Research. SAGE Publications Inc. DOI:10.4135/9781412984997 [57] Wojciech Samek, Alexander Binder, Grégoire Montavon, Sebastian Lapuschkin, and Klaus-Robert Müller. 2017. Evaluating the vi- sualization of what a deep neural network has learned. IEEE Trans. Neural Netw. Learn. Syst. 28, 11 (Nov. 2017), 2660–2673. DOI: https://doi.org/10.1109/TNNLS.2016.2599820 [58] Wojciech Samek, Grégoire Montavon, Sebastian Lapuschkin, Christopher J. Anders, and Klaus-Robert Müller. 2021. Explaining deep neural networks and beyond: A review of methods and applications. Proc. IEEE 109, 3 (2021), 247–278. DOI: https://doi.org/10.1109/ JPROC.2021.3060483 [59] Wojciech Samek, Thomas Wiegand, and Klaus-Robert Müller. 2017. Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. ITU J.: ICT Discov. 1, 1 (2017), 39–48. [60] Emilia Scalona, Roberto Di Marco, Enrico Castelli, Kaat Desloovere, Marjolein Van Der Krogt, Paolo Cappa, and Stefano Rossi. 2019. Inter-laboratory and inter-operator reproducibility in gait analysis measurements in pediatric subjects. Int. Biomech. 6, 1 (2019), 19–33. DOI: https://doi.org/10.1080/23335432.2019.1621205 [61] Wolfgang I. Schöllhorn. 2004. Applications of artificial neural nets in clinical biomechanics. Clin. Biomech. 19, 9 (2004), 876–898. DOI: https://doi.org/10.1016/j.clinbiomech.2004.04.005 [62] Huijuan Shi, Hongshi Huang, Yuanyuan Yu, Zixuan Liang, Si Zhang, Bing Yu, Hui Liu, and Yingfang Ao. 2018. Effect of dual task on gait asymmetry in patients after anterior cruciate ligament reconstruction. Sci. Rep. 8, 1 (2018), 1–10. [63] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning important features through propagating activation differ- ences. In Proceedings of the International Conference on Machine Learning (ICML). PMLR, 3145–3153. [64] Maureen J. Simmonds, C. Ellen Lee, Bruce R. Etnyre, and G. Stephen Morris. 2012. The influence of pain distribution on walking velocity and horizontal ground reaction forces in patients with low back pain. Pain Res. Treatm. (2012), 11. DOI: https://doi.org/10. 1155/2012/214980 [65] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Deep inside convolutional networks: Visualising image classifica- tion models and saliency maps. In Proceedings of the International Conference on Learning Representations (ICLR). Retrieved from http://arxiv.org/abs/1312.6034. [66] Djordje Slijepcevic, Matthias Zeppelzauer, Anna-Maria Gorgas, Caterine Schwab,Michael Schüller, Arnold Baca, Christian Breiteneder, and Brian Horsak. 2017. Automatic classification of functional gait disorders. IEEE J. Biomed. Health Inf. 22, 5 (2017), 1653–1661. DOI:https://doi.org/10.1109/JBHI.2017.2785682 [67] Djordje Slijepcevic, Matthias Zeppelzauer, Caterine Schwab, Anna-Maria Raberger, Christian Breiteneder, and Brian Horsak. 2020. Input representations and classification strategies for automated human gait analysis. Gait & Posture 76 (2020), 198–203. DOI:https: //doi.org/10.1016/j.gaitpost.2019.10.021 [68] Djordje Slijepcevic, Matthias Zeppelzauer, Caterine Schwab, Anna-Maria Raberger, Bernhard Dumphart, Arnold Baca, Christian Breiteneder, and Brian Horsak. 2018. P 011-Towards an optimal combination of input signals and derived representations for gait classification based on ground reaction forcemeasurements.Gait Post. 65 (2018), 249. DOI: https://doi.org/10.1016/j.gaitpost.2018.06.155 [69] S. Jun Son, Hyunsoo Kim, Matthew K. Seeley, and J. Ty Hopkins. 2019. Altered walking neuromechanics in patients with chronic ankle instability. J. Athlet. Train. 54, 6 (2019), 684–697. DOI: https://doi.org/10.4085/1062-6050-478-17 ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2. Publications 112 Explaining Machine Learning Models for Clinical Gait Analysis • 14:27 [70] Fabian Sperrle, Mennatallah El-Assady, Grace Guo, Rita Borgo, Duen Horng Chau, Alex Endert, and Daniel Keim. 2021. A survey of human-centered evaluations in human-centered machine learning. Comput. Graph. Forum 40, 3 (2021). [71] Erik Štrumbelj and Igor Kononenko. 2014. Explaining prediction models and individual predictions with feature contributions. Knowl. Inf. Syst. 41, 3 (2014), 647–665. DOI: https://doi.org/10.1007/s10115-013-0679-x [72] Erico Tjoa and Cuntai Guan. 2019. A survey on explainable artificial intelligence (XAI): Towards medical XAI. CoRR abs/1907.07374 (2019). [73] Eric J. Topol. 2019. High-performance medicine: The convergence of human and artificial intelligence. Nat. Med. 25, 1 (2019), 44–56. DOI: https://doi.org/10.1038/s41591-018-0300-7 [74] Leen Van Gestel, Tinne De Laet, Enrico Di Lello, Herman Bruyninckx, GuyMolenaers, Anja Van Campenhout, Erwin Aertbeliën, Mike Schwartz, Hans Wambacq, Paul De Cock, and Kaat Desloovere. 2011. Probabilistic gait classification in children with cerebral palsy: A Bayesian approach. Res. Devel. Disab. 32, 6 (Nov. 2011), 2542–2552. DOI:https://doi.org/10.1016/j.ridd.2011.07.004 [75] Markus Wagner, Djordje Slijepcevic, Brian Horsak, Alexander Rind, Matthias Zeppelzauer, and Wolfgang Aigner. 2018. KAVAGait: Knowledge-assisted visual analytics for clinical gait analysis. IEEE Trans. Visualiz. Comput. Graph. 25, 3 (2018), 1528–1542. [76] Ferdous Wahid, Rezaul K. Begg, Chris J. Hass, Saman Halgamuge, and David C. Ackland. 2015. Classification of Parkinson’s disease gait using spatial-temporal gait features. IEEE J. Biomed. Health Inf. 19, 6 (2015), 1794–1802. [77] Nils Wilhelm, Anna Vögele, Rebeka Zsoldos, Theresia Licka, Björn Krüger, and Jürgen Bernard. 2015. FuryExplorer: Visual-interactive exploration of horse motion capture data. In Visualization and Data Analysis 2015. International Society for Optics and Photonics, 93970F. DOI: https://doi.org/10.1117/12.2080001 [78] Carin Willén, Katarina Stibrant Sunnerhagen, Claes Ekman, and Gunnar Grimby. 2004. How is walking speed related to muscle strength? A study of healthy persons and persons with late effects of polio. Arch. Phys. Med. Rehabil. 85, 12 (2004), 1923–1928. DOI: https://doi.org/10.1016/j.apmr.2003.11.040 [79] David A. Winter. 2009. Biomechanics and Motor Control of Human Movement (4th ed.). Wiley, Hoboken, NJ. [80] Sebastian Wolf, Tobias Loose, Matthias Schablowski, Leonhard Döderlein, Rüdiger Rupp, Hans Jürgen Gerner, Georg Bretthauer, and Ralf Mikut. 2006. Automated feature assessment in instrumented gait analysis. Gait Post. 23, 3 (2006), 331–338. DOI: https://doi.org/10. 1016/j.gaitpost.2005.04.004 [81] Luisa M. Zintgraf, Taco S. Cohen, Tameem Adel, and Max Welling. 2017. Visualizing deep neural network decisions: Prediction differ- ence analysis. In Proceedings of the International Conference on Learning Representations (ICLR). [82] Jacek M. Zurada, Aleksander Malinowski, and Ian Cloete. 1994. Sensitivity analysis for minimization of input data dimension for feedforward neural network. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 447–450. DOI: https://doi.org/10.1109/ISCAS.1994.409622 [83] Niels J. S. Morch, Ulrik Kjems, Lars Kai Hansen, Claus Svarer, Ian Law, Benny Lautrup, Steve Strother, and Kelly Rehm. 1995. Visual- ization of neural networks using saliency maps. In Proceedings of ICNN’95-International Conference on Neural Networks, vol 4. IEEE, 2085–2090. DOI:10.1109/ICNN.1995.488997 [84] Erik Strumbelj and Igor Kononenko. 2010. An efficient explanation of individual classifications using game theory. The Journal of Machine Learning Research 11 (2010), 1–18. JMLR. org. Received July 2020; revised June 2021; accepted July 2021 ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2.4. Explaining Machine Learning Models for Clinical Gait Analysis 113 Supplementary Material for: Explaining Machine Learning Models for Clinical Gait Analysis DJORDJE SLIJEPCEVIC, Institute of Creative Media Technologies, Department of Media and Digital Technologies, St. Pölten University of Applied Sciences FABIAN HORST, Department of Training and Movement Science, Institute of Sport Science, Johannes Gutenberg-University Mainz SEBASTIAN LAPUSCHKIN, Department of Artificial Intelligence, Fraunhofer Heinrich Hertz Institute BRIAN HORSAK, Institute of Health Sciences, Department of Health Sciences, St. Pölten University of Applied Sciences and Center for Digital Health and Social Innovation, St. Pölten University of Applied Sciences, Austria ANNA-MARIA RABERGER, Institute of Health Sciences, Department of Health Sciences, St. Pölten University of Applied Sciences ANDREAS KRANZL, Laboratory for Gait and Movement Analysis, Orthopaedic Hospital Vienna-Speising WOJCIECH SAMEK, Department of Artificial Intelligence, Fraunhofer Heinrich Hertz Institute CHRISTIAN BREITENEDER, Institute of Visual Computing and Human-Centered Technology, TU Wien WOLFGANG IMMANUEL SCHÖLLHORN, Department of Training and Movement Science, Institute of Sport Science, Johannes Gutenberg-University Mainz MATTHIAS ZEPPELZAUER, Institute of Creative Media Technologies, Department of Media and Digital Technologies, St. Pölten University of Applied Sciences The supplementary material presents additional results we generated for the article “Explaining Machine Learning Models for Clinical Gait Analysis”. The primary aim of this article is to explain which class-specific characteristics Machine Learning (ML) models learn from clinical gait analysis (CGA) data. For this purpose, we investigate different gait classifica- tion tasks, employ a representative set of classification methods, i.e., (linear) Support Vector Machine (SVM), Multi-layer Perceptron (MLP), and Convolutional Neural Network (CNN), and an Explainable Artifi- cial Intelligence (XAI) method, i.e., Layer-wise Relevance Propagation (LRP), to explain predictions at the signal (input) level. Subsequently, the explanations of the individual predictions are aggregated to obtain class-specific model explanations. Since there is no ground truth for automatically generated explanations in this context, we we suggest a two-step approach for the evaluation of the obtained explanations. First, we ana- lyze the discriminatory power of the obtained explanations from a statistical perspective. For this purpose, we leverage Statistical Parametric Mapping (SPM) to derive statistical measures along with the input signals and thereby investigate how statistically justified the obtained explanations are. Second, two experienced clinical ex- perts interpret the explainability results from a clinical perspective, to evaluate whether obtained explanations match characteristics from clinical practice. © 2021 Copyright held by the owner/author(s). 2637-8051/2021/12-ART14 $15.00 https://doi.org/10.1145/3474121 ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2. Publications 114 14:2 • D. Slijepcevic et al. The dataset employed, comprises ground reaction force (GRF) measurements from 132 patients with gait disorders (GD) and data from 62 healthy controls (HC). TheGD class is furthermore differentiated into three classes of gait disorders associated with the hip (H ), knee (K), and ankle (A). The classification tasks, which represent the basis of the XAI investigation, due to high classification accuracies obtained, include a binary clas- sification between healthy controls and all gait disorders (HC/GD), and a binary classification between healthy controls and each gait disorder separately, i.e., HC/H , HC/K , and HC/A. The classification results obtained for all classification tasks, are presented in supplementary Table S1. The following figures visualize the relevance-based explanations obtained with LRP. The input vector for the classifiers comprises concatenated affected and unaffected GRF signals. These GRF signals are time-normalized to 101 points (100% stance phase), thus the input vector contains 606 values. For each value, LRP provides whether they are relevant or not for the classification. Sub-figure (A) shows mean GRF signals averaged over each class of the classification task. The shaded areas in all sub-figures highlight areas in the input signals where SPM resulted in a statistically significant difference between both classes. Sub-figure (B) shows mean GRF signals (including a band of one standard deviation) for the HC class. The input relevance indicates, which GRF characteristics were most relevant for (or contradictory to) the classification of a certain class. For visualization, input values neutral to the prediction (Ri ≈ 0) are shown in black, while warm hues indicate input values supporting the prediction (Ri 0) of the analyzed class and cool hues identify contradictory input values (Ri 0). Sub- figure (C) depicts mean GRF signals averaged over a pathological class (H , K , or A) or all gait disorders (GD), in the same format as in sub-figure (B). Sub-figure (D) shows the effect size computed as Pearson’s correlation coefficient and the total relevance, which is calculated as the sum of the absolute input relevance values of both classes. The total relevance indicates the common relevance of the input signal for the classification task. CLASSIFICATION RESULTS Table S1. Overview of the Prediction Accuracy Obtained for the Three Employed Classification Methods (CNN, SVM, and MLP) and All Classification Tasks with Min–Max Normalized and Non-Normalized Input Signals, Reported as Mean (Standard Deviation) Over the Ten-Fold Cross Validation in Percent Task Normalization ZRB CNN SVM MLP HC/GD no norm. 68.0 87.8 (4.5) 88.6 (4.9) 88.1 (4.8) HC/GD min-max 68.0 88.0 (5.0) 88.4 (5.3) 88.8 (5.0) HC/H no norm. 62.6 85.1 (8.2) 85.9 (8.4) 86.6 (7.9) HC/H min-max 62.6 85.5 (8.0) 87.1 (7.6) 86.7 (8.5) HC/K no norm. 54.4 84.8 (9.9) 85.7 (9.0) 86.1 (7.9) HC/K min-max 54.4 85.9 (9.3) 88.5 (7.2) 88.5 (7.6) HC/A no norm. 59.0 88.7 (5.5) 89.1 (5.9) 88.3 (6.3) HC/A min-max 59.0 86.7 (8.3) 87.6 (7.4) 86.5 (8.1) H/K/A no norm. 39.4 48.0 (10.1) 46.4 (9.5) 45.9 (11.0) H/K/A min-max 39.4 50.7 (9.8) 51.8 (9.6) 47.4 (10.9) HC/H/K/A no norm. 32.0 55.0 (8.7) 58.7 (7.5) 55.6 (7.6) HC/H/K/A min-max 32.0 57.5 (7.0) 59.5 (8.5) 59.2 (7.6) Note that the Zero-Rule Baseline (ZRB) is task-specific. ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2.4. Explaining Machine Learning Models for Clinical Gait Analysis 115 Explaining Machine Learning Models for Clinical Gait Analysis • 14:3 EXPLAINABILITY RESULTS Classification Task: HC/GD | Classification method: CNN Fig. S1. Result overview for the classification of healthy controls and the aggregated class of all three gait disorders (HC/GD) based on min–max normalized GRF signals using a CNN as classifier. Classification Task: HC/GD | Classification method: MLP Fig. S2. Result overview for the classification of healthy controls and the aggregated class of all three gait disorders (HC/GD) based on min–max normalized GRF signals using an MLP as classifier. ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2. Publications 116 14:4 • D. Slijepcevic et al. Classification Task: HC/GD | Classification method: SVM Fig. S3. Result overview for the classification of healthy controls and the aggregated class of all three gait disorders (HC/GD) based on min–max normalized GRF signals using an SVM as classifier. Classification Task: HC/H | Classification method: CNN Fig. S4. Result overview for the classification of healthy controls (HC) and hip injury class (H ) based on min–max normalized GRF signals using a CNN as classifier. ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2.4. Explaining Machine Learning Models for Clinical Gait Analysis 117 Explaining Machine Learning Models for Clinical Gait Analysis • 14:5 Classification Task: HC/H | Classification method: MLP Fig. S5. Result overview for the classification of healthy controls (HC) and hip injury class (H ) based on min–max normalized GRF signals using an MLP as classifier. Classification Task: HC/H | Classification method: SVM Fig. S6. Result overview for the classification of healthy controls (HC) and hip injury class (H ) based on min–max normalized GRF signals using an SVM as classifier. ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2. Publications 118 14:6 • D. Slijepcevic et al. Classification Task: HC/K | Classification method: CNN Fig. S7. Result overview for the classification of healthy controls (HC) and knee injury class (K ) based on min–max normal- ized GRF signals using a CNN as classifier. Classification Task: HC/K | Classification method: MLP Fig. S8. Result overview for the classification of healthy controls (HC) and knee injury class (K ) based on min–max normal- ized GRF signals using an MLP as classifier. ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2.4. Explaining Machine Learning Models for Clinical Gait Analysis 119 Explaining Machine Learning Models for Clinical Gait Analysis • 14:7 Classification Task: HC/K | Classification method: SVM Fig. S9. Result overview for the classification of healthy controls (HC) and knee injury class (K ) based on min–max normal- ized GRF signals using an SVM as classifier. Classification Task: HC/A | Classification method: CNN Fig. S10. Result overview for the classification of healthy controls (HC) and ankle injury class (A) based on min–max normalized GRF signals using a CNN as classifier. ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2. Publications 120 14:8 • D. Slijepcevic et al. Classification Task: HC/A | Classification method: MLP Fig. S11. Result overview for the classification of healthy controls (HC) and ankle injury class (A) based on min–max nor- malized GRF signals using an MLP as classifier. Classification Task: HC/A | Classification method: SVM Fig. S12. Result overview for the classification of healthy controls (HC) and ankle injury class (A) based on min–max nor- malized GRF signals using an SVM as classifier. ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2.4. Explaining Machine Learning Models for Clinical Gait Analysis 121 Explaining Machine Learning Models for Clinical Gait Analysis • 14:9 EXPLAINABILITY RESULTS – NON-NORMALIZED DATA Classification Task: HC/GD | Classification method: CNN Fig. S13. Result overview for the classification of healthy controls and the aggregated class of all three gait disorders (HC/GD) based on non-normalized GRF signals using a CNN as classifier. Classification Task: HC/GD | Classification method: MLP Fig. S14. Result overview for the classification of healthy controls and the aggregated class of all three gait disor- ders (HC/GD) based on non-normalized GRF signals using an MLP as classifier. ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2. Publications 122 14:10 • D. Slijepcevic et al. Classification Task: HC/GD | Classification method: SVM Fig. S15. Result overview for the classification of healthy controls and the aggregated class of all three gait disor- ders (HC/GD) based on non-normalized GRF signals using an SVM as classifier. Classification Task: HC/H | Classification method: CNN Fig. S16. Result overview for the classification of healthy controls (HC) and hip injury class (H ) based on non-normalized GRF signals using a CNN as classifier. ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2.4. Explaining Machine Learning Models for Clinical Gait Analysis 123 Explaining Machine Learning Models for Clinical Gait Analysis • 14:11 Classification Task: HC/H | Classification method: MLP Fig. S17. Result overview for the classification of healthy controls (HC) and hip injury class (H ) based on non-normalized GRF signals using an MLP as classifier. Classification Task: HC/H | Classification method: SVM Fig. S18. Result overview for the classification of healthy controls (HC) and hip injury class (H ) based on non-normalized GRF signals using an SVM as classifier. ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2. Publications 124 14:12 • D. Slijepcevic et al. Classification Task: HC/K | Classification method: CNN Fig. S19. Result overview for the classification of healthy controls (HC) and knee injury class (K ) based on non-normalized GRF signals using a CNN as classifier. Classification Task: HC/K | Classification method: MLP Fig. S20. Result overview for the classification of healthy controls (HC) and knee injury class (K ) based on non-normalized GRF signals using an MLP as classifier. ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2.4. Explaining Machine Learning Models for Clinical Gait Analysis 125 Explaining Machine Learning Models for Clinical Gait Analysis • 14:13 Classification Task: HC/K | Classification method: SVM Fig. S21. Result overview for the classification of healthy controls (HC) and knee injury class (K ) based on non-normalized GRF signals using an SVM as classifier. Classification Task: HC/A | Classification method: CNN Fig. S22. Result overview for the classification of healthy controls (HC) and ankle injury class (A) based on non-normalized GRF signals using a CNN as classifier. ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2. Publications 126 14:14 • D. Slijepcevic et al. Classification Task: HC/A | Classification method: MLP Fig. S23. Result overview for the classification of healthy controls (HC) and ankle injury class (A) based on non-normalized GRF signals using an MLP as classifier. Classification Task: HC/A | Classification method: SVM Fig. S24. Result overview for the classification of healthy controls (HC) and ankle injury class (A) based on non-normalized GRF signals using an SVM as classifier. ACM Transactions on Computing for Healthcare, Vol. 3, No. 2, Article 14. Publication date: December 2021. 2.4. Explaining Machine Learning Models for Clinical Gait Analysis 127 2. Publications 2.5 Explainable Machine Learning in Human Gait Analysis: A Study on Children With Cerebral Palsy Djordje Slijepcevic, Matthias Zeppelzauer, Fabian Unglaube, Andreas Kranzl, Chris- tian Breiteneder, and Brian Horsak. Explainable Machine Learning in Human Gait Analysis: A Study on Children With Cerebral Palsy. IEEE Access, 11:65906–65923, 2023. DOI: 10.1109/ACCESS.2023.3289986 The final version of this publication is available at: https://doi.org/10.1109/ ACCESS.2023.3289986. Permission for reprint granted, © 2023 Slijepcevic 128 IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY SECTION Received 10 May 2023, accepted 11 June 2023, date of publication 27 June 2023, date of current version 6 July 2023. Digital Object Identifier 10.1109/ACCESS.2023.3289986 Explainable Machine Learning in Human Gait Analysis: A Study on Children With Cerebral Palsy DJORDJE SLIJEPCEVIC 1, MATTHIAS ZEPPELZAUER 1, FABIAN UNGLAUBE2, ANDREAS KRANZL2, CHRISTIAN BREITENEDER3, AND BRIAN HORSAK 4,5 1Institute of Creative Media Technologies, St. Pölten University of Applied Sciences, 3100 Sankt Pölten, Austria 2Laboratory for Gait and Movement Analysis, Orthopaedic Hospital Speising, 1130 Vienna, Austria 3Institute of Visual Computing and Human-Centered Technology, TU Wien, 1040 Vienna, Austria 4Institute of Health Sciences, St. Pölten University of Applied Sciences, 3100 Sankt Pölten, Austria 5Center for Digital Health and Social Innovation, St. Pölten University of Applied Sciences, 3100 Sankt Pölten, Austria Corresponding author: Djordje Slijepcevic (djordje.slijepcevic@fhstp.ac.at) This work was supported in part by the Research Promotion Agency of Lower Austria and the Provincial Government of Lower Austria within IntelliGait3D under Grant FTI17-014, and in part by the Endowed Professorship for Applied Biomechanics and Rehabilitation Research under Grant SP19-004. This work involved human subjects or animals in its research. Approval of all ethical and experimental procedures and protocols was granted by the Ethics Committee of the City of Vienna under Application No. EK 19-083-VK. ABSTRACT This work investigates the effectiveness of various machine learning (ML) methods in classifying human gait patterns associated with cerebral palsy (CP) and examines the clinical relevance of the learned features using explainability approaches. We trained different ML models, including convolutional neural networks, self-normalizing neural networks, random forests, and decision trees, and generated explanations for the trained models. For the deep neural networks, Grad-CAM explanations were aggregated on different levels to obtain explanations at the decision, class and model level. We investigate which subsets of 3D gait analysis data are particularly suitable for the classification of CP-related gait patterns. The results demonstrate the superiority of kinematic over ground reaction force data for this classification task and show that traditional ML approaches such as random forests and decision trees achieve better results and focus more on clinically relevant regions compared to deep neural networks. The best configuration, using sagittal knee and ankle angles with a random forest, achieved a classification accuracy of 93.4 % over all four CP classes (crouch gait, apparent equinus, jump gait, and true equinus). Deep neural networks utilized not only clinically relevant features but also additional ones for their predictions, which may provide novel insights into the data and raise new research questions. Overall, the article provides insights into the application of ML in clinical practice and highlights the importance of explainability to promote trust and understanding of ML models. INDEX TERMS Explainable artificial intelligence, explainability, human gait analysis, biomechanical gait data, kinematics, ground reaction forces, convolutional neural network, self-normalizing neural network, random forest, decision tree. I. INTRODUCTION Walking impairments can severely affect a person’s ability to participate in social activities and work life and negatively The associate editor coordinating the review of this manuscript and approving it for publication was Kin Fong Lei . impact quality of life. Causes of walking impairments can range from traumatic events to various diseases such as stroke, Parkinson’s disease, or cerebral palsy (CP). One of the most common causes of physical disability in children is CP, which occurs in approximately 2.5 out of every 1,000 births in developed countries [1]. This group of 65906 This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 11, 2023 2.5. Explainable Machine Learning in Human Gait Analysis: A Study on Children With Cerebral Palsy 129 D. Slijepcevic et al.: Explainable Machine Learning in Human Gait Analysis neurological disorders can cause tremors, muscle weakness, stiffness, and spasticity, which can affect a child’s motor functions and ability to walk [2]. One of the most frequent causes of CP are brain lesions that occur before, during, or shortly after birth, mostly leading to musculoskeletal impairments that can worsen throughout childhood and adolescence [3]. Accurate examination and quantification of underlying movement mechanisms are necessary to ensure the best possible treatment for children with CP. Such information is essential for clinicians to offer targeted treatment plans to their patients. The worldwide established gold standard for this purpose is clinical 3D gait analysis (3DGA). This method allows to objectively and quantitatively describe and analyze the human motor function of patients from a kinematic (joint angles) and kinetic (ground reaction forces, joint reaction forces, and joint moments) point of view [4]. The basis for 3DGA is motion capturing and assessment of ground reaction forces (GRF). Both data sources capture complementary information on walking behavior. This work investigates the automated classification of gait patterns associated with CP using data from clinical 3DGA, i.e., kinematic data, GRF data, and a combination of both. Recordings obtained during 3DGA produce a vast amount of data. A typical report in clinical practice contains up to a few dozen discrete parameters along with more than 20 waveforms describing kinematic and kinetic gait variables across a gait cycle. A gait cycle refers to the interval between the initial contact of a foot and the subsequent initial contact of the same foot and is the standard ‘‘time frame’’ used in clinical practice to describe human gait. Due to the complexity, characterized by high dimensionality, temporal dependence, strong variability, non-linear relation- ship, and inter-correlation [5], these data are challenging to comprehend and analyze manually (see Figure 1 for an example of a data record obtained by 3DGA). Hence, data interpretation in clinical practice is highly challenging, and a lot of experience is required to draw valid medical conclusions. The complexity of the data in 3DGA combined with the need for timely and precise decision-making has motivated research to utilize Machine Learning (ML) to aid decision- making [6]. ML approaches increasingly leverage non-linear classification models, such as multi-layer (deep) neural networks, which have shown to provide promising results concerning classification accuracy in the field of clinical gait analysis [7], [8]. However, such complex classification models share a major limitation: their black-box nature [9]. This means that it is hard to trace back and understand how a certain model has reached a specific decision, how it is grounded in the input data, and what kind of patterns and rules it actually learned from the data. Consequently, even well-performing ML models are rarely used in clinical practice [10]. Given an ML model trained on 3DGA data, it is a non-trivial task to trace back which patterns in the signal are responsible for its predictions. Furthermore, it is unclear whether predictions are based on clinically relevant patterns or rather on signals that relate to the targeted pathologies due to a spurious correlation or a bias in the data but are not causally related to them. The experts’ skepticism regarding automatically generated predictions and diagnostic suggestions is, therefore, well justified. At the same time, the strong performance obtained by state-of-the-art ML models shows great potential to significantly support the diagnostic process and, thus, to save costs and time in everyday clinical practice. Therefore, their application in clinical practice would be of great value. This, however, requires ML approaches to become more transparent and traceable, e.g., via explainability mechanisms [8]. In addition, this would further help to fulfill legal requirements, such as the EUGeneral Data Protection Regulation (GDPR) [11], that require the traceability of ML predictions. The primary aim of this work is to investigate the effectiveness of various ML methods in automatically classifying gait patterns associated with CP and to employ explainability approaches to examine whether the features learned by these models are clinically relevant. To this end, we trained different ML models, including convo- lutional neural networks (CNNs), self-normalizing neural networks (SNNs) [12], random forests (RFs), and decision trees (DTs) and generated and compared explanations for the trained models. We utilized model-specific explanations for DTs and RFs in terms of Gini impurity-based feature impor- tance. For the investigated deep neural networks (DNNs), we adapted the well-known Grad-CAM algorithm [13] to be applicable to one-dimensional time series input data. This explainability method has been shown to be robust [14] in explaining the internal workings of DNNs. Our investigation focuses on the following leading research questions: 1) How advantageous is the use of kinematic data over GRF data in the automated classification of gait patterns associated with CP, and are the two inputs more effective in combination than used individually? 2) How do traditionalMLmodels compare to state-of-the- art DNNs for the automated classification of clinical 3DGAdata in terms of performance and explainability? 3) To what extent do the investigated ML models base their decisions on clinically meaningful features when classifying CP-related gait patterns? 4) To what extent are the explanations obtained from DNNs robust to variations in architecture? We performed experiments on a dataset of 302 patients with CP (375 limbs) and four different gait patterns related to this condition. Our results show an unexpected outcome. DNNs have not met initial expectations and fall behind traditional methods in classification performance. Compared to traditional methods that provide concise explanations and identify and utilize clinically relevant regions in the input data for the classification task, DNNs are less informative in their explanations. VOLUME 11, 2023 65907 2. Publications 130 D. Slijepcevic et al.: Explainable Machine Learning in Human Gait Analysis FIGURE 1. Visualization of a 3DGA data record as used in clinical practice. Several retro-reflective markers (pink spheres) are attached to specific anatomical landmarks of the human body and allow quantifying human locomotion using 3D motion-capturing techniques. The 3D trajectories of these markers combined with geometrical biomechanical models are used to calculate, e.g., joint angles. In clinical practice, this information is used to inform medical decision-making. The data from clinical 3DGA are typically reported in simple line plots. Blue and red colors encode the right and left body sides, respectively. Deriving a diagnosis from these abstract line plots is a challenging task that requires trained medical personnel. Thus, machine learning models are highly desirable to assist decision-making. II. RELATED WORK The research in this paper combines methodology from multiple disciplines, namely automated classification in clinical gait analysis and explainable machine learning, a branch of explainable artificial intelligence (XAI). For this reason, the related work is structured in two subsections, one for each field. A. AUTOMATED CLASSIFICATION OF PATIENTS WITH CP There is a growing interest in using ML in the field of clinical gait analysis due to its ability to analyze large amounts of gait data in a cost-effective, fast, and objective manner [6], [15], [16]. ML methods have been successfully applied to analyze gait patterns of patients with different conditions, such as stroke [17], Parkinson’s disease [18], multiple sclerosis [19], osteoarthritis [20], and various functional gait disorders [21], [22]. One area that has received particular attention in the literature is the use of ML for automated classification of gait patterns associated with CP [23]. Several studies have compared the performance of differ- ent ML approaches for this task. Ferrari et al. [24] compared the use of multi-layer perceptrons (MLPs), support vector machines (SVMs), and long short-term memory networks (LSTMs) for the classification of four CP-related gait patterns defined by Ferrari et al. [25]. According to their results, the LSTMs achieved the highest classification accuracy of 67.4 %. The authors utilized kinematic data and frequency information (obtained via fast Fourier transform) 65908 VOLUME 11, 2023 2.5. Explainable Machine Learning in Human Gait Analysis: A Study on Children With Cerebral Palsy 131 D. Slijepcevic et al.: Explainable Machine Learning in Human Gait Analysis from 174 patients. Zhang and Ma [26] compared seven ML methods, including MLP, SVM, DT, and RF, for the classification of CP-related gait patterns as defined by Rodda and Graham [27], i.e., crouch gait, apparent equinus, jump gait, and true equinus. The MLPs performed best with a classification accuracy of 93.5 %. The DT, RF, and SVM had considerably lower accuracy rates of 84.3 %, 83.6 %, and 85.0 %, respectively. The dataset comprised discrete parameters from kinematic waveforms of 200 children. Darbandi et al. [28] also used discrete parameters of kinematic data of 66 children in a stochastic approach to translate expert knowledge into rules and to perform fuzzy clustering. For the classification of gait patterns as defined by Rodda and Graham [27], their approach achieved a classification accuracy of 94.0 %. Chia et al. [29] developed a decision support system that used discrete parameters from kinematic waveforms, physical examinations, and anthropo- metric data to identify 14 different CP-related impairments (e.g., hamstring spasticity, gastrocnemius spasticity, and gluteal weakness) and provide surgical recommendations. The dataset comprised 689 3DGA recordings of 423 children. The authors evaluated the performance of a stratified and standard RF, with the latter achieving better results, i.e., a misclassification rate of 0.13 (corresponding to a classifica- tion accuracy of 87.0 %). Furthermore, feature importance served as decision explanation and partial dependence plots as model explanation. B. EXPLAINABLE MACHINE LEARNING The inherent non-transparency of modern ML models, in particular DNNs, has greatly advanced research on explain- ability methods in the field of XAI in recent years. These methods are designed to provide explanations for automated predictions and to help clinical experts understand how and why a particular prediction was made. XAI methods can be categorized according to the type of explanation they provide. Following the taxonomy of Arya et al. [30], we distinguish between XAI approaches for (i) data exploration, (ii) decision explanation, and (iii) model explanation. Data exploration methods cannot explain an ML model, but rather the data on which the model was trained. These methods include techniques from the field of visual analyt- ics [31], statistics (e.g., statistical parametric mapping) [8], and unsupervised machine learning [32], [33]. The goal is to visualize and adequately transform the data, thereby enabling domain experts to find meaningful structures and patterns that will allow them to better understand the data, their distribution, and cluster structures. This process should result in novel insights from the data. Data exploration is generally recommended before anMLmodel is trained. Decision explanation methods explain the local behavior of an ML model, i.e., providing an explanation for the prediction of an individual data sample. For a classification task, such an explanation can, for example, indicate which parts of the input are responsible for the prediction. In the case of gait classification, suchmethods can identify characteristic sections in the input time series related to a specific gait disorder [8]. The majority of decision explanation methods are post-hoc methods, which offer great flexibility as they can be directly applied to previously trained classification models [30]. Typical results of post-hoc methods are saliency maps that highlight which input features are most relevant to a particular prediction [13]. Post-hoc methods can be divided into propagation-based and perturbation-based approaches. Propagation-based methods determine the effect of input features on the model’s prediction by (partially) back- propagating an entity of interest (e.g., gradients) from the output to the input of the model. Popular examples for such approaches are SmoothGrad [34], Grad-CAM [13], and Layer-wise Relevance Propagation (LRP) [35]. Perturbation- based methods, e.g., Local Interpretable Model-Agnostic Explanations (LIME) [36] and SHapley Additive exPla- nations (SHAP) [37], estimate the importance of input features by partially masking the input and measuring the effect on the model output. Perturbation-based methods are model-agnostic since no access to the internal architecture of the models is necessary. Compared to propagation- based methods, however, they require a significantly higher computational effort. Propagation-based methods are com- putationally more efficient and allow the explanation of classifier-specific characteristics, thus enabling a more pro- found analysis. Besides decision explanations, there are model explana- tion methods that aim to explain what a trained model has learned at a global level, e.g., by providing class- specific prototypes [38] or synthesized samples reflecting the characteristic patterns learned for a certain class [39]. Consequently, ambiguous features that the model learned can be identified and overlaps between classes can be detected. Model explanation allows to check whether a model has been trained correctly and whether the predicted classes are based on meaningful patterns. Decision and model explanation, thus, complement each other. In clinical gait analysis, only a few studies have used XAI to shed light on the underlying black-box models and promote their use within the clinical setting. We have recently proposed several approaches for model and decision explanation based on LRP to explain the functioning of different ML models (i.e., linear SVMs, MLPs, and CNNs) for the classification of GRF data into different functional gait disorders [8]. The investigated ML models utilized GRF waveforms as input. Consequently, to obtain class-specific explanations, the averaged relevance scores were superimposed over the averaged GRF waveforms. Furthermore, we proposed the use of a model explanation based on SpRAy [40], which is a method for identifying clusters within the explanations. These approaches have been further investigated to explain sex- and age-dependent gait patterns learned by ML models [41], [42]. Dindorf et al. [43] used LIME to explain a linear SVM that was trained to VOLUME 11, 2023 65909 2. Publications 132 D. Slijepcevic et al.: Explainable Machine Learning in Human Gait Analysis distinguish between healthy controls and patients after total hip arthroplasty. In their study, two input scenarios were examined, one with kinematic and kinetic waveforms, while the other employed discrete parameters derived from these waveforms. The explainability results showed that the SVMs were highly sensitive to the input representation employed and that each of the models often focused on different biomechanical features. Kokkotis et al. [44] leveraged SHAP to explain which discrete kinematic and kinetic parameters contributed most to the decisions of an SVM for the classification of patients with anterior cruciate ligament injury (with and without reconstruction surgery) and healthy controls. The authors noted that SHAP highlighted several discrete parameters that were consistent with biomechanical findings reported in the literature. However, a discrepancy was observed between the explainability results and the results of conventional statistical analysis. Recently, we pro- posed gaitXplorer [45], a visual analytics approach for the classification of CP-related gait patterns that employed Grad-CAM [13] to explain predictions of CNNs. This work employs the same dataset and explainability method as the present paper, but focuses solely on decision explanations. Figure 2 shows the interactive visual interface of the gaitXplorer. The present work investigates for the first time the suitability of different DNN architectures and their explain- ability in terms of decision and model explanations for the classification of CP-related gait patterns as defined by Rodda et al. [47]. Moreover, our research addresses the ability of DNNs and conventional ML approaches to capture clinically significant features from kinematic and kinetic waveforms. III. METHODS A. CLINICAL USE CASE AND TARGET CLASSES The clinical use case of this paper is the identification of clinically well-defined gait patterns in children with CP and neuro-muscular disorders. This patient group is associated with varying symptoms such as muscle weakness and stiffness, tremors, and limited joint range of motion, among other impairments [2]. All of these can strongly affect motor function and the ability to walk. The present study uses 3DGA data from patients who can walk independently and have well-recognizable gait deviations such as toe- walking, flexed-stiff knees, flexed hips, and an anteriorly tilted pelvis [47]. The correct classification and identification of these underlying impairments are essential as clinicians base their decisions about optimal treatment interventions on this information. All patients in our study were categorized into four patho- logical gait patterns by a clinically established procedure, the so-called ankle plantarflexor-knee extension couple (PFKE) index [48]. This categorization served as the ground truth during the training and evaluation of the ML model. The method compares the sagittal knee and ankle angles of patients with those of a speed-matched healthy control cohort and automatically determines the four classes using a set of rules. These rules provide a well-suited reference for the evaluation of the appropriateness of the explainability results. In our experiment we expect that a trustworthy classification model for CP would base its decisions on the same signal regions as the PFKE method. The gait patterns associated with CP are illustrated in Figure 3 and briefly described in the following [47]: • True equinus: The ankle is in plantar flexion throughout the stance phase (‘‘toe-walking’’). • Jump gait: Equinus at the ankle (partly in late stance), flexion at knee and hip (especially in early stance), anterior pelvis tilt, and increased lumbar lordosis. • Apparent equinus: The ankle has a normal range, but the knee and hip are excessively flexed throughout the stance, and the heel is off the ground during walking. • Crouch gait: The ankle is excessively dorsiflexed throughout the stance, and the knee and hip are excessively flexed. B. DATASET The data used for this study are retrospective gait anal- ysis data from an existing clinical database maintained by the Laboratory for Gait and Movement Analysis at the Orthopaedic Hospital Vienna-Speising. Gait analysis data are, briefly described, obtained by motion capturing techniques where spherical retroreflective markers with a diameter of approximately 1 cm are placed directly on the patient’s skin above anatomical landmarks. Then the patient is asked to walk freely up and down a walkway of roughly ten meters in a gait laboratory. A motion capture system comprising several infrared-based cameras then records the 2D trajectories of each reflective marker for each camera. These redundant 2D coordinates are then triangulated to derive the 3D coordinates in space for each marker [49] at any instant of time. The obtained marker positions are then used to fit a multibody biomechanical model into these 3D trajectories by a least square algorithm. The model then allows describing kinematic and kinetic variables of human locomotion in detail [4]. The local ethics committee approved this retrospective study (EK 19-083-VK). The dataset comprises anonymized data from 302 patients with CP (375 affected legs) and includes the aforementioned four gait patterns: true equinus (N = 129), jump gait (N = 72), apparent equinus (N = 92), and crouch gait (N = 82). Table 1 presents class-specific demographic details. The 3D clinical gait analysis was performed on a 12 m walkway using a motion capture system (150 Hz, Vicon, Oxford, United Kingdom) comprising at least 14 infrared cameras and three force plates (1500 Hz, Advanced Mechanical Technology Inc., MA, USA). The force plates were embedded in the ground flush with the walkway and covered with the same surface material as the floor. Patients walked unassisted (without a walking aid) and at self-selected walking speed until at least five valid recordings had been obtained. A record was 65910 VOLUME 11, 2023 2.5. Explainable Machine Learning in Human Gait Analysis: A Study on Children With Cerebral Palsy 133 D. Slijepcevic et al.: Explainable Machine Learning in Human Gait Analysis FIGURE 2. Visual interface of the gaitXplorer [45] showing the classification prediction and corresponding explanations for both legs of a patient. The top right corner (a) features a compact overview of the Grad-CAM-based explanations. The main panel (b) illustrates the patient’s 3D gait analysis data as line plots, with color intensity indicating the relevance for the predictions (i.e., blue for the left and red for the right leg). Figure adapted from [46]. FIGURE 3. Four motion patterns in patients with cerebral palsy, i.e., true equinus (toe-walking), jump gait, apparent equinus, and crouch gait [47]. One indicator for these four gait patterns is the sagittal ankle angle (dorsi-plantarflexion). The typical value range of this angle is displayed for each gait pattern. considered valid if the patient walked naturally and had a clean foot strike on one of the force plates. The raw data were preprocessed with Vicon Nexus (Vicon, Oxford, United Kingdom) and custom-made Matlab routines (The MathWorks, Inc., Matrick, MA, USA). Marker trajectories were filtered with a Woltring filter (mean square error of 15 mm2) and GRF data with a third-order Savitzky-Golay filter. Joint angles and moments were calculated according to the modified Cleveland clinical marker set. Based on distinct gait events, i.e., initial contact and foot off, data of all valid gait cycles were linearly time normalized to 100 % of the respective gait cycle. Subsequently, the average curve was computed for each joint angle, joint moment, and GRF component by aggregating data from all gait cycles within one recording session. The dataset includes information about the joint angles (kinematics) of the pelvis, hip, knee, and ankle and GRFs in all three planes of motion. For the gait kinematics, the sagit- tal, frontal, and transverse plane of motion correspond to the flexion/extension, abduction/adduction, and internal/external rotation of a joint, respectively. Consistent with standard practice in this domain, data were time-normalized to one gait cycle. As a result, each signal has 101 data samples after time-normalization, i.e., corresponding to 0–100 % of the gait cycle for joint angles and stand phase for GRFs. The data are multi-dimensional and consist of 13 signals in total, i.e., vertical (GRFV ), anterior-posterior (GRFAP), and medio-lateral (GRFML) GRFs as well as sagittal (PelvisS ), frontal (PelvisF ), transversal (PelvisT ) pelvis angles, sagittal (HipS ), frontal (HipF ), transversal (HipT ) hip angles, sagittal (KneeS ), frontal (KneeF ), transversal (KneeT ) knee angles, and sagittal ankle angle (AnkleS ). Each signal represents either one of the GRF components or the kinematic profile at a particular joint and anatomical plane during one gait cycle. Several gait cycles were available for each patient. The signals from these gait cycles were averaged to one waveform per body side to account for intra-subject gait variability. The classification was conducted at the level of individual legs, i.e., only the affected legs were classified. VOLUME 11, 2023 65911 2. Publications 134 D. Slijepcevic et al.: Explainable Machine Learning in Human Gait Analysis TABLE 1. Demographic information for each class within the employed dataset. For our experiments we employed different subsets of the captured data to investigate the influence of different signals on the classification performance and on the obtained explanations. For each subset we concatenated the respective signals into a one-dimensional vector. We defined the following four subsets for our experiments: • the lower body kinematic and GRF data (i.e., a 1×1313- dimensional input vector for each patient, consisting of 13 signals, with 101 samples each); • only lower body kinematic data (a 1×1010-dimensional input vector for each patient); • only GRF data (a 1×303-dimensional input vector for each patient); • the sagittal knee and ankle joint angles which are the signals that are actually used to determine the ground truth (a 1×202-dimensional input vector for each patient). Since the signal amplitudes differ in their dynamic ranges, we normalized the input features component-wise to the range [0,1]. Hence, it is ensured that each signal can contribute equally to the decision process and signals with a smaller amplitude range are not disadvantaged. C. CLASSIFICATION METHODS For the classification task, we examined various ML models, including CNNs, SNNs, RFs, and DTs, and generated and compared explanations for the trained models. We selected CNNs because they have not been previously employed in the literature for this purpose, despite their success in other gait analysis tasks [8]. SNNs utilize scaled expo- nential linear units (SELUs) as activation function, which exhibit self-normalizing properties, causing the output of the layers to converge to zero mean and unit variance [12]. Klambauer et al. [12] demonstrated that SNNs exhibit great robustness due to these properties, since vanishing and exploding gradients are eliminated by construction. As tra- ditional ML methods, RFs and DTs performed well for classifying CP-related gait patterns, and thus we utilized them as baseline approaches. We evaluated the ML models in a stratified five-fold cross-validation approach. Hence, three folds served as training data, one fold served as a validation set on which the optimal architecture and hyperparameters were determined, and the remaining fold served as a test set. 1) NEURAL NETWORKS CNNs and SNNs learn abstract feature representations for the provided data via several consecutive 1D convolutional layers. The filter size and stride1 remained fixed for all convolutional layers. We investigated whether compressing information across the convolutional stack provides an advantage in both performance and explainability. To this end, we examined a stride of one (baseline without com- pression) and two (compression by half). For both model types, CNNs and SNNs, the filter size was set to three. Simonyan and Zisserman [50] showed the advantage of using a stack of 3 × 3 convolutional layers over filters with larger receptive fields in image classifiers. This approach employs multiple non-linearities, resulting in a more discriminative decision function as well as a reduction in the number of parameters [50]. Non-linear neuron activations in terms of ReLUs for CNNs and SeLUs for SNNs were applied in each convolutional layer. The feature maps in the last convolutional layer were flattened and linked to a fully- connected (dense) layer stack. This stack consists of one dense layer (with ReLU for CNNs and SeLU for SNNs as an activation function) and an output layer situated on top that has four output neurons. To promote generalizability and counteract potential overfitting during training, a dropout was applied to the last two dense layers (including the output layer). The output layer has a softmax activation function attached to scale the outputs to class likelihoods. The fully-connected layers (including the output layer) can be considered a non-linear multi-class predictor, which operates on top of a hierarchically learned stack of 1D filters. The convolutional layer stack is strongly non-linear, which makes this part of the architecture very flexible in modeling but at the same time non-transparent. During the training process, the weights were updated via back-propagation using the Adam optimizer (1000 training epochs with early stopping using the validation loss as the monitored metric and a patience of 100 epochs) and a categorical cross-entropy loss function. For each input setting, we determined the optimal hyperparameters via a grid search, i.e., stride {1, 2}, number of convolutional layers and number of filters {{32, 32}, {32, 32, 32, 32}, {32, 32, 32, 32, 32, 32}, {32, 64}, {32, 32, 64, 64}}, size of the dense layer {64, 128}, dropout rate {0.1, 0.25, 0.5}, batch size {32, 64}, and learning rate {10−4, 10−3}. The optimal hyperparameters for the CNNs are presented in Table 2 and for the SNNs in Table 3. 1The number of input features that the convolution filter moves across the input of the convolutional layer. 65912 VOLUME 11, 2023 2.5. Explainable Machine Learning in Human Gait Analysis: A Study on Children With Cerebral Palsy 135 D. Slijepcevic et al.: Explainable Machine Learning in Human Gait Analysis TABLE 2. Optimal hyperparameters, i.e., stride (S), number of layers and filters in the convolution stack, size of dense layer, dropout rate (DR), batch size (BS), and learning rate (LR), for the CNN architectures. TABLE 3. Optimal hyperparameters, i.e., stride (S), number of layers and filters in the convolution stack, size of dense layer, dropout rate (DR), batch size (BS), and learning rate (LR), for the SNN architectures. 2) TREE-BASED MODELS In addition to DNNs, we explored traditional ML methods such as DTs and RFs. The non-parametric, non-linear, and intrinsically interpretable nature of DTs makes them popular for gait classification [6]. DTs have a tree structure containing decision nodes and leaf nodes. A decision node uses a suitable input feature to try to split the data into two homogeneous subsets. To determine which feature is suitable for a decision node, different metrics can be used. The most commonly used metrics are Gini impurity and information gain based on entropy. These metrics can also be used to calculate the importance of each input feature for the entire model, which can be used as a model explanation. Since a feature can be used at different levels of the tree, the importance of a feature is determined as the total contribution in reducing the impurity. Different algorithms exist for the construction of a DT, e.g., ID3 [51], C4.5 [52], and CART [53]. For our experiments, we utilized the DT implementation of Scikit-learn [54], which is based on the CART algorithm. To construct the DT, we used Gini impurity and the feature importance based on this metric serves as explanation method. Since individual DTs can be sensitive to even small changes in the input data, we also investigated RFs. RFs are also supervised non-parametric ML methods built on a set of simple DTs. To generate an RF, a predetermined number of DTs are first trained on different subsets of the training data, and then the predictions of these simpler models are combined. As RFs are sensitive to the number of individual DTs, we performed a grid search over this hyperparameter NDT ∈ {100, 200, 300}. The number of individual DTs that performed best in all settings was 100. For our experiments, we utilized the RF implementation of Scikit-learn [54] with Gini impurity as metric. Similar to an individual DT, the feature importance can be calculated for the entire RF using Gini impurity. This can directly serve as an explanation for the trained model. D. EXPLAINABILITY METHODS FOR NEURAL NETWORKS Once the networks were trained, a central question was how to explain these models to examine their internal functioning and plausibility. With regard to DNNs, the ever-growing ecosystem of XAI methods offers many choices. However, as demonstrated by Adebayo et al. [14], not all of the proposed XAImethods are robust and the validity of obtained explanations should be questioned. Unfortunately, most of the popular gradient-based methods, which are particularly well-suited for DNNs, are heavily exposed to artifacts caused by the problem of gradient shattering [55]. Thus, the explanations in the input space are not continuous, and single input features show considerably different or even opposite importance values (regarding a given prediction) compared to input features in their immediate neighborhood [34]. Furthermore, it is questionable whether the information obtained for individual input features represents an adequate abstraction level to explain a decision, especially when the input signal is a continuous time series. Justifications based on local and consecutive signal features in the time series may lead to more comprehensible and intuitive explanations. For this reason, we decided to perform the explanation at a higher semantic level. A suitable level is the last layer of the convolutional filter stack. This layer represents higher-level signal filters with a larger receptive field and, thus, potentially captures more meaningful signal features for human observers. A method that provides explanations at this level is Grad- CAM [13]. This method does not propagate the gradients back to the input space, but the final prediction of the network is directly explained in terms of the abstract features learned in the last convolutional layer. Grad-CAM weighs the activation map of the last convolutional layer with the gradients (which flow into this layer) with respect to the target class to be explained. The weighted activation map is averaged over all channels of the layer. This results in an activation pattern that reflects higher-level signal patterns and captures contextual information. For easier interpretation of the results, the activation pattern can be upscaled (via interpolation) and mapped (e.g., via color coding) to the input signal. The upscaled activation pattern highlights continuous but local sections in the input signal that have a strong relation to the target class under investigation. An overview of how Grad-CAM functions for 1D gait analysis data is provided in Figure 4. In our experimental setup, we employed a five-fold cross- validation, which results in five distinct models. We decided to explain the model that performed closest to the median VOLUME 11, 2023 65913 2. Publications 136 D. Slijepcevic et al.: Explainable Machine Learning in Human Gait Analysis FIGURE 4. A schematic representation of the Grad-CAM method adapted for deep neural networks trained on 1D gait data. This example illustrates the process of generating a decision explanation for the true equinus class. To generate a Grad-CAM explanation, the gradients of the feature maps of the last convolutional layer are averaged channel-wise (x̄) and used as weights to calculate a weighted sum of the activations of this layer. Then, a ReLU function is applied to obtain only positive values, because these contribute to the prediction of the specific target class. In a final step, the Grad-CAM explanation is scaled up to match the size of the input. among the five models because we intended to emulate real-life scenarios where usually a single model is used in practice rather than multiple models. For this model, we gen- erated explanations at different levels, i.e., at the decision level and subsequently by aggregating these explanations at the class and model level. The motivation for this approach is to gain a comprehensive understanding of the model by explaining not only a single decision, but also which features are important for each class and the overall model. First, we computed decision explanations for each record in the dataset (training, validation, and test samples) using the ground truth label as target for the explanation. Next, we calculated the median over the Grad-CAM activations of all records assigned to a particular class to obtain explanations on the class level. The obtained activation pattern highlights class-specific patterns for an entire target class. The calculation of the median, however, can cause interpretation difficulties by obscuring the existence of different decision strategies that the model may have learned for different patient subgroups within a class. As shown in our previous research [8], CNNs have the ability to learn different strategies for distinct patient subgroups. Consequently, if a model learns complementary strategies to distinguish different patient subgroups of a particular class, the median plot of that class may not provide sufficient information to understand the model’s functioning. There- fore, similar to individual conditional expectation (ICE) [56] plots, we propose an additional subplot that also visualizes the individual Grad-CAM activations. Furthermore, we propose an explanation at the model level that goes beyond the interpretation of individual classes. For this type of model explanation, we calculated the total relevance as the sum of Grad-CAM activations over all samples for all four classes. This model explanation should serve as an informative indicator for the overall relevance of an input feature for the underlying classification task. The implementation of all classification and explainability methods was conducted within the software framework Python 3.7.10 (Python Software Foundation, USA), Tensor- Flow 2.3.0 (Google Brain Team, Google LLC, USA), and Scikit-learn 1.0.2 [54]. IV. RESULTS Subsection IV-A presents the quantitative results in terms of classification accuracy for all investigated ML models which were used to classify the four CP-related patterns presented in Subsection III-A. The explainability results for the examined ML models, which aim to explain the functioning of the models on the class and model level, are presented in Subsection IV-B. A. CLASSIFICATION RESULTS Classification results are provided for the four classification methods from Section III-C, i.e., CNN (with stride of one and two), SNN (with stride of one and two), RF, and DT. Each model has been trained and evaluated on the four different input configurations (signal sub-selections) defined in Section III-B, i.e., i) all 3DGA signals, ii) only kinematic signals, iii) only sagittal knee and ankle angles, and iv) only GRF signals. The zero rule baseline (ZRB), which refers to the theoretical accuracy obtained by assigning always the class label with the highest prior probability, is 34.4 % for this classification task. Since we evaluated theMLmodels via stratified five-fold cross-validation, we report the averaged classification accuracy over all five folds for the training, validation and test set. Table 4 summarizes all quantitative results. 65914 VOLUME 11, 2023 2.5. Explainable Machine Learning in Human Gait Analysis: A Study on Children With Cerebral Palsy 137 D. Slijepcevic et al.: Explainable Machine Learning in Human Gait Analysis In addition to the presented methods, we also conducted experiments with linear support vector machines (SVMs) and gradient boosting classifiers to evaluate their potential. For the SVMs, we performed a grid search on the hyperparameter C = {10−4, 10−3, . . . , 103, 104}. The results of the SVMs (All: 75.5%, Kinematics: 72.3%, and AnkleS & KneeS : 78.8%) showed suboptimal performance for the employed dataset and classification task and thus the classifier was not further investigated. We also evaluated the performance of a gradient boosting classifier using a grid search on the hyperparameters: learning rate {10−4, 10−3, 10−2, 10−1} and number of trees {100, 200, 300}. Since RF outperformed gradient boosting (All: 91.4%, Kinematics: 91.2%, and AnkleS & KneeS : 92.0%), we focus on the RF results in the following. The results in Table 4 show that all models overfitted on the training data, as evidenced by their significantly higher performance on the training set compared to the test set, even though the optimal parameters were selected using a validation split. RF outperformed all other models, with CNNs and SNNs showing the lowest test accuracies. The performance of DTs lies between that of the DNNs and RFs. RFs consistently achieved peak performance for all three input configurations where kinematic data were present. RFs are outperformed by CNNs and SNNs only when exclusively GRF data are used. CNNs performed slightly better than SNNs across all input configurations. For CNNs and SNNs, there was little difference in performance between using a stride of one and two. There is no consistent trend in performance with respect to stride. Only in the first input configuration where all 3DGA signals were used, a stride of one performed slightly better. For CNNs and SNNs, reducing the data to signals that are most relevant for the classification task, i.e., sagittal ankle (AnkleS ) and knee (KneeS ) angles, showed a significant advantage. The pre-selection of input signals seems to help the networks to find the most relevant information for solving the task. This effect can also be observed with DTs, but it is less pronounced. Remarkably, RFs demonstrate a high degree of invariance towards the pre-selection of input signals. For all input configurations where sagittal ankle and knee angles are included, RFs achieved a similarly high performance level independent of input dimensionality. This shows that RFs are very good at identifying the most relevant information and are hardly distracted by information in unrelated signals. Furthermore, the results in Table 4 allow to compare the classification performance achievable with GRF and kinematic data. This is an important question for clinical practice, as the acquisition of 3D data is much more demanding than capturing GRF data via force plates. Our results show that 3DGA data (i.e., kinematic data) are essential for automated gait classification. We observed a significant drop in performance when classification is restricted to the use of GRF data. We further discuss the TABLE 4. Classification accuracy (averaged over all five folds, in %) for the four classification methods (CNN, SNN, RF, and DT) and two variations of CNN and SNN (each with different stride S) and different input signal selections. results as well as their relevance in the context of the driving research questions of our study in Section V. B. EXPLAINABILITY RESULTS In the following, we present the explainability results of the investigated classification models. We start with CNNs and show their explanations at the class level (Figure 5) andmodel level (Figure 6). Subsequently, we compare them with the explanations of DTs and RFs at the model level. 1) EXPLANATIONS OF NEURAL NETWORKS Here we focus on the CNN with a stride of two, since the explanations generated by this model were found to be the most satisfactory compared to the other DNNs. In addition, we focus on the scenario in which the kinematic signals were utilized as input data, as these are the signals most commonly considered for CP. Figure 5 shows the respective explainability results on the class level. The results are presented in four panels, each showing the results of a particular gait class. The top of each panel shows the averaged input signals (in this case 10 concatenated kinematic signals) per class. These are colored with a sequential red palette based on the median of the Grad-CAM relevances of all samples assigned to the particular class. The higher the degree of red coloration of a region in the averaged input signal, the greater its relevance is to the corresponding class. The bottom part of each panel shows the Grad-CAM explanations for individual input samples, with the median visualized in blue. The lower part of each panel provides a more comprehensive overview by showing the distribution of individual decision explanations, from which we can VOLUME 11, 2023 65915 2. Publications 138 D. Slijepcevic et al.: Explainable Machine Learning in Human Gait Analysis derive additional information. For instance, considering the HipS signal in the jump gait class in Figure 5C, the lower visualization reveals at least two strategies the model learned for classifying this class, i.e., one focuses on the central regions of HipS , while the other focuses on the regions at the beginning and the end of HipS . In general, the explanations on the class level show different relevance patterns for the different classes. Very similar regions were considered relevant for crouch gait and apparent equinus. The relevant regions for these two classes primarily reside during the stance phase (approximately the first 60 % of the signal) of AnkleS (markers b and d in Figure 5), the swing phase (approximately the last 40 % of the signal) of KneeS (markers a and c in Figure 5), and the beginning and the end of HipS . In addition to these regions, for the apparent equinus class, the other signals also exhibit moderate relevance (e.g., most prominent in HipF , PelvisF , and KneeF ). The jump gait class shares some of the relevant regions (i.e., the swing phase ofKneeS (marker e in Figure 5), as well as the start and end of HipS ) with the two classes described previously. However, there are certain differences, especially to crouch gait, such as the moderate relevance in PelvisS , HipS , HipT , KneeF , and KneeT as well as for some samples the stand phase in HipS is highly relevant, whereas for others it is not. The true equinus class is clearly dominated by relevant regions during the stance phase inHipS andKneeS (markers f and g in Figure 5). This relevance pattern is clearly different from all other classes. The highlighted regions in Figure 5 represent class-specific activations and do not have to be discriminative for the classification task per se. To identify which regions are most relevant for the overall classification task we calculate the total relevance as the sum of Grad-CAM activations of all samples over all four classes. The higher this overall activation, the more relevant is a given signal portion for the classification task. Figure 6 shows the min-max normalized overall activation as blue lines. Figure 6A shows the mean and standard deviation of the raw input signals (calculated per class) as solid and dashed lines, respectively. Figure 6B-E shows the overall activation for models trained using the four different input configurations (Subsection IV-A). Figure 6B-E shows the Gini impurity-based feature impor- tance for the RFs and DTs in orange and red, respectively. As with the class level explanations, we observed very high relevances for the CNN in the signals HipS , KneeS , and AnkleS in the corresponding model explanation in Figure 6C. This is independent of whether GRF signals are used (Figure 6B) or not (Figure 6C). In case GRF is used in addition, the propulsion peak inGRFAP (marker d in Figure 6) is very relevant for the classification task. When using only the signals that are most relevant to clinicians for classification (Figure 6D), we can see that i) for KneeS the relevance shifts more to the stance phase (marker h in Figure 6D) while decreasing for the swing phase (marker i in Figure 6D) and ii) for AnkleS the relevance shifts to the swing phase (marker j in Figure 6D) while decreasing overall. Similar regions exhibit high relevance when only GRF data are utilized (Figure 6E) as in the case when all 3DGA signals are used (Figure 6B). 2) EXPLANATIONS OF TREE-BASED MODELS Figure 6 shows relevance scores (Gini impurity-based feature relevances) for both DTs (red) and RFs (orange) at the model level. RFs and DTs both exhibit locally similar regions that are considered highly relevant. The DT places a high emphasis on a single input feature, while the RF distributes the relevance to a more widespread area, which is strongly related to the relevant features of DT. For the first three input configurations (Figure 6B-D) there is a strong correspondence between these regions, showing that both DT and RF find the most relevant information for the classification task in a highly targeted manner. The identified regions further correlate with those of the CNN (except for AnkleS in Figure 6D where the CNN activation is shifted towards the swing phase). Interestingly, RF and DT provide explanations which are much more focused on the clinically relevant signals, compared to the CNNs, for which the activations are distributed across a broad range of input signals. Finally, we want to point out that for the case where only GRF data are used (Figure 6E), a very noisy pattern is observed in the relevances of RF and DT, focusing mainly on the beginnings of the signals, which is in contrast to the regions relevant to the CNN. We assume that the low expressiveness of the GRF signals for the CP-related gait patterns is the reason for the unfocused activation patterns. V. DISCUSSION In the following, we analyze and interpret the classification and explainability results from a technical and clinical per- spective. Additionally, we discuss the influence of different input configurations on the performance of the investigated classification methods. We structurally organize this section according to the research questions and provide answers to each of them. 1) HOW ADVANTAGEOUS IS THE USE OF KINEMATIC DATA OVER GRF DATA IN THE AUTOMATED CLASSIFICATION OF GAIT PATTERNS ASSOCIATED WITH CP, AND ARE THE TWO INPUTS MORE EFFECTIVE IN COMBINATION THAN USED INDIVIDUALLY? The classification results in Table 4 demonstrate a significant difference in classification performance between the use of kinematic data and GRF data as input in all experiments. The absolute differences range from 32.6 % for the SNN (stride of two) to 51.4 % for the RF. The results on GRF data also show that DNNs are more effective than traditional ML approaches in modeling this type of data (the absolute difference between CNN and RF is 5.2 %). The model level explanation provides a possible rationale for this observation: The CNNs use very similar regions in the GRF signals for both input conditions, which is not the case for traditional 65916 VOLUME 11, 2023 2.5. Explainable Machine Learning in Human Gait Analysis: A Study on Children With Cerebral Palsy 139 D. Slijepcevic et al.: Explainable Machine Learning in Human Gait Analysis FIGURE 5. Explainability results on the class level for the classification of gait patterns associated with cerebral palsy based on min-max normalized kinematic data using a CNN (with stride of two). The results are shown in four panels, each showing the results for a pathological gait pattern. The top of each panel displays the class-averaged kinematic signals that are colored with a sequential red palette, based on the median Grad-CAM relevance for that class (i.e., the redder, the more relevant). The bottom of each panel shows the Grad-CAM relevances for individual samples, with the median plotted in blue. VOLUME 11, 2023 65917 2. Publications 140 D. Slijepcevic et al.: Explainable Machine Learning in Human Gait Analysis FIGURE 6. Explainability results on the model level for the classification of gait patterns associated with cerebral palsy based on min-max normalized kinematic data using a CNN (stride of two), a DT, and an RF. The results are presented in five subfigures (A-E). Subfigure A) displays the mean and standard deviation of the raw input signals as solid and dashed lines, respectively. The following four subfigures show the model explanations in terms of relevance for models trained on four different input configurations: B) all signals from 3DGA, C) only kinematic signals, D) only sagittal knee and ankle angles, and F) only ground reaction forces. The model explanations (relevances) for the CNN (blue) are calculated by adding the Grad-CAM relevances for all input samples and then applying min-max normalization. For DT and RF we provide the Gini impurity-based feature relevances in orange (RF) and red (DT). ML approaches. For illustration, refer to Figure 6B and E, where the relevance of the CNNs (blue curves) are similar in both subfigures (apart from the high relevance at the end of GRFV , which is not particularly meaningful from a clinical perspective). The distribution of feature importance for RF and DT shown in Figure 6E is highly scattered and noisy, and there is also a lack of agreement between the two models. This suggests that RF and DT exhibit difficulties in learning the most important input features from GRF data. When considering kinematic data, using only AnkleS and KneeS for the classification task results in a sig- nificant improvement in performance. Given that these signals are considered by the clinicians to be most rel- evant to the classification task, these improvements in performance are not surprising. Our experiments show that these two signals are also the most important and useful ones for the ML models. Using all kinematic signals as input slightly decreases performance (except for RF), which indicates that (i) the models are distracted to a certain degree by the additional input and (ii) the other signals do not contribute additional information to the task. Combining kinematic and GRF data does not provide any advantage and leads to a slight degradation in performance in the majority of cases. This suggests that there is no complementary information in the GRF signals compared to the kinematic data for the evaluated task. In the literature, GRF data have been utilized in multiple studies examining pathological gait patterns associated with Parkinson’s disease [18], [57], [58], cerebral palsy [19], multiple sclerosis [19], osteoarthritis [59], transfemoral amputation [60], and lower limb fracture [61]. However, notable success in classification has been achieved primarily for the relatively simple classification tasks of distinguishing between one or two pathological gait patterns and healthy controls. Furthermore, the majority of previous research employed relatively small datasets. For the few studies that 65918 VOLUME 11, 2023 2.5. Explainable Machine Learning in Human Gait Analysis: A Study on Children With Cerebral Palsy 141 D. Slijepcevic et al.: Explainable Machine Learning in Human Gait Analysis addressed more complex research questions, such as the classification of various functional gait disorders [8], [21], [22], the exclusive use of GRF data has yielded less promising results. This tendency is also evident in our results. The lower body motion information aggregated in GRF signals is (i) insufficient when used alone and (ii) does not contribute complementary information for the classification of gait patterns associated with CP. 2) HOW DO TRADITIONAL ML MODELS COMPARE TO STATE-OF-THE-ART DNNS FOR THE AUTOMATED CLASSIFICATION OF CLINICAL 3DGA DATA IN TERMS OF PERFORMANCE AND EXPLAINABILITY? For the given task and data, RFs performed significantly better compared to the other ML methods including DNNs. When analyzing all input scenarios, RFs have consistently shown the best performance ranging from 92.9 % to 93.4 %, except for the scenarios involving only GRF signals. DTs demonstrated the second-best performance in almost all input scenarios, with the exception of scenarios involving only GRF signals. One explanation for the superior performance of the traditional ML methods can be attributed to the good generalization ability of these methods to a small number of training samples. This is not the case for CNNs and SNNs, which have a strong tendency to overfit on smaller datasets. In comparison with related work focusing on the classi- fication of the four CP-related gait patterns as defined by Rodda et al. [27], we achieved similar classification perfor- mance with the traditional ML models. Zhang and Ma [26] reported that MLPs achieved the highest classification accuracy of 93.5%, while DTs and RFs achieved lower accuracy rates of 84.3% and 83.6%, respectively, using a dataset of 200 children and the four classes. Similarly, Darbandi et al. [28] achieved a classification accuracy of 94.0% with their stochastic approach, using a dataset of 66 children and the four classes. In our study, utilizing a significantly larger dataset of 302 children, RFs demonstrated the highest performance. The explainability results show that DNNs attempt to learn features from a broad range of signals for classification (e.g., Figure 6B shows high relevances for PelvisS , KneeS , AnkleS , and GRFAP, while other signals are also considered relevant to a certain degree). A cause for this behaviour can be the limited dataset size. In contrast, RFs and DTs focus much more on the regions in KneeS and AnkleS that are actually relevant. We assume that their lower complexity in terms of numbers of parameters compared to DNNs is beneficial for the task and dataset. Interestingly, the feature importance for DT and RF is consistent for all input configurations (Figure 6B-D), which confirms that both models are not distracted by additional (obviously mostly unrelated) input signals. The feature relevance for the DNNs is more sensitive and varies stronger between the different input configurations. 3) TO WHAT EXTENT DO THE INVESTIGATED ML MODELS BASE THEIR DECISIONS ON CLINICALLY MEANINGFUL FEATURES WHEN CLASSIFYING CP-RELATED GAIT PATTERNS? In clinical practice, the four investigated CP-related gait patterns mainly differ in the sagittal knee (KneeS ) and sagittal ankle (AnkleS ) angles during the stance phase. The model explanations for the three input configurations with kinematic data show the highest relevance in these signals (markers a/e/h and c/g/j in Figure 6). This matches expectations from clinical practice and is in agreement with other studies [48] which identified both signals as the most promising to distinguish crouch gait, apparent equinus, jump gait, and true equinus. As previously discussed, DNNs tend to learn patterns from a broad range of input signals, in contrast to RFs and DTs, which focus only on the most clinically rel- evant signals, i.e., KneeS and AnkleS . Considering the case where all kinematic data are used (Figure 6C), the CNN shows the highest activations in KneeS and AnkleS as well, but also in HipS , which is clinically reason- able. Interestingly, although clinically relevant, HipS does not contribute to the classification performance in our experiments. From a clinical point of view, the main characteristic of the gait pattern true equinus is an increased plantarflexion during stance. One could expect a strong activation in AnkleS for true equinus. However, true equinus has a plantarflexed pattern in the ankle, which is similar to jump gait. Apparent equinus (neutral ankle angle) and crouch gait (dorsiflexed ankle angle) are more similar in AnkleS compared to jump gait and true equinus. However, they also differ from each other in AnkleS . Therefore, the high activation in AnkleS for crouch gait and apparent equinus (markers b and d in Figure 5) and the non-activation in this signal in jump gait and true equinus seem plausible from a clinical point of view. In KneeS we see less activation for crouch gait and apparent equinus during stance (Figure 5A and B). This seems plausible from a clinical perspective because both gait patterns are associated with increased knee flexion during stance compared to normative data. Both jump gait and especially true equinus are gait patterns associated with a decrease in knee flexion or even hyperextension during stance. The classification algorithm clearly picked up this pattern associated with knee hyperextension as we observed increased relevance scores during stance in KneeS for both classes, i.e., for some jump gait samples, but especially for true equinus (marker g in Figure 5). Our experiments have further revealed that the explain- ability approaches highlight certain signal regions as highly relevant, which may not be considered important from a clinical perspective. We observed a high activation during early to mid-swing in KneeS (markers b, f, and i in Figure 6), which is not expected from a clinical point of view. The reason might be twofold, either this attribution is due to a VOLUME 11, 2023 65919 2. Publications 142 D. Slijepcevic et al.: Explainable Machine Learning in Human Gait Analysis bias in the data (e.g., a spurious correlationwith the respective classes), or it indicates a potentially useful signal region that the ML model has discovered during learning, which is not considered in clinical practice (either because it shows too subtle differences that have not been considered as clinically relevant yet, or it has not been observed in practice yet). These results demonstrate that explainability approaches have the potential to assess not only the correctness of the trained models but also to gain new clinical insights about the data and the investigated task. Overall, we conclude that the ML models successfully learn clinically relevant patterns for the distinction of dif- ferent CP-related gait patterns. All three model explanations have a high activation in the clinically most relevant signal regions, i.e., KneeS (markers a, e, and h in Figure 6) and AnkleS (markers c and g in Figure 6) during stance. In addition, the CNN also considered regions that were not expected from a clinical perspective, e.g., KneeS during swing (markers b,f, and i in Figure 6) and AnkleS during swing (marker j in Figure 6D). Explainability approaches can reveal such unexpected patterns and are essential for clinicians to verify the correct working of the model, to gain trust in its decisions and to support gaining new insights into the data. We can conclude that all employed methods primarily focus on clinically relevant input signals, whereas this pattern is much more distinct for DT and RF than for the DNN models. 4) TO WHAT EXTENT ARE THE EXPLANATIONS OBTAINED FROM DNNS ROBUST TO VARIATIONS IN ARCHITECTURE? CNNs and SNNs show significant differences in their expla- nations. A visualization of the explanations on the class level for all input configurations can be found in the supplementary material in Figure S1. A direct comparison shows that SNNs lack activations for entire classes in the scenario where all signals (kinematic and GRF data) are used. This means that in some situations the Grad-CAM explanation does not high- light any signal as important, which is counter-intuitive and not credible. However, whenwe reduce the number of signals, we obtain more reasonable explanations. This behavior may be attributed to the high dimensionality of the input data, which potentially leads to an over-parameterization of the model and which in turn impedes Grad-CAM to identify distinct features. For CNNs (with stride of one) we also observe this problem, but only for the apparent equinus class in the configuration where all signals are used. This indicates that the normalization introduced by SNNs is not the (only) reason for this behavior. It is more likely that the high input dimensionality is responsible for the partly meaningless explanations. Further experiments with stronger regularization are needed to investigate this problem in more detail. For the other input configurations (except for SNN with a stride of one and the kinematic data as input), there are more similarities in the explanations of CNNs and SNNs, especially for the crouch gait, apparent equinus, and true equinus classes. In general, there is also more similarity between the two CNNs with different strides except for GRF signals, where the CNN with a stride of 1 learns patterns that are more similar to those of SNNs. The two SNNs with different strides show very high similarities for two input configurations, i.e., GRF and AnkleS & KneeS . We conclude that small changes in the model architecture may lead to larger variations in the explanations, which should not be the case if the models are able to robustly model the given task. The input dimensionality seems to be one factor that impedes the robustness of explainability, but not the only one. Different network architectures (compo- nents, layers and connection schemes between layers) may further either impede or facilitate the explainability of the model. VI. FUTURE WORK & LIMITATIONS An important prerequisite for reliable and well-functioning MLmodels, particularly DNNs, is a sufficient amount of data, but this is often a limitation in practice. The more data are available, the more robust patterns may be learned by the classifiers leading to more intuitive explanations. Especially regarding the use of DNNs for the domain of human gait analysis we are optimistic for two reasons. First, depending on the model type and architecture it is possible to train models which provide meaningful explanations for clinical experts even with the currently available data. Second, it is very likely that further advances in performance and generalizability will be made in the future, as new data are constantly being recorded (just as in other domains where DNNs have proven their superiority after being trained on large datasets). The limitation of training data underlines the need to analyze and combine 3DGA data from multiple laboratories in the future. Merging 3DGA data from different laboratories would provide a much larger and heterogeneous dataset that could improve the generalizability of the models. Subsequently, this could lead to the development of more robust and diverse models that can be employed across multiple laboratories. During our study and in previous research [45], we observed discrepancies between clinicians’ expectations and the explanations of the trained models. Clinicians expected ML models to use all regions that are charac- teristic (from a clinical perspective) for a particular class (independent of which other classes are modeled). However, the models often used only a subset of these regions. This discrepancy is exemplified by the true equinus class in our experiments, in which the model used regions in HipS and KneeS , whereas the clinicians expected the model to use also regions in AnkleS . From an ML perspective, it is logical that ML models mainly use features that exhibit large differences between classes. The high similarity of AnkleS between the true equinus and jump gait class (Figure 5A) seems to be the reason why its features are not used for the classification of true equinus. On the other hand, there are significant differences inHipS andKneeS that are relevant to true equinus 65920 VOLUME 11, 2023 2.5. Explainable Machine Learning in Human Gait Analysis: A Study on Children With Cerebral Palsy 143 D. Slijepcevic et al.: Explainable Machine Learning in Human Gait Analysis in contrast to the other classes. Therefore, the high relevance in these two signals (and missing relevance in AnkleS ) is reasonable from an ML perspective. A further reason for this discrepancy could originate from the diagnostic approach of clinicians, which often compares a patient’s walking pattern to the walking pattern of healthy controls. To this end, clinicians typically employ methods such as statistical parametric mapping (SPM) [62], which allows to identify statistically significant differences in the 3DGA data between a patient group and healthy controls. The experience with such methods may contribute to the expectation that an ML model’s relevant regions should include all regions that are considered different between a pathological gait pattern and healthy controls. However, in our case, the ML model learns to differentiate the four different pathological gait patterns in a discriminative manner. The ML model does not use a reference to physiological gait and mainly focuses on discriminative patterns that effectively separate two or more pathological classes. A future direction may be to develop novel approaches that mimic the diagnostic approach of clinicians, while still being explainable and trustworthy. One possible approach is to regularize the ML model with input from clinicians, which would force the network to use specific regions in the data during the training process. Still, there is a trade-off between sacrificing potential insights into the data and building trust in the ML model that should be explored in future work. VII. CONCLUSION Building trust in ML models is essential in the medical field to facilitate their use in clinical practice. Explainability approaches provide a useful tool to explain on which information a model bases its predictions. Building upon the post-hoc explainability method Grad-CAM – initially introduced for images and adapted by us to time series – we generated explanations for DNNs trained to differentiate CP-related gait patterns on several levels, i.e., on the decision, class and model level. Furthermore, we trained traditional models (DTs and RFs) for the given problem and explained them via feature importance. We investigated which subsets of 3DGA data are particu- larly suitable for the classification of gait patterns associated with CP. Our results confirm the superiority of kinematic over GRF data for this complex classification task, with the former achieving a classification accuracy of up to 93.4 % compared to 47.2 % with GRFs. Our results further demonstrate that the employed ML models base their predictions on clinically relevant features. Traditional ML approaches such as RFs and DTs achieve not only better results in classifying CP-related gait patterns, but also focus more on the clinically relevant regions in the 3DGA data compared to DNNs. An interesting point from the clinical perspective is that DNNs use additional (initially unexpected) features for their predictions. This may facilitate providing novel insights into the data, and thereby raise novel questions in the field. REFERENCES [1] S. McIntyre, ‘‘The continually changing epidemiology of cerebral palsy,’’ Acta Paediatrica, vol. 107, no. 3, pp. 374–375, Mar. 2018. [2] N. Pérez and A. Rodríguez, ‘‘Cerebral palsy: Hope through research,’’ NIH NINDS, Bethesda, MD, USA, 2013. [3] H. K. Graham, P. Rosenbaum, N. Paneth, B. Dan, J.-P. Lin, L. D. Damiano, G. J. Becher, D. Gaebler-Spira, A. Colver, D. S. Reddihough, K. E. Crompton, and R. L. Lieber, ‘‘Cerebral palsy,’’ Nature Rev. Disease Primers, vol. 2, no. 1, pp. 1–25, 2016. [4] R. Baker, Measuring Walking: A Handbook of Clinical Gait Analysis. London, U.K.: Mac Keith Press, 2013. [5] T. Chau, ‘‘A review of analytical techniques for gait data. Part 1: Fuzzy, statistical and fractal methods,’’ Gait Posture, vol. 13, no. 1, pp. 49–66, Feb. 2001. [6] J. Figueiredo, C. P. Santos, and J. C. Moreno, ‘‘Automatic recognition of gait patterns in human motor disorders using machine learning: A review,’’ Med. Eng. Phys., vol. 53, pp. 1–12, Mar. 2018. [7] F. Horst, S. Lapuschkin, W. Samek, K.-R. Müller, and W. I. Schöllhorn, ‘‘Explaining the unique nature of individual gait patterns with deep learning,’’ Sci. Rep., vol. 9, no. 1, pp. 1–13, Feb. 2019. [8] D. Slijepcevic, F. Horst, S. Lapuschkin, B. Horsak, A.-M. Raberger, A. Kranzl, W. Samek, C. Breiteneder, W. I. Schöllhorn, and M. Zeppelzauer, ‘‘Explaining machine learning models for clinical gait analysis,’’ ACM Trans. Comput. Healthcare, vol. 3, no. 2, pp. 1–27, Apr. 2022. [9] A. Adadi and M. Berrada, ‘‘Peeking inside the black-box: A survey on explainable artificial intelligence (XAI),’’ IEEE Access, vol. 6, pp. 52138–52160, 2018. [10] A. Holzinger, C. Biemann, C. S. Pattichis, and D. B. Kell, ‘‘What do we need to build explainable AI systems for the medical domain?’’ 2017, arXiv:1712.09923. [11] European Union, ‘‘Regulation (EU) 2016/679 of the European parliament and of the council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/EC (general data protection regulation),’’ Off. J. Eur. Union, vol. 119, pp. 1–88, May 2016. [Online]. Available: https://eur-lex.europa.eu/eli/reg/2016/679/oj [12] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, ‘‘Self- normalizing neural networks,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017, pp. 1–10. [13] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, ‘‘Grad-CAM: Visual explanations from deep networks via gradient-based localization,’’ inProc. IEEE ICCV, Oct. 2017, pp. 618–626. [14] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim, ‘‘Sanity checks for saliency maps,’’ in Proc. Adv. NIPS, 2018, pp. 9505–9515. [15] E. Halilaj, A. Rajagopal,M. Fiterau, J. L. Hicks, T. J. Hastie, and S. L. Delp, ‘‘Machine learning in human movement biomechanics: Best practices, common pitfalls, and new opportunities,’’ J. Biomech., vol. 81, pp. 1–11, Nov. 2018. [16] P. Khera and N. Kumar, ‘‘Role of machine learning in gait analysis: A review,’’ J. Med. Eng. Technol., vol. 44, no. 8, pp. 441–467, Nov. 2020. [17] C. Cui, G. Bian, Z. Hou, J. Zhao, G. Su, H. Zhou, L. Peng, and W. Wang, ‘‘Simultaneous recognition and assessment of post-stroke hemiparetic gait by fusing kinematic, kinetic, and electrophysiological data,’’ IEEE Trans. Neural Syst. Rehabil. Eng., vol. 26, no. 4, pp. 856–864, Apr. 2018. [18] F. Wahid, R. K. Begg, C. J. Hass, S. Halgamuge, and D. C. Ackland, ‘‘Classification of Parkinson’s disease gait using spatial-temporal gait features,’’ IEEE J. Biomed. Health Informat., vol. 19, no. 6, pp. 1794–1802, Nov. 2015. [19] M. Alaqtash, T. Sarkodie-Gyan, H. Yu, O. Fuentes, R. Brower, and A. Abdelgawad, ‘‘Automatic classification of pathological gait patterns using ground reaction forces and machine learning algorithms,’’ in Proc. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc., Aug. 2011, pp. 453–457. [20] M. J. Long, E. Papi, L. D. Duffell, and A. H. McGregor, ‘‘Predicting knee osteoarthritis risk in injured populations,’’ Clin. Biomech., vol. 47, pp. 87–95, Aug. 2017. VOLUME 11, 2023 65921 2. Publications 144 D. Slijepcevic et al.: Explainable Machine Learning in Human Gait Analysis [21] D. Slijepcevic, M. Zeppelzauer, A. Gorgas, C. Schwab, M. Schüller, A. Baca, C. Breiteneder, and B. Horsak, ‘‘Automatic classification of functional gait disorders,’’ IEEE J. Biomed. Health Informat., vol. 22, no. 5, pp. 1653–1661, Sep. 2018. [22] D. Slijepcevic, M. Zeppelzauer, C. Schwab, A.-M. Raberger, C. Breiteneder, and B. Horsak, ‘‘Input representations and classification strategies for automated human gait analysis,’’ Gait Posture, vol. 76, pp. 198–203, Feb. 2020. [23] E. Papageorgiou, A. Nieuwenhuys, I. Vandekerckhove, A. Van Campenhout, E. Ortibus, and K. Desloovere, ‘‘Systematic review on gait classifications in children with cerebral palsy: An update,’’ Gait Posture, vol. 69, pp. 209–223, Mar. 2019. [24] A. Ferrari, L. Bergamini, G. Guerzoni, S. Calderara, N. Bicocchi, G. Vitetta, C. Borghi, R. Neviani, and A. Ferrari, ‘‘Gait-based diplegia classification using LSMT networks,’’ J. Healthcare Eng., vol. 2019, pp. 1–8, Jan. 2019. [25] A. Ferrari, S. Alboresi, S. Muzzini, R. Pascale, S. Perazza, and G. Cioni, ‘‘The term diplegia should be enhanced. Part I: A new rehabilitation oriented classification of cerebral palsy,’’ Eur. J. Phys. Rehabil. Med., vol. 44, no. 2, p. 195, 2008. [26] Y. Zhang and Y. Ma, ‘‘Application of supervised machine learning algorithms in the classification of sagittal gait patterns of cerebral palsy children with spastic diplegia,’’ Comput. Biol. Med., vol. 106, pp. 33–39, Mar. 2019. [27] J. Rodda and H. K. Graham, ‘‘Classification of gait patterns in spastic hemiplegia and spastic diplegia: A basis for a management algorithm,’’ Eur. J. Neurol., vol. 8, no. 5, pp. 98–108, Nov. 2001. [28] H. Darbandi, M. Baniasad, S. Baghdadi, A. Khandan, A. Vafaee, and F. Farahmand, ‘‘Automatic classification of gait patterns in children with cerebral palsy using fuzzy clustering method,’’ Clin. Biomech., vol. 73, pp. 189–194, Mar. 2020. [29] K. Chia, I. Fischer, P. Thomason, H. K. Graham, and M. Sangeux, ‘‘A decision support system to facilitate identification of musculoskeletal impairments and propose recommendations using gait analysis in children with cerebral palsy,’’ Frontiers Bioeng. Biotechnol., vol. 8, Nov. 2020, Art. no. 529415. [30] V. Arya, R. K. E. Bellamy, P.-Y. Chen, A. Dhurandhar, M. Hind, S. C. Hoffman, S. Houde, Q. Vera Liao, R. Luss, A.Mojsilović, S.Mourad, P. Pedemonte, R. Raghavendra, J. Richards, P. Sattigeri, K. Shanmugam, M. Singh, K. R. Varshney, D. Wei, and Y. Zhang, ‘‘One explanation does not fit all: A toolkit and taxonomy of AI explainability techniques,’’ 2019, arXiv:1909.03012. [31] M. Wagner, D. Slijepcevic, B. Horsak, A. Rind, M. Zeppelzauer, and W. Aigner, ‘‘KAVAGait: Knowledge-assisted visual analytics for clinical gait analysis,’’ IEEE Trans. Vis. Comput. Graphics, vol. 25, no. 3, pp. 1528–1542, Mar. 2019. [32] A. Bois, B. Tervil, A.Moreau, A. Vienne-Jumeau, D. Ricard, and L. Oudre, ‘‘A topological data analysis-based method for gait signals with an application to the study of multiple sclerosis,’’ PLoS ONE, vol. 17, no. 5, May 2022, Art. no. e0268475. [33] N. Roche, D. Pradon, J. Cosson, J. Robertson, C. Marchiori, and R. Zory, ‘‘Categorization of gait patterns in adults with cerebral palsy: A clustering approach,’’ Gait Posture, vol. 39, no. 1, pp. 235–240, Jan. 2014. [34] D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg, ‘‘Smooth- Grad: Removing noise by adding noise,’’ 2017, arXiv:1706.03825. [35] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek, ‘‘On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation,’’ PLoS ONE, vol. 10, no. 7, Jul. 2015, Art. no. e0130140. [36] M. T. Ribeiro, S. Singh, and C. Guestrin, ‘‘‘Why should I trust you?’: Explaining the predictions of any classifier,’’ in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2016, pp. 1135–1144. [37] S. M. Lundberg and S.-I. Lee, ‘‘A unified approach to interpreting model predictions,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017, pp. 1–10. [38] C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. K. Su, ‘‘This looks like that: Deep learning for interpretable image recognition,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019, pp. 1–12. [39] K. Simonyan, A. Vedaldi, and A. Zisserman, ‘‘Deep inside convolutional networks: Visualising image classification models and saliency maps,’’ 2013, arXiv:1312.6034. [40] S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek, and K.-R. Müller, ‘‘Unmasking clever Hans predictors and assessing what machines really learn,’’ Nature Commun., vol. 10, no. 1, pp. 1–8, Mar. 2019. [41] F. Horst, D. Slijepcevic, M. Zeppelzauer, A. M. Raberger, S. Lapuschkin, W. Samek, W. I. Schöllhorn, C. Breiteneder, and B. Horsak, ‘‘Explaining automated gender classification of human gait,’’ Gait Posture, vol. 81, pp. 159–160, Sep. 2020. [42] D. Slijepcevic, F. Horst, M. Simak, S. Lapuschkin, A. M. Raberger, W. Samek, C. Breiteneder, W. I. Schöllhorn, M. Zeppelzauer, and B. Horsak, ‘‘Explaining machine learning models for age classifica- tion in human gait analysis,’’ Gait Posture, vol. 97, pp. S252–S253, Sep. 2022. [43] C. Dindorf, W. Teufl, B. Taetz, G. Bleser, and M. Fröhlich, ‘‘Interpretability of input representations for gait classification in patients after total hip arthroplasty,’’ Sensors, vol. 20, no. 16, p. 4385, Aug. 2020. [44] C. Kokkotis, S. Moustakidis, T. Tsatalas, C. Ntakolia, G. Chalatsis, S. Konstadakos,M. E. Hantes, G. Giakas, and D. Tsaopoulos, ‘‘Leveraging explainable machine learning to identify gait biomechanical parameters associated with anterior cruciate ligament injury,’’ Sci. Rep., vol. 12, no. 1, pp. 1–12, Apr. 2022. [45] A. Rind, D. Slijepčević, M. Zeppelzauer, F. Unglaube, A. Kranzl, and B. Horsak, ‘‘Trustworthy visual analytics in clinical gait analysis: A case study for patients with cerebral palsy,’’ in Proc. IEEE Workshop Trust Expertise Vis. Anal. (TREX), Oct. 2022, pp. 8–15. [46] A. Rind and D. Slijepcevic, ‘‘gaitXplorer screenshot,’’ Zenodo, Pölten Univ. Appl. Sci., St. Pölten, Austria, Dec. 2022, doi: 10.5281/zen- odo.7442945. [47] J.M. Rodda, H. K. Graham, L. Carson,M. P. Galea, and R.Wolfe, ‘‘Sagittal gait patterns in spastic diplegia,’’ J. Bone Joint Surg., vol. 86-B, no. 2, pp. 251–258, Mar. 2004. [48] M. Sangeux, J. Rodda, and H. K. Graham, ‘‘Sagittal gait patterns in cerebral palsy: The plantarflexor–knee extension couple index,’’ Gait Posture, vol. 41, no. 2, pp. 586–591, Feb. 2015. [49] D. A. Winter, Biomechanics and Motor Control of Human Movement, 3rd ed. Hoboken, NJ, USA: Wiley, 2005. [50] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for large-scale image recognition,’’ 2014, arXiv:1409.1556. [51] J. R. Quinlan, ‘‘Induction of decision trees,’’ Mach. Learn., vol. 1, no. 1, pp. 81–106, Mar. 1986. [52] J. R. Quinlan, C4.5: Programs for Machine Learning (Morgan Kaufmann Series in Machine Learning). San Mateo, CA, USA: Morgan Kaufmann, 1993. [53] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and Regression Trees. New York, NY, USA: Taylor & Francis, 1984. [54] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, ‘‘Scikit-learn: Machine learning in Python,’’ J. Mach. Learn. Res., vol. 12, pp. 2825–2830, Nov. 2011. [55] A. Mamalakis, E. A. Barnes, and I. Ebert-Uphoff, ‘‘Investigating the fidelity of explainable artificial intelligence methods for applications of convolutional neural networks in geoscience,’’ Artif. Intell. Earth Syst., vol. 1, no. 4, Oct. 2022, Art. no. e220012. [56] A. Goldstein, A. Kapelner, J. Bleich, and E. Pitkin, ‘‘Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation,’’ J. Comput. Graph. Statist., vol. 24, no. 1, pp. 44–65, Jan. 2015. [57] I. El Maachi, G.-A. Bilodeau, and W. Bouachir, ‘‘Deep 1D-convnet for accurate Parkinson disease detection and severity prediction from gait,’’ Expert Syst. Appl., vol. 143, Apr. 2020, Art. no. 113075. [58] W. Zeng, F. Liu, Q. Wang, Y. Wang, L. Ma, and Y. Zhang, ‘‘Parkinson’s disease classification using gait analysis via deterministic learning,’’ Neurosci. Lett., vol. 633, pp. 268–278, Oct. 2016. [59] C. Nüesch, V.Valderrabano, C. Huber, V. von Tscharner, andG. Pagenstert, ‘‘Gait patterns of asymmetric ankle osteoarthritis patients,’’ Clin. Biomech., vol. 27, no. 6, pp. 613–618, Jul. 2012. [60] D. P. Soares, M. P. de Castro, E. A. Mendes, and L. Machado, ‘‘Principal component analysis in ground reaction forces and center of pressure gait waveforms of people with transfemoral amputation,’’Prosthetics Orthotics Int., vol. 40, no. 6, pp. 729–738, 2016. [61] A.M. S.Muniz and J. Nadal, ‘‘Application of principal component analysis in vertical ground reaction force to discriminate normal and abnormal gait,’’ Gait Posture, vol. 29, no. 1, pp. 31–35, Jan. 2009. [62] T. C. Pataky, ‘‘Generalized n-dimensional biomechanical field analysis using statistical parametric mapping,’’ J. Biomech., vol. 43, no. 10, pp. 1976–1982, Jul. 2010. 65922 VOLUME 11, 2023 2.5. Explainable Machine Learning in Human Gait Analysis: A Study on Children With Cerebral Palsy 145 D. Slijepcevic et al.: Explainable Machine Learning in Human Gait Analysis DJORDJE SLIJEPCEVIC received theM.Sc. degree in computer engineering from TU Wien, Austria, where he is currently pursuing the Ph.D. degree in technical sciences, with a focus on the development of machine learning methods in the field of clinical human gait analysis. He is a Researcher with the Institute of Creative Media Technologies (ICMT), St. Pölten University of Applied Sciences, Austria. He has extensive experience in research in the domain of automated human gait analysis, with particular focus on the topics gait recognition, gait pattern classification, gait event detection, and similarity retrieval of gait patterns. His research interests include machine learning, explainable artificial intelligence, computer vision, and time series analysis. MATTHIAS ZEPPELZAUER received the Ph.D. and Habilitation degrees in computer science from the Vienna University of Technology. He is currently the Head of the Media Computing Research Group and a Coordinator of the Center for Artificial Intelligence, St. Pölten University of Applied Sciences, Austria. His research interests include computer vision, machine learning andmultimedia analysis. He has a long track of research on automated human gait analysis focusing on machine learning architectures and features for the extraction of information in gait signals and the prediction of pathological gait patterns. FABIAN UNGLAUBE received the master’s degree in sports science. Since 2016, he has been a Research Assistant with the Laboratory for Gait and Movement Analysis, Orthopaedic Hospital Speising, Vienna, Austria. He has been participated in several national and international research projects in the field of clinical gait and movement analysis. On a daily basis, he works with orthopaedic and cerebral palsy related patients in the gait laboratory. He has been an active member in several professional societies, such as the European Society for Movement Analysis in Adults and Children (ESMAC). ANDREAS KRANZL received the Ph.D. degree in sports science. Since 1996, he has been the Head of the Laboratory for Gait and Movement Analysis, Orthopaedic Hospital Speising, Vienna, Austria. He has more than 30 years experience in clinical gait and movement analysis with focus on orthopaedic and cerebral palsy related patients. He has been holding active membership in several professional societies, such as the European Society forMovementAnalysis inAdults andChildren (ESMAC). He has been active as a reviewer for several internationally renowned journals in the field of gait analysis. CHRISTIAN BREITENEDER received the Diploma in Engineering degree in computer science from Johannes Kepler University Linz, in 1978, and the Ph.D. degree in computer science from TU Wien, in 1991. He studied history of art with the University of Vienna, from 1977 to 1981, and theatre directing with Max Reinhardt Seminar, Vienna, from 1981 to 1984. He was Postdoctoral Researcher with CUI, University of Geneva, Switzerland, from 1991 to 1993, and GMD (now Fraunhofer), Birlinghoven, Germany, from 1995 to 1996. He was an Associate Professor with the University of Vienna, from 1997 to 2000. He is currently a Retired Professor with the Institute of Visual Computing and Human-Centered Technology, TU Wien. His current research interests include interactive media systems, media processing systems, augmented and virtual reality, content-based multi- modal information retrieval, and the analysis of high-dimensional data. BRIAN HORSAK received the Ph.D. and Habilitation degrees in sport science from the University of Vienna. He is currently the Head of the Motor Rehabilitation Research Group and the Scientific Director of the Center for Digital Health and Social Innovation, St. Pölten University of Applied Sciences, Austria. He is an accomplished researcher whose vision is to combine technology and healthcare to provide advanced medical solutions in the field of gait analysis and rehabilitation. His research is geared toward enhancing clinical practice and facilitating medical decision-making through the utilization of cutting-edge technologies, including motion capturing, wearables, musculoskeletal simulations, machine learning, and augmented and virtual reality. VOLUME 11, 2023 65923 2. Publications 146 147