Pole-arina: Deep Learning–Based Coaching System for Pole Dancing Technique DIPLOMARBEIT zur Erlangung des akademischen Grades Diplom-Ingenieurin im Rahmen des Studiums Visual Computing eingereicht von Katharina Scheucher, BSc Matrikelnummer 11809620 an der Fakultät für Informatik der Technischen Universität Wien Betreuung: Dr. Peter Kán Mitwirkung: Dr. Diana Marin Wien, 8. September 2025 Katharina Scheucher Peter Kán Technische Universität Wien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at Pole-arina: Deep Learning–Based Coaching System for Pole Dancing Technique DIPLOMA THESIS submitted in partial fulfillment of the requirements for the degree of Diplom-Ingenieurin in Visual Computing by Katharina Scheucher, BSc Registration Number 11809620 to the Faculty of Informatics at the TU Wien Advisor: Dr. Peter Kán Assistance: Dr. Diana Marin Vienna, September 8, 2025 Katharina Scheucher Peter Kán Technische Universität Wien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at Erklärung zur Verfassung der Arbeit Katharina Scheucher, BSc Hiermit erkläre ich, dass ich diese Arbeit selbständig verfasst habe, dass ich die verwen- deten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der Arbeit – einschließlich Tabellen, Karten und Abbildungen –, die anderen Werken oder dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter Angabe der Quelle als Entlehnung kenntlich gemacht habe. Ich erkläre weiters, dass ich mich generativer KI-Tools lediglich als Hilfsmittel bedient habe und in der vorliegenden Arbeit mein gestalterischer Einfluss überwiegt. Im Anhang „Übersicht verwendeter Hilfsmittel“ habe ich alle generativen KI-Tools gelistet, die verwendet wurden, und angegeben, wo und wie sie verwendet wurden. Für Textpassagen, die ohne substantielle Änderungen übernommen wurden, haben ich jeweils die von mir formulierten Eingaben (Prompts) und die verwendete IT- Anwendung mit ihrem Produktnamen und Versionsnummer/Datum angegeben. Wien, 8. September 2025 Katharina Scheucher v Danksagung Ich danke meinen Betreuern Peter Kán und Diana Marin von Herzen. Peter, mein Hauptbetreuer, hat mich mit schneller und durchdachter Beratung begleitet und seine Erfahrung großzügig eingebracht. Diana war eine außergewöhnliche Unterstützerin: Unsere wöchentlichen Besprechungen, ihr detailliertes Feedback und ihre Bereitschaft, jederzeit zu helfen (insbesondere in der Schlussphase) waren von unschätzbarem Wert. Mein Dank gilt zudem allen Teilnehmenden, die zur Datensammlung und zur Nutzerstudie beigetragen haben, sowie der Studioleitung von PoleDanceVienna für die großzügige Bereitstellung der Studioräume für die Datenerhebung und Evaluation. Abschließend gilt mein tiefster Dank meinem Partner, meinen Eltern und meinen Schwes- tern für ihre unerschütterliche Unterstützung, Geduld und Ermutigung. vii Acknowledgements I am sincerely grateful to my supervisors, Peter Kán and Diana Marin. Peter, as my primary supervisor, provided swift and thoughtful guidance, sharing his experience generously throughout this project. Diana was an exceptional supporter: our weekly meetings, her detailed feedback, and her readiness to help at any time (especially during the final stretch) were invaluable. My thanks extend to all participants who contributed to the dataset and the user study, as well as to the owner of PoleDanceVienna for generously providing studio space for data collection and evaluation. Finally, my deepest gratitude goes to my partner, my parents, and my sisters for their unwavering support, patience, and encouragement, always and without exception. ix Kurzfassung Diese Arbeit stellt Pole-Arina vor, ein markerloses Trainingssystem für statische Pole- Dance-Tricks, das Trainingsvideos analysiert, um den ausgeführten Trick zu erkennen und die Endpose mit transparentem, geometriebasiertem Feedback zu bewerten. Zu diesem Zweck wurde ein domänenspezifischer Datensatz kuratiert und annotiert. Er umfasst 836 Clips von 58 TeilnehmerInnen, die mit entsprechenden Labeln gekennzeichnet sind, welche die Erkennung mehrerer Tricks sowie passiver Zustände ermöglichen. Pole-Arina kombiniert ein leichtgewichtiges bidirektionales LSTM für die frameweise Erkennung mit einer regelbasierten Engine, die trickspezifische Ausrichtungen im Raum, Gelenkausrich- tungen und Abstände bewertet. Diese Daten werden für NutzerInnen durch intuitive Visualisierungen und verständliche Tipps zugänglich gemacht. Das Modell erreichte eine accuracy von 93,82% pro Frame über alle Klassen hinweg und eine accuracy von 98,74% für trickspezifische Klassen. In einer kontrollierten between-groups Anwenderstudie wurde Pole-Arina mit der traditionellen Video-Selbstbewertung verglichen. Die TeilnehmerInnen, die Pole-Arina verwendeten, gaben an, dass sie dem Feedback deutlich mehr Vertrauen schenkten und mehr Klarheit darüber hatten, wie sie sich verbessern konnten. Außer- dem wurde die Benutzerfreundlichkeit in der Pole-Arina Gruppe höher bewertet. Diese Ergebnisse zeigen, dass Pole-Arina eine genaue Erkennung und umsetzbares Feedback liefern kann, dem die Benutzer vertrauen und das sie verstehen, wodurch strukturiertes Coaching auch außerhalb des Studios zugänglich wird. Diese Arbeit schafft eine praktische Grundlage für KI-Coaching im Pole-Sport. xi Abstract This thesis presents Pole-Arina, a marker-less coaching system for static pole dancing tricks that analyzes training videos to recognize the performed trick and grade the final pose with transparent, geometry-based feedback. A domain-specific dataset was curated and annotated for this purpose. It includes 836 clips from 58 participants, labeled with a state scheme that supports multi-trick recognition and explicit background modeling. Pole-Arina combines a lightweight bidirectional LSTM for frame-wise recognition with a rule engine that evaluates trick-specific orientations, joint alignments, and proximities, rendering interpretable overlays and concise tips. The model achieved 93.82% per- frame accuracy across all classes and 98.74% trick-only accuracy on end-pose frames. A controlled between-groups user study compared Pole-Arina against traditional video self-review. Participants using Pole-Arina reported significantly higher trust in the feedback and greater clarity for how to improve, and rated usability higher. These results indicate that Pole-Arina can deliver accurate recognition and actionable feedback that users trust and understand, making structured coaching accessible outside the studio. This work establishes a practical baseline for AI coaching in pole sports. xiii Contents Kurzfassung xi Abstract xiii Contents xv 1 Introduction 1 1.1 Motivation & Problem Statement . . . . . . . . . . . . . . . . . . . . . 2 1.2 Aim of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Structure of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Related Work 7 2.1 Marker-Based vs. Marker-less Motion Capture . . . . . . . . . . . . . 7 2.2 Pose Estimation & Sequence Models . . . . . . . . . . . . . . . . . . . 9 2.3 Automated Coaching & Feedback Systems in Sports . . . . . . . . . . 10 2.4 Applications of Pose Analysis in Dance . . . . . . . . . . . . . . . . . . 12 3 Pole-Arina: Dataset 15 3.1 Data Collection & Labeling . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Feature Extraction & Preprocessing . . . . . . . . . . . . . . . . . . . 22 3.3 Final Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4 A Coaching System for Pole Dancing Technique 33 4.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Trick Recognition & Pose Analysis . . . . . . . . . . . . . . . . . . . . 34 4.3 Pole-Arina Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5 Pole-Arina: Implementation 39 5.1 Data Preprocessing & Augmentation . . . . . . . . . . . . . . . . . . . 39 5.2 Recognition Model (LSTM) . . . . . . . . . . . . . . . . . . . . . . . . 44 5.3 Geometric Scoring System . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.4 Pole-Arina: Application . . . . . . . . . . . . . . . . . . . . . . . . . . 53 xv 6 Evaluation & Results 61 6.1 Quantitative Model Performance . . . . . . . . . . . . . . . . . . . . . 61 6.2 User Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.3 User Study Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 7 Conclusion 75 Overview of Generative AI Tools Used 77 Übersicht verwendeter Hilfsmittel 79 List of Figures 81 List of Tables 83 List of Algorithms 85 Bibliography 87 CHAPTER 1 Introduction Advancements in artificial intelligence (AI) and computer vision are reshaping how athletes and dancers train, offering new possibilities for personalized coaching and performance analysis. From fitness apps providing real-time form corrections [DWDW25] to AI referees and analytics in professional sports [PPW+24, GRRCR23], virtual coaches are increasingly emerging to enhance efficiency and reduce injury risks [MMN+24]. Such systems also aim to enhance exercise enjoyment and to incorporate social aspects [DWDW25]. However, as AI-driven coaching becomes more prevalent, it is crucial to address ethical considerations such as privacy, informed consent, and user trust. Ensuring transparent algorithms and respectful data handling not only meets ethical standards but also improves user acceptance of AI systems [LWHL24]. With thoughtful implementation, deep and machine learning-based tools hold powerful enhancements in complementing a human instructor by processing complex movement data and delivering instant feedback for all types of training. AI-powered coaching systems have already proven their potential in sports and dance evaluation. For instance, computer vision models can evaluate Olympic sport perfor- mances and predict judges’ scores, providing an objective measure of technique and consistency [PTM17]. In the fitness domain, vision-based assistant applications guide users through exercises such as yoga, weight training, or martial arts [TP23]. Furthermore, marker-less pose estimation can handle complex movements like full-body rotations and self-occlusions to evaluate dance performances [KK18]. Across various domains, pose estimation and deep learning enable the automatic analysis of motion and the generation of real-time feedback. Crucially, unlike traditional marker-based motion capture systems, computer vision approaches enable free movements, making suits or sensors unnecessary [MK23]. This is especially valuable for contact sports, where wearables are impractical to integrate. 1 1. Introduction 1.1 Motivation & Problem Statement One emerging domain that has received little attention from such AI coaching technologies is pole dancing. In recent years, pole dancing has shed much of its past stigma and evolved into a recognized athletic practice, with a growing community and global competitions. The International Pole Sports Federation (IPSF) was founded in 2009 and subsequently introduced a standardized set of rules and regulations for competitions, marking a fundamental step in establishing pole dance as a serious sport [Fed25]. Although the Global Association of International Sport Federations (GAISF) classified pole dancing as a professional sport, critics argue that hypersexualization is unavoidable [Wea20]. Treating pole dancing explicitly as a sport helps shift expectations toward an inclusive, technique-centered evaluation. This thesis contributes to that shift by developing and studying an objective, transparent coaching system for pole technique. Pole dancing combines artistry with acrobatics, requiring strength, flexibility, and precise technique. Each pole trick involves complex, full-body movements that often include inversion and intricate transitions. To review their form independently, dancers most often record themselves, both inside and outside formal classes. Nonetheless, subtle body misalignments or incorrect technique often go unnoticed without proper assessment by expert instructors, which can slow progress and increase the risk of injury. A lack of feedback is not only frustrating but also causes falls or chronic strain. While instructors provide immediate guidance in the studio, they are costly and not always available, especially when training at home. Wearable-sensor systems overcome some visibility issues, but restrict movement and interfere with contact between the dancer’s skin and the pole, making them unsuited for pole dancing. Recent advances in marker-less pose estimation and sequence models make purely video-based coaching feasible [Qu24]. However, building such a system for pole dancing comes with its own challenges. Pole tricks are highly dynamic and can appear very similar in their early stages, making it challenging to recognize which trick is being performed until key characteristics emerge. Furthermore, the naming conventions of pole tricks are not globally standardized or as established as in other sports. The same trick might have different names across studios or regions, causing dancers to struggle in identifying the trick or searching for a related tutorial. These factors motivate the need for an automated assistant that can accurately classify the performed pole trick, pinpoint technical mistakes, and provide interpretable feedback to improve the dancer’s form. Another motivation stems from the author’s direct experience teaching pole-dancing classes. Observations of numerous students revealed common struggles with self-review and clarified the value of consistent feedback. In summary, an accessible system outside the studio that does not require wearables and is tailored to the unique demands of pole athletes offers a clear value. It represents a necessary step towards making pole dancing training more accessible. 2 1.2. Aim of the Work 1.2 Aim of the Work This thesis addresses the above-mentioned challenges by developing Pole-Arina, a marker-less, deep learning-based coaching system for static pole tricks. The main goals of Pole-Arina include: automatic pole trick classification, quality evaluation of the execution, and the delivery of corrective feedback in an intuitive visual format. To achieve this, the system leverages state-of-the-art pose estimation to extract the dancer’s skeleton keypoints from ordinary training videos. It analyzes the motion sequence using a domain-specific model and rules. Therefore, Pole-Arina aims to act as an AI-powered pole coach, identifying the trick, evaluating it, and highlighting form deviations. It is not a replacement for real-world instruction when learning new tricks, but it provides guidance in training settings. The focus is on fundamental static pole tricks to provide educational value for beginners and intermediate dancers, to build correct technique. This work aims to (1) design a privacy-preserving, interpretable coaching pipeline for pole dancing technique, (2) develop a lightweight temporal recognizer for trick and phase detection, and (3) evaluate the system’s effectiveness and usability in realistic training scenarios. Building on these objectives, the research questions can be summarized as follows: 1. How accurately can deep learning models classify pole dancing tricks? 2. To what extent can geometric-statistical scoring identify and quantify deviations from ideal execution of pole dancing tricks? 3. Which type of feedback from a marker-less coaching system can effectively improve a dancer’s technique, and how can this improvement be quantified? By answering these questions, the thesis aims to validate the feasibility of an AI-driven coaching tool in the context of pole sports. It further provides insights into designing effective automated feedback for complex athletic movements. The evidence is collected along the three research questions: • RQ1 via quantitative recognition metrics on held-out data. • RQ2 via rule statistics and pose scores for detected end-poses, to identify and quantify deviations from ideal execution. • RQ3 via a controlled user study measuring trust & adoption, efficiency, under- standability, and usability. 1.3 Contribution The key contributions of this work include: 3 1. Introduction • Pole Dancing Skeleton Dataset: A novel, domain-specific dataset of pole dancing performances was created to support model training and evaluation. This dataset encompasses six fundamental static pole tricks, each performed by multiple volunteers of varying skill levels. The data collection process was carried out with strict attention to ethics and privacy. All participants provided informed consent, and the raw video data were handled confidentially. To secure privacy, only the extracted 3D skeleton joint coordinates (with a reconstructed depth value) are included in the released dataset and utilized for analysis. The final dataset consists of 836 video samples, covering both successful and failed attempts across varying experience levels. Each clip is annotated with the trick label and temporal progression. • Trick & Phase Recognition Model: A bidirectional Long Short-Term Mem- ory (LSTM) neural network was developed to classify pole tricks and detect their phase progression automatically. During training and inference, the model ingests the entire landmark sequence for a clip and produces a per-frame prediction. The trained model provides an accurate classification pipeline that serves as the backbone of Pole-Arina. • Pose-Quality Scoring and Feedback System: Based on domain knowledge of proper pole technique, a geometric rule-based scoring system was designed to evaluate a dancer’s performance. Each trick holds unique characteristics that specify body alignment and present opportunities for common instructor critiques. Based on these attributes, custom rules were derived to define a trick and apply geometric computations on the 3D joint coordinates. The output is a set of quantitative scores indicating the performer’s deviation from the optimal pose. To aid understanding, the system visualizes the feedback directly on the dancer’s video frames as an overlay. It highlights angles, alignments, and distances. The scoring system and its visualization are implemented as part of an interactive feedback interface, enabling users to review their performance frame by frame with guidance on how to improve. Together, these contributions realize a lightweight, marker-less AI coaching prototype tailored to basic pole dancing tricks. Furthermore, this system offers potential for scalability as it is adaptable to other physical activities such as physiotherapy, gymnastics, or fitness routines. 1.4 Structure of this Thesis The remainder of this thesis is organized as follows: first, Chapter 2 reviews the state- of-the-art in human motion capture and automated coaching systems. It contrasts marker-based motion capture with marker-less pose estimation approaches, and discusses relevant deep learning models for pose estimation and sequence analysis. Furthermore, it surveys prior work on automated feedback in sports and dance, highlighting the implementation of similar concepts in applied domains like gymnastics, yoga, and general 4 1.4. Structure of this Thesis dance performance. Next, Chapter 3 presents a detailed description of the composition of the Pole-Arina dataset. It describes the selection of tricks, the data acquisition process, the annotation protocol for labeling phases, and the feature extraction pipeline using MediaPipe for skeletal data. This chapter also presents ethical considerations and dataset statistics, such as participant demographics and data characteristics. Chapter 4 provides a high-level overview of the Pole-Arina system’s design and evaluation methodology. It formally states the problem and introduces the system’s overall architecture. The chapter then describes the core methods for trick classification and pose-quality scoring. Chapter 5 dives into the technical implementation of each component. It covers data preprocessing steps, iterative model development, and parameter tuning of the LSTM model, specific rule calculations, and integration into a full-stack prototype. Chapter 6 evaluates the performance of the developed system. First, it presents the quantitative results of the trick recognition model, including accuracy, confusion matrices, and analysis of misclassifications. It further introduces the evaluation strategy by presenting a user study concept. It continues to report on the results, including both objective measurements and subjective feedback from questionnaires. Hypothesis tests and statistical analysis are used to determine the significance of observed improvements. Qualitative observations from the study are also discussed to gain insights beyond the tests. It further outlines potential directions for extending this research. Finally, Chapter 7 summarizes the thesis’s findings, reflecting on the extent to which the research questions were answered and the overall success of Pole-Arina in addressing the initial problem. 5 CHAPTER 2 Related Work This chapter surveys prior work to set the stage for Pole-Arina. First, it contrasts marker-based, wearable, and marker-less motion capture approaches and summarizes their trade-offs for athletic and dance contexts. Next, it reviews human pose estimation and temporal sequence models, before covering automated coaching systems in sports and applications in dance. This synthesis motivates the choice of marker-less pose estimation with lightweight temporal modeling for pole-dance technique analysis. 2.1 Marker-Based vs. Marker-less Motion Capture (a) Marker-based pose estima- tion example by [CZDK21] (b) Wearable motion capture example, taken from [LX13] (c) Marker-less motion capture using MediaPipe [Goo25] Figure 2.1: Visual comparison of different motion capture technologies. Marker-based optical systems. Optical motion capture (mocap) systems represent the gold standard for capturing human motion with high precision [STL24]. Such marker- based systems can achieve millimeter-level accuracy under controlled conditions, making 7 2. Related Work them well suited for detailed biomechanics analysis [CECS18]. They can be divided into active and passive marker systems [STL24]. While traditional motion capture systems like Vicon [Vic25] use passive markers that reflect light and are tracked by multiple cameras, active markers used by systems like Optotrak3020 [Nor25] emit light. Although active markers are generally more stable, they rely on additional power supplies and cables, which restrict movement compared to passive solutions. However, both solutions have limited practicality outside of the lab. Setting up markers on a person is time-consuming and can interfere with natural movement. Markers might slip, require readjustment, recalibration during dynamic movements, or suffer from occlusion and constrained space [STL24]. Even if no cables are involved, attached markers and suits impose physical and psychological constraints on the performer, altering how they move [CECS18]. Therefore, when covered with sensors, dancers and athletes might not perform as usual [MK23]. Markers or wearable systems are particularly problematic in sports like pole dancing, where minimal sportswear is required for grip on the pole. Attached sensors or markers on the dancer’s body interfere with the needed contact between skin and pole, while the pole can introduce magnetic interference. Thus, while marker-based systems are often unmatched regarding accuracy, their setup complexity and spatial limitations make them ill-suited to pole dancing. Wearable inertial systems. IMU suits, or inertial measurement unit suits, are wearable systems that track body movements using sensors to capture data such as acceleration, rotation, and magnetic fields [STL24]. Key advantages are portability and robustness to occlusion. Since no cameras are needed, athletes can move freely in any environment without a direct line of sight. Wearable sensors allow large capture volumes and real-time tracking, making them well-suited for in-situ training like skiing or running. While they impose less interference on athletes, compared to markers, the accuracy is lower. Inertial sensors drift over time, require recalibration, and can be sensitive to magnetic distortions [STL24]. In pole dancing, any device must be tightly secured to withstand inversion and spins, while high-impact or aerial motions (e.g., tumbles, jumps, flips) might still cause a shift or data integration errors. Therefore, while IMU suits are effective for specific use cases, they are ineffective for free-form arts like dance, which require high fidelity and unrestricted movement. Marker-less computer vision approaches. Advances in computer vision and deep learning have enabled marker-less motion capture using regular camera footage. Frame- works like OpenPose [CHS+19] and DeepPose [TS14] estimate human body positions from video frames without markers. This non-invasive method enables performers to move naturally while being recorded with ordinary cameras or smartphones [CECS18], using algorithms to reconstruct their skeleton motions. This is especially valuable in dance and other contexts, where attached equipment is undesirable. Fueled by deep learning, modern pose estimation models have dramatically improved accuracy and reliability. They achieve high recognition accuracy in various sports scenarios, from single-athlete skill analysis to multi-player game tactics [STL24]. However, limitations remain even 8 2.2. Pose Estimation & Sequence Models if marker-less methods are more flexible and scalable. Computer vision models require a direct line of view, struggle with occlusions, and are limited to the camera’s field of view. Nonetheless, in contexts like pole dancing, marker-less methods represent the most viable approach, as it does not restrain the dancers’ movements and leverages the use of simple video recording. Table 2.1: Marker-based vs. marker-less comparison overview. Method Advantages Drawbacks Marker- Based - High accuracy (mm-level) - Compromise natural movement - Detailed 3D data - Occlusion & limited capture volume - Well-validated for biomechanics - Expensive & requires lab setup Wearable Sensors (IMU) - No line-of-sight - Drift & noise issues → recalibration - Possible real-time feedback - Can be uncomfortable & restrictive - Minimal setup (small sensors) - Sensitive to magnetic interference Marker- Less - Enables free movement - Needs direct line-of-sight - Flexible setup - Accuracy depends on algorithm - Scales to multiple people & large scenes - Real-time bottlenecks 2.2 Pose Estimation & Sequence Models Human pose estimation. Early advances in computer vision enabled the automatic detection of human joint positions from images or videos. Toshev and Szegedy introduced DeepPose [TS14], a landmark work in formulating pose estimation as a deep neural network regression problem. Their work achieved state-of-the-art accuracy on benchmarks [TS14]. Furthermore, frameworks like OpenPose [CHS+19] have made real-time 2D multi- person pose tracking feasible by utilizing Part Affinity Fields to localize multiple people simultaneously. This and similar CNN-based methods became popular due to their robustness in various environments [ZWC+23]. Additionally, open-source libraries like Facebook AI’s Detectron2 [WKM+19] provide keypoint detection models as part of its object detection toolkit. Google’s MediaPipe Pose [Goo25] (based on BlazePose [BGR+20]) estimator further optimized pose estimation for mobile and edge devices, outputting 33 body landmarks over 30 FPS on a phone. It further reconstructs 3D pose coordinates (with a relative depth component) from a single RGB camera, enabling real-time posture analysis with only a smartphone camera. These advancements enable the calculation of highly accurate human skeleton data, in either 2D or estimated 3D, which can be obtained by marker-less pose estimation from ordinary video data. Sequence models. After the computation of landmark sequences, the data can be fed to sequence models to solve classification problems. Traditional Recurrent Neural Networks (RNNs) suffered from vanishing or exploding gradients when learning long sequences [Zar21]. Hochreiter and Schmidhuber introduced the Long Short-Term Memory 9 2. Related Work (LSTM) network as a solution to overcome this limitation by enforcing constant error flow over time through gating mechanisms [HS97]. LSTMs can maintain long-range dependencies (over 100 time steps in the original work), making them well-suited for complex motion sequences or time-series related data. For instance, an LSTM can aggregate frame-by-frame pose data to recognize a tennis swing or a dance sequence as a whole. However, recent developments introduced an architecture that revolutionized sequence modeling. Vaswani et al. presented the Transformer model [VSP+17], which removes recurrence entirely and relies solely on self-attention mechanisms to capture temporal relationships. Transformers enable greater parallelization during training and achieve superior results in tasks like machine translation as well as motion analysis tasks [Qu24]. In human pose analysis, such models can attend to all timesteps simultaneously, capturing subtle movement patterns that RNNs might miss. However, as expected with deep learning models, they typically require a large dataset to generalize well. On smaller motion datasets, simpler recurrent models can sometimes rival or outperform large pretrained Transformers. For instance, Ezen-Can [EC20] found that a tuned bi- directional LSTM outperformed a fine-tuned BERT (a transformer-based model) on a small action classification task, while also being much faster to train. Thus, model choice should not only be decided on accuracy and performance but also on data size and context. Especially for sports and rehab applications, data is limited and might favor an LSTM-based approach over a data-hungry Transformer model. Overall, modern pipelines often pair a CNN-based pose estimator with a temporal model to analyze a pose sequence [TP23]. Some systems utilize simple rule-based algorithms on landmark coordinates, while others train models directly on the data [TP23]. Furthermore, hybrid approaches are emerging, combining aspects of different models, to tailor the model architecture directly to the application domain. Qu et al. proposed a TransCNN-DSSS model to analyze dance movements with body dynamic and static streams [Qu24]. By decoupling quality dimensions via an attention mechanism, their model achieved about 90% accuracy in automatically scoring dance performances. This further highlights how different models can complement each other to form the technical foundation for modern motion analysis in various domains. 2.3 Automated Coaching & Feedback Systems in Sports Towards coaching systems. Advances in pose estimation and sequence modeling have enabled a new generation of automated coaching and feedback systems across a wide range of sports. In traditional coaching, detailed movement analysis was limited to expert eyes or expensive motion-capture setups. With affordable hardware and recent advancements in AI, athletes can now receive real-time technique feedback on demand. As a result, deep-learning-enhanced fitness applications are growing rapidly in popularity, offering highly engaging and personalized coaching experiences to users [DWDW25]. Such apps leverage computer vision methods to monitor the user’s movements and provide corrective feedback on form, count repetitions, and suggest workout adjustments. According to a recent user study, the appeal of such AI fitness systems lies in their 10 2.3. Automated Coaching & Feedback Systems in Sports interactivity and tailored guidance, which can boost motivation and other aspects of gratification [DWDW25]. Technically, most automated coaching systems share a standard pipeline: pose estimation from video, followed by movement assessment and feedback generation [TP23]. Tharatipyakul and Pongnumkul’s [TP23] review revealed that many systems rely on open-source frameworks, such as OpenPose, for human pose estimation to obtain skeleton data. The movement assessment can employ simple rule-based checks or a more complex comparison of an athlete’s motion trajectory against an optimal model. Notably, researchers emphasize the importance of feedback clarity, correctness, and ethical consideration as users must trust and understand the AI coach for it to be effective [LWHL24]. Sports applications. In individual sports like weightlifting, yoga, or golf, computer vision systems guide users to refine their form by detecting asymmetries in a yoga pose [BNKB23] or the swing plane of a golf club [LHK22]. Comparatively, in team sports, analysis often goes beyond the single-athlete technique towards tactical insights. With abundant video data for popular sports, an AI-driven system can evaluate how a player’s body orientation affects their passing options, or recommend tactical adjustments based on pattern recognition in movement data [PPW+24]. A systematic review by Pu et al. highlights that the explosion of data and the advancement of deep learning methods are transforming soccer analysis and training decisions [PPW+24]. Coaches can receive automated reports on metrics such as distance covered, joint load, or alignment during plays, thereby augmenting their expertise with objective data. Evaluation & scoring. Another important application is performance evaluation and scoring. AI systems have been developed to score performance by analyzing pose sequences. Parmar and Morris [PTM17] implemented this by training models on Olympic events. Their system learned to predict judges’ scores for diving, vault, and figure skating routines from video, using spatiotemporal features and regression models. A comparison between a Support Vector Regression (SVR) and an LSTM framework showed that while the SVR gave slightly better numeric scores, the LSTM was more natural for describing an action and thus better suited for giving qualitative feedback for improvement [PTM17]. Automated scoring is still an active research area. However, results so far show strong correlation with expert evaluations, suggesting AI can objectively standardize aspects of judging that are prone to human bias or error [KK18]. Injury prevention & rehabilitation. Motion analysis further enables the possibility of injury prevention and rehabilitation. By analyzing an athlete’s movement pattern over time, models can detect risky mechanics or deterioration that coaches might miss. Recent reviews conclude that machine learning models can significantly improve the accuracy of injury risk assessments by processing complex biomechanical and workload data beyond human capacity [MMN+24]. For instance, such systems might learn that a certain gait asymmetry and jump landing force profile often precede Anterior Cruciate Ligament (ACL) injuries. In rehabilitation, pose estimation systems monitor patients doing therapy 11 2. Related Work exercises at home, ensuring compliance and correctness. This kind of augmented feedback loop can personalize training loads and prevent injuries by continuously adjusting to the athlete’s posture [MMN+24]. In summary, automated coaching systems leveraging pose estimation are increasingly prevalent. They provide immediate, data-driven feedback on technique, reduce depen- dence on constant human supervision, and can enhance training efficiency and safety. While never aiming to replace human coaches, deep learning-based coaches fully act as intelligent assistants to reinforce proper form and measure performance. 2.4 Applications of Pose Analysis in Dance Scoring dance performances. Applying pose estimation and automated feedback to dance presents unique challenges and opportunities. Unlike many sports, dance is an artistic performance where quality is judged not only by objective technique but also by expressiveness, musicality, and style. Despite this subjectivity, researchers have shown that computational pose analysis can effectively evaluate and even enhance dance training. A study by Kim and Kim [KK18] introduced a real-time dance evaluation system that utilizes marker-less pose estimation. They developed a camera-based pose tracker that is robust to fast rotations and self-occlusions. For evaluation, they defined a metric to compare a student’s motion sequence to a reference sequence, ideal in terms of timing and accuracy. Remarkably, their system’s scores had a 98% correspondence with professional judges’ evaluations of the same performance [KK18]. Recent advances by Qu [Qu24] proposed a novel Transformer-CNN model that evaluated dance movement quality across multiple dimensions. Their approach breaks down dance quality into factors like accuracy of execution, fluidity of motion, and emotional expressiveness, using an attention-based mechanism to weight each factor. The combined model captures per-frame posture details and temporal dynamics to output an overall performance score. Again, tested against expert ratings, the system achieved an accuracy of 90% in predicting quality rating. While dancers could use such a system to get immediate feedback on form deviations, it also offers potential to provide rich recommendations targeting specific aspects of technique and performance quality. Dance movement recognition & classification. Beyond scoring entire performances, pose analysis has been used for classifying and recognizing dance movements. Bera et al. [BNKB23] addressed the problem of fine-grained posture recognition in sports, yoga, and dance. They also highlight the scarcity of large public datasets in this domain. Therefore, they introduce a new image dataset for 102 sports actions and 12 dance styles. To solve the classification problem, they implemented a deep CNN with patch-based self-attention to classify poses and styles. Similarly, Agarwal et al. introduced POA-Net [AJJB24], a CNN model for classifying dance poses and activities, with a focus on ballroom dance forms. These classification models enable applications, such as an automatic dance coach, 12 2.4. Applications of Pose Analysis in Dance to recognize the move a student is performing and evaluate it against a domain-specific syllabus. Real-world employment. Notably, the entertainment industry has already embraced rudimentary pose-based dance feedback in the form of video games. Ubisoft’s Just Dance [Ubi09] and similar rhythm games invite players to mimic on-screen choreography and score their performance based on the similarity of movements. Earlier versions relied on handheld motion controllers, but newer systems use camera-based full-body tracking (e.g., Kinect sensor) to evaluate dance moves. While these games are not as precise as state-of-the-art research pose estimators, they demonstrate a mass-market use case of marker-less motion analysis. Therefore, it is a short stretch to imagine a more serious dance training tool that provides dancers with real-time corrections during practice. Application: Pole dance Despite this growing trend of AI coaching systems, pole dancing remains an area with relatively sparse technological assistance or other related work. In contrast to the evaluation and scoring system, PoeSpin [LCLX25] represents a human-AI collaborative system that turns pole dance movements into poetry. In Li et al.’s [LCLX25] work, they mapped the dancer’s poses and motions to poetic verses by using movement as input for a generative art model. As a demonstration, the system captured live performances from a pole dancer to compose verses in real-time, blending physical expression with literary expression. This artistic application underscores that pose estimation is not limited to quantitative evaluation. At least one attempt placed pole dancing into the context of automatic evaluation. Yu [Yu20] proposed a virtual reality-based training system to help students learn pole routines by imitating a virtual instructor in an immersive environment. The system synchronized music and movement in a virtual scene to improve students’ sense of rhythm and performance quality. While offering a new teaching medium, it does not provide automated pose feedback under commodity hardware constraints, which is precisely the advantage that Pole-Arina aims to provide. In conclusion, marker-less pose estimation and AI analysis have shown great success in domains ranging from sports training to dance education. These technologies respect the performer’s freedom of movement and capture rich data that can be translated into meaningful feedback. By building on existing methods and addressing pole-specific challenges, the related work sets the stage for Pole-Arina: a deep learning-based coaching system for pole dancing technique. 13 CHAPTER 3 Pole-Arina: Dataset This chapter introduces the dataset for the Pole-Arina system. It motivates the selected set of static pole tricks and the recognition task. Furthermore, it outlines two acquisition paths, guided studio classes and open online submissions, with explicit consent and privacy safeguards, and details phase-aware annotation schemes. Next, it describes the pose-extraction pipeline that yields 3D skeleton sequences, along with lightweight preprocessing for temporal stability. The closing sections report key statistics, such as class balance, label integrity, pose coverage, and demographics. Last, it highlights practical biases that inform subsequent modeling and evaluation. The final dataset can be found here: Dataset. 3.1 Data Collection & Labeling This section establishes the data foundation for Pole-Arina. The collection goal includes an ethical, realistic representation of fundamental static pole tricks. To that end, the dataset targets six foundational tricks and balances feasibility for novices with sufficient discriminability for modeling. Overall, the data collection follows two complementary paths: guided in-class recordings and open online submissions. Both routes use the same filming guidance and consent protocol and prioritize privacy by limiting released data to 2D skeleton keypoints. Annotations structure each clip into semantically meaningful temporal states. The remainder of the section motivates the trick set and task definition, presents data from both acquisition procedures, and discusses ethical concerns. 3.1.1 Trick Selection & Task Definition Pole dance contains a broad repertoire of movement types, including dance moves (transitions where at least one foot stays on the floor), spins (rotational movements around the pole, while both feet are in the air), floor work (movements performed close 15 3. Pole-Arina: Dataset to the ground), and tricks (static or semi-static shapes on the pole). Poles themselves are either configured in static (no rotation) or spinning mode (bearing-mounted rotation). Since naming conventions and grouping of elements vary across federations and studios, this work uses the terminology presented by Spin City’s Pole Bible [Cit25] and the International Pole and Aerial Sports Federation (IPSF) [Fed25]. This work focuses exclusively on static pole tricks. Static execution allows the dancer to present the end position at a deliberate, consistent yaw to the camera/viewer. This minimizes foreshortening of limbs and self-occlusion and maximizes the aesthetic of the silhouette. Compared to spinning elements, this requirement makes static tricks more suited to consistent pose estimation and geometric measurement, as the reduced motion stabilizes the keypoint detection and limits motion blur on smartphones. Task Definition. Each trick is decomposed into a three-phase movement sequence: 1. Entry: the performer is off the pole or in initial contact, preparing the entry. 2. In transition: the dancer is on the pole and moving toward the target position. 3. End pose: the final pose is established and held for a short interval. For feedback generation, this thesis utilizes a geometric scoring system on the end pose phase only. They present the benefit of being not only time-stable but also reflecting other errors that manifest during the transition, making them suitable for precise, rule-based evaluation. Evaluation of the other phases is left as future work due to time constraints and the greater variability within the movement. Selection criteria. The goal for the trick selection was to balance feasibility, safety for novice participants, and discriminability for the model. Therefore, the choice fell on six foundational tricks spanning two posture types: three upright beginner-level tricks and three intermediate inverted tricks. The upright set holds Layout, Pin-Up, and Wrist Seat. These tricks share similar entries (standing on the right side of the pole and pulling up into a sit) yet result in distinct end shapes. While the Wrist Seat is more distinguishable, the Layout and Pin-Up provide intentional “near-neighbor” classes to test the classifier’s ability to separate subtle visual differences (see Figure 3.1a). The inverted set comprises Straddle Invert, Gemini, and Crucifix. Again sharing similar entries, these tricks begin from a basic invert grip, leading into a basic inverted position and transition to clearly differentiated end poses (see Figure 3.1b). Compared to the upright tricks, the inverted poses introduce greater biomechanical complexity, while remaining achievable for athletic beginners to lower-intermediate dancers. 16 3.1. Data Collection & Labeling (a) 1: Layout, 2: Pin-Up, 3: Wrist Seat (b) 4: Straddle Invert, 5: Gemini, 6: Crucifix Figure 3.1: Progression of each trick, highlighting similar entries and transitions before the final pose. 3.1.2 Data Acquisition Collecting high-quality and representative data was a crucial step in this thesis, as together with the annotations, it provides the ground truth for training and evaluating the Pole-Arina system. The aim was to assemble a dataset of approximately 600 clips, covering a balanced distribution of the six selected tricks and including both successful and failed attempts. Beyond the scale and diversity of the data, particular attention was given to the ethical aspects of data collection, since responsible handling of human-centered motion data is essential for trustworthy AI research [HZMY22]. Ethical considerations. Data-driven approaches such as Pole-Arina and other deep or machine learning-based systems offer clear benefits but also raise well-documented ethical concerns [HZMY22]. This paragraph briefly highlights possible issues and how they are addressed. In particular, recent publications discuss controversies around training on copyrighted or scraped content without consent (especially regarding generative AI). This further reinforces the need for explicit permission and transparent documentation [Lem24, Luc24, BP21]. With these concerns in mind, our data collection followed four principles: 1. Consent: A detailed protocol informed participants about the project goals, data handling, privacy measures, and withdrawal rights. Contributions were voluntary and limited to the intended purpose, consistent with GDPR (General Data Protection Regulation) requirements [PC16]. TU Vienna’s data protection policy also enforces these regulations for lawful processing and data minimization [Wie25]. 2. Privacy: To secure the participants’ privacy, the released dataset only includes skeleton 17 3. Pole-Arina: Dataset joint coordinates. The raw video files remain private and are used solely for processing and quality control. Such privacy-enhancing approaches aim to minimize exposure of facial and background detail, while preserving the system’s performance [HNR+25]. Although numerous participants gave their consent to publish their images in this thesis, the only human depicted in this work is the author. 3. Transparency: Standardized trick terminology is used throughout the execution of the data collection protocol. It further provides detailed information about the motivation, collection process, intended use, and limitations. 4. Discrimination and bias: While paying attention to balancing the data across tricks, experience levels, age, and gender, bias may arise from the used pose-estimation framework and demographic imbalance. Prior research highlights that fairness evaluation for human pose estimation is challenging due to missing demographic labels and data imbalance [LTNX23]. Online submissions. The first option to participate in the data collection was via an online form. Considering the ethical and project requirements, the form contained: project overview, participation and data-use terms, contact details, data collection consent, and detailed filming instructions for each trick, including a short tutorial playlist. General recording guidelines included: • Use a smartphone: Participants record each trick with a smartphone, following the provided instructions. • Angle & framing. The camera should capture the full body at all times, position it straight-on rather than from above or below. • Multiple attempts/videos per trick encouraged: Multiple attempts per trick (including incomplete or failed tries) are encouraged to capture natural variation for the learning system. • Avoid background distractions: Record in a space without bystanders or distracting movement visible in the frame. • Lighting. Provide even, front-facing illumination so the body and movements are clearly visible, avoid strong backlight. • Clothing. Wear form-fitting athletic wear (e.g., shorts and a sports bra/tank top) to keep key body positions visible. The whole form was available in English and German. Once the video recording was finished, participants were able to upload the results directly through the online form. 18 3.1. Data Collection & Labeling In-class recordings. Because online recruitment alone did not reach the target goal, additional data were collected through dedicated pole classes. Similar to the online form, participants got all the relevant information beforehand and gave their consent with the right to withdraw participation at any time. An experienced instructor (myself) demonstrated each trick and supervised filming with a fixed camera setup that mirrored the online instructions. This setup ensured consistent angles, safer spotting for beginners, and more control to balance the data across the six target tricks. Results. The final dataset comprises 836 clips from N = 58 participants. The protocol deliberately encouraged multiple attempts per participant to capture natural variability in the progression of a trick. Section 3.3 presents a detailed breakdown of the final dataset. Table 3.3 compiles concise, syllabus-aligned instructions and categories for the six selected tricks. 3.1.3 Annotation Protocol To get into the final shape of a specific pole dancing trick, a sequence of movements is required. Therefore, the goal of the annotations was to create a ground truth that represents not only the final position but also identifies the trick’s temporal progression. In parallel to the annotation process, the recognition model was already trained and tested on a subset, resulting in two different annotation schemes. The initial multi-task protocol (Protocol A) separates trick and phase labels, while the revised single-task scheme (Protocol B) merges them into a single per-frame label. How the labeling was conducted. Hand-labeling data is often a slow, tedious, and repetitive process. A brief initial review of a small batch of clips resulted in a concise labeling guide and definitions. The aim was to keep labels measurable (frame-accurate start/end marks for phases) while avoiding subjective constructs. In particular, an early idea to label a per-phase “score” proved too subjective for consistent ground truth and was therefore replaced by a post hoc, rule-based scoring system. A custom Python tool accelerates annotation with keyboard shortcuts for frame stepping, instant annotation, and one-key phase assignment. This workflow kept labels consistent, adoptable, and scalable across hundreds of clips. Why phase awareness matters. As described in section 3.1.1, each trick progresses through: entry (lifting off the floor) → transition (moving on the pole) → end pose (holding the final position). Phase awareness is crucial for three reasons. First, many tricks share similar entries, while discriminative cues often emerge only near the end of the transition. Therefore, phase context aids trick identification. Second, each phase holds different challenges (e.g., jump vs. pull-up during entry; loss of engagement during transition; angle deviations in the end pose). Third, end poses are comparatively more stable due to their static nature, enabling reliable geometric scoring methods (see Section 4.2.2 and 5.3). 19 3. Pole-Arina: Dataset Protocol A: multi-task labels. In the first scheme, each frame received two targets: a trick label from {Layout,Pin-Up,Wrist Seat,Straddle Invert,Gemini,Crucifix} and a phase label from {Start,Transition,End}. The definition of each label is as follows: • Start: the dancer is still on the floor, touching the pole and preparing for the trick. • Transition: both feet have left the floor, the dancer is on the pole, transitioning into a target position. • End: the dancer has reached the final shape and holds it. The phases match directly with the previously defined temporal progressions of a trick. A custom Python script was implemented to efficiently navigate through the video files and apply labels using keyboard shortcuts. The data were trimmed to the annotated interval (first Start phase frame to last End phase frame), and the labels were saved as a CSV with the following format: filename,trick_name,start_frame,end_frame,phase l1.mov,Layout,0,6,Start l1.mov,Layout,7,112,Transition l1.mov,Layout,113,162,End ... Frame indices are zero-based, the start and end frames are inclusive, and segments are contiguous and non-overlapping within each file. This scheme aligns with a multi-task bidirectional LSTM, which will be discussed in section 5.2. As a result of this protocol, each video was separated into these exact same phases in the same order. This sequence prior helps suppress spurious fragments during trick detection. However, in real practice videos, idle time before starting, failed attempts, and immediate retries are very common. This rigid order suppressed more flexible patterns and restricted training signals for background frames. As illustrated in the top row of Figure 3.2 (A), frames outside the strictly defined window are trimmed, so idle time, retries, and dismounts are not represented in the labels. Protocol B: single-task labels. To better support realistic training settings, where dancers might perform multiple tricks in one recording, a revision of the first scheme transformed the labels into single per-frame targets. The label set comprised two generic states and end pose labels for each of the six tricks: {floor, on_pole} ∪ {L_pose, P_pose, W_pose, V_pose, G_pose, C_pose}. The exact definition of each label is as follows: 20 3.1. Data Collection & Labeling • floor: the dancer is on the floor. • on_pole: both feet have left the floor, the dancer is on the pole. • *_pose: the dancer has reached the final shape and holds it. Again, the phases match the defined temporal progressions of a trick. This subtle change merged the former trick and phase targets into one label space, enabling arbitrary sequences such as floor → on_pole → floor → on_pole → L_pose → on_pole → floor, thereby capturing idle segments and retries within a single clip. A semi- automatic Python script supported an efficient adaptation from the old to the new annotation scheme. While the floor label extends the previously used Start label to also cover idle time before and after the dancer is on the pole, the remaining labels are directly mapped to: Transition → on_pole and End → {L,P,W,V,G,C}_pose. Again, stored in a CSV, the labels were transformed into: filename,state,start_frame,end_frame l98.MOV,floor,0,20 l98.MOV,on_pole,21,140 l98.MOV,L_pose,141,154 l98.MOV,on_pole,155,308 l99.MOV,floor,0,83 l99.MOV,on_pole,84,151 l99.MOV,L_pose,152,158 p1.mov,floor,0,10 p1.mov,on_pole,11,97 p1.mov,P_pose,98,137 ... The bottom row of Figure 3.2 (B) shows how the state labels capture idle (floor), generic interaction (on_pole), explicit end poses (e.g., L_pose), and the dismount (on_pole → floor) within a single recording, enabling multi-trick detection and robust background modeling. Model-wise, this simplified the objective from a two-head (trick, phase) to a single-head problem over eight states. Significantly, the model trained on Protocol-B could detect multiple tricks per video and distinguish background and generic interaction from explicit poses. However, the introduction of generic labels by protocol B enhanced class imbalance, which was addressed by using a weighted cross-entropy. Conclusion. Protocol A provided clean, phase-aware supervision but constrained clips to a single trick and rigid phase order, limiting realism and background coverage. Protocol B preserved phase awareness implicitly, scaled labeling to real practice behavior, and enabled multi-trick detection within a single video. A side-by-side example on the same video is shown in Figure 3.2, contrasting the rigid Start→Transition→End segmentation of Protocol A with the richer state timeline of Protocol B. 21 3. Pole-Arina: Dataset Figure 3.2: Side-by-side comparison of applying both protocols to the same video. 3.2 Feature Extraction & Preprocessing This section explains how the pipeline converts raw training videos into stable, privacy- preserving inputs for learning and evaluation. Instead of operating on pixels, the system extracts 2D body skeletons per frame and stores them as a compact tensor. A lightweight preprocessing stage then mitigates artifacts, such as high jitter, brief occlusions, and occasional dropped detections. These steps, together with normalization and optional data augmentation, produce temporally coherent landmark sequences that match the training distribution. 3.2.1 Skeleton Extraction Pose-estimator selection. Three marker-less pose estimators were considered with a focus on real-time application and mobile feasibility: OpenPose [CHS+19], Medi- aPipe Pose [Goo25], and Detectron2 [WKM+19]. MediaPipe is based on the BlazePose [BGR+20] architecture and offers a lightweight, on-device pipeline with a 33-landmark body topology. OpenPose provides robust multi-person parsing but carries a heavier runtime cost with only 18 joints. Detectron2’s Keypoint R-CNN achieves high accuracy on COCO’s 17-keypoint scheme, but requires GPU resources and offers 17 joints. A qualitative comparison of these estimators on four representative pole tricks pro- vides insight into their limitations. MediaPipe produced stable and complete landmark predictions even in inverted positions, while Detectron2 and OpenPose frequently lost joints or misaligned the skeleton. Simple poses, such as the Pin-Up, were estimated consistently across all frameworks, while inverted tricks, like the Straddle Invert, proved more challenging, especially for Detectron2 and OpenPose (see Figure 3.3). A quantitative runtime analysis on Google Colab further highlights their performance differences. As shown in Table 3.1, MediaPipe achieves real-time performance, even on a CPU, while Detectron2 and OpenPose are far slower and therefore impractical for real-time analysis. 22 3.2. Feature Extraction & Preprocessing (a) MediaPipe (b) OpenPose (DNN/COCO) (c) Detectron2 Figure 3.3: Qualitative comparison of skeleton overlays across four pole tricks. Besides runtime, deployment support also played a role. MediaPipe offers pip-installable packages with stable CPU/GPU support. Detectron2 requires GPU resources and a heavier installation. In contrast, OpenPose suffers from compatibility issues with current CUDA/Caffe toolchains and only supports Ubuntu and Windows systems. Considering runtime performance, number of joints, robustness in complex poses, and ease of integration, MediaPipe Pose was selected as the estimator for this thesis. Table 3.1: Runtime benchmark on a five-second, 360×640 video (164 frames). CPU = Colab CPU runtime; GPU = Colab T4. OpenPose results use the COCO-18 model via OpenCV DNN. Framework CPU (Colab) T4 GPU (Colab) Joints Supports Time [s] FPS Time [s] FPS MediaPipe Pose[Goo25] 7.7 21.3 5.5 30.0 33 Windows, Linux, Android, iOS, macOS Detectron2 (KPRCNN)[WKM+19] 1382.9 0.12 23.4 7.00 17 Linux, macOS, Windows (limited) OpenPose (COCO)[CHS+19] 691.6 0.24 618.9 0.26 18 Ubuntu, Windows MediaPipe configuration. The extraction script implements MediaPipe Pose with the following specifications: • MEDIAPIPE_DISABLE_GPU=1 • static_image_mode=false 23 3. Pole-Arina: Dataset • min_detection_confidence=0.5 • min_tracking_confidence=0.5 For each frame, MediaPipe returns 33 landmarks as an array (x, y, z, visibility). Coordi- nates x, y are normalized to [0, 1] (origin at the top-left), where z is a relative depth value in the same normalized scale (negative towards the camera). Visibility is a per-landmark confidence in [0, 1]. Output. For each input video of length T frames, the extractor produces a NumPy array of shape T × 33 × 4 (float32) saved as .npy. Optionally, for quality checks, the script saves a video with a skeleton overlay for each frame. Script notes. Extraction is implemented in extract_skeleton.py (batch process- ing over folders). The tool also supports optional data augmentations (add noise, in-plane rotation, sequence time-warp) to create additional samples. 3.2.2 Preprocessing Techniques Temporal smoothing. Pose estimation might suffer from jitter caused by detecting noise and small motions. To address this issue, the skeleton extractor applies an Exponen- tial Moving Average (EMA) to each landmark. Let ℓi,j ∈ R3 be the coordinates (x, y, z) of joint j at frame i and ℓ̂i,j the smoothed value. With smoothing factor α ∈ (0, 1], the smoothing process is governed by: ℓ̂i,j = α ℓi,j + (1 − α) ℓ̂i−1,j , with α = 0.3 in our case. EMA is a first-order infinite impulse response (IIR) low-pass filter. Compared to the Simple Moving Average (SMA), EMA is more efficient as it does not require a buffer to store previous data, and the weight is not distributed equally. Instead, it emphasizes the most recent data samples while the previous data decays exponentially but never reaches zero [FHC19]. Low-visibility handling. MediaPipe provides a confidence for the visibility of each landmark. If the value falls below a threshold τ = 0.3, the landmark keeps the previous frame value rather than updating, to reduce jitter during brief occlusions. This usually occurs when the pole covers certain body parts. If no pose is detected for an entire frame, the extractor saves a zero array placeholder. In the later introduced preprocessing pipeline, such frames will be linearly interpolated. These additions stabilize joint trajectories while keeping the computation lightweight. Savitzky-Golay filters, or Kalman filters [FHC19], are valid alternatives but introduce either additional latency or extra state assumptions [Sch11, FHC19]. 24 3.3. Final Dataset Statistics 3.3 Final Dataset Statistics This section summarizes the composition and quality of the dataset. It documents the number of recordings per trick, the distribution of frames across labels, and the reliability of the pose extractor in tracking joints. Beyond simple counts, the statistics highlight practical biases that arise in real practice footage and explain the compensating measures used during learning. The reporting proceeds as follows: • Label & class balance: per-video counts by trick and frame-level distributions under Protocol B, with a side-by-side comparison to the legacy Protocol A to expose phase proportions. • Label integrity: coverage ratios (#labeled / #total frames) and checks for overlaps or gaps to ensure consistent alignment between labels and skeletons. • Video & skeleton properties: capture characteristics (FPS, portrait resolu- tions) and pose-tracking reliability (per-clip coverage, per-joint visibility patterns), confirming suitability for temporal modeling and geometric rules. Together, these statistics provide a transparent view of the dataset’s strengths and limitations. 3.3.1 Label & Class Balance This section provides a summary of the chosen label properties to provide insight and validate the quality. Per-video class balance. Figure 3.4 shows the number of videos that contain at least one end-pose for each of the six target tricks. Upright tricks like the Layout and Pin-Up are more common compared to the inverted ones. These results were expected given the increasing difficulty of the intermediate tricks. Especially for beginners, sitting on the pole in a Layout or Pin-Up is easier than hanging upside down from a Gemini. This class imbalance was addressed by applying targeted data augmentation and a weighted loss for the model. Notably, a few video files include failed and multiple attempts of a trick, but the visualization contains the number of videos, not segments. The actual number of frames per phase will be discussed in the section below. Across the complete list, 812/836 videos contain one end-pose, 9/836 include a second try, and 24/836 are failed attempts. Phase distribution. The following plots show the frame distribution for each label and include a comparison against the legacy Protocol A. Figure 3.5 presents the actual phase per frame distribution used for training the final recognition model. Phase on_pole immediately stands out, as it holds the majority share of 60.0% of all frames. Followed 25 3. Pole-Arina: Dataset Figure 3.4: Per-trick class balance. Number of videos containing at least one end-pose for each target trick. by floor with 20.5%, while the trick-specific labels contribute only a small fraction individually, with around 2-4% and a combined value of 19.5%. This shows that the label imbalance is stronger than the number of tricks per video imbalance seen in Figure 3.4. By design, each recording typically shows a single target trick but always includes background, idle, and transition segments. As a result, on_pole and floor dominate frame counts. For context and comparison, Figure 3.6 shows the label distribution from the legacy Protocol A (see Section 3.1.2 for more details on this labeling scheme). Since this scheme yielded two separate sets of labels (trick name, phase), the results are represented as a stacked bar chart. The stacks show the absolute frames per trick label, while the percentages inside indicate the share of each phase label (Start, Transition, End). Aggregated over all tricks, Transition accounts for 63.2% of labeled frames, Start for 16.4%, and End for 20.4%. While not being as noticeable at first glance, the results present a similar class imbalance to the chosen labeling method. Figure 3.6 also reveals how long each trick tends to last in the transition phase. Measured as the share of labeled frames within each trick, Straddle Invert shows the shortest transition (40.4%), whereas Crucifix shows the longest (74.8%). In the end, Protocol B was chosen for training the final model, because it keeps the background explicit and can capture retries and idle time without data trimming, while preserving trick-specific end states for evaluation. 26 3.3. Final Dataset Statistics Figure 3.5: Protocol B. Percentages above bars show the relative contribution of each label to the total of 212,574 labeled frames. At a glance, the label distribution is: • Protocol B: 20.5% floor, 60.0% on_pole, 19.5% *_pose (combined) • Protocol A: 16.4% Start, 63.2% Transition, 20.4% End Label integrity. All labels underwent basic quality checks. First, the label coverage ratio is calculated as the labeled frames divided by the total number of frames. It is important to note that not all frames are labeled, especially long idle segments at the beginning or end of a clip. The ratios are overall high, with 54.9% having perfect coverage (1.0) and 75% over 0.957 (see Figure 3.7). No overlapping segments or internal gaps were detected after final curation. Quality control and label integrity enable consistent alignment between labels and skeleton data. 3.3.2 Video Data & Skeleton Data To preserve privacy, the public dataset contains only skeleton sequences rather than raw videos. The extraction script stores the data as NumPy arrays of shape N × 33 × 4 with MediaPipe’s 33 joints and the channels x, y, z, visibility in normalized image coordinates. Furthermore, an EMA filter slightly smoothed the data, and joints with a visibility lower than 0.3 were treated as missing (see Section 3.2.2). 27 3. Pole-Arina: Dataset Figure 3.6: Protocol A. Bar height shows the absolute number of labeled frames per trick. The phase labels occur in the following stack order: bottom=Start, middle=Transition, top=end (highlighted in a trick-specific color). Percentages inside the bars indicate the relative share of each phase. FPS & resolution. Participants of the data collection process used their smartphones for recording. The videos are predominantly at 30 FPS (∼96%) and a small minority at 60 FPS (∼4%). Resolutions cluster around portrait format with a 16:9 aspect ratio. After normalizing orientation, 29.3% of the clips are exactly 1080x1920, and 69.3% are close to Full High Definition (FHD). Table 3.2 summarizes the most common resolutions in more detail. Generally, these characteristics match a typical home and studio setup and align with the intended use case. Skeleton detection quality. Skeleton reliability was evaluated to confirm that Me- diaPipe consistently tracks across frames. Pose coverage is defined as the fraction of frames with a valid skeleton. Out of 836 clips, 771 (92.2%) achieve a perfect score of 1.0. Only seven videos fall below 0.95, with the minimum at 0.767 for a recording of a Pin-Up. In total, non-perfect coverage occurs in 65 files, most often in more complex tricks such as the Straddle Invert (20 files) and the Wrist Seat (15 files). Due to their simplicity, Layout and Pin-Up remain the most stable, with only a handful of affected videos. Figure 3.8a highlights all non-perfect coverage clips. The individual joint visibility was also evaluated by counting frames where the landmark confidence fell below 0.3. The results show a generally high visibility, with a median of 0.0% for frames below this threshold. Looking at a per-joint visibility heatmap across all frames, Figure 3.8b indicates uniformly high ratios across the body, with slightly lower 28 3.3. Final Dataset Statistics Figure 3.7: Box plot of coverage ratio with most labels achieving near-perfect coverage. Table 3.2: Most common portrait resolutions within the dataset. Resolution Count % of 836 1080×1920 245 29.31% 720×1280 22 2.63% 1010×1796 20 2.39% 1028×1828 19 2.27% 1020×1814 18 2.15% 1018×1810 18 2.15% 478×850 16 1.91% 1034×1838 14 1.67% 1016×1806 14 1.67% 1014×1804 13 1.56% values (0.99-0.95) for the right elbow and hand cluster. This pattern matches recording issues for upright tricks such as Layout and Pin-Up, where the arm extends above the head and may leave the camera view. Overall, skeleton detection offers near-perfect coverage and stable joint visibility, ensuring reliable alignment between labels and skeleton data. 3.3.3 Demographics The data collection aimed to strike a balance that still accurately reflects the student demographics in real-world pole classes. Another priority was to capture broad variations in execution quality so the model learns every deviation from failed to perfect attempts. 29 3. Pole-Arina: Dataset (a) Non-perfect pose coverage per video, grouped by trick. Dotted lines mark thresholds at 0.95 and 0.99. (b) MediaPipe skeletal schematic col- ored by landmark visibility ratio. Figure 3.8: MediaPipe coverage information. Experience balance. The dancers’ experience has a greater impact on variability than age or height. Because the selected tricks range from beginner to intermediate, the dataset deliberately emphasized novice attempts. Ideally, half of all contributors should have limited to no prior pole experience. This selection enhances the diversity of entries, transitions, and final shapes, thereby improving the robustness and generalizability of the recognition model. Therefore, as shown in Figure 3.9, the total number of 58 participants comprised 34 non-dancers (58.6%) and 24 dancers (41.4%). Figure 3.9: Experience balance bars (Non-dancer vs. Dancer). Gender context. Adult female students mostly visit local studios in Austria and likely around the world. Many classes are restricted to women only to maintain a comfortable environment, especially given the low-coverage attire required for a secure grip on the pole. At the same time, pole is a gender-inclusive sport with a growing number of men. The collected data reflects this reality while remaining open to all participants. Gender counts are: Female n = 43(74.1%), Male n = 15(25.9%) (see Figure 3.10). Figure 3.10: Gender balance bars (Female vs. Male). 30 3.3. Final Dataset Statistics Age distribution. Although not as crucial for data diversity, the age distribution, as displayed in Figure 3.11, ranges from 20 to 58 years old, with a median of 28 years. Again, this accurately reflects real-class distributions of pole dancing students. Figure 3.11: Age distribution. 31 3. Pole-Arina: Dataset Table 3.3: Compact summary of the selected tricks; terminology aligned with IPSF and Spin City [Fed25, Cit25]. Thumbnail Trick Description Category / Level Layout 1. Stand on the right side of the pole 2. Pull up, cross legs at ankles 3. Lean back, arch, push hips up Upright Beginner Pin-Up 1. Stand on the right side of the pole 2. Pull up, right toes to left knee 3. Lean slightly back and arch Upright Beginner Wrist Seat 1. Stand on the right side of the pole 2. Pull up, right toes to left knee 3. Place left hand underneath the thigh 4. Lean back, open legs into a V shape Upright Beginner Straddle Invert 1. Stand behind the pole, facing left 2. Stronghold grip, pull up, lean back 3. Open legs into a V shape Inverted Inter. Gemini 1. Start with Straddle Invert 2. Hook the outside leg, other leg down 3. Chest up, arch and release hands Inverted Inter. Crucifix 1. Start with Straddle Invert 2. Place legs into crucifix hold 3. Upper body low, release arms Inverted Inter. 32 CHAPTER 4 A Coaching System for Pole Dancing Technique Pole-Arina addresses a common challenge in pole training: subtle misalignments and unsafe form, which often remain undetected outside guided classes. The thesis delivers a marker-less, video-based coaching application that recognizes the performed trick, identifies its temporal progression, and grades the final pose with transparent, geometry- based feedback. The design prioritizes privacy, interpretability, and practical deployment on user hardware. 4.1 System Overview Pipeline overview. Figure 4.1 summarizes the end-to-end process of the Pole-Arina system. First, the user records themselves performing one pose or a combination of tricks. From a single uploaded RGB clip, the pipeline extracts a sequence of 33×4 MediaPipe landmarks per frame. Next, it applies preprocessing (e.g., smoothing and normalization) before feeding the sequence to the LSTM-based recognition model. The model produces a frame-by-frame timeline of semantic states, such as generic phases or specific trick poses. Detected end-pose frames are then passed to a rule-based engine, which evaluates trick-specific geometric checks and renders visual overlays along with textual cues for feedback. In summary, the model identifies the performed trick, and the rule-based module provides feedback on the pose correctness, indicating where adjustments are needed. The following sections formalize the recognition and scoring components at a high level, while Chapter 5 provides implementation details and Chapter 6 presents the evaluation and results. 33 4. A Coaching System for Pole Dancing Technique Figure 4.1: Pole-Arina end-to-end pipeline. 4.2 Trick Recognition & Pose Analysis Pole-Arina requires a temporal model that turns a video into a sequence of semantic states interpretable for coaching. Given per-frame landmarks, the task is defined as a multi-class classification problem that assigns one label to every frame. In computer vision, this falls under the broad field of action recognition or action segmentation. This section formalizes the frame-wise label space and input representation, then presents a bidirectional LSTM for sequence labeling, including architecture, class-imbalance handling, and the training objective. Next, it describes decoding and post-processing to obtain stable, contiguous end-pose segments. Finally, it introduces the rule-based pose-quality analysis that converts the classifier output into interpretable coaching feedback. 4.2.1 LSTM Model The primary task of the recognition model is to map a time sequence of skeletons to a time sequence of semantic states that support coaching feedback. Each input frame provides a 33 ×4 feature tensor (33 landmarks × (x, y, z, visibility)). The model outputs a label for every frame, covering both generic movement phases and trick-specific end poses. Model decision. A lightweight, bidirectional Long Short-Term Memory (LSTM) model efficiently processes temporal context on modest datasets and hardware [HS97]. On small to medium datasets, enhanced recurrent models can rival or outperform heavier Transformer variants while offering lower latency [EC20]. Due to latency and deployment constraints, the implementation of Pole-Arina favored a Bi-LSTM over larger models. As illustrated in Figure 4.2, one or more LSTM layers, with a moderate number of hidden units, take the normalized skeleton coordinates as input. The following time-distributed fully connected layer produces a classification score for each frame. The final layer is a softmax over the defined label classes, so the network outputs a probability distribution for the frame’s state at each time step. Compared to the original LSTM formulation [HS97] and the classical Bi-LSTM concept [GFS05], the deployed model is explicitly bidirectional with concatenated directions, uses a time-distributed fully connected head rather than a CRF/CTC decoding layer, and includes dropout between recurrent output 34 4.2. Trick Recognition & Pose Analysis and the classification head for regularization. This compact design preserves temporal context while remaining efficient on consumer hardware. Figure 4.2: Bidirectional LSTM architecture, taken from [NJ22]. Label design. The final label set combines generic phases with trick end poses: {floor, on_pole} ∪ {L_pose, P_pose, W_pose, S_pose, G_pose, I_pose}. These end pose labels indicate the frames where the performer holds the final position, which presents the most trick characteristics and will be used for evaluation. To smooth out momentary prediction errors, a short median filter is applied to the model’s per- frame outputs. This removes brief misclassifications before decoding the sequence into contiguous segments for each recognized phase or pose. Handling imbalance. As presented in Chapter 3.3, end pose classes occur far less frequently than the generic phase classes, and some tricks are less represented than others. To address this imbalance, weights are added to the training loss, and data augmentation balances rare tricks. This encourages the network to learn the infrequent classes despite skewed data distribution. The exact weighting and optimization settings are described in Section 5.1.2 and 5.2.2. The network was optimized with standard procedures (using the Adam optimizer and early stopping on a validation set) to ensure good generalization. Decoding & post-processing. The raw output of the LSTM is a per-frame label sequence, which may still contain occasional flicker or brief frame misclassifications within a phase. To obtain stable, meaningful segments, two post-processing steps are applied. First, using the model’s confidence, only results above a specified threshold are accepted. 35 4. A Coaching System for Pole Dancing Technique Second, a median filter over a short window merges isolated misclassified frames into the surrounding classes. After these steps, the timeline is segmented and labeled by phase or end pose. In particular, only sufficiently long end pose segments are kept for analysis to avoid fleeting motions. This decoding process yields a cleaner interpretation for the dancer’s entry into the final pose of a trick. 4.2.2 Pose-Quality Rule Design Once the system has identified an end pose, the next step is to grade the quality of that pose using explicit geometric rules on the skeleton. The module turns joint coordinates into transparent feedback, including pass/fail values, an overall score, and textual suggestions. The design focuses on interpretable checks, like body orientation, limb alignment, and joint proximity. It measures angles or normalizes distances with tolerances specific to the trick. The pipeline uses normalized image coordinates, enforces a visibility threshold, and evaluates all frames of the end phase. This subsection explains the concept of the feedback mapping in support of RQ2. The full implementation and rule design specifics are located in Section 5.3. Post-processing vs. direct pose evaluation. An early design choice was whether to have the model directly learn pose correctness or to evaluate in a post-processing step. One initial consideration included the first LSTM iteration to output a pose-quality score. The implementation included scoring each phase using a predefined error list and matching it with the viewed performance. For each identified error, the score would be reduced by one, going from: perfect → good → ok → fail. However, this approach proved impractical due to the subjective nature of labeling and the limited time for expanding and annotating the training dataset. Instead, the LSTM focuses solely on trick and phase recognition, while the pose quality evaluation was left for the post-processing stage. This separation simplifies model training and makes evaluation criteria more transparent and adjustable, without requiring retraining of the model. Focus on the final pose. As a reminder, each pole trick can be broken down into individual phases. For instance, the mount or entry into the trick, the transition where the dancer is in motion, and the final pose. In practice, the execution quality can be evaluated at each of these steps. Some general pointers include: • Entry: Controlled entry on the pole, including the right muscle engagement without unnecessary jumps and swings. • Transition: Correctness of contact points, technique, and body alignment while moving towards the final position. • End position: Present common trick characteristics, often including pointed toes, specific angles, and body orientation. 36 4.2. Trick Recognition & Pose Analysis Initially, all phases were considered for evaluation, but due to time and data constraints, the solution was optimized based on the end position. This phase holds the richest information about the trick’s execution because errors during transition are usually reflected in the final state. Moreover, the dancer holds the trick statically at the end, which further has a specific presentation angle to the viewers. This focus enables a clear view, less motion blur, and consistent positioning. Defining correctness. A key challenge in designing pose-quality metrics is the absence of a strict “rulebook” for executing each pole trick. Unlike gymnastics or ballet, where technique is codified in detail, pole dancing is guided by general best practices rather than exact prescriptions. Judges and instructors look for proper form, body alignment, and precision, but there is room for personal style. For instance, having fully extended legs and pointed toes is universal for clean lines, but exact angles and distances between joints can vary with individual anatomy. Therefore, the evaluation criteria should be specific, yet flexible to account for differences in body proportions or styles. Additionally, factors such as height, limb length, age, or gender should not result in penalties. Using the normalized skeleton data provided by MediaPipe accounts for scale differences and focuses on the geometry of the pose. Each rule is specified in terms of an angle or a distance between joints, along with a tolerance range that accounts for minor variations. The tolerances ensure that performances are not unfairly penalized, when deviating slightly due to personal style. In summary, correctness is defined by a set of geometrical conditions that reflect good form, with allowances made for normal variability between different performers. Rule families & scoring concept. Each trick holds defining characteristics repre- sented as rules with targets and tolerances. During evaluation, each rule returns a boolean pass/fail for each frame of the end pose segment. The overall pose score is calculated as the fraction of rules passed out of the total rules for that trick. For example, if a particular trick has 5 rules and the dancer satisfies 4 of them, the pose would score 4/5 or 80%. The scoring system defined three geometric rule families to cover a broad range of possible checks: • Body Orientation: orientation of a segment relative to ground. • Limb Alignment: internal joint angles. • Joint Proximity: normalized distances between landmarks. Exact formulas, thresholds, and per-trick targets are specified in Section 5.3. 4.2.3 Feedback & Visualization In addition to the written feedback, Pole-Arina also provides visual feedback in the form of overlays, drawn on top of the evaluated frame. Each rule type maps to a distinct visual primitive: 37 4. A Coaching System for Pole Dancing Technique • Orientation → Arcs: appears as circular arcs around a reference joint to highlight the accepted angle sector. • Alignment → Angles: traditionally represented as the angle between two vectors, connecting the affected joints. • Proximity → Circles: shows a circle centered between two landmarks with a radius equal to the measured distance. This mapping separates the rule concepts effectively, while maintaining consistent feedback across the tricks. Regarding the implementation, the same primitives port cleanly to SVG overlays to enable exploration through interactivity. Figure 4.3 illustrates the three overlays in a simple style on a Pin-Up pose. Together, the rule configuration and its overlays form a transparent scoring system on top of the recognizer. In short, the model identifies what was performed, the scoring system explains why it received this value, and the visuals show where to adjust. The how is directly given by the evaluation message, explaining the needed adjustments. Figure 4.3: Rules mapped to visual overlays on a Pin-Up pose. 4.3 Pole-Arina Evaluation Pole-Arina is evaluated along two paths. First, quantitative tests on a held-out split assess recognizer accuracy and the stability of end-pose detection (RQ1) and validate geometric scoring against trick definitions (RQ2). Second, a controlled user study examines effectiveness and usability in practice, combining objective improvement with subjective measures of trust, clarity, and SUS usability (RQ3). The full methodology and results are presented in Chapter 6. 38 CHAPTER 5 Pole-Arina: Implementation This chapter provides detailed information on the end-to-end implementation of Pole- Arina. Following standard practice in motion pipelines [TP23], data preprocessing stabilizes pose trajectories and handles missing detections. For that, it employs temporal smoothing and gap interpolation before preparing the data into a train/validation/test split. Next, data augmentation balances the dataset and increases the sample size to improve generalization. A bidirectional LSTM is trained to produce per-frame predictions. Section 5.2 documents the architectural evolution and the final formulation with class- weighted cross-entropy. Three distinct rule families turn landmark geometry into pass/fail checks with specified tolerances for each check. Finally, the application prototype wraps the pipeline behind a single analysis endpoint and renders interactive overlays, including scores and improvement tips. 5.1 Data Preprocessing & Augmentation Effective preprocessing is essential to convert raw pose data into reliable inputs. Real- world motion-capture data often contains noise, missing values, and other inconsistencies. If left unaddressed, these issues can lead to training malfunctions and degrade the model’s performance. Preprocessing techniques clean and refine the data by transforming noisy and inconsistent inputs into a clean format suitable for machine/deep learning [OGK+24]. Typical preprocessing steps include: noise reduction, outlier removal, and handling of missing data. Once cleaned, the labels are aligned with the skeleton data and split into train, test, and validation sets. Figure 5.1 presents a visual overview of this pipeline. 39 5. Pole-Arina: Implementation Figure 5.1: Preprocessing pipeline overview. 5.1.1 Data Preprocessing Clean-up. As a reminder, the video data was captured through single RGB cameras, more specifically, the ones integrated in smartphones. MediaPipe Pose [Goo25], a pose estimation framework, extracted the skeleton data, which can introduce jitter and minor errors in the detected joint coordinates. Noisy pose data causes visible flicker effects in skeleton visualization, which may complicate learning but also appear less trustworthy. To address this, the pre-processing pipeline applied temporal smoothing to the sequence of detected keypoints. The result is a calmer, more realistic motion sequence where high-frequency noise is suppressed. Along with noise, missing detections are another issue, which yield zero or null coordinates for all joints. Instead of leaving blank frames, the pipeline performs a simple interpolation to fill the gaps. Therefore, short occlusions or tracking failures do not result in inaccurate labels or fragmented input data. Both of these techniques are thoroughly discussed in Section 3.2.2. However, to make the impact visible and measurable, Figure 5.2 overlays per-frame jitter before and after temporal smoothing on a representative Straddle Invert. Figure 5.2: EMA reduces high-frequency jitter on a representative Straddle Invert clip. Measuring jitter. Per-frame jitter quantifies high-frequency motion in the 2D land- marks. For frame t, we compute the mean Euclidean displacement across N tracked landmarks in normalized image coordinates. The calculation only uses landmarks above a visibility threshold (vi ≥ τ, τ = 0.3). The first frame is excluded since J1 is undefined. Let T be the number of frames in a clip and let pi,t = (xi,t, yi,t) denote the 2D (normalized) image coordinates of landmark i at frame t. 40 5.1. Data Preprocessing & Augmentation Per-frame jitter is computed as: Jt = 1 N N∑︂ i=1 ⃦⃦ pi,t − pi,t−1 ⃦⃦ 2, pi,t = (xi,t, yi,t). The series {Jt} is collapsed to a single, comparable number. The arithmetic mean is calculated over frames for both raw and EMA-smoothed landmarks using the sequence- level averages: J raw = 1 T − 1 T∑︂ t=2 J raw t , J smooth = 1 T − 1 T∑︂ t=2 J smooth t . The percentage reduction reports how much the EMA suppresses jitter: Reduction = (︄ 1 − J smooth J raw )︄ × 100. At the dataset level, Figure 5.3 summarizes the percentage reduction in jitter achieved by the EMA filter and grouped by trick. As expected, more complex or inverted poses exhibit higher raw jitter and therefore show larger reductions in jitter, compared to simpler tricks such as the Layout. Together, the plots illustrate that the smoother input sequences are produced by the preprocessing pipeline by removing high-frequency noise. Figure 5.3: Dataset-level jitter reduction by trick. 41 5. Pole-Arina: Implementation Label alignment & segmentation. Once the skeleton data is verified and cleaned, the data gets aligned with the ground-truth phase annotations from the dataset. Each video was labeled in segments, providing the start and end frames of each trick phase (e.g., floor, on_pole, L_pose, etc.). The preprocessing script takes those segments to match them with the continuous skeleton sequence and assign a class label to each frame. First, the label CSV provides all annotations for a given video and is sorted by their start frame. Next, the skeleton data is sliced to span from the start to the end frame of each labeled phase. Previously, the data was tested for any gaps or overlapping labels. This alignment process ensures each skeleton frame is paired with the correct target label. The resulting collection of synchronized data-label pairs is a matrix of shape frames×joints×coordinates, accompanied by a matching sequence of class labels for each frame. Dataset split. The final step divides the processed dataset into separate subsets for training, validation, and testing. After aggregating all labeled sequences, the set is randomly split into three parts: • Train: ∼70% • Validation: ∼15% • Test: ∼15% Notably, the dataset separation does not split per-frame but per-video. Therefore, the model will always see a full sequence without interruption. The test set provides hold-out data for final evaluation, while the validation split finetunes the model. Finally, the cleaned, aligned, and augmented (as discussed next) data is prepared and ready for the model. In summary, smoothing and interpolation enhance the data quality by reducing noise and filling gaps. It aligns each sequence with the ground-truth labels and splits the data for the training of the model. This enables the model to learn from high-quality inputs and correct targets, which is foundational to achieving proper output. 5.1.2 Data Augmentation After preprocessing, the next step deliberately expands the dataset by introducing diversity through data augmentation. It is a standard concept where new synthetic training examples are derived from existing ones by applying various transformations. The goal is to introduce controlled variability by increasing the dataset size, thereby reducing overfitting and improving the model’s ability to generalize, especially when the available data is limited. Data augmentation techniques can range from basic image manipulation to deep learning approaches by generating new data samples through Generative Adversarial Networks (GANs) [SK19] or other models. In this thesis, the applied techniques are categorized into spatial augmentations and temporal augmentations. 42 5.1. Data Preprocessing & Augmentation The first modifies the geometric attributes of the pose, while the latter alters the time dimension of the sequence. Both types aim to simulate realistic variations, including diverse viewing angles, varying execution speeds, and minor sensor noise. Augmentation techniques. Three methods were implemented and applied to the skeleton sequence in random combinations: 1. Gaussian noise : adding small Gaussian noise to joint coordinates. 2. Random rotation: rotating the 2D pose trajectory by a few degrees. 3. Time warping: randomly stretching or compressing the temporal duration. These methods produce realistic variations for human pose sequences and align with known related work methods [XKCP24]. Gaussian noise. A slight Gaussian noise is added independently to each joint’s coordinates. Therefore, each coordinate is perturbed by a tiny random offset. This effect simulates minor positional errors or measurement noise that naturally occurs in pose estimation. By training the model on noisy skeletons, the model becomes more robust to jitter while maintaining focus on the overall pose, rather than on landmark positions ([XKCP24]). Random Rotation. A rotation of between -5◦ and +5◦ is applied to the whole skeleton in 2D. This spatial transformation changes the global orientation of the skeleton but preserves the relative pose. The aim is to make the model invariant to viewpoint or orientation changes. In practice, this happens frequently, as dancers might not pay attention to the horizontal alignment of the camera view. The rotation range is limited to small angles, as tricks are often characterized by body orientation which is utilized for evaluation. Figure 5.4 displays such constraints for split-style shapes. The different presentations impose distinct geometric targets. Augmentation, therefore, restricts rotation to small angles to preserve trick-defining orientation cues. Figure 5.4: Split-style shapes: horizontal, diagonal, and vertical presentations. 43 5. Pole-Arina: Implementation Time Warping. This augmentation alters the speed of the motion sequence. The data can either be randomly stretched or squeezed (between a factor of 0.8 and 1.2) before getting resampled to the original number of frames. Therefore, some frames are either skipped or interpolated while the original sequence length remains to match the existing labels. Training on time-warped data enables the model to learn variations in execution speed and temporal dynamics. The augmentation can be applied individually or in combination. For the final data set, multiple augmented versions were generated by randomly selecting one, two or all three of the transformations. By augmenting each original sequence into several variants, the size of the training set was effectively multiplied, while introducing a rich variety of poses. Each clip was multiplied by three random augmentations, except for the Gemini trick, which received five per video. Additionally, it addressed class imbalance by generating more samples for underrepresented tricks such as the Gemini. 5.2 Recognition Model (LSTM) This section covers the recognition model, which takes skeleton data as input and outputs semantic states for feedback and scoring. First, Subsection 5.2.1 LSTM Iterations outlines the architectural progression, from a real-time system to a multitask prediction model, and finally to an eight-class single-head solution. Each step reveals different strengths and limitations through experiments, which lead to the development of the final model. Subsection 5.2.2 Final Single-Output LSTM specifies the LSTM architecture used for the Pole-Arina implementation. Task Definition The output evolved with the project goals: 1. {Start, Transition, End}×{fail, ok, good, perfect} for real-time grading; 2. {Start, Transition, End}×{L, P, W, S, G, I} for automatic trick identification; 3. {floor, on_pole} ∪ {L_pose, P_pose, W_pose, S_pose, G_pose, I_pose} to enable multi-trick detection. Several constraints influence the model choice and labels: privacy (skeletons only), data scale, single-camera capture, and class imbalance. Success criteria focus on accurate end-pose detection, reliable segmentation across multiple tricks in one clip, and controlled errors concentrated in transitional frames rather than in the final poses. 5.2.1 LSTM Iterations This section presents the explored model iterations before settling on the final design. All iterations run in PyTorch and consume skeleton vectors of size 33 ×4 (33 landmarks × (x, y, z, visibility)). They also share the same backbone: a bidirectional LSTM with two 44 5.2. Recognition Model (LSTM) layers, a hidden size of 64, and a dropout rate of 0.2. Furthermore, they output per-frame logits and employ an Adam optimizer with a learning-rate scheduler. All models were trained with Google Colab on an NVIDIA T4 GPU. Initial Multi-Task LSTM (phase + score) The initial model was an LSTM-based network with several training enhancement configurations and a double-head output layer. Key features of this model included: • Learning-rate scheduler: A dynamic learning-rate scheduler adjusted the step size during training, to sustain steady convergence as training progressed. • Dropout regularization: A dropout layer tackles overfitting by randomly omitting units during training, encouraging the model to generalize better. • Bidirectional: The activated bidirectional LSTM layer enables the model to process the sequence in both forward and reverse time directions. Considering past and future context, the model can capture richer dependencies in the sequence. • Multi-task outputs: The LSTM was expanded into a multi-task model that computes predictions for two separate output heads: first, a performance score {perfect, good, ok, fail} and second, the current temporal phase of the trick {Start, Transition, End}. This design allowed the network to learn both what the dancer was doing and how well they did it in parallel. Multi-task learning can improve the generalization by leveraging a shared representation for related tasks. The labeled dataset is closely related to the one introduced by Protocol A (see Section 3.1.3) but holds an additional label for the score. Each phase segment was annotated with a grade (perfect, good, ok, fail) based on the number of execution mistakes. Standard errors were identified in advance, and each match would result in a one-level downgrade of the score. For instance, if two mistakes were identified during the transition phase, the score would be labeled as ok instead of perfect. This first concept was inspired by dance video games (e.g. Just Dance [Ubi09]), which provide immediate and similar feedback to players during performance. The initial multi-task LSTM performed well, achieving a test accuracy of 91.58% on the score prediction task and 96.08% on the phase classification. While the real-time model was conceptually appealing for regular dancing, it had some practical issues, especially for pole dancing. First, it required the user to pre-select which trick they were going to perform, limiting the system’s autonomy. Second, delivering real-time feedback to a pole dancer has some usability issues, as dancers often hang upside-down on the pole without the ability to look at their phones. An augmented-reality mirror could overlay feedback visually, but such hardware is not commonly available to users and would complicate the system. Considering these drawbacks, instead of a "real-time coach", the system would focus on recognizing and identifying the trick and current progression (phase) and transfer the grading to a post-analysis system. 45 5. Pole-Arina: Implementation Phase and trick detection LSTM. The next iteration removed the scoring component to focus on the recognition task. The two prediction heads were reconfigured first to classify the current phase {Start, Transition, End} as before, and second, to identify the performed trick at each frame. The user no longer needed to tell the system which trick would be performed, as the LSTM would automatically detect it from the motion sequence. Furthermore, this opened the possibility to handle multiple different tricks in one session, which is standard practice in training where dancers train combinations of moves in succession. This implementation fully realized the concept of the labeling Protocol A, with each frame carrying a phase and trick label. Again, the model learned this in parallel and continued to utilize the previously introduced features (bidirectional, dropout, learning-rate scheduler, etc.) to maintain the already satisfying performance. Trained on the full dataset, this model achieved a test accuracy of 96.83% for phase classification and 99.95% for trick classification. These results were auspicious for building the final Pole-Arina system. Despite its excellent performance on single-trick videos, this architecture revealed some limitations when tested for multi-trick detection within a single recording. The issue was caused by the labeling scheme and how the LSTM learned temporal patterns. First, the initial training data assumed a strict phase order per video: start on the floor, transition on the pole, and end in the final pose. Next, if the dancer performed a second trick in the same recording, the sequence of labels might repeat, but the dancer would first need to get down from the pole. Effectively, this approach did not provide the required flexibility to detect multiple tricks in a single video. To address this, a final strategy was implemented to enhance the labeling strategy and increase the model’s flexibility, which is discussed in section 5.2.2. Evaluation protocol. To verify that each model learned a meaningful representation and generalizes beyond its training split, every iteration underwent the same evaluation: • Training-Validation plots: to compare training and validation curves. If the training performance drastically exceeds validation performance, the model might overfit, resulting in poor generalization. • Confusion matrices: to reveal if certain classes are more frequently confused. • Per-frame timelines: to inspect temporal consistency on held-out clips. Figure 5.5 illustrates a concise panel summarizing these checks for the second model iteration. This protocol documents the training progression and guides changes between iterations. 46 5.2. Recognition Model (LSTM) Figure 5.5: Evaluation diagnostics for the second model iteration, serving as an overview. 5.2.2 Final Single-Output LSTM For the final refinement, the previous problem was reformulated into a single-head classification task with a mix of trick-specific and generic labels. A semi-automated re-labeling script, enforced by Protocol B (see Section 3.1.3), transformed the data to support multiple trick detection per video. Instead of using two separate outputs for phase and trick, the labels were combined into a single label set that encoded both the phase and trick. Generic phase labels covered the common positions at the start, middle, and idle time of any trick, while specific end-pose labels represented the final trick progression. Improvements & challenges. This combination simplified the model output to a single classification at each frame, realistically capturing the structure of pole dance training videos. It further leverages the insight that early phases are similar across tricks, with no immediate need to classify the trick until a distinctive position is reached. However, this adaptation introduced significant class imbalance in the training data. While every attempt at a trick will contain generic labels, end-pose labels appear only when the particular trick is performed. Additionally, more challenging tricks, such as the Gemini, are underrepresented compared to beginner tricks. This imbalance is prone to introducing a data bias towards frequent classes to the classifier, while struggling to recognize rare poses. A countermeasure introduced weights to the model’s loss function. The model training applied weighted cross-entropy by assigning higher weights to the 47 5. Pole-Arina: Implementation minority classes, to enforce penalties for errors on the rare classes. This is a standard technique to improve learning for imbalanced data, as it encourages the model to devote more capacity to learning those classes [ADAdCB19]. Class-weighted cross-entropy. Let K be the number of classes, zt ∈ RK the logits at frame t, and yt ∈ {0, . . . , K − 1} the ground-truth label. Let T = { t | yt ̸= −100 } be the set of valid frames (ignore_index= −100 in PyTorch). The loss is L = 1 |T | ∑︂ t∈T wyt (︄ − log exp(zt,yt)∑︁K−1 c=0 exp(zt,c) )︄ . (5.1) Class weights are inversely proportional to the square root of the empirical frame counts nc, normalized to keep the average weight at 1: wc = (nc + ε)−1/2 1 K ∑︁K−1 j=0 (nj + ε)−1/2 , ε = 10−8. (5.2) This schedule reduces the dominance of frequent classes without the instability of strict inverse-frequency weighting. Rare end poses receive a stronger and more stable learning signal while the overall loss scale remains comparable. Evaluation & fine-tuning. The following protocol and visualizations aim to verify the model’s accuracy, frame-wise phase/trick recognition, reliable end pose identifica- tion, and strong generalization while handling class imbalance. The training kept the checkpoint with the lowest validation loss, while the best score was at epoch 29 with val_loss=0.121 and val_acc=93.83%. The test accuracy yields 93.82%, closely matching the validation accuracy and indicating good generalization ability. While fine-tuning the hyperparameters, different learning-rates were tested, as shown in Figure 5.6. This comparison identifies 5×10−4 as the most reliable setting as it achieves the highest validation accuracy with stable late-epoch behavior. Figure 5.6: Learning rate sweep: over 30 epochs for different learning rates. A confusion matrix helps to identify common misclassifications for each class. It counts how often each true label appears on the diagonal (correct) and how often it is predicted 48 5.2. Recognition Model (LSTM) as another label off the diagonal (errors). Therefore, a deeply colored diagonal suggests accurate classifications. The coloring in Figure 5.7 suggests frequent errors for all classes except for floor and on_pole. However, this is mainly due to its high representation in the dataset, as on_pole holds the most mislabeled entries. This aligns with the data, as transitions morph directly into the final position, so border frames near the end can already resemble the end pose. Per-class recall (see Figure 5.7) confirms this pattern: floor 97.8%, on_pole 91.1%, and all end poses between 98.1-99.4%. Trick-only accuracy reaches 98.74%, confirming dependable recognition once a trick sequence begins. Overall, the errors concentrate in transitional on_pole frames, while the crucial final poses are detected reliably. Figure 5.7: Left: per-class recall on the test set. Right: confusion matrix on the test set. Multi-trick detection. A comparison on a challenging test case reveals the effective- ness of this solution. The test includes a single video in which a pole dancer performs nine tricks in succession. Before the weights were applied, the LSTM correctly detected 6 out of 8 tricks, with both missed tricks being instances of the Gemini. The improved version correctly recognized all 8 tricks in the sequence. As a result, the final model successfully overcame this challenge, reliably detecting all tricks, including those with fewer samples. Figure 5.8: Successful detection of 8/8 tricks in one video. 49 5. Pole-Arina: Implementation Together, these results indicate that the single-head formulation, the chosen hyperparam- eters, and the class-weighting scheme deliver accurate frame-wise phase recognition and robust trick identification without resorting to a more complex architecture. Table 5.1 provides a final overview of the LSTM iterations. Table 5.1: Bidirectional LSTM, 2 layers, hidden size 64, and dropout rate 0.2. Iteration Label set Settings Data Test acc. Two-head: phase+score {Start, Transition, End}/ {fail, ok, good, perfect} 2× cross-entropy; no class weights; no decoding 50 (Pin- Up) P: 96.1% S: 91.6% Two-head: phase+trick (Protocol A) {Start, Transition, End}/ {L, P, W, S, G, I} 2× cross-entropy; no class weights; no decoding 510 (Mixed) P: 96.8% T: 99.9% Single-head: (Protocol B) {floor, on_pole} ∪ {L_pose, P_pose, W_pose, S_pose, G_pose, I_pose} cross-entropy with class weights; median filter 836 (Final) 93.82% multi- trick Training data across iterations. The metrics between model iterations are not strictly comparable, as the training samples grew during data collection. To provide a brief overview, the introduced LSTM models were trained on the following subsets: 1. Phase/Score LSTM: trained on 50 Pin-Up data samples only. 2. Phase/Trick LSTM: trained on 510 mixed videos covering all tricks, with ∼3 random data augmentations per video. 3. Final single-head LSTM: trained on complete dataset of 836 videos with ∼3-5 augmentations per clip, weighted by trick frequency (e.g., more augmentation for Gemini, fewer for Layout). 5.3 Geometric Scoring System An initial idea was to use a data-driven approach, such as an autoencoder [TBL18], to learn ideal poses and quantify deviations from the ideal pose. Out-of-place body parts could be identified by analyzing the reconstruction error. However, this implementation requires a large set of “perfect” trick examples, while still being susceptible to bias. Instead, a geometric rule-based approach is realized, as this thesis already provides a working recognition model with a well-labeled dataset. This direct and interpretable solution circumvents the black-box nature of Deep Learning models and the potential bias inherent in subjective labeling. This approach aligns with how instructors critique 50 5.3. Geometric Scoring System form via visual checks of angles and body alignment. Explicit checks ensure consistency and address the known issue that human judgment can be biased and inconsistent in dance evaluation, underscoring the need for objective measures [Qu24]. Rule design. Specific characteristics enable viewers to identify a performed trick. Instructors, therefore, establish those features as guidance to achieve this trick. Based on this observation, the evaluation set consists of five to seven geometric rules capturing its most characteristic requirements. The following categories group the rules as follows: • Body Orientation: requires the dancer to orient the body at a particular angle relative to the pole or ground. For example, in a Layout, the upper body should be angled towards the floor, whereas in the Crucifix, the entire body should be upside-down. • Limb Alignment: defines how straight or aligned a specific body part is. Joint angle calculations at elbows, knees, and other joints verify full extension or specific angles. For instance, straight legs are required in the Layout, Wrist Seat, or straight arms for the Straddle Invert and Crucifix. • Joint Proximity: ensures correct distances between parts of the body. For example, in a Layout, the ankles should be closed, or in a Pin-Up, the right toes should be close to the left knee. Each rule is defined by a target value and an accepted tolerance. During evaluation, the rules are applied to each frame of the end position to compute all the defined angles and distances. If a required joint holds a low confidence and is therefore not visible, the rule is marked and does not pass due to insufficient data. The evaluation script outputs a list of results, including a boolean for pass/fail, the measure value, and a short description. Geometric rule definitions. Let pi = (xi, yi, zi, vi) denote MediaPipe landmarks in normalized image coordinates, where x, y ∈ [0, 1] and v is the visibility score. A visibility threshold min(vi) ≥ τ (with τ=0.75) must pass for the joints to be evaluated. Body orientation. For an oriented segment B → A, define θorient(A, B) = atan2(yB − yA, xA − xB) · 180 π , A rule passes if |θorient − θ⋆| ≤ Δ, where θ⋆ is the target (e.g., 180◦ for straight) and Δ is the tolerance. 51 5. Pole-Arina: Implementation Limb alignment. Given points A, B, C, define u = a − b, w = c − b (2D). The internal angle is θalign(A, B, C) = arccos (︃ u · w ∥u∥ ∥w∥ + ε )︃ · 180 π , ε = 10−8. Again, a rule passes if |θalign − θ⋆| ≤ Δ. Joint Proximity. For points A, B, the normalized 2D distance is d(A, B) = √︂ (xA − xB)2 + (yA − yB)2. A rule passes if d(A, B) ≤ Δ (e.g., 0.05). The end pose score is computed as the fraction of rules passed, score = #passed #evaluable . The system reports per-rule messages with measured values and targets for interpretability. Figure 5.9: Examples of passed and failed rules displayed through interactive overlays. Top: failed, bottom: passed. 52 5.4. Pole-Arina: Application 5.4 Pole-Arina: Application This section presents the final prototype to realize Pole-Arina and enable user interaction. It is implemented as a web application to support heterogeneous hardware, enable fast interactive overlays, and centralize computational workloads on a server, while still running on consumer laptops and smartphones. The frontend (React/Next.js) runs in the browser, manages uploads, and renders interactive overlays. The backend (FastAPI + PyTorch) exposes a single analysis endpoint that extracts skeletal data, performs sequence classification with the Bi-LSTM, evaluates end poses with geometric rules, and returns a structured result. Videos are uploaded through the website’s file picker (drag-and-drop supported). In the user study, recordings were made on a smartphone and then uploaded via the laptop browser. The full source code can be found here: Pole-Arina. 5.4.1 Backend This section describes the server-side components to transform an uploaded video into trick phases and interpretable pose-quality feedback. The backend exposes a single analysis endpoint, runs the recognition pipeline, evaluates all end-pose frames, and returns a structured result. API endpoint. The service employs FastAPI (api.py) with one primary route: • POST /analyze • Input: a single .mp4/.mov uploaded via the web interface (desktop or mobile browser).. • Optional query field: confidence threshold for phase recognition and median filter kernel size. • Output: JSON structure with detected tricks, per-trick feedback items, per-trick end-pose frames, and processing metadata. At a high level, the route: 1. stores the upload in a temporary path, 2. calls evaluate_video() from pole_arina.py, 3. collects end-pose sequences, feedback, end-pose frames, and normalized skeleton sequences, 4. and serializes results to JSON. This compact interface keeps the client simple and reduces integration effort in the frontend. 53 5. Pole-Arina: Implementation Trick & phase recognition module. The recognition pipeline is implemented in pole_arina.py and follows a six-step formula designed for clarity and robustness: 1. Skeleton extraction. The system reads frames via OpenCV and extracts 33x4 landmarks per frame using MediaPipe. It keeps both normalized image coordinates for model input and world/absolute coordinates available for display. To maintain consistency with the training distribution, the backend applies the same Exponential Moving Average (EMA) and interpolation to MediaPipe landmarks prior to classification. 2. Input normalization. Landmark sequences are stored as a tensor of shape (T,33,4), where T is the number of frames in the clip, and normalized consistently with the training setup. This preserves privacy and yields a compact representation for inference. 3. Model loading. The backend loads the trained lightweight bidirectional LSTM from a checkpoint and either selects a GPU if available or CPU otherwise. 4. Per-frame inference. The model outputs per-frame logits over the single-head output layer, while a softmax converts them to probabilities. 5. Temporal smoothing. A median filter with a configurable kernel (default 7) suppresses brief misclassifica- tions from being detected as a fully realized end pose. 6. Decoding. The sequence decodes into contiguous runs with labels, start/end frame, and a confidence summary. The pipeline only passes stable end poses (confidence over 0.75) to the evaluation module. The model returns the full timeline and skeleton sequences to the frontend to avoid re-running inference. Feedback module. Pose scoring is implemented in trick_evaluator.py as a transparent rule engine over MediaPipe joints. The rules mirror the specification in Section 5.3 and cover three geometric evaluations, each mapping directly to a distinct overlay technique: • Body Orientation→Arc: body angle relative to a horizontal line. • Limb Alignment→Angle: internal angle at a joint, e.g., knee or elbow. • Joint Proximity→Proximity: normalized 2D distance between landmarks. 54 5.4. Pole-Arina: Application For each detected end-pose frame, the evaluator: • checks trick-specific rule configurations, • applies a visibility threshold, • computes pass/fail per visible rule given a target and tolerance, • aggregates a pose score as the fraction of passed rules over all evaluable rules. This ensures interpretability, as every score is derived from a readable checklist. By combining these modules, the backend provides a compact and reliable analysis service for trick recognition and evaluation. A single /analyze call turns a raw clip into a trick timeline and an interpretable feedback list. 5.4.2 Frontend Design & technology. The interface aims to be intuitive, straightforward, and coach- like. Guided interaction minimizes friction for non-technical users and keeps the focus on the trick analysis. An early Gradio prototype validated the pipeline workflow, but limited interactivity led to a full React/Next.js implementation. The fully realized prototype is a lightweight single-page web application with Material UI (MUI) for accessible, consistent components and D3.js for interactive overlays. User interface. A stepper UI organizes the workflow into four screens: 1. Upload: initial video upload and analysis. 2. Summary: overview of all detected tricks. 3. Detail: interactive single-trick view with detailed feedback and overlays. 4. Dashboard: session summary for all detected tricks and scores. 55 5. Pole-Arina: Implementation Upload. A single video upload triggers model analysis, starts a new training session, and resets previous results. First, the user selects a single trick video and hits "analyze" to send it to the backend. While waiting for a response, the UI displays a loading progress bar. After a few seconds, the frontend receives the results and the user advances to the summary view. Each initial upload resets the frontend and starts a new training session. Figure 5.10: Upload & analyze: single-video upload starts a new session. 56 5.4. Pole-Arina: Application Summary. This tab provides an overview of all detected tricks in the current training session. Analyzed tricks appear as cards grouped by trick. Each card shows the trick name, the middle frame of the detected end sequence (thumbnail), and the performance score. A floating “+” button at the bottom right corner allows users to add further attempts to the current session without clearing prior results. Videos of the same trick are displayed in chronological order with a simple indicator for improvement, no change, or decline between the cards. Clicking a card lets the user advance to the detailed view of the selected trick. Figure 5.11: Summary: trick cards with thumbnails, scores, and an add-video button. 57 5. Pole-Arina: Implementation Detail. The selected trick opens a control panel. First, the trick evaluation returns two scores: the visibility score and the performance score. The first score is defined as the percentage of evaluable rules, while the second calculates the percentage of passed rules out of all evaluable rules. The left pane displays the current frame, initially chosen based on highest visibility and then highest performance. The right pane offers overlay toggles by rule family (orientation, alignment, proximity) and a rule list. Each rule item includes a status icon (pass, fail, or not visible), a rule description, an improvement tip (if failed), a score with a target range, and an overlay toggle button. By default, the panel activates all overlays for failed rules, but provides global control through the top buttons or single control for each rule item. Two MUI rating components display each score, including the exact value. A frame slider at the bottom enables manual frame selection to explore changes over time. Figure 5.12: Detail: frame viewer with overlay controls and per-rule feedback. 58 5.4. Pole-Arina: Application Dashboard. The session dashboard summarizes the training progress across all tricks. It displays the average performance and visibility scores, and the number of performed tricks. An interactive line chart plots score over attempts per trick, with a thumbnail display at the side. The display (see Figure 5.14) can either show the best, worst, or a selected attempt through the line chart, for direct visual comparison. Figure 5.13: Dashboard: session-level statistics. Figure 5.14: Best/Worst display example. 59 5. Pole-Arina: Implementation To summarize, the frontend guides the user through a simple, coach-like flow: upload a clip, review a trick gallery, inspect per-rule feedback with overlays, and reflect on progress in the dashboard. Together with the backend, the interface turns model outputs into actionable tips and session-level insights, enabling the user-study assessment that answers RQ3. Table 5.2 lists the complete rule catalog used by the evaluator across all six tricks, including targets and tolerances. Table 5.2: Complete rule catalog across all tricks. Trick Rule Type Joints (idx) Target Tol. Layout Lean back Orientation (11, 23) 22.5◦ ±22.5◦ Legs down, hips up Orientation (27, 23) −155◦ ±20◦ Straight bottom leg Alignment (23, 25, 27) 180◦ ±20◦ Straight top leg Alignment (24, 26, 28) 180◦ ±20◦ Cross at ankles Proximity (28, 27) 0 ≤ 0.05 Right-foot point Alignment (26, 28, 32) 165◦ ±15◦ Left-foot point Alignment (25, 27, 31) 165◦ ±15◦ Wrist Seat Straight left leg Alignment (23, 25, 27) 180◦ ±20◦ Straight right leg Alignment (24, 26, 28) 180◦ ±20◦ Lean back Orientation (11, 23) 20◦ ±20◦ Right-foot point Alignment (26, 28, 32) 165◦ ±15◦ Left-foot point Alignment (25, 27, 31) 165◦ ±15◦ Pin-Up Lean slightly back Orientation (11, 23) 45◦ ±15◦ Straight leg down Orientation (27, 23) −135◦ ±10◦ Toe to knee Proximity (32, 25) 0 ≤ 0.10 Top leg into passé Alignment (23, 25, 27) 180◦ ±20◦ Right-foot point Alignment (26, 28, 32) 165◦ ±15◦ Left-foot point Alignment (25, 27, 31) 165◦ ±15◦ Straddle Invert Push hips up Orientation (23, 11) 112.5◦ ±22.5◦ Lean back, straight arms Alignment (11, 13, 15) 160◦ ±20◦ Straight left leg Alignment (23, 25, 27) 180◦ ±15◦ Straight right leg Alignment (24, 26, 28) 180◦ ±15◦ Head-back tilt Alignment (7, 11, 23) 160◦ ±20◦ Right-foot point Alignment (26, 28, 32) 165◦ ±15◦ Left-foot point Alignment (25, 27, 31) 165◦ ±15◦ Gemini Push hips up Orientation (23, 11) 112.5◦ ±22.5◦ Back-leg straight Alignment (23, 25, 27) 160◦ ±20◦ Back-leg horizontal Orientation (27, 23) 180◦ ±15◦ Right-foot point Alignment (26, 28, 32) 165◦ ±15◦ Left-foot point Alignment (25, 27, 31) 165◦ ±15◦ Crucifix Left-arm straight Alignment (11, 13, 15) 180◦ ±20◦ Right-arm straight Alignment (12, 14, 16) 180◦ ±20◦ Cross at ankles Proximity (28, 27) 0 ≤ 0.05 Body upside-down Orientation (0, 27) −90◦ ±15◦ Right-foot point Alignment (26, 28, 32) 165◦ ±15◦ Left-foot point Alignment (25, 27, 31) 165◦ ±15◦ 60 CHAPTER 6 Evaluation & Results This chapter presents the evaluation strategy and outcomes for the Pole-Arina system. The evaluation encompasses both a quantitative assessment of the pose recognizer and scoring model, as well as a controlled user study to examine the system’s effectiveness and usability in practice. Quantitative tests on a held-out dataset validate the recognizer (RQ1) and the geometric scoring (RQ2). A controlled user study assesses effectiveness and usability in practice (RQ3). The study evaluates the system in real-world training scenarios, comparing AI-assisted feedback with traditional video self-review in terms of user trust, improvement efficiency, feedback understandability, and overall usability. 6.1 Quantitative Model Performance The recognizer was evaluated on a held-out test split to estimate generalization. The protocol reported: • Per-frame accuracy on the full label set and trick-only accuracy on end-pose classes (RQ1). • Per-class precision/recall and confusion matrices to reveal systematic mis- classifications. • Temporal stability via post-processing with a fixed confidence detection threshold and median kernel chosen on validation data. • Multi-trick robustness on sequences containing several tricks in succession. The classifier achieved a per-frame accuracy of 93.82% across all classes and a trick-only accuracy of 98.74% when considering only the final trick poses. Per-class precision and recall were analyzed, and a confusion matrix revealed that the most common trick 61 6. Evaluation & Results misclassification occurred for the on_pole label. Between tricks, the two visually similar tricks Layout and Pin-Up proved most confusing. The system was also tested on video sequences containing multiple tricks in succession, and it demonstrated robust performance by correctly segmenting and recognizing each trick in order. Overall, these results indicate that the recognizer provides accurate and stable identification of pole tricks, forming a solid foundation for the feedback mechanism. 6.2 User Study Design This study’s hypotheses evaluate whether Pole-Arina’s feedback would be trusted and understood, whether it would improve form efficiently, and whether it would be rated as usable. To answer RQ3, a controlled between-groups experiment compared Pole-Arina feedback against traditional self-review. Participants practiced a single preselected pole trick for five trials. The Experimental condition used Pole-Arina, while the Control condition used a standard video with self-assessment. Measurement combined per- trial Likert items, a post-session questionnaire, the System Usability Scale (SUS), and open-ended questions for qualitative insights. 6.2.1 User Study Methodology Evaluation goals. To evaluate the system’s performance in real training scenarios, the user study verifies the following hypotheses: • H1: Trust & Adoption: Participants in the Experimental condition will report higher trust in the accuracy of feedback and greater confidence about what to improve next than those in the Control condition. • H2: Efficiency: Participants in the Experimental condition will show greater improvement across five trials than those in the Control condition. • H3: Understandability: Participants in the Experimental condition will rate the clarity and helpfulness of feedback higher than those in the Control condition. • H4: Usability: The Experimental condition will receive a higher System Usability Scale (SUS) score than the Control condition. The chosen criteria align with the project’s research questions while testing the technol- ogy’s acceptance. Factors related to trust also include: demographic variables, privacy protection, robustness, transparency, and performance [LWHL24], which are all consid- ered throughout this thesis. In particular, fostering user trust is crucial, as prior studies have indicated the importance of this criterion for the effective utilization of AI systems [LWHL24]. This study’s hypotheses evaluate if Pole-Arina’s feedback would be trusted, understood, improve their form efficiently, and be rated as a usable system. 62 6.2. User Study Design Study design. The experiment targeted N≈20-30 participants, ranging from non- dancers to advanced pole dancers. The user study followed a between-groups design. The study randomly assigned one of two conditions to each participant and completed all trials under that feedback method. The options are: • Control: using traditional self-review through video recording and post hoc replay. • Experimental: using the Pole-Arina application for deep-learning-based evaluation and feedback about pose correctness. To minimize expectancy effects, participants remained unaware of the alternative condition and study aims until the debriefing. The study assigned a trick based on the dancer’s experience. Beginners would practice the Layout while intermediate and advanced levels performed the Pin-Up. The two tricks share similar geometric structure and hold the same difficulty, yet the Pin-Up enforces stricter pose rules. They also formed the most frequent confusion trick pair in the recognition model (see Figure 5.7). Therefore, the feedback rules would be tested on similar form requirements, additionally challenging the model’s recognition abilities. User study protocol. All sessions took place in the same pole studio to maintain consistency. A smartphone camera recorded the participants at a fixed position and angle. For better accessibility, a two-screened laptop station hosted Pole-Arina and served as a reviewing platform for each trick attempt. Therefore, between each review, the recording was transferred to the laptop via Apple AirDrop. Each participant booked a one-hour time slot, which was typically used for 45 minutes. An online form guided the user through each step with follow-up questions. The study protocol was structured as follows, with identical flow for both the Control and Experimental conditions: 1. Orientation & Consent: Participants were briefed on the study goals and what to expect, including a caution about physical strain and possible bruises (also known as pole kisses). After participants signed the informed consent, the first section of the online form gathered demographic data. This established the context of each user and ensured a mix of backgrounds. 2. Task Assignment & Demonstration: The condition (Control or Experimental) was assigned at random while keeping it balanced throughout the study and the dancers’ pole experience. As the setting simulated pole practice at home (or at least without an instructor), the participant was shown a demonstration video, instead of a real-life tutorial. Additionally, they received a walkthrough of the selected reviewing method. Both groups were allowed to watch the demonstration video unrestricted. A free attempt was allowed to get familiar with the trick and setup. 3. 5-Trial Loops with Feedback: The core of the study is executed through a 5-trial practice loop. The participant attempted the assigned trick five times, aiming 63 6. Evaluation & Results to improve with each repetition. After each attempt, the participant reviewed their performance. Either they watched their video playback or examined the feedback provided by Pole-Arina. After each review, the participant filled out a per-trial survey. The form encouraged them to reflect on their performance, identify mistakes, and adjust the trick accordingly. 4. Post-Session Questionnaire: After completing all trials, the participant filled out a post-study survey about the employed feedback method. It includes Likert-scale statements that directly address the evaluation hypotheses. Additional open-ended questions allowed for qualitative feedback, and a standard ten-item System Usability Scale (SUS) was used to evaluate overall usability. The SUS is a widely used survey that provides a global measure for usability, through ten Likert items [B+96]. All questions were answered with respect to the assigned condition, to allow baseline comparison of traditional versus AI-assisted review methods. 5. Pole-Arina specific Feedback: After the main evaluation, all participants, regardless of their group, were offered a chance to try out Pole-Arina. Therefore, everyone had the opportunity to experience the AI-coaching system and fill out a final questionnaire to provide qualitative feedback. 6. Debrief: The study concluded with a debrief, during which the purpose and conditions were thoroughly explained, and participants were allowed to ask questions and share additional remarks. Evaluating the results. By design, the user study produced both quantitative and qualitative data. Quantitatively, each trial provided an objective performance score from the implemented system and a subjective self-assessment from the participant. This allowed for a comparison of self-perception vs. actual performance. Using the evaluation scores, an improvement metric estimates the participant’s trick execution across the trials. The primary hypothesis for efficiency was to demonstrate that the experimental group showed greater improvement than the control group, indicating faster and more effective learning. For subjective responses, such as clarity of feedback and confidence after each trial, trends were analyzed across the five attempts. All Likert scales were matched and summarized according to the hypotheses. The SUS responses were both statistically compared and converted to a 0-100 score per standard guidelines [B+96]. Finally, the qualitative open-ended answers were analyzed by open coding methodology to identify important codes. Overall, this evaluation approach combines a quantitative check on the model’s per- formance with a human-centered assessment of the system’s effectiveness and user experience. 64 6.2. User Study Design Question-hypothesis mapping & aggregates. The underlying hypotheses grouped each questionnaire item and appropriately aggregated it into simple composites. Each composite is identified by a short code for cross-referencing in Chapter 6. • H1 (Trust & Adoption). – H1A1 (per-trial): Q: How confident are you that you know what to improve next? → Aggregate: mean across trials. – H1A2 (post-session): Q: The feedback I received was accurate. Q: I felt confident that this digital review method correctly reflected my perfor- mance. → Aggregate: mean over both items. • H2 (Efficiency). – H2A1 (performance slope): Q: How would you rate your performance of this trial? → per-participant least-squares slope of the five self-ratings. – H2A2 (performance delta): Q: How would you rate your performance of this trial? → per-participant change of the same self-ratings. – H2A3 (self vs. system agreement): Q: How would you rate your performance of this trial? → per-participant averages of self-ratings and Pole-Arina scores for correlation and paired comparison. • H3 (Understandability). – H3A1 (per-trial): Q: How clear was the feedback you received from the digital review method? → Aggregate: mean across trials. – H3A2 (post-session): Q: I understood how to interpret this feedback to improve my form Q: The digital review method helped me identify mistakes. → Aggregate: mean over both items. • H4 (Usability). – H4A1 (SUS): System Usability Scale score (0-100) computed per standard rules [B+96]. 65 6. Evaluation & Results 6.3 User Study Results This section reports the outcomes of the between-groups user study. It first profiles participants’ demographics to contextualize subsequent analyses. It then presents the hy- pothesis tests for H1-H4 with corresponding visualizations, highlighting where differences between conditions are statistically reliable and where effects were not detected. The section closes with qualitative feedback that complements the quantitative findings and surfaces design implications for Pole-Arina. 6.3.1 Demographics. By design, the user study proposed two different groups (dancers or non-dancers) with two distinct conditions (Experimental or Control). To ensure meaningful results, the protocol required a sample size of between 20 and 30 participants. In the end, a total of 33 participants completed the study: 17 in the Experimental group and 16 in the Control group (see Figure 6.1). Similar to the dataset demographics, the final set resembles the composition of regular pole classes while covering a broad range of execution quality. Figure 6.1: Condition balance bars. Bars show absolute counts. Experience balance. Depending on the prior pole experience, a matching practice trick was assigned for the trials. Balancing experience across conditions was therefore essential to avoid confounding when comparing feedback methods. The stacked balance bar in Figure 6.2 shows a near-equal distribution of non-dancers and dancers within both the Experimental and Control groups, enabling fair analyses of improvement and feedback quality across skill levels. Figure 6.2: Experience balance bars. Bars show absolute counts. Gender context. Participation reflected typical studio demographics, with most respondents identifying as female, fewer male participants, and no respondents selecting the alternative option. The distribution is shown in Figure 6.3. Figure 6.3: Gender balance bars. Bars show absolute counts. 66 6.3. User Study Results Age distribution. Figure 6.4 presents the age spread using a horizontal boxplot. The ages span between 19 and 56 years old, with a median of 30, which aligns with the local studio’s age distributions. Figure 6.4: Age distribution boxplot. Training frequency & technology comfort. To contextualize prior exposure and likely learning dynamics, Figure 6.5a presents the weekly pole-training frequencies, and Figure 6.5b shows a self-estimated technology comfort rating on a 1-5 Likert scale. (a) Weekly pole training frequency: 0-1, 2-3, 4-5, or 6+. (b) Technology comfort: higher equals more comfortable with apps/websites. Figure 6.5: Participant distributions for (a) weekly training frequency and (b) technology comfort. Category labels match the questionnaire options. Bars show absolute counts. 67 6. Evaluation & Results 6.3.2 Hypothesis Tests All analyses were run in IBM SPSS Statistics Version 31.0.0.0. For each composite, normality was assessed with Kolmogorov-Smirnov and Shapiro-Wilk tests. As all cases vi- olated normality, the Mann-Whitney U was chosen to perform all conditional comparisons (Experimental N=17 vs. Control N=16). H1: Trust & Adoption (H1A1, H1A2). • H1A1 - Results: There was a statistically significant difference in H1A1 between Experimental (mean rank = 20.88) and Control (mean rank = 12.88) condition, U=70.00, Z=−2.430, p=.015. • H1A2 - Results: There was a statistically significant difference in H1A2 between Experimental (mean rank = 20.79) and Control (mean rank = 12.97) condition, U=71.50, Z=−2.702, p=.007. Interpretation: Participants using Pole-Arina reported higher accuracy for the feedback and greater confidence that the method reflected their performance, and they felt it was clearer what to improve next. (a) H1A1: per-trial confidence, 1-5. (b) H1A2: post-session trust, 1-5. Figure 6.6: Value distributions by condition, H1. H2: Efficiency • H2A1 - Results: There was no statistically significant difference in H2A1 (slope- based improvement) between Experimental (mean rank = 15.41) and Control (mean rank = 18.69) condition, U=117.50, Z=−.698, p=.485. • H2A2 - Results: There was no statistically significant difference in H2A2 (delta- based improvement) between Experimental (mean rank = 15.91) and Control (mean rank = 18.16) condition, U=117.50, Z=−.698, p=.485. 68 6.3. User Study Results Interpretation: Both groups improved over five trials to a similar extent. On average, participants’ self-ratings systematically differed from the system’s ratings (inspection of per-participant differences indicated a tendency to rate themselves lower than the tool). However, the rank-order association between the two was weak. (a) H2A1: improvement slope per participant. (b) H2A2: improvement delta per participant. Figure 6.7: Value distributions by condition, H2. H3: Understandability • H3A1 - Results: There was a statistically significant difference in H3A1 between Experimental (mean rank = 21.94) and Control (mean rank = 11.75) condition, U=52.00, Z=−3.100, p=.002. • H3A2 - Results: There was a statistically significant difference in H3A2 between Experimental (mean rank = 21.59) and Control (mean rank = 12.13) condition, U=58.00, Z=−3.266, p=.001. Interpretation: Pole-Arina’s overlays and explanations were rated as significantly clearer and more helpful for understanding how to fix mistakes, compared to traditional video analysis. H4: Usability • H4A1 - Results: There was a statistically significant difference in H4A1 between Experimental (mean rank = 20.76) and Control (mean rank = 13.00) condition, U=72.00, Z=−2.335, p=.020. • H4A1 - Descriptives: Experimental M=95.44, SD=4.07, Median = 95.0; Control M=86.41, SD=11.14, Median = 88.75. Interpretation: Both reviewing methods achieved high SUS scores, with Pole-Arina rated significantly higher (“Best imaginable” usability) by benchmark guidelines (see Table 6.1). 69 6. Evaluation & Results (a) H3A1: per-trial clarity, 1-5. (b) H3A2: post-session understandability, 1-5. Figure 6.8: Value distributions by condition, H3. Figure 6.9: Value distributions by condition, H4A1: SUS score, 0-100. Table 6.1: SUS benchmark guidelines (after Bangor et al.[BKM09]). SUS score Adjective rating Grade ≥ 90 Best imaginable A+ 80.3–89 Excellent A 74–80 Good B 68–73 OK–Good C 51–67 OK / Marginal D < 51 Poor / Not acceptable F 6.3.3 Qualitative Results A Pole-Arina-specific post-study questionnaire was used to collect qualitative feedback from all 33 participants. The responses provide rich insight into how users perceived the system’s usefulness, accuracy, and areas for improvement, which helps in evaluating our hypotheses. 70 6.3. User Study Results Would you use Pole-Arina regularly in your training? Overall, the feedback was overwhelmingly positive: all participants indicated that they would use the application in their training routine. Many said they would use it definitely or regularly to improve their form, with a few noting they might not use it every single attempt, but certainly every training session or for challenging moves. For example, one participant wrote, “I would definitely use it, it’s a wonderful and invaluable tool to correct even moves that you thought you knew how to do”. This strong usage intention demonstrates a high acceptance of the system. Participants were enthusiastic about incorporating the app into solo practice at home, and even as a complement during classes. Did the overlays help you understand how to improve your form? All par- ticipants found the feedback rules and visual overlays helpful in understanding and correcting their technique. In particular, dancers reported that the system’s precise, objective feedback helped them notice details they would otherwise overlook. “The graphical representation of the lines is really helpful to see where and how you can improve the movement” noted one user. Do you think Pole-Arina would be useful in beginner and/or advanced classes? Many highlighted that the tool was especially beneficial for beginners. Simultaneously, most participants agreed it would “be useful to all levels”, with advanced dancers using it to “give the pose the final touch” and perfect their form. A few novices did caution that a complete beginner might feel overwhelmed using the app without any prior instruction. However, after learning the basics in class, they felt the app would be handy for independent practice. How accurate did you find Pole-Arina’s performance analysis? Participants generally reported that the system’s analysis was accurate and reliable, giving them confidence in the feedback. Many described the pose analysis as “very accurate” noting that the app correctly identified their form errors. However, a few minor inaccuracies were observed. For instance, a few users mentioned the system occasionally had trouble recognizing a fully pointed foot or confused two very similar tricks (Layout vs. Pin-Up) on the first try. One participant wrote: “Some accuracy problems in identifying similar poses, but other than that it seemed quite accurate.” Others noted that if a body part was hidden from the camera, the system sometimes missed that joint, leading to a less optimal frame selection or an incomplete evaluation. Nonetheless, participants mainly understood these issues as minor limitations of the current prototype (often related to camera angle or body positioning) rather than fundamental flaws. This overall trust in the accuracy of the system is critical for its validity and was reflected in comments like “Absolut akurat! Ich stimm der App vollkommen zu, was die Bewertung meiner Posen betrifft.” (Absolutely accurate! I fully agree with the app regarding the evaluation of my poses.). 71 6. Evaluation & Results What did you like most about using Pole-Arina? Participants highlighted several aspects of the application that they liked best. A dominant theme was the visual and detailed nature of the feedback. Nearly all users praised the overlay of angles, lines, and highlighted body segments on their pose images, which made it “immediately clear what to improve and how.” They also appreciated the simple, intuitive interface and workflow. Several described the tool as “easy to use” and the feedback presentation as “clear and specific.” The ability to scrub through recorded video frames and see feedback for each attempt in a summary was also frequently mentioned. Other answers mentioned that the tool introduced a game-like or self-competitive element to practice: “it was a fun, gamified experience . . . it motivated me to improve each time to look better,” wrote one participant. Such comments suggest the system can increase engagement and enjoyment in training, potentially improving adherence. What would you change or improve about Pole-Arina? While the overall feedback was positive, participants also provided valuable suggestions and pointed out current limitations, which helped identify avenues for future work. For example, many participants wanted to see a reference of the “perfect” pose for comparison. Similarly, participants asked for integrated tutorials or tip videos. Another highly requested feature was real-time feedback. For instance, the app could give an audio cue whenever the dancer achieves the correct form or if a major mistake occurs. Users found the idea of live feedback exciting, as it could help them adjust their pose immediately rather than only correcting on the next attempt. In summary, the qualitative results show that participants overwhelmingly found the system beneficial, easy to use, and effective. At the same time, users provided constructive feedback highlighting the current limitations. The current system has a limited set of moves with offline feedback and minor detection and scoring quirks. These limitations, however, directly point to concrete improvements that form the basis of future work. 6.4 Discussion Summary of findings. The study examined whether a deep learning-based coaching system improves trust, learning efficiency, understandability, and usability compared with traditional video self-review. Overall, three out of four hypotheses were supported: • H1 (Trust & Adoption) • H3 (Understandability) • H4 (Usability) H2 (Efficiency) was not supported within the five-trial protocol: improvement slopes and deltas did not differ significantly between conditions. Together, these results indicate that RQ3 is answered positively for trust, understandability, and usability, while efficiency advantages were not detected under the present design. 72 6.4. Discussion Interpretation & likely causes. Two factors explain the absence of a measurable efficiency difference over five trials. First, traditional video replay represents a strong baseline for immediate improvement, especially if one is familiar with the tricks. Second, efficiency advantages from structured cues often manifest over longer practice horizons and across various mistakes. Furthermore, participants described overlays as precise and helpful for understanding how to adjust form, and adoption intent was uniformly high, suggesting that benefits may accumulate with continued use. Alignment with model results. The quantitative model’s performance established a reliable basis for the user experience. Per-frame accuracy and trick-only accuracy were high, and end pose detection remained robust across multi-trick sequences. The rule-based scoring provided transparent, geometric justifications, while the qualitative remarks are consistent with these properties and help explain the observed advantages in trust and understandability. In summary, the study validates Pole-Arina’s primary goal: pairing accurate recognition with transparent, geometry-based explanations yields feedback that users trust and understand, laying the groundwork for measurable skill gains as practice extends beyond a single session. 73 CHAPTER 7 Conclusion This thesis introduced Pole-Arina, a novel marker-less deep learning-based coaching system designed for static pole dancing tricks. The development and evaluation of Pole-Arina addressed a clear gap in technology for dismissed sports, providing feedback and analysis without the need for wearable sensors. The system recognizes the performed trick, isolates end poses, and grades form using transparent geometric rules rendered as visual overlays. A primary contribution is a domain-specific dataset tailored to pole dancing technique. It includes: 836 clips from 58 participants, annotated for phases and end poses, and released as 3D skeleton (with an estimated depth value) sequences to protect privacy. A revised single-head label scheme supports multi-trick recordings and explicit background modeling. Building on this foundation, a lightweight bidirectional LSTM performs frame- wise recognition over six static tricks, and a rule engine converts landmark geometry into pass/fail checks and an overall pose score. A full-stack prototype implements the pipeline and surfaces interpretable feedback through interactive overlays. The research questions were addressed as follows: • RQ1: was met with strong results. 93.82% per-frame accuracy across all classes and 98.74% trick-only accuracy on end-pose frames, with robust behavior on multi-trick sequences. • RQ2: was realized via explicit, trick-specific geometric rules that map angles, ori- entations, and proximities. Feedback and visual cues enable consistent, transparent grading. • RQ3: was examined in a controlled user study (N=33): compared to traditional video self-review, Pole-Arina achieved significantly higher ratings for trust/accuracy and understandability, and a higher SUS usability score (best imaginable by bench- mark guidelines). 75 7. Conclusion These findings suggest that Pole-Arina can provide meaningful support for independent practice in pole sports. The approach generalizes to other domains where end poses encode most of the instructional signal and where transparent rule checks promote trust. Current implementation limitations & improvements. The present prototype is optimized for single-camera, offline analysis of static tricks. This choice simplifies deployment but introduces practical constraints. First, robustness can degrade under strong occlusions, extreme viewpoints, or low light. Next, geometric rules emphasize alignment and orientation but do not yet cover fine stylistic criteria. Furthermore, the workflow involves smartphone capture and laptop-based processing, so near–real-time feedback on-device is not evaluated. Immediate improvements include on-device/mobile inference for faster computation, camera-guidance prompts (viewing angle, distance, lighting) to reduce failure modes, confidence/uncertainty indicators in the UI, and incremental rule catalogs with editable tolerances. User study limitations & future work. It is important to note the following limitations: • Sample and setting: The study took place in a single studio with a modest sample size (N=33), which bounds external validity. • Protocol length: Five attempts in one session may be insufficient to expose efficiency differences that require consolidation over longer practice intervals. • Set-up: The user study employed a laptop-based workflow with smartphone recordings and post-hoc transfer. Therefore, mobile deployment or real-time evaluation was not tested. • Task scope: Only a fixed set of tricks and rule configurations was evaluated. Generalization to a broader repertoire remains to be demonstrated. Finally, qualitative reasoning followed a lightweight thematic approach. While themes were consistent with quantitative outcomes, future work could include multi-coder reliability checks to strengthen interpretive claims. Altogether, Pole-Arina demonstrates that a compact Bi-LSTM paired with interpretable geometric scoring and a privacy-first dataset can deliver accurate recognition, clear feedback, and high user trust. This establishes a practical baseline for AI coaching in pole dance and points to an expandable pathway for accessible, marker-less coaching across movement disciplines. 76 Overview of Generative AI Tools Used Generative AI tools (OpenAI ChatGPT and Grammarly) were used solely as writing aids for surface-level editing (grammar, punctuation, and minor rephrasing) and for translating the abstract/acknowledgements. No AI tools were used to generate research ideas, design the study, create or analyze data, write technical content, or produce figures or results. All scientific claims, methods, datasets, and conclusions are my own. 77 Übersicht verwendeter Hilfsmittel Generative KI programme (OpenAI ChatGPT und Grammarly) wurden im Schreibpro- zess zur Kontrolle der Grammatik und Zeichensetzung, sowie in geringem Ausmaß als Formulierungshilfe eingesetzt. KI wurde nicht zur Entwicklung wissenschaftlicher Ideen, Studienplanung, der Verarbeitung von Daten, Generierung von Tabellen, Grafiken oder technischer Texte genutzt. Alle wissenschaftlichen Behauptungen, Methoden, Datensätze und Schlussfolgerungen sind meine eigenen. 79 List of Figures 2.1 Visual comparison of different motion capture technologies. . . . . . . . . 7 3.1 Progression of each trick, highlighting similar entries and transitions before the final pose. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Side-by-side comparison of applying both protocols to the same video. . 22 3.3 Qualitative comparison of skeleton overlays across four pole tricks. . . . . 23 3.4 Per-trick class balance. Number of videos containing at least one end-pose for each target trick. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 Protocol B. Percentages above bars show the relative contribution of each label to the total of 212,574 labeled frames. . . . . . . . . . . . . . . . . . 27 3.6 Protocol A. Bar height shows the absolute number of labeled frames per trick. The phase labels occur in the following stack order: bottom=Start, mid- dle=Transition, top=end (highlighted in a trick-specific color). Percentages inside the bars indicate the relative share of each phase. . . . . . . . . . . 28 3.7 Box plot of coverage ratio with most labels achieving near-perfect coverage. 29 3.8 MediaPipe coverage information. . . . . . . . . . . . . . . . . . . . . . . . 30 3.9 Experience balance bars (Non-dancer vs. Dancer). . . . . . . . . . . . . . 30 3.10 Gender balance bars (Female vs. Male). . . . . . . . . . . . . . . . . . . . 30 3.11 Age distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1 Pole-Arina end-to-end pipeline. . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 Bidirectional LSTM architecture, taken from [NJ22]. . . . . . . . . . . . . 35 4.3 Rules mapped to visual overlays on a Pin-Up pose. . . . . . . . . . . . . . 38 5.1 Preprocessing pipeline overview. . . . . . . . . . . . . . . . . . . . . . . . 40 5.2 EMA reduces high-frequency jitter on a representative Straddle Invert clip. 40 5.3 Dataset-level jitter reduction by trick. . . . . . . . . . . . . . . . . . . . . 41 5.4 Split-style shapes: horizontal, diagonal, and vertical presentations. . . . . 43 5.5 Evaluation diagnostics for the second model iteration, serving as an overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.6 Learning rate sweep: over 30 epochs for different learning rates. . . . . . . 48 5.7 Left: per-class recall on the test set. Right: confusion matrix on the test set. 49 5.8 Successful detection of 8/8 tricks in one video. . . . . . . . . . . . . . . . 49 81 5.9 Examples of passed and failed rules displayed through interactive overlays. Top: failed, bottom: passed. . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.10 Upload & analyze: single-video upload starts a new session. . . . . . . . . 56 5.11 Summary: trick cards with thumbnails, scores, and an add-video button. . 57 5.12 Detail: frame viewer with overlay controls and per-rule feedback. . . . . . 58 5.13 Dashboard: session-level statistics. . . . . . . . . . . . . . . . . . . . . . . 59 5.14 Best/Worst display example. . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.1 Condition balance bars. Bars show absolute counts. . . . . . . . . . . . . 66 6.2 Experience balance bars. Bars show absolute counts. . . . . . . . . . . . . 66 6.3 Gender balance bars. Bars show absolute counts. . . . . . . . . . . . . . . 66 6.4 Age distribution boxplot. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.5 Participant distributions for (a) weekly training frequency and (b) technology comfort. Category labels match the questionnaire options. Bars show absolute counts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.6 Value distributions by condition, H1. . . . . . . . . . . . . . . . . . . . . . 68 6.7 Value distributions by condition, H2. . . . . . . . . . . . . . . . . . . . . . 69 6.8 Value distributions by condition, H3. . . . . . . . . . . . . . . . . . . . . . 70 6.9 Value distributions by condition, H4A1: SUS score, 0-100. . . . . . . . . . 70 82 List of Tables 2.1 Marker-based vs. marker-less comparison overview. . . . . . . . . . . . . . 9 3.1 Runtime benchmark on a five-second, 360×640 video (164 frames). CPU = Colab CPU runtime; GPU = Colab T4. OpenPose results use the COCO-18 model via OpenCV DNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Most common portrait resolutions within the dataset. . . . . . . . . . . . 29 3.3 Compact summary of the selected tricks; terminology aligned with IPSF and Spin City [Fed25, Cit25]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.1 Bidirectional LSTM, 2 layers, hidden size 64, and dropout rate 0.2. . . . . 50 5.2 Complete rule catalog across all tricks. . . . . . . . . . . . . . . . . . . . . 60 6.1 SUS benchmark guidelines (after Bangor et al.[BKM09]). . . . . . . . . . 70 83 List of Algorithms 85 Bibliography [ADAdCB19] Yuri Sousa Aurelio, Gustavo Matheus De Almeida, Cristiano Leite de Cas- tro, and Antonio Padua Braga. Learning from imbalanced data sets with weighted cross-entropy function. Neural processing letters, 50(2):1937–1949, 2019. [AJJB24] Aditya Agarwal, Parth Jha, Ojas Jain, and Asish Bera. Poa-net: Dance poses and activity classification using convolutional neural networks. In 2024 IEEE Region 10 Symposium (TENSYMP), pages 1–6. IEEE, 2024. [B+96] John Brooke et al. Sus-a quick and dirty usability scale. Usability evaluation in industry, 189(194):4–7, 1996. [BGR+20] Valentin Bazarevsky, Ivan Grishchenko, Karthik Raveendran, Tyler Zhu, Fan Zhang, and Matthias Grundmann. Blazepose: On-device real-time body pose tracking. arXiv preprint arXiv:2006.10204, 2020. [BKM09] Aaron Bangor, Philip Kortum, and James Miller. Determining what individual sus scores mean: Adding an adjective rating scale. Journal of usability studies, 4(3):114–123, 2009. [BNKB23] Asish Bera, Mita Nasipuri, Ondrej Krejcar, and Debotosh Bhattacharjee. Fine-grained sports, yoga, and dance postures recognition: A benchmark analysis. IEEE Transactions on Instrumentation and Measurement, 72:1– 13, 2023. [BP21] Abeba Birhane and Vinay Uday Prabhu. Large image datasets: A pyrrhic win for computer vision? In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1536–1546. IEEE, 2021. [CECS18] Steffi L Colyer, Murray Evans, Darren P Cosker, and Aki IT Salo. A review of the evolution of vision-based motion analysis and the integration of advanced computer vision methods towards developing a markerless system. Sports medicine-open, 4:1–15, 2018. [CHS+19] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity 87 fields. IEEE transactions on pattern analysis and machine intelligence, 43(1):172–186, 2019. [Cit25] Spin City. The Ultimate Pole Bible. Spin City Aerial Fitness Ltd, 2025. [CZDK21] Anargyros Chatzitofis, Dimitrios Zarpalas, Petros Daras, and Stefanos Kollias. Democap: Low-cost marker-based motion capture. International Journal of Computer Vision, 129(12):3338–3366, 2021. [DWDW25] Zhao Du, Shan Wang, Ziyan Deng, and Fang Wang. Unveiling the power of ai fitness apps: a uses and gratifications perspective. Journal of Global Information Management (JGIM), 33(1):1–28, 2025. [EC20] Aysu Ezen-Can. A comparison of lstm and bert for small corpus. arXiv preprint arXiv:2009.05451, 2020. [Fed25] International Pole Sports Federation. Code of points 2025 – 2027. https: //ipsfsports.org/downloads/Uncategorised/ipsf_pole_ sports_code_of_points_2025-2027_final_070120240.pdf, 2025. Accessed: 2025-08-16. [FHC19] Muhammad Fikri, Samiadji Herdjunanto, and Adha Cahyadi. On the performance similarity between exponential moving average and discrete linear kalman filter. In 2019 Asia Pacific Conference on Research in Industrial and Systems Engineering (APCoRISE), pages 1–5. IEEE, 2019. [GFS05] Alex Graves, Santiago Fernández, and Jürgen Schmidhuber. Bidirec- tional lstm networks for improved phoneme classification and recognition. In International conference on artificial neural networks, pages 799–804. Springer, 2005. [Goo25] Google. Mediapipe pose landmarker. https://ai.google.dev/edge/ mediapipe/solutions/vision/pose_landmarker, 2025. Medi- aPipe Solutions, Google AI Edge. Accessed: 2025-08-22. [GRRCR23] Indrajeet Ghosh, Sreenivasan Ramasamy Ramamurthy, Avijoy Chakma, and Nirmalya Roy. Sports analytics review: Artificial intelligence appli- cations, emerging technologies, and algorithmic perspective. Wiley Inter- disciplinary Reviews: Data Mining and Knowledge Discovery, 13(5):e1496, 2023. [HNR+25] Wenjun Huang, Yang Ni, Arghavan Rezvani, SungHeon Jeong, Hanning Chen, Yezi Liu, Fei Wen, and Mohsen Imani. Recoverable anonymization for pose estimation: A privacy-enhancing approach. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5239–5249. IEEE, 2025. 88 [HS97] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [HZMY22] Changwu Huang, Zeqi Zhang, Bifei Mao, and Xin Yao. An overview of artificial intelligence ethics. IEEE Transactions on Artificial Intelligence, 4(4):799–819, 2022. [KK18] Yeonho Kim and Daijin Kim. Real-time dance evaluation by markerless human pose estimation. Multimedia Tools and Applications, 77:31199– 31220, 2018. [LCLX25] Yihua Li, Hongyue Chen, Yiqing Li, and Yetong Xin. Poespin: A human-ai dance to poetry system for movement-based verse generation. Proceedings of the ACM on Computer Graphics and Interactive Techniques, 8(3):1–13, 2025. [Lem24] Mark A. Lemley. How generative ai turns copyright upside down. Stanford Technology Law Review, 25(1):21–48, 2024. [LHK22] Chen-Chieh Liao, Dong-Hyun Hwang, and Hideki Koike. Ai golf: Golf swing analysis tool for self-training. IEEE Access, 10:106286–106295, 2022. [LTNX23] Julienne LaChance, William Thong, Shruti Nagpal, and Alice Xiang. A case study in fairness evaluation: Current limitations and challenges for human pose estimation. In Association for the Advancement of Artificial Intelligence 2023 Workshop on Representation Learning for Responsible Humancentric AI (R2HCAI), Washington, DC, volume 1, 2023. [Luc24] Nicola Lucchi. Chatgpt: a case study on copyright challenges for genera- tive artificial intelligence systems. European Journal of Risk Regulation, 15(3):602–624, 2024. [LWHL24] Yugang Li, Baizhou Wu, Yuqi Huang, and Shenghua Luan. Develop- ing trustworthy artificial intelligence: insights from research on interper- sonal, human-automation, and human-ai trust. Frontiers in psychology, 15:1382693, 2024. [LX13] Jianchao Lv and Shuangjiu Xiao. Real-time 3d motion recognition of skeleton animation data stream. International Journal of Machine Learning and Computing, 3(5):430, 2013. [MK23] Marina Mikami and Noriyuki Kida. Categorizing rhythmic jumping motion using motion capture without markers. Advances in Physical Education, 13(2):93–105, 2023. [MMN+24] Carmina Liana Musat, Claudiu Mereuta, Aurel Nechita, Dana Tutunaru, Andreea Elena Voipan, Daniel Voipan, Elena Mereuta, Tudor Vladimir 89 Gurau, Gabriela Gurău, and Luiza Camelia Nechita. Diagnostic appli- cations of ai in sports: a comprehensive review of injury risk prediction methods. Diagnostics, 14(22):2516, 2024. [NJ22] Dinesh Naik and CD Jaidhar. A novel multi-layer attention framework for visual description prediction using bidirectional lstm. Journal of Big Data, 9(1):104, 2022. [Nor25] Northern Digital Inc. Optotrak3020. https://tsgdoc.socsci. ru.nl/images/e/eb/Optotrak_Certus_User_Guide_rev_6% 28IL-1070106%29.pdf, 2025. Optotrak3020. Accessed: 2025-09-08. [OGK+24] Bengie L Ortiz, Vibhuti Gupta, Rajnish Kumar, Aditya Jalin, Xiao Cao, Charles Ziegenbein, Ashutosh Singhal, Muneesh Tewari, and Sung Won Choi. Data preprocessing techniques for ai and machine learning readiness: Scoping review of wearable sensor data in cancer care. JMIR mHealth and uHealth, 12(1):e59587, 2024. [PC16] European Parliament and Council. Regulation (eu) 2016/679 (gen- eral data protection regulation). https://eur-lex.europa.eu/eli/ reg/2016/679/oj/eng, 2016. Accessed: 2025-08-17. [PPW+24] Zhiqiang Pu, Yi Pan, Shijie Wang, Boyin Liu, Min Chen, Hao Ma, and Yixiong Cui. Orientation and decision-making for soccer based on sports analytics and ai: A systematic review. IEEE/CAA Journal of Automatica Sinica, 11(1):37–57, 2024. [PTM17] Paritosh Parmar and Brendan Tran Morris. Learning to score olympic events. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 20–28, 2017. [Qu24] Jiping Qu. A dance movement quality evaluation model using transformer encoder and convolutional neural network. Scientific Reports, 14(1):32058, 2024. [Sch11] Ronald W Schafer. What is a savitzky-golay filter?[lecture notes]. IEEE Signal processing magazine, 28(4):111–117, 2011. [SK19] Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of big data, 6(1):1–48, 2019. [STL24] Xiang Suo, Weidi Tang, and Zhen Li. Motion capture technology in sports scenarios: a survey. Sensors, 24(9):2947, 2024. [TBL18] Michael Tschannen, Olivier Bachem, and Mario Lucic. Recent ad- vances in autoencoder-based representation learning. arXiv preprint arXiv:1812.05069, 2018. 90 [TP23] Atima Tharatipyakul and Suporn Pongnumkul. Deep learning-based pose estimation in providing feedback for physical movement: A review. 2023. [TS14] Alexander Toshev and Christian Szegedy. Deeppose: Human pose estima- tion via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1653–1660, 2014. [Ubi09] Ubisoft Paris. Just dance, 2009. Video game. [Vic25] Vicon Motion Systems Ltd UK. Vicon. https://www.vicon.com/, 2025. Vicon. Accessed: 2025-09-08. [VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [Wea20] Charlene Weaving. Sliding up and down a golden glory pole: Pole dancing and the olympic games. Sport, Ethics and Philosophy, 14(4):525–536, 2020. [Wie25] TU Wien. Data protection at tu wien. https://www. tuwien.at/en/tu-wien/organisation/central-divisions/ data-protection-and-document-management/ data-protection-at-tu-wien, 2025. Accessed: 2025-08-17. [WKM+19] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github.com/facebookresearch/ detectron2, 2019. [XKCP24] Chu Xin, Seokhwan Kim, Yongjoo Cho, and Kyoung Shin Park. Enhancing human action recognition with 3d skeleton data: A comprehensive study of deep learning and data augmentation. Electronics, 13(4):747, 2024. [Yu20] Hongbo Yu. Application research and analysis of college pole dance teaching based on virtual reality technology. In International Conference on Application of Intelligent Systems in Multi-modal Information Analytics, pages 602–610. Springer, 2020. [Zar21] S Zargar. Introduction to sequence learning models: Rnn, lstm, gru. Department of Mechanical and Aerospace Engineering, North Carolina State University, 37988518, 2021. [ZWC+23] Ce Zheng, Wenhan Wu, Chen Chen, Taojiannan Yang, Sijie Zhu, Ju Shen, Nasser Kehtarnavaz, and Mubarak Shah. Deep learning-based human pose estimation: A survey. ACM computing surveys, 56(1):1–37, 2023. 91