Proceedings of the 6th International Workshop on Reading Music Systems 22nd November, 2024 Organization General Chairs Jorge Calvo-Zaragoza University of Alicante, Spain Alexander Pacha TU Wien, Austria Elona Shatri Queen Mary University of London, United Kingdom Proceedings of the 6th International Workshop on Reading Music Systems, 2024 Edited by Jorge Calvo-Zaragoza, Alexander Pacha, and Elona Shatri © The respective authors. Licensed under a Creative Commons Attribution 4.0 International License (CC-BY-4.0). Logo made by Freepik from www.flaticon.com. Adapted by Alexander Pacha. Preface Dear colleagues, We are proud to present the proceedings of the 6th International Workshop on Reading Music Systems (WoRMS). Over the past few years, interest in Music Reading Systems has continued to grow. This year marks a new record, with a total of 22 submissions, 15 of which have been accepted to the workshop. A few papers are omitted from the proceedings by request of the authors. We took great care to provide comprehensive feedback to authors whose works were not accepted, highlighting areas for improvement to meet the quality standards of WoRMS. We hope to see these authors submit their revised works next year. Due to logistical reasons, we have decided to host this year’s edition online again. This format allows participants from all over the world to join easily and learn about the latest developments without the need for extensive travel. However, we acknowledge that an online format cannot fully replace the experience of face-to-face interactions, and we aim to make future editions on-site events once more. We would like to take this opportunity to promote the GitHub organization https://github. com/omr-research once more, which welcomes contributions from everyone and serves as a central hub for publishing and discovering research-related repositories. Additionally, we encourage you to explore our public YouTube channel, https://www.youtube.com/OpticalMusicRecognition, which has nearly 250 subscribers and hosts recordings of previous years’ sessions. This year’s presentations will also be uploaded there. If you have additional content, beyond your WoRMS submission, that you would like to share on this channel, please get in touch with us. We look forward to engaging presentations and discussions and hope to see many of you again next year. Jorge Calvo-Zaragoza, Alexander Pacha, and Elona Shatri 2 Contents Jorge Calvo-Zaragoza, Eliseo Fuentes-Mart́ınez, Noelia Luna-Barahona, Antonio Rı́os- Vila Can multimodal large language models read music score images? . . . . . . 4 Antonio Rı́os-Vila, Eliseo Fuentes-Martinez, Jorge Calvo-Zaragoza Towards Sheet Music Information Retrieval: A Unified Approach Using Multitask Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Grégoire de Lambertye, Alexander Pacha Semantic Reconstruction of Sheet Music with Graph-Neural Networks . . 12 Vojtěch Dvořák, Jan Hajič jr., Jiř́ı Mayer Staff Layout Analysis Using the YOLO Platform . . . . . . . . . . . . . . . . 18 Pau Torras, Sanket Biswas, Alicia Fornés On Designing a Representation for the Evaluation of Optical Music Recog- nition Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Aitana Menárguez-Box, Alejandro H. Tosselli, Enrique Vidal Enhanced User-Machine Interaction for Historical Sheet Music Retrieval: a Musical Notation Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Bertrand Coüasnon, Mathieu Giraud, Christophe Guillotel Nothmann, Aurélie Lemaitre, Philippe Rigaux The CollabScore project – From Optical Recognition to Multimodal Music Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Tristan Repolusk, Eduardo Veas Semi-Automatic Annotation of Chinese Suzipu Notation Using a Component- Based Prediction and Similarity Approach . . . . . . . . . . . . . . . . . . . . 38 Janosch Umbreit, Silvana Schumann OMR on Early Music Sources at the Bavarian State Library with MuRET – Prototyping, Automating, Scaling . . . . . . . . . . . . . . . . . . . . . . . . 43 Alexander Hartelt, Frank Puppe OMMR4all revisited – a Semiautomatic Online Editor for Medieval Music Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Nivesara Tirupati, Elona Shatri, György Fazekas Crafting Handwritten Notations: Towards Sheet Music Generation . . . . 50 3 Can multimodal large language models read music score images? Jorge Calvo-Zaragoza, Eliseo Fuentes-Martı́nez, Noelia Luna-Barahona, Antonio Rı́os-Vila Pattern Recognition and Artificial Intelligence Group, University of Alicante, Spain Abstract—This paper investigates whether multimodal large language models (MLLMs), which combine visual and textual understanding, can effectively read and interpret music score images. Given their ability to process and integrate information from multiple modalities, MLLMs present a promising approach for Optical Music Recognition (OMR). Through empirical eval- uation, we demonstrate that while MLLMs exhibit potential in recognizing musical structures, challenges remain in addressing the complexity of music notation. This work highlights the need for further refinements in MLLM architectures to improve their effectiveness in OMR tasks. Index Terms—Multimodal Large Language Models, Optical Music Recognition, Music Information Retrieval. I. INTRODUCTION Optical Music Recognition (OMR) is a challenging area of research that studies how to computationally read music notation in documents [1]. Traditional OMR systems rely on specific computer vision and machine learning techniques to identify musical symbols [2], but modern advances of deep learning, particularly the development of multimodal large language models (MLLM), have opened up new possibilities for interpreting music scores. MLLMs integrate information from both visual and textual inputs and have shown remarkable success in tasks that require an understanding of multiple modalities, such as image captioning and visual question answering [3], [4]. This paper explores whether MLLMs can be leveraged to interpret music score images by processing both the visual aspects of the score and the symbolic structure of the music. The question we seek to answer is: Can MLLM be used to the task of reading music score images? We hypothesize that while MLLM have the potential to recognize some elements of music notation, the unique challenges posed by the structure and complexity of music require further adaptation of existing architectures. While this might be no surprise, no previuos work has evaluated this scenario. II. METHODOLOGY This is a preliminary work to evaluate the capabilities of general MLLM for reading music scores. Each model is tested with the same set of cropped music score images, and their outputs were analyzed to determine the extent to which they can interpret and describe music notation. For such reason, we selected a tiny sample of music score image crops to these The authors appear in alphabetical order. general-purpose MLLM. We informally build specific prompts to assess different capabilities regarding sheet music reading. All the components of our study are described below. A. General models We aim to assess how general-purpose MLLMs, which have been successful in disparate fields, perform when faced with the task of reading sheet music or retrieving some specific information from music score images. Below, we provide a brief overview of the general-purpose MLLMs tested in our experiments: ChatGPT (GPT-4V) [5], Gemini [6], Llama 3.2 [7], Mistral 7B, and Claude 3.5 [8]. B. Sample The (tiny) set of samples selected to evaluate the MLLMs are depicted in Fig. 1, including different textures such as monophonic (mono), pianoform (piano), and vocal textures.1 As can be observed, apart from the variability in textures, the images are relatively simple in terms of graphic complexity. In addition, they are rather well-known culturally and socially. C. Capabilities We identify four interesting capabilities to assess the MLLMs. These were translated into four questions (prompts) that are outlined below: • Q1: Piece recognition: Evaluates whether the model can identify the composition from a cropped score. This ca- pability tests the model’s broader cultural understanding and whether it can associate visual notation with specific compositions or composers. • Q2: Transcription: Assesses the model’s ability to con- vert music notation to a symbolic format. This is a core OMR task, requiring the model to interpret the visual layout of the music notation. • Q3: Tonality identification: Tests if the model can infer the score’s tonality. It requires both graphical recognition and some understanding of basic music notation. • Q4: Texture classification: Examines if the model can recognize the type of musical texture. This is a simple graphical task, but requires understanding of the layout of sheet music. 1The images were taken from IMSLP Petrucci Music Library. Ac- cessed September 30, 2024. International Music Score Library Project. https://imslp.org/. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 4 (a) Mono: Excerpt of Symphony No. 9 in D minor, Op. 125 by L.V. Beethoven” (b) Piano: Excerpt of Piano Sonata No. 11 in A major, K. 331 by W.A. Mozart. (c) Vocal: excerpt of My Way, lyrics by Paul Anka, music by Claude François and Jacques Revaux. Fig. 1: Sample of images used for evaluating the capabilities of the MLLM, involving different music textures (monophonic, pianoform, and vocal). Each task was designed to capture a specific facet of music reading, from broad cultural knowledge (Q1) to technical tran- scription skills (Q2), as well as simpler graphic recognitions (Q3 and Q4). The specific prompts for answering these questions were carefully formulated with the help of the MLLM itself to ensure that they were worded in the best possible way. III. RESULTS The evaluation of the models across the four questions reveals significant differences in their capabilities. A summary of our evaluation is given in Table I. For Q1, the models generally failed to identify the musical piece from the score, as they could not interpret enough musical information. The exception was a vocal example, where some models successfully identified the song “My Way” due to their ability to recognize and process the lyrics, highlighting their reliance on textual rather than musical data for recognition. In Q2, all models performed poorly, unable to transcribe the music notation into any symbolic format.2 A minor exception was observed in vocal music, where the models managed to 2As mentioned above, the prompts were built using the model itself. In this sense, each model was asked for the output format they claimed to know (MusicXML or ABC, mainly). MLLM Model Input Questions Q1 Q2 Q3 Q4 GPT-4V Vocal ✓ ∼ ✓ ✓ Mono × × ✓ ✓ Piano × × × ✓ Gemini Vocal ∼ × ✓ ∼ Mono × × ✓ ✓ Piano × × × ∼ Llama 3.2 Vocal × ∼ ✓ × Mono ∼ × ✓ ∼ Piano × × × ✓ Mistral 7B Vocal ✓ × ✓ ✓ Mono × × × ∼ Piano × × ∼ ∼ Claude 3.5 Vocal ✓ ∼ ✓ ✓ Mono × × ✓ ✓ Piano × × ✓ ✓ TABLE I: Observed performance of the MLLM. ✓denotes the cases where the model is able to provide a reasonable or accurate answer; ∼means that the model does not provide a correct answer but exhibits some knowledge about the task; ×indicates the cases where the model clearly fails. transcribe the lyrics accurately, but their transcription of music notation remained inaccurate. In contrast, the models performed better in Q3 and Q4. For Q3, most models could infer the tonality with reasonable accuracy, suggesting that they could identify key signatures based on visual cues, even without detailed transcription of the music. For Q4, the models were generally accurate, particularly GPT-V4 and Claude 3.5, demonstrating that they can detect visual patterns related to musical structure, even struggling with specific notational details. Overall, the results indicate that while the models are not equipped to read music, they are capable of extracting some visual information. This highlights the potential of MLLMs for founding music reading systems, although significant im- provements are required for their use in detailed OMR tasks. IV. CONCLUSIONS In this paper, we explored the potential of multimodal large language models (MLLMs) for understanding and in- terpreting music score images, a task traditionally handled by Optical Music Recognition (OMR) systems. Our preliminary experiments demonstrated that while MLLMs exhibit certain capabilities, such as recognizing lyrics in vocal music and identifying musical features like tonality and texture, they still struggle significantly with tasks that require detailed interpretation of musical notation. Future work could focus on fine-tuning these MLLMs specifically for music score reading tasks, using techniques such as low-rank adaptation (LoRA) to adjust their weights for OMR tasks, or retrieval-augmented generation (RAG) approaches to enhance their ability to reference symbolic music knowledge. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 5 ACKNOWLEDGEMENTS This paper is supported by grant CISEJI/2023/9 from “Pro- grama para el apoyo a personas investigadoras con talento (Plan GenT) de la Generalitat Valenciana”. REFERENCES [1] Jorge Calvo-Zaragoza, Jan Hajič Jr, and Alexander Pacha. Understanding optical music recognition. ACM Computing Surveys (CSUR), 53(4):1–35, 2020. [2] Elona Shatri and György Fazekas. Optical music recognition: State of the art and major challenges. arXiv preprint arXiv:2006.07885, 2020. [3] Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023. [4] Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, and Dong Yu. Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601, 2024. [5] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. [6] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. [7] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma- hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine- tuned chat models. arXiv preprint arXiv:2307.09288, 2023. [8] Anthropic. Claude 3.5. https://www.anthropic.com, 2024. Language model developed by Anthropic. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 6 Towards Sheet Music Information Retrieval: A Unified Approach Using Multitask Transformers Antonio Rı́os-Vila1, Eliseo Fuentes-Martinez1, Jorge Calvo-Zaragoza1 1 Pattern Recognition and Artificial Intelligence Group, University of Alicante, Spain Abstract—Sheet Music Information Retrieval (SMIR) is a novel and rapidly evolving field within Music Information Retrieval that aims to extract, analyze, and retrieve information from sheet music documents. This discipline encompasses a wide range of tasks, including Optical Music Recognition (OMR), Optical Character Recognition (OCR), layout analysis, and content- based retrieval. SMIR has significant applications in musicology, digital libraries, and music education, enabling researchers and musicians to interact with and analyze large collections of sheet music more efficiently. Recent advancements in SMIR have been largely driven by Deep Learning-based approaches dedicated to specific tasks, which currently show remarkable improvements in accuracy and robustness compared to traditional methods. However, these approaches are isolated for their specific tasks, leading to a fragmented landscape of solutions and increased complexity in developing comprehensive SMIR applications. In this paper, we research in briefly defining SMIR and addressing its challenges through an end-to-end approach using multitask learning and language modeling techniques. We present the Sheet Music Information Retrieval Transformer (SMIReT), an Transformer-based deep learning model that unifies multiple SMIR tasks within a single framework. Built upon the Sheet Music Transformer architecture, SMIReT leverages task-specific prompting and a unified vocabulary to handle diverse SMIR tasks seamlessly. We evaluate our model on the Mottecta corpus, a collection of early notation documents from the 17th century. Results demonstrate the ability of to perform multiple SMIR tasks within a single framework, showing promising results and challenges for the future of SMIR. Index Terms—SMIR, Transformer, Mensural notation, Multi- task learning I. INTRODUCTION The field of Optical Music Recognition (OMR) has evolved significantly from its conception [1], evolving from multi- stage statistical learning pipelines [2], [3] to end-to-end deep learning-based approaches, where notation primitive detection and assembly detection [4]–[8] and sequence generation-based transcription [9]–[12] mainly domain the state of the art in the field. This progress has led to advanced systems capable of extracting more than just the content of music scores, having for example Layout Analysis for detecting regions of inter- est [13]–[15] or search systems based on transcriptions [16]. This progress has led to multiple practical applications in the musicology field, where users are able to work hands-on with these technologies to process music scores [17], [18]. Despite these significant advances, practical OMR applica- tions often still require task-specific models to extract all the information form a music score. This is inconvenient in terms of computing resources and maintainability. The main reason this has happened is because there has never been the perspective of recognizing music as a whole when developing these systems. The same way analogous fields, such as Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) are shifting towards a task unification of seemengly isolated tasks through end-to-end models under the umbrella of Document Understanding [19], [20], music should shift towards this. This end-to-end paradigm represents a promising direction for automatic information extraction from documents, as not only all tasks are resolved through the same method, but their information helps to produce more accurate results in other tasks. In this paper, we briefly define how this evolution can be formulated through the Sheet Music Information Retrieval (SMIR) challenge. Then, we explore if end-to-end state-of-the-art OMR can also go make this step further. To do so, we introduce a first solution based on autore- gressive Transformers, curriculum learning and task-specific prompting [21]. We test this approach with an early notation corpus, which is one of the application targets of musicological tools [17]. Results indicate that the approach is viable with promising performance, although several improvements are still required. II. SHEET MUSIC INFORMATION RETRIEVAL This paper refers to the challenge of SMIR. Whereas there is not a formal definition, SMIR is a specialized field within Music Information Retrieval (MIR) that focuses on extracting, analyzing, and retrieving information from sheet music docu- ments. That is, SMIR serves as an umbrella term for several tasks, such as OMR, Layout Analysis or content-based search. Since this is the first time the term is defined and proposed, there are no specific tasks that settle the challenge. In this paper, we propose which tasks—based on state of the art— should be considered to compose the SMIR challenge. These tasks are grouped in three families: parsing, layout recognition and queries. A. Parsing tasks The first group is composed by the tasks that involve end- to-end content extraction from the music scores. This mainly involves OMR, as it is the field that primarily studies the extraction of music content from score documents. Note, however, that text extraction-related tasks should also be con- sidered, as some score documents may contain text paragraphs or lyrics. Given this, we propose parsing tasks to be Full Parsing, where all the content of the document is extracted, Proceedings of the 6th International Workshop on Reading Music Systems, 2024 7 OMR, where only music is recognized and OCR, where the model should output only the text sections of the document. B. Layout recognition tasks The second group is related to the detection and extraction of the graphical elements of the music score, this is mainly ad- dressed by the Layout Analysis field. We, therefore, formulate this group as an object detection task, where we benchmark both region of interest extraction and classification. C. Query-based tasks The queries group refers to all the tasks where the system must give an answer based on the input document and specific instructions. This group serves as a proxy for user interaction with the system. In this case, we propose two main tasks. The first one, named selective OMR, refers to the partial tran- scription of the music score given a specific set of bounding boxes. This way, we assess the awareness of the model to the structure of the music score, as well as its capability to guide the reading in a non-hierarchical reading order1. Then, we also propose pattern search queries, where the user inserts a specific music sequence—of a varying number of notes— and the model outputs the bounding boxes of the regions in the score that contain the pattern, or none if not found. III. SHEET MUSIC INFORMATION RETRIEVAL TRANSFORMER In this paper, we present the Sheet Music Information Retrieval Transformer (SMIReT) model, which is a next- step of the Sheet Music Transformer (SMT) transcription architecture to address SMIR. A. Sheet Music Transformer The SMT is an autoregressive neural network designed for music transcription [22], [23]. It is composed of two key components: an encoder and a decoder. The encoder functions as a feature extractor, taking an input image x and producing a feature map x′ e. The decoder, built upon an autoregressive conditioned language model, predicts the probability of each symbol in the vocabulary at a given timestep. This prediction is based on both the output of the encoder and the sequence of previously generated symbols, formalized as: ŷ = ŷ∈Σ P (ŷt | x′ e, (ŷ0, ŷ1, ŷ2, ..., ŷt−1)) (1) Here, Σ represents the comprehensive symbol vocabulary encoding musical content, x′ e is the encoded feature map, and t denotes the current timestep. 1) Encoder: The encoder processes an input image x ∈ Rc×h×w, where h, w, and c represent height, width, and num- ber of channels, respectively. Leveraging Convolutional Neural Networks, the encoder transforms this input into a set of ce two-dimensional feature maps, denoted as xe ∈ Rhe×we×ce . The dimensions he and we are related to the original image dimensions by factors rh and rw, representing the downscaling effect of the network. 1Bear in mind that full parsing always follows a specific reading order given by the layout of the page. 2) Decoder: The decoder is built upon the Transformer architecture, currently the state-of-the-art approach for con- ditional sequence generation tasks. At each timestep t, the decoder generates a probability distribution pt ∈ R|Σ| over the symbol vocabulary Σ. This distribution is conditioned on both the output of the encoder x′ e and the previously predicted tokens (ŷ0, . . . , ŷt− 1). The prediction process begins with a special start-of-transcription symbol and concludes upon generating an end-of-transcription token. To bridge the di- mensional gap between the 2D output of the encoder and the sequential nature of the decoder, the feature map is flattened. To preserve the spatial intricacies of full-page music scores, a two-dimensional positional encoding is integrated into the feature maps before flattening [24], [25]. B. SMIReT: a multitask SMT for SMIR To achieve multitask processing capabilities, the SMIReT model adapts the SMT through task prompting, as in other Document Understanding approaches [19]. These prompts act as task-specific cues, allowing the model to adapt its behavior based on the desired output. Referring to Equation 1, we modify the input of the decoder as: d = p ∪ (ŷ0, ŷ1, ŷ2, ..., ŷt−1) (2) where d is the decoder input and p is the prompt sequence, tokenized through the prompt vocabulary Σp. Therefore, equa- tion 1 is expressed as: ŷ = ŷ∈Σ P (ŷt | x′ e, d) (3) By incorporating different tokens to the prompts and uni- fying the input and output vocabulary, the SMT can perform all the tasks that are described in Section II end-to-end. An example of this mechanism is shown in Figure 1. One of the challenges of following this specific formulation is the unification of the SMIR vocabulary, which is multimodal by the diversity of tasks, through a single language model. To approach this, the SMIReT output vocabulary is composed of music tokens in agnostic encoding, where notes are depicted by its shape and position, characters for text, absolute posi- tions for bounding boxes—following the Pix2Seq methodol- ogy [26]—and special tokens for music region categories. C. Training procedure Since we are dealing with an autoregressive Transformer, we perform a curriculum-based learning with synthetic generation, which is composed of two main processes: • Full parsing training: The model is trained in full- page parsing with synthetic samples. This training is done incrementally, feeding the model with pages with an increasing amount of music staves with text regions to transcribe. • Target fine-tuning: The pretrained model is fine-tuned in all the SMIR tasks at the same time. In this case, we follow also an incremental curriculum learning, where synthetic samples are interleaved with target ones. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 8 SMIReT Encoder SMIReT Decoder note.eigth:L2, note halfup:L1 clef.C:L2, metersign.Ccut:L3, note.whole:L4… El mihi domine clef.C:L2, metersign.Ccut:L3, note.whole:L4… ... ... Fig. 1: Architecture of the SMIReT model with some examples of task prompting. IV. EXPERIMENT A. Data We evaluated the SMIReT model in the application of information extraction from early notation documents. These datasets are, up to the moment, the ones that contain the majority of information required to target SMIR, due to the effort hat has been put for their digital preservation [17]. We experimented with the MOTTECTA corpus [27], which is a set of 297 printed pages from a collection of Mensural books of the “Biblioteca Digital Hispánica” dated from the 17th century completely labeled, both regions and text, covering both parsing and layout analysis tasks. In the case of query- based approaches, we randomly generate them through the information given in the dataset, by selecting a specific set of regions in the case of selective OMR and by picking up chunks of the ground truth from pages in the case of pattern-matching queries. The datasets has been split into fixed partitions, where 60% of the samples have been used for training, 20% have been used for validation, and 20% for testing. For synthetic generation, we construct samples through the PRIMENS dataset, which is a large collection of synthetically rendered mensural music incipits [27]. B. Results Table I discloses the performance reported of the SMIReT model with the test set of the MOTTECTA dataset. TABLE I: Results of the performance obtained by the SMIReT model on the different tasks proposed for SMIR for the MOTTECTA dataset. Task Metric SMIReT Parsing tasks Full parsing Music SER 6.05 Text CER 15.30 OMR SER 5.92 OCR CER 10.08 Layout recognition tasks Region detection IoU 70.23 Classification F1 97.00 Query-based tasks Selective OMR SER 41.55 Pattern match Accuracy 73.80 IoU 75.03 First of all, we observe that the SMIReT model is capable of learning all the SMIR tasks successfully and show acceptable performance. Results on parsing tasks reveals intriguing dynamics in the SMIReT multitask learning capabilities. In isolated tasks, the model demonstrates superior performance in, 5.92% SER in OMR and 10.08% CER in OCR. However, when faced with the full parsing task that combines both music and text recognition, we observe a slight degradation—2.9%—in music recognition and a more substantial decline in text recognition, a 51.78%. This disparity suggests a potential bias in the Proceedings of the 6th International Workshop on Reading Music Systems, 2024 9 attention mechanism of the model towards musical elements when processing mixed content. The relative stability of music recognition performance in the presence of text, contrasted with the more significant deterioration of text recognition in the presence of musical notation, indicates that the features learned for music recognition are more robust and less suscep- tible to interference. This phenomenon may be attributed to the more structured and standardized nature of musical notation compared to the variability inherent in textual elements found in sheet music. When analyzing the layout recognition tasks, a paradox emerges, where the model demonstrates high classification ac- curacy, 97.00% F1 score, but moderate performance in region detection, with 70.23% of IoU, notably below to state-of-the- art Layout Analysis techniques [15]. This discrepancy suggests that while the model excels at recognizing the nature of different regions within a sheet music image, it struggles with precisely localizing these regions or defining their boundaries. This suggests that, perhaps, the model is learning the structure of the music document through the language model, but it may not necessarily correspond to precise spatial information, where some regions, as Figure 2 shows, may be avoided by the network. Fig. 2: Visualization of the SMIReT performance on a test sample of the MOTTECTA dataset in the tasks of layout recognition. The analysis on the query-based tasks reveals that the model is able to detect patterns and correlate them to the image general features, shown by the 73% accuracy on the pattern matching task. However, when giving pixel-wise contextual information both in the input, through the selective OMR, and in the output, reported 75.03% IoU, the model struggles in the same way as in the layout recognition tasks. This points to a potential shortcoming in the integration between the model visual processing capabilities and its natural language understanding or instruction-following modules. Addressing this limitation could involve developing more sophisticated attention mechanisms that can dynamically focus on relevant parts of the input based on query requirements, as well as improving the ability of the model to ground natural language queries in the visual domain of sheet music. V. CONCLUSION In this paper, we present the Sheet Music Information Retrieval (SMIR) challenge, a novel research field in Music Information Retrieval that seeks to extract the information from music score documents. We research on the capabilities of deep learning models to be able to address the challenge in an end-to-end fashion. To do so, we propose the Sheet Music Information Retrieval Transformer (SMIReT) model, a Transformer-based model that adapts state-of-the-art OMR to adress multitask learning. Our model demonstrates the feasibility of addressing SMIR tasks—including full parsing, OMR, OCR, layout recognition, and query-based retrieval—within a single, unified framework. The evaluation on the MOTTECTA corpus reveals promising results. However, our study also uncovers several challenges that warrant further investigation. The performance disparity between music and text recognition in mixed content scenarios suggests a need for more balanced feature learning. The discrepancy between high classification accuracy and moderate region detection performance in layout analysis tasks indicates room for improvement in spatial understanding. Additionally, the model struggles with pixel-wise contextual information in query-based tasks highlight the need for enhanced integration between visual processing and language understanding com- ponents. These findings open up several avenues for future research. Developing more sophisticated attention mechanisms could improve the ability of the model to focus on relevant parts of the input based on task requirements. Furthermore, exploring ways to balance the learning of features for different modalities (music notation, text, spatial information) could lead to more robust performance across all SMIR tasks. REFERENCES [1] Jorge Calvo-Zaragoza, Jan Hajič Jr., and Alexander Pacha. Understand- ing optical music recognition. ACM Comput. Surv., 53(4), 2020. [2] David Bainbridge and Tim Bell. The challenge of optical music recognition. Computers and the Humanities, 35:95–121, 2001. [3] Ana Rebelo, Ichiro Fujinaga, Filipe Paszkiewicz, Andre RS Marcal, Carlos Guedes, and Jaime S Cardoso. Optical music recognition: state-of-the-art and open issues. International Journal of Multimedia Information Retrieval, 1(3):173–190, 2012. [4] Alexander Pacha and Horst Eidenberger. Towards a universal music symbol classifier. In 14th International Conference on Document Analysis and Recognition, pages 35–36, Kyoto, Japan, 2017. IAPR TC10 (Technical Committee on Graphics Recognition), IEEE Computer Society. [5] Yaqi Song, Yun Shen, Peng Ding, Xuezhi Zhang, Xiaohou Shi, and Yuying Xue. Optical music recognition based deep neural networks. In Signal and Information Processing, Networking and Computers, pages 1051–1059, Singapore, 2022. Springer Nature Singapore. [6] Francisco Fernández De Vega, Jorge Alvarado, and Juan Villegas Cortez. Optical Music recognition and Deep Learning: An application to 4-part harmony. In 2022 IEEE Congress on Evolutionary Computation (CEC), pages 01–07, 2022. [7] Alexander Hartelt and Frank Puppe. Optical medieval music recognition using background knowledge. Algorithms, 15(7), 2022. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 10 [8] Ali Yesilkanat, Yann Soullard, Bertrand Coüasnon, and Nathalie Girard. Full-page music symbols recognition: State-of-the-art deep model com- parison for handwritten and printed music scores. In Document Analysis Systems, pages 327–343, Cham, 2024. Springer Nature Switzerland. [9] Jorge Calvo-Zaragoza and David Rizo. Camera-PrIMuS: Neural End- to-End Optical Music Recognition on Realistic Monophonic Scores. In Proceedings of the 19th International Society for Music Information Retrieval Conference, pages 248–255. ISMIR, November 2018. [10] Jorge Calvo-Zaragoza, Alejandro H Toselli, and Enrique Vidal. Hand- written Music Recognition for Mensural notation with convolutional recurrent neural networks. Pattern Recognition Letters, 128:115–121, 2019. [11] Marı́a Alfaro-Contreras, Antonio Rı́os-Vila, Jose J. Valero-Mas, José M. Iñesta, and Jorge Calvo-Zaragoza. Decoupling music notation to improve end-to-end optical music recognition. Pattern Recognition Letters, 158:157–163, 2022. [12] Pau Torras, Arnau Baró, Lei Kang, and Alicia Fornés. On the Integration of Language Models into Sequence to Sequence Architectures for Hand- written Music Recognition. In Proceedings of the 22nd International Society for Music Information Retrieval Conference, pages 690–696. ISMIR, 2021. [13] Vicente Bosch Campos, Jorge Calvo-Zaragoza, Alejandro H Toselli, and Enrique Vidal Ruiz. Sheet music statistical layout analysis. In 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pages 313–318. IEEE, 2016. [14] Francisco J Castellanos, Jorge Calvo-Zaragoza, and Jose M Iñesta. A neural approach for full-page optical music recognition of mensural documents. In Proc. of the 21th Int. Society for Music Information Retrieval Conference, pages 12–16, 2020. [15] Francisco J. Castellanos, Carlos Garrido-Munoz, Antonio Rı́os-Vila, and Jorge Calvo-Zaragoza. Region-based layout analysis of music score images. Expert Systems with Applications, 209:118211, 2022. [16] Ichiro Fujinaga, Andrew Hankinson, and Julie E Cumming. Introduction to simssa (single interface for music score searching and analysis). In Proceedings of the 1st international workshop on digital libraries for musicology, pages 1–3, 2014. [17] David Rizo, Jorge Calvo-Zaragoza, and José M Iñesta. Muret: A music recognition, encoding, and transcription tool. In Proceedings of the 5th international conference on digital libraries for musicology, pages 52– 56, 2018. [18] Andrew Noah Hankinson. Optical music recognition infrastructure for large-scale music document analysis. McGill University (Canada), 2014. [19] Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jiny- oung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding trans- former. In European Conference on Computer Vision (ECCV), 2022. [20] Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents, 2023. [21] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. [22] Antonio Rı́os-Vila, Jorge Calvo-Zaragoza, and Thierry Paquet. Sheet music transformer: End-to-end optical music recognition beyond mono- phonic transcription. In Document Analysis and Recognition - ICDAR 2024, pages 20–37, Cham, 2024. Springer Nature Switzerland. [23] Antonio Rı́os-Vila, Jorge Calvo-Zaragoza, David Rizo, and Thierry Paquet. End-to-end full-page optical music recognition for pianoform sheet music, 2024. [24] Denis Coquenet, Clément Chatelain, and Thierry Paquet. Dan: a segmentation-free document attention network for handwritten document recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7):8227–8243, 2023. [25] Sumeet S. Singh and Sergey Karayev. Full page handwriting recognition via image to sequence extraction. In Josep Lladós, Daniel Lopresti, and Seiichi Uchida, editors, 16th International Conference on Docu- ment Analysis and Recognition, ICDAR 2021, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part III, volume 12823 of Lecture Notes in Computer Science, pages 55–69. Springer, 2021. [26] Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection. arXiv preprint arXiv:2109.10852, 2021. [27] Juan C Martinez-Sevilla, Adrian Rosello, David Rizo, and Jorge Calvo- Zaragoza. On the performance of optical music recognition in the absence of specific training data. In ISMIR, pages 319–326, 2023. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 11 Semantic Reconstruction of Sheet Music with Graph-Neural Networks Grégoire de Lambertye TU Wien, Austria e12202211@student.tuwien.ac.at Alexander Pacha Institute of Visual Computing and Human-Centered Technology TU Wien, Austria alexander.pacha@tuwien.ac.at Abstract—Optical Music Reconstruction (OMR) is a field of research that investigates how to computationally read music notation. Many OMR systems operate by first detecting all objects in an image, and then using heuristics to recover the relationships between the musical primitives to reconstruct their semantics. These heuristics are inherently limited, and there is a significant lack of research on performing the semantic reconstruction more adequately. This paper investigates how Graph Neural Networks (GNNs) can be used to perform the semantics reconstruction of notated music. We developed a versatile pipeline and demonstrated the capacity of GNNs to effectively recover the relations between the musical primitives. However, challenges related to the instability and sensibility of the GNNs indicate that, despite their potential, these models may not be the optimal solution for this task either. Index Terms—Optical Music Recognition, Graph Neural Net- work, Link Prediction, Semantic Reconstruction I. INTRODUCTION Many OMR systems divide the task of reading music into a 4-stage pipeline. This pipeline starts with image pre- processing, followed by the music object detection stage, which retrieves the locations of all musical primitives, and as- signs each element a class label. The third stage is the semantic reconstruction, which attempts to recover the relationships between the primitives. The last stage is called encoding and converts the internal representation into a standardized format such as MusicXML. A useful representation for recovering the semantics of mu- sic notation is the Music Notation Graph (MuNG), illustrated in Figure 1. The notion of a MuNG has been used before [1], [2], [3], but there is no commonly accepted definition; the shared understanding is that musical primitives (e.g., noteheads, accidentals, or clefs) are the nodes of the graph, and an edge represents a relationship between two primitives. Definitions of MuNGs vary primarily in how these edges are constructed: for instance, some MuNGs include edges between accidentals and noteheads, while others exclude them. This paper investigates how to construct MuNGs from the output of a music object detector with GNNs, more specifically how to predict the existence of syntactic edges between primitives that form notes. GNNs are a class of machine learning models introduced by Gori and Scarselli [4]. Unlike traditional neural networks that operate on regular grid-like structures, GNNs directly process graph-structured data. They can learn node embeddings over aggregated information from a neighborhood, and demonstrate state-of-the-art capabilities in link prediction. Fig. 1: Music Notation Graph II. RELATED WORK The term Music Notation Graph (MuNG) has first been used by Hajič et al. [5] to build the MUSCIMA++ dataset. This format has then been adopted by other datasets such as MusiGraph [1], and DoReMi [6]. In Pacha et al. [3], the authors formulate the link prediction task as a binary classi- fication problem and apply a Convolutional Neural Network to construct a MuNG. In [1] Baró et al. construct MuNGs by leveraging GNNs. While most of their architecture is kept private, they claim very good results and a Music Error Rate of 5%. III. SEMANTIC RECONSTRUCTION WITH GRAPH NEURAL NETWORKS GNNs require graph-structured data as input. To make use of the output of a Music object detector, we can transform the list of detected objects into a feature matrix. Instead of directly constructing the adjacency matrix, we propose to build an over-complete graph called Candidate Graph (CG), which is then pruned using the GNN to obtain the final graph. Figure 2 illustrates the proposed pipeline. A. Building the Feature Matrix The feature matrix contains 1 row for each detected musical primitive and stores its size and position on the page, as well as a one-hot encoding of its class label. The first challenge is the set of classes—the vocabulary that is used to encode the musical primitives—which can be different for each encoding and each dataset. In the simplest case, the same concept just has a different name, e.g., note- headFull vs. noteheadBlack. However, in some cases it gets more complicated when datasets have a different granularity, e.g., a flag being split into the classes flagUp and flagDown. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 12 Input Set of labels and position with their relations Feature matrix Ground true graph Choose kKNN graph Predicted graph Truth for training Candidate graph Semantic pruning Normalisation Directed edges Position Labels mapping GNN - Type of layer - Layer properties Classifier Model embedding embedding embedding Fig. 2: Training Pipeline Generally, it’s beneficial to have the most detailed granularity, as it can be simplified to a reduced set of classes while maintaining potentially relevant information (e.g., the direction of a stem might help to determine the voice in a polyphonic staff). We propose to map the set of classes of a given dataset to a reduced set of classes that is optimized for reconstructing the relationships between primitives. Input classes that never form relations can be filtered and removed. The mapping also doesn’t need to be bijective, because we can use the IDs of each object that is coming from the object detector to retrieve the original class for each object. In our experiments, the models learned more efficiently with smaller, coarser sets of classes, e.g., when using a single class for all flags instead of multiple classes (8thFlagUp, 8thFlagDown, 16thFlagUp, ...). The 2 final sets of classes in our experiments have only 6 and 10 classes respectively. The details are given in the Appendix. B. Building the Adjacency Matrix To directly construct the adjacency matrix using a GNN would be ideal; however, GNNs require an input graph to operate. After processing, undesired edges can be removed to perform link prediction. The simplest approach to obtain an initial graph would be to construct a fully-connected graph. However, this is computationally prohibitive for large graphs. If we used GNNs without any connections, they would not work efficiently either, as they use the edges for information to flow from one node to another. Given that related music primitives are spatially close to one another, a K-Nearest Neighbors (kNN) approach seems suitable. An exploration of the MusiGraph dataset shows, that after removing primitives that never form relations, k=13 is sufficient to include every relation from the ground truth. It is important to choose k sufficiently high, as a missing edge from the CG would also be missing from the final adjacency matrix, given we only cut edges. To further optimize our pipeline, grammatical rules are applied to semantically prune the initial kNN graph, removing edges that would not exist in an errorless MuNG; for example, links between 2 noteheads are pruned. We also normalize the scores: the top-most, bottom-most, left-most, and right-most primitive bounding box edges are used to perform a min-max normalization. This normalization ensures that the pipeline can work with different fonts or handwritten notation as well as images of any size. While normalization is usually beneficial, we observe a negative impact for GNNs: the distance between certain objects (e.g., notehead and stem) is usually stable due to the typesetting process. With normalization, these scores are distorted and the model cannot learn from these distances. C. Model architecture The model consists of a GNN, which learns a node embed- ding, and a classifier, which decides whether an edge should be kept or pruned. The CG edges are considered undirected to enable the information to flow both ways along an edge in the GNN. The GNN is composed of 3 GraphSAGE layers, separated by ReLU activation functions. The intermediate rep- resentation has a size of 2048 and the final output embedding has a size of 1024. The classifier takes those embeddings and calculates the cross-product between 2 edge endpoints. If the value is above 0.5, the edge is kept, otherwise it is pruned. We use the Binary Cross-Entropy loss function to train our models and a learning rate scheduler which reduces the learning rate if the validation loss stagnates for 15 epochs. While we started with a standard early-stopping mechanism, we observed that the training and validation losses initially decreased and then started to increase again. To force the model to explore promising areas we implement a novel mechanism that we call jump back on learning rate change, illustrated in Figure 3. It resets the weights to the best-saved configuration when decreasing the learning rate. This approach automates a training restart from a checkpoint that previously had the lowest validation loss. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 1 5 10 15 20 25 Va lid at io n Lo ss Epoch standard with jump_back_on_lr_change 10 epochs without improvement Fig. 3: Illustration of the jump back on learning rate change mechanism: Epoch 10 yields the best validation loss with an initial learning rate. After 10epochs without improvement, the learning rate is reduced and the snapshot from epoch 10 gets loaded. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 13 IV. METRICS OMR lacks a universally accepted, intrinsic metric that is easily interpretable and corresponds directly to the proportion of recovered musical information. Such a metric would enable reliable comparisons of OMR system performance without the need for expensive user studies. However, some standard metrics remain useful for evaluating OMR systems. In the specific case of MuNGs, we can define 2 types of metrics: those based on the binary classification aspect of the task and those based on its graph structure. The task of music semantic reconstruction can be formu- lated as a binary classification problem, where each edge in the CG is classified as either existing or not. This allows the evaluation of a model using standard binary classification metrics such as accuracy. However, the informative value of these metrics is limited: Adding trivial or easily removable edges to a CG can increase accuracy without improving the meaningfulness of the predicted graph. Thus, while binary classification metrics reflect how well a model has learned, they do not necessarily capture how valuable its output is. Some limitations of these metrics disappear if we evaluate the output as a graph. A more robust approach is to use the Graph Edit Distance (GED) as defined in equation 1. It measures the sequence of least-cost edit operations required to transform one graph into another. This metric is effective when comparing different models on the same score; however, it becomes problematic when comparing performance across different scores, such as monophonic vs. polyphonic music. Differences in the size of ground truth graphs make GED values incomparable. To address this, the Music Error Rate (MER), defined in equation 2 can be more meaningful. It normalizes the number of edit operations by the size of the graph (see [7] for more details). GED(G1, G2) = min (o1,o2,...,ok) k∑ i=1 c(oi) (1) With o1, . . . , ok a set of operations that transforms G1 into G2 and c(oi) the cost of the operation i. MER = I +R+ S T = GED T (2) Where I, R, and S are the number of insertions, deletions, and substitutions to obtain the ground truth sequence. T is the number of edges in the ground graph. While the MER provides a more balanced comparison, it is still not flawless, as the length of the ground truth graph does not always correlate with the complexity of the musical notation. Moreover, there is no single standard for constructing edges in a MuNG, meaning that multiple valid MuNGs could represent the same musical score. Consequently, graph-based metrics are only comparable across MuNGs that have been built according to the same rules. V. INSTABILITY In the course of our experiments, we encountered significant challenges related to the instability and reproducibility of results. The first barrier to reproducible and stable results is inherent to the library we used: PyTorch Geometric. Although a seed is set to control many sources of randomness, some operations retain non-deterministic behavior during GPU exe- cution [8]. The second barrier is inherent to GNNs, which are known to be unstable [9]. While we aimed for reproducible ex- periments, we noted that different seeds led to vastly different outcomes. VI. RESULTS It is important to acknowledge that achieving a 0% Graph Edit Distance or Music Error Rate is not a realistic expectation in this study. The datasets employed, such as MusiGraph, inherently contain an unknown number of errors. The scores in the dataset MUSCIMA++ and DoReMi have been divided by measure to align with MusiGraph characteristics. This division process certainly introduced some errors as well [7]. Table I shows the performance of different models for each dataset using the 10-labels class set, described in Appendix. To get a better impression of the performance of these models, Figures 4, 5, and 6 visualize the predicted graph for 3 different models. The algorithm that is used to divide the scores has a couple of drawbacks including that some one-page scores (notably for MUSCIMA++) are considered as a single measure. In addition, the increased complexity of the scores makes the 13 nearest neighbors insufficient to obtain inclusive CGs. To account for the different types of scores, we set k to 20 for the datasets MUSCIMA++ and DoReMi cut by measures. Setting k to 20 does not guarantee the CG to be inclusive either. In fact for MUSCIMA measure cut, 80% of the ground true edges are included in the CG, and for DoReMi measure cut the share of edges included in the CG is 91%. An improved algorithm for dividing scores by measure should be used to select a more meaningful value for k. One improvement that we didn’t implement, would be to integrate the grammar of music notation directly into the construction of the graph and only connect a node with the k nearest neighbors that it theoretically could connect to instead of connecting each node to its k nearest neighbors and then pruning the graph. We hypothesize that with this improvement, k could be even smaller, leading to smaller CGs. Not all models perform equally across different link types, a link type is defined by the class of its 2 endpoints. Based on this observation, we can imagine an ensemble approach, where multiple models are employed together. During the prediction phase, various models generate predictions. For each link, depending on its specific type, we select the prediction from the model that has demonstrated the best performance for that particular link type. Table II shows the performance of 4 models across the link types. Leveraging model ensemble strategy and combining these 4 models, we obtained the results presented in Table III. One of these models is based on a Geo- GCN layer [10] instead of a graphSAGE layer. This pipeline Proceedings of the 6th International Workshop on Reading Music Systems, 2024 14 TABLE I: Best models obtained for the different datasets with the 10-labels class set Models Dataset k Accuracy (%) Precision (%) Recall (%) Specificity (%) MER (%) GED model 1 MusiGraph 13 94.45 96.97 95.15 97.00 13.48 2.41 model 2 DoReMi measure cut 20 89.71 71.20 84.75 91.00 37.02 8.39 model 3 MUSCIMA measure cut 20 84.37 64.71 72.00 88.00 38.08 13.87 TABLE II: Accuracy obtained for each link type by a selection of models trained and evaluated on MusiGraph for the 6-labels class set Models Layer noteheadBlack - stem noteheadBlack - accidental noteheadBlack - Flag noteheadBlack - Beam Notehead WholeOrHalf - stem noteheadWholeOrHalf - accidental model 4 graphSAGE 98.84 99.00 99.38 70.41 98.80 90.87 model 5 graphSAGE 98.89 98.96 99.37 69.02 98.80 90.83 model 6 geoGCN 91.52 90.53 90.34 89.43 96.63 89.99 model 7 graphSAGE 98.85 98.97 99.41 68.45 98.61 90.83 TABLE III: Metrics obtained with the model ensemble and the 6-labels class set Dataset k Accuracy (%) Precision (%) Recall (%) Specificity (%) MER (%) GED MusiGraph 13 97.09 97.76 92.70 99.06 6.09 0.70 also uses a different set of classes, 6-labels, more suited to MusiGraph. It encodes less primitives but has a greater granularity (see Appendix for details). VII. DISCUSSION AND CONCLUSION Our developed pipeline incorporates certain decisions that may be subject to discussion. The first criticism we can address is the limiting aspect of the solution regarding the set of classes. Our model relies on specific class sets, and the primitives’ labels must be one-hot encoded to form the feature vectors. Such a framework makes it impossible to adapt a pre-trained model for accepting new classes. Another critical area to address is the normalization step; While intended to standardize musical scores for versatility, the approach was excessive. A more moderate strategy, using staff size as a reference for normalization, would have aligned better with the inherent properties of the musical data and potentially improved the model’s performance. An important majority of scores leverage typeset staff and considering them different from one to another is probably excessive. Despite these criticisms, the testing framework itself is robust and provides a solid foundation for evaluating model performance. It allows for an objective comprehensible mea- surement of how good the models are. We have demonstrated that GNNs can be applied to the semantic reconstruction stage of the OMR pipeline with ac- ceptable performances. The performances could probably be improved with a better combination of parameters. However, the sensitivity and instability of these models may limit their suitability as the optimal solution. REFERENCES [1] A. Baró, P. Riba, and A. Fornés, “Musigraph: Optical music recognition through object detection and graph neural network,” in Frontiers in Handwriting Recognition - 18th International Conference, ICFHR 2022, Hyderabad, India, December 4-7, 2022, Proceedings, ser. Lecture Notes in Computer Science, U. Porwal, A. Fornés, and F. Shafait, Eds., vol. 13639. Springer, 2022, pp. 171–184. [Online]. Available: https://doi.org/10.1007/978-3-031-21648-0 12 [2] J. Hajič jr., M. Dorfer, G. Widmer, and P. Pecina, “Towards full- pipeline handwritten omr with musical symbol detection by u-nets,” in International Society for Music Information Retrieval Conference, 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:53048053 [3] A. Pacha, J. Calvo-Zaragoza, and j. Jan Hajič, “Learning notation graph construction for full-pipeline optical music recognition,” in Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR 2019), 2019, pp. 75–82. [Online]. Available: https://doi.org/10.5281/zenodo.3527744 [4] M. Gori, G. Monfardini, and F. Scarselli, “A new model for learning in graph domains,” in Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., vol. 2, 2005, pp. 729–734 vol. 2. [5] J. Hajič and P. Pecina, “The MUSCIMA++ dataset for handwritten optical music recognition,” in 14th IAPR International Conference on Document Analysis and Recognition, ICDAR 2017, Kyoto, Japan, November 9-15, 2017. IEEE, 2017, pp. 39–46. [6] E. Shatri and G. Fazekas, “Doremi: First glance at a universal OMR dataset,” CoRR, vol. abs/2107.07786, 2021. [Online]. Available: https://arxiv.org/abs/2107.07786 [7] G. de Lambertye, “Music semantic reconstruction with deep learning,” Master’s thesis, Technical University of Vienna, Wien, Austria, Oct. 2024. [8] PyTorch Contributors, Reproducibility — PyTorch 2.0 documentation, pytorch.org. [Online]. Available: https://pytorch.org/docs/stable/notes/ randomness.html [9] P. Velic̆ković, “Everything is connected: Graph neural networks,” CoRR, vol. abs/2301.08210, 2023. [10] P. Spurek, T. Danel, J. Tabor, M. Smieja, L. Struski, A. Slowik, and L. Maziarka, “Geometric graph convolutional neural networks,” CoRR, vol. abs/1909.05310, 2019. [Online]. Available: http://arxiv.org/ abs/1909.05310 Proceedings of the 6th International Workshop on Reading Music Systems, 2024 15 APPENDIX Fig. 4: Example of model 2’s prediction on the MUSCIMA++ dataset (cut by measure). Fig. 5: Example of model 3’s prediction on the DoReMi dataset (cut by measure). In this example, we see an error in the dataset: the links between the triple flags and their noteheads have been correctly predicted but are classified as false positives. Fig. 6: Example of the model ensemble prediction on the MusiGraph dataset. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 16 TABLE IV: 6-labels class set MusiGraph MUSCIMA++ DoReMi 6-labels stem stem stem stem notehead-full noteheadFull noteheadBlack noteheadBlack notehead-empty noteheadHalf, noteheadFullSmall, noteheadWhole noteheadHalf, noteheadWhole noteheadWholeOrHalf beam beam beam beam sharp, flat, natural augmentationDot, accidentalSharp, accidentalFlat, accidentalNatural, accidentalDouble Sharp augmentationDot, accidentalSharp accidentalFlat, accidentalNatural accidentalDoubleFlat accidentalQuarterToneSharpStein accidentalQuarterToneFlatStein accidentalDoubleSharp accidentalThreeQuarterTonesSharpStein accidental 16th flag, 8th flag flag16thUp, flag16thDown,f lag8thDown, flag8thUp, flag16thUp, flag16thDown, flag8thDown, flag8thUp flag16thUp, flag16thDown, flag8thDown, flag8thUp, flag32ndUp, flag32ndDown flag TABLE V: 10-labels class set MusiGraph MUSCIMA++ DoReMi 10-labels stem stem stem stem notehead-full, notehead-empty noteheadFull, noteheadHalf, noteheadFullSmall, noteheadWhole noteheadBlack, noteheadHalf, noteheadWhole notehead beam beam beam beam (missing) augmentationDot augmentationDot augmentationDot sharp, flat, natural accidentalSharp, accidentalFlat, accidentalNatural, accidentalDoubleSharp accidentalSharp accidentalFlat, accidentalNatural accidentalDoubleFlat accidentalQuarterToneSharpStein accidentalQuarterToneFlatStein accidentalDoubleSharp accidentalThreeQuarterTonesSharpStein accidental 16th flag, 8th flag flag16thUp, flag16thDown, flag8thDown, flag8thUp flag16thUp, flag16thDown, flag8thDown, flag8thUp, flag32ndUp, flag32ndDown Flag (missing) tie tie tie (missing) legerLine (missing) legerLine (missing) dynamicCrescendoHairpin, dynamicDiminuendoHairpin slur dynamicForte, dynamicPiano, dynamicFFF, dynamicPPP, dynamicFF, dynamicText, dynamicMP, dynamicFortePiano, dynamicPP, dynamicSforzato, dynamicMF, dynamicForzando, gradualDynamic, slur others slur dynamics etc 8th rest, 16th rest, quarter rest, half rest rest8th, rest16th, restQuarter, restHalf, restHBar, restWhole rest8th, rest16th, rest32nd, restQuarter, restHalf, restWhole rest Proceedings of the 6th International Workshop on Reading Music Systems, 2024 17 Staff Layout Analysis Using the YOLO Platform Vojtěch Dvořák, Jan Hajič jr., Jiřı́ Mayer (B) Institute of Formal and Applied Linquistics Charles University, Prague, Czech Republic Email: v.dvorak@matfyz.cz, hajicj@ufal.mff.cuni.cz, mayer@ufal.mff.cuni.cz ORCID: 0009-0007-8423-5139, 0000-0002-9207-567X, 0000-0001-6503-3442 Abstract—Detecting staffs, systems, and measures, collectively known as layout analysis, matters for Optical Music Recognition (OMR), both because most systems today expect staff-level inputs, and because even if these are replaced by systems that can process the whole page, the staffs and systems are useful elements of OMR user interfaces and applications. It receives comparatively little attention, which is justified, as it avoids many class im- balance, small object, and object assembly phenomena, which is what makes OMR difficult and interesting. However, the main publicly available tool for layout analysis, the MeasureDetector, has not been updated for several years, and off-the-shelf object detection has progressed: not just in accuracy, but also in speed. Therefore, in this paper, we bring an update on the performance of OMR layout analysis with the state-of-the-art YOLO platform. Compared to the MeausreDetector, it achieves a similar or better accuracy across both in-domain and out-of-domain tests over three different datasets that we harmonized, it is more than 20x faster, and requires more than 4 times less memory. Index Terms—Optical Music Recognition, Layout Analysis, Deep Learning I. INTRODUCTION One of the first steps in many Optical Music Recognition systems is detecting which regions of the music score image correspond to high-level elements of music notation: system, staff, and measure, usually as a staff detection step of the tra- ditional OMR pipeline [4], [5], [17]. Assigning written music to these objects, especially staffs and systems, determines the reading order of the written page. This step can be done before or after individual image pixels are assigned to layers such as background, staff, and foreground [3], [6], [7]. Is staff layout analysis still a relevant task for OMR in the presence of end-to-end methods? Most end-to-end recognition methods to date have also operated on single staffs or systems [1], [15], [18]–[20]. Even though there are attempts to perform full-page recognition that learns to read the whole page with- out splitting it into these basic elements of music score layout [19], these are still initial experiments (though promising). Therefore, while system, staff and measure detection may not represent a key element of every OMR system, it is currently still a broadly applicable initial step that, while perhaps not as exciting as end-to-end recognition itself, has its place in the ecosystem. Furthermore, even in the presence of well-performing end- to-end methods for processing the entire page, we believe having an explicit staff layout detected before processing entire pages may be highly useful in practice. Computing resources are not unlimited and transformer-based models and other recurrent models that represent the current state of the art [15], [19] are computationally expensive. Errors in staff layout (such as not assigning staffs to systems correctly) are extremely expensive to fix manually and lead to many compounding errors downstream. So, in user-facing applications, it may be highly desirable to get staff layout information verified inter- actively, before running the relatively expensive recognition model itself. Finally, also in the spirit of lowering computational (and therefore energy) costs, when one is trying to detect music notation in large collections of documents (in the millions of pages or more), the staff is the visually most distinct and clear sign of music notation’s presence – Common Western Music Notation (CWMN) as well as mensural or medieval.1 Sending an image into full OMR processing only when a staff is detected with a very high probability is thus a reasonable component in a practical library-scale system. At the same time, detecting systems, staffs and measures is a sub-task of OMR that is not particularly affected by the music notation phenomena that make OMR as a whole so difficult [2], [4]. These notation objects occupy large convex regions of the image, and there aren’t as many on any single page. Hence, existing object detection methods are expected to be entirely applicable. This justifies why the task of staff layout detection has not received much scholarly attention in the past few years. But, significant progress in object detection has been made since [8], [12], [14], [24], most importantly on the YOLO platform [22]. And the most popular publicly available measure detector that the field has produced, the veritable MeasureDetector2, has last been updated in 2020, it is based on TensorFlow version 1.13.1, which is outdated and requires Python 3.7, a version that reached end-of-life in 2023. Therefore, we believe it is time to update the OMR field’s collective intuition on how well (and how fast) this auxilliary task of sheet music layout detection can in fact be performed today. II. CONTRIBUTIONS The central contribution of this paper is not surprising: we find that the current state-of-the-art YOLOv8m model [13] reaches similarly good or slightly better performance as the older R-CNN [16], but is significantly faster and smaller, 1Aside from early adiastematic chant manuscripts, of which there aren’t millions of pages extant. 2https://github.com/OMR-Research/MeasureDetector Proceedings of the 6th International Workshop on Reading Music Systems, 2024 18 and therefore it is preferable. Pre-trained models are made available both for the previous Faster R-CNN architecture and YOLOv8m.3 Furthermore, this work: • harmonizes, combines and extends already existing datasets (extended with systems and grand staffs), and provides scripts to convert and merge them into COCO and YOLO format; • adds MZKBlank4, a new dataset that contains background images representative of archival collections; • trained object detection models available both as Faster R-CNN and YOLOv8m; • provides an estimate of out-of-domain generalization for several basic classes of scores; All of the scripts and models are available on GitHub5. Secondary outputs include hotfixes to A. Pacha’s 2019 Mea- sureDetector6, done to prepare datasets and train R-CNN models. Taken together, we believe these contributions are a sub- stantial update – especially in terms of quality-of-life – for putting into practice those OMR systems that rely on layout detection as a preprocessing step. III. LAYOUT OBJECTS We use five classes of layout objects: staff, grand staff, system, staff measure, and system measure. Staff. Contains (typically) five parallel lines, all of the same length. One staff is one “line of sheet music” for an instrument. Many end-to-end methods assume as their input the image of a single staff and its associated symbols. Staffline spacing is also the basic element of music notation scaling. Grand staff. A pair of staffs meant for a single instrument with a large range (typically keyboard instruments, or the harp). It is practical to treat the grand staff as a separate class because it implies the presence of more complex classes of notation that might be better handled by a more complex but more demanding model (polyphonic and pianoform [4]). System. A set of staffs (some of which may be grand staffs) that are to be read in parallel. Barlines may be drawn across the whole system, to provide the readers (usually the conductor, or singers) clear information on synchronization. Technically, e.g. in a violin part, each staff is also a system, but systems are most useful to detect when one needs to decide which staffs should be concatenated and which should not (for instance, to correctly assemble the staffs for individual instruments in a string quartet score). Staff measure. One measure (a region of notation that corresponds, typically, to one metrical cycle of a downbeat and other beats, as denoted by the time signature) on a staff. Measures are useful for instance as units of indexing for 3https://github.com/v-dvorak/omr-layout-analysis/releases/tag/ evaluation-release 4https://github.com/v-dvorak/omr-layout-analysis/blob/main/app/ MZKBlank 5https://github.com/v-dvorak/omr-layout-analysis 6https://github.com/v-dvorak/MeasureDetector fast sheet music retrieval, and they can also be used for “sanity checks” when assembling scores from individual staff components to catch de-synchronization between parts early (and correct for it). System measure. All the measures belonging to the same system that should be played in parallel. These classes are sufficiently generic that they apply across many different CWMN datasets,7 as evidenced by the unprob- lematic harmonization of multiple datasets for this work, and at the same time are useful objects that someone might want to extract from a score, for instance to establish an unambiguous reading order. IV. DATASETS The resulting dataset is a combination of three already existing datasets and a new one, for numbers of images and annotations (see Tab. I). All datasets mentioned have anno- tations available in COCO format, concrete implementations differ. A. AudioLabs v2 AudioLabs v2 is an extension of the AudioLabs v1 dataset. Its annotations were generated with the help of a neural network and the original dataset [23], the images are generated from CSV files. Grand staffs and system bounding boxes were added manually to the dataset. B. MUSCIMA++ MUSCIMA++ [11] is a dataset of handwritten music nota- tion for musical symbol detection that is based on the CVC- MUSCIMA dataset [9]. Grand staffs and system bounding boxes were added manually to the dataset. C. Open Score Lieder – OSLiC OpenScore Lieder is a collection of digital editions of accompanied songs by 19th century composers transcribed using the MuseScore editor [10]. The annotations were parsed from SVGs that were generated along with PNGs using MuseScore from the dataset’s MSCX scores. Because of many inconsistencies, some scores were ruled out of the final dataset; however, these are all still pixel-accurate annotations. D. MZKBlank For the best training results, 1–10 % of the images in the dataset should be background images (negative samples) [13], but the datasets mentioned above do not contain enough of these examples, only 56 images of 6 007 do not contain any annotations. The Moravská Zemská Knihovna (MZK) offers access to more than two thousand public domain sheet music documents with more than nine thousand labeled pages that do not contain any music8 – our negative samples. Reducing 7For menusral music, measures are (in the vast majority of cases) not applicable, and different configurations than a partitura such as choirbooks or partbooks where the concept of systems is much less trivial (even though at least choirbooks take care so that all parts need turning the page at the same time, in case of longer compositions). 8Blank, front cover, front end sheet, title page, table of contents and more. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 19 TABLE I NUMBERS OF ANNOTATIONS AND IMAGES IN EACH DATASET images system measures staff measures staffs systems grand staffs AudioLabs v2 940 24 186 50 064 11 143 5 376 5 375 MUSCIMA++ 140 2 888 4 616 883 484 94 OSLiC 4 927 72 028 220 868 55 038 17 991 17 959 MZKBlank 1 006 0 0 0 0 0 total 7 013 99 102 275 548 67 064 23 851 23 428 the number of samples while maintaining the relative ratios, we get the new MZKBlank dataset that contains 1 006 semi- randomly chosen images that are related to sheet music but do not contain any sheet music. Fig. 1. Overview of front covers from MZKBlank resized to squares. V. EXPERIMENTS In our evaluation, we compare the YOLOv8m model with the Faster R-CNN model implemented using TensorFlow, previously utilized for a measure detector by A. Pacha. (We train both architectures on the same datasets, we do not re-use the trained MeasureDetector.) All trained models and results are available online.9 One in-domain test was performed using a 90/10 train/test split across the combination of all datasets. Then, three out-of- domain tests (with one non-blank dataset left out as the test set, see Tab. II) were performed. The results were measured with mAP50 and mAP50-9510. We think that specifically for layout analysis, the higher IoU thresholds are more relevant, because the accuracy of the bounding box matters, especially when layout analysis is used as a preprocessing step (as opposed to localization of e.g. clefs or stems in a hypothetical downstream object detection step). We used the pycocotools Python library to calculate these metrics.11 9https://github.com/v-dvorak/omr-layout-analysis/releases/tag/ evaluation-release 10Mean average precision at an IoU at threshold of 0.50, and the average of the mAP calculated at varying IoU thresholds ranging from 0.50 to 0.95. 11The YOLO platform provides its own evaluation script that is not suitable to evaluate the R-CNN models. In fact, they use less strict parameters, so a custom script is used to evaluate both types of models. TABLE II OUT-OF-DOMAIN TEST DATASETS CONTENTS id training datasets validation dataset IV AudioLabs v2, MUSCIMA++, MZKBlank OSLiC V MUSCIMA++, OSLiC, MZKBlank AudioLabs v2 VI AudioLabs v2, OSLiC, MZKBlank MUSCIMA++ A. Test results In case of the in-domain test (see Tab. III), the YOLO model outperforms R-CNN: slightly, in the mAP50 setting, and significantly in mAP50-95, where it on average gets halfway closer to a perfect score. In the case of out-of-domain tests on printed music (see Tab. IV and V, both models perform comparably, with the R- CNN slightly beating YOLOv8m in mAP50 scores, but YOLO performing better when better localization is required. When MUSCIMA++is the out-of-domain dataset, however, YOLO is better in both metrics, and while nowhere near usable overall, it reaches 0.75 mAP50-95 for grand staffs and 0.72 mAP50 for staffs, which compared to YOLO’s 0.164 and 0.061. YOLO can apparently to some extent abstract away the unexpectedly handwritten context in which the staffs exist, while R-CNN has practically no chance. B. Speed comparison Using the same hardware and running on a CPU12, the inference times for both models were measured. Pacha’s R- CNN averaged an inference time of 21.33 seconds per image, whereas YOLOv8m averages at just 0.83 seconds, nearly 26 times faster. YOLOv8m’s speed can be further improved when ran on GPU13, with an average of 0.42 seconds per image, where the inference itself (with pre- and post-processing) takes an average of 0.16 seconds. C. R-CNN’s overlap problem One of the specifics of staff layout analysis are overlapping bounding boxes – for every grand staff there has to be a system (that may or may not contain other staff). In our dataset 98%14 instances of grand staffs are a system. This overlap has an unwanted effect on the R-CNN model. It has no problem identifying systems and grand staff with confidence > 0.9 when grand staff ⊊ system. But when grand staffs are also entire systems, the confidence drops for both predictions 12Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz, 32GB RAM 13NVIDIA GeForce GTX 1080 Ti, 32GB RAM 1423 428 : 23 851 ≈ 98%, see Tab. I. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 20 (sometimes even bellow 0.5) and the bounding boxes are less accurate. We conjecture that this may be caused by the structure of Faster R-CNNs, specifically the region proposals. YOLOv8 is a one-stage model [21] in contrast to the two-stage R-CNN, and we did not notice any significant performance drops in the same situation (see Fig. 2). One can rightly argue that this is an artifact of ground truth design. However, we contend that at least for library-scale OMR systems, which are one of the major use-cases, the ability to determine that a page is e.g. purely for the piano (or one instrument), which can be inferred precisely from (grand) staffs and systems occupying the same area, should still be automated, and hence this property of R-CNN is not an irrelevant artifact of our ground truth design. Fig. 2. Visualization of R-CNN’s overlap problem. Each cell contains predictions for a 2- and 3-staff system. R-CNN ’s confidence and precision significantly drops (red rectangles) when system and grand staff overlap (2-staff system), YOLOv8 does not exhibit this behavior. VI. CONCLUSIONS AND FUTURE DIRECTIONS Given that the MeasureDetector tool15 has not been updated for several years, we supply a new, up-to-date measure detector for anyone to use in this still-important OMR preprocessing step. This is no revolution of OMR, of course, but we observed clear performance benefits, both in correctness (esp. on the in- domain task, where the switch to YOLO and more extensive training eliminated between a third and a half of the remaining errors across diverse datasets), in speed (by an order of magnitude, a 26x increase, for an average of 0.83 s per page), and in memory requirements (50 vs 220 MB, which matters for instance for running the model in a browser, if one feels so inclined). However, for some out-of-domain settings, YOLO still does significantly worse when one does not need to use higher IoU thresholds, so we do not advocate phasing out the MeasureDetector completely. We plan to extend the dataset further by adding more handwritten music, possibly synthetic, and to further explore 15https://github.com/OMR-Research/MeasureDetector possibilities of the YOLO platform by experimenting with both smaller and larger models available (N, S, L, X), and to provide more pre-trained models that can be easily embedded into complete OMR workflows. Looking at the progress of object detection [22], [24], layout analysis for OMR should be on its way to become a solved problem and a step that practitioners can easily plug into their systems. While in-domain detection results are coming close to this goal, out-of-domain layout analysis still has a long way to go. Overall, we believe that these models are a useful step in that direction. TABLE III IN DOMAIN EVALUATION, 90/10 TRAIN/TEST SPLIT. class instances Pacha’s R-CNN YOLOv8m mAP50 mAP50-95 mAP50 mAP50-95 system measures 9 151 0.989 0.943 0.987 0.975 staff measures 27 294 0.979 0.831 0.989 0.930 staffs 6 816 0.980 0.854 0.989 0.888 systems 2 326 0.990 0.947 0.990 0.986 grand staff 2 285 0.996 0.931 1.000 0.993 all 47 872 0.987 0.901 0.991 0.954 TABLE IV OUT OF DOMAIN: EVALUATED ON OSLIC. class instances Pacha’s R-CNN YOLOv8m mAP50 mAP50-95 mAP50 mAP50-95 system measures 72 028 0.727 0.507 0.554 0.571 staff measures 220 868 0.678 0.204 0.580 0.249 staffs 55 038 0.921 0.295 0.829 0.334 systems 17 991 0.945 0.697 0.978 0.949 grand staff 17 959 0.982 0.701 0.901 0.792 all 383 884 0.851 0.481 0.790 0.579 TABLE V OUT OF DOMAIN: EVALUATED ON ALV2. class instances Pacha’s R-CNN YOLOv8m mAP50 mAP50-95 mAP50 mAP50-95 system measures 24 186 0.989 0.827 0.934 0.770 staff measures 50 064 0.976 0.494 0.921 0.535 staffs 11 143 0.939 0.511 0.939 0.584 systems 5 376 0.989 0.832 0.960 0.860 grand staff 5 375 0.973 0.699 0.960 0.859 all 96 144 0.973 0.673 0.943 0.722 TABLE VI OUT OF DOMAIN: EVALUATED ON MUSCIMA++. class instances Pacha’s R-CNN YOLOv8m mAP50 mAP50-95 mAP50 mAP50-95 system measures 2 888 0.256 0.140 0.153 0.123 staff measures 4 616 0.196 0.026 0.420 0.174 staffs 883 0.061 0.008 0.723 0.329 systems 484 0.237 0.111 0.192 0.140 grand staff 94 0.393 0.164 0.758 0.747 all 8 965 0.229 0.090 0.449 0.303 ACKNOWLEDGMENT The authors would like to thank Kristýna Harvanová for sharing her code16 on which the parsing of annotations from SVG files is based. This work has been supported by the Charles University (project GAUK no. 289623 and SVV project number 260698). 16https://github.com/Kristyna-Harvanova/Bachelor-Thesis Proceedings of the 6th International Workshop on Reading Music Systems, 2024 21 REFERENCES [1] Marı́a Alfaro-Contreras, José M. Iñesta, and Jorge Calvo-Zaragoza. Optical music recognition for homophonic scores with neural networks and synthetic music generation. 12th International Journal of Mul- timedia Information Retrieval, 12(1), May 2023. doi:10.1007/ s13735-023-00278-5. [2] Donald Byrd and Jakob Grue Simonsen. Towards a standard testbed for optical music recognition: Definitions, metrics, and page images. Journal of New Music Research, 44(3):169–195, 2015. doi:10. 1080/09298215.2015.1045424. [3] Jorge Calvo-Zaragoza and Antonio-Javier Gallego. A selectional auto- encoder approach for document image binarization. Pattern Recognition, 86:37–47, 2019. doi:10.1016/j.patcog.2018.08.011. [4] Jorge Calvo-Zaragoza, Jan Hajič Jr., and Alexander Pacha. Under- standing optical music recognition. ACM Comput. Surv., 53(4), 2020. doi:10.1145/3397499. [5] Jorge Calvo-Zaragoza, Juan C. Martinez-Sevilla, Carlos Penarrubia, and Antonio Rios-Vila. Optical music recognition: Recent advances, current challenges, and future directions. In Mickael Coustaty and Alicia Fornés, editors, Document Analysis and Recognition Workshops, pages 94–104, Cham, 2023. Springer Nature Switzerland. doi:10.1007/ 978-3-031-41498-5_7. [6] Jorge Calvo-Zaragoza, Luisa Mico, and Jose Oncina. Music staff removal with supervised pixel classification. International Journal on Document Analysis and Recognition, 19:211–219, sept 2016. doi: 10.1007/s10032-016-0266-2. [7] Francisco J. Castellanos, Antonio Javier Gallego, and Ichiro Fujinaga. A few-shot neural approach for layout analysis of music score images. In 24th International Society for Music Information Retrieval Conference, pages 106–113, Milan, Italy, 2023. URL: https://archives.ismir.net/ ismir2023/paper/000011.pdf. [8] Wei Chen, Jinjin Luo, Fan Zhang, and Zijian Tian. A review of object detection: Datasets, performance evaluation, architecture, applications and current trends. Multimedia Tools and Applications, 83:1–59, 01 2024. doi:10.1007/s11042-023-17949-4. [9] Alicia Fornés, Anjan Dutta, Albert Gordo, and Josep Lladós. CVC- MUSCIMA: A ground-truth of handwritten music score images for writer identification and staff removal. International Journal on Docu- ment Analysis and Recognition, 15(3):243–251, 2012. doi:10.1007/ s10032-011-0168-2. [10] Mark Robert Haigh Gotham and Peter Jonas. The OpenScore Lieder Corpus. In Stefan Münnich and David Rizo, editors, Music Encoding Conference Proceedings 2021, pages 131–136. Humanities Commons, 2022. doi:10.17613/1my2-dm23. [11] Jan Hajič jr. and Pavel Pecina. In search of a dataset for handwritten optical music recognition: Introducing MUSCIMA++. Computing Re- search Repository, abs/1703.04824:1–16, 2017. URL: http://arxiv.org/ abs/1703.04824. [12] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, and Kevin Murphy. Speed/accuracy trade-offs for modern convolutional object detectors. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, pages 3296–3297, 2017. doi:10.1109/CVPR.2017.351. [13] Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultralytics YOLO, January 2023. URL: https://github.com/ultralytics/ultralytics. [14] Jaskirat Kaur and Williamjeet Singh. A systematic review of object de- tection from images using deep learning. Multimedia Tools and Applica- tions, 83:1–86, 06 2023. doi:10.1007/s11042-023-15981-y. [15] Jiřı́ Mayer, Milan Straka, Jan Hajič Jr., and Pavel Pecina. Practical end-to-end optical music recognition for pianoform music. In Elisa H. Barney Smith, Marcus Liwicki, and Liangrui Peng, editors, Document Analysis and Recognition, pages 55–73, Cham, 2024. Springer Nature Switzerland. doi:10.1007/978-3-031-70536-6. [16] Alexandr Pacha. MeasureDetector, April 2019. URL: https://github. com/OMR-Research/MeasureDetector/releases/tag/v1.0. [17] Ana Rebelo, Ichiro Fujinaga, Filipe Paszkiewicz, Andre R.S. Mar- cal, Carlos Guedes, and Jamie dos Santos Cardoso. Optical music recognition: state-of-the-art and open issues. International Journal of Multimedia Information Retrieval, 1(3):173–190, 2012. doi:10. 1007/s13735-012-0004-6. [18] Antonio Rı́os-Vila. Rotations are all you need: A generic method for end-to-end optical music recognition. In Jorge Calvo-Zaragoza, Alexander Pacha, and Elona Shatri, editors, Proceedings of the 5th International Workshop on Reading Music Systems, pages 34–38, Milan, Italy, 2023. doi:10.48550/arXiv.2311.04091. [19] Antonio Rı́os-Vila, Jorge Calvo-Zaragoza, and Thierry Paquet. Sheet music transformer: End-to-end optical music recognition beyond mono- phonic transcription. In International Conference on Document Anal- ysis and Recognition, pages 20–37, Athens, Greece, 2024. Springer. doi:10.48550/arXiv.2402.07596. [20] Antonio Rı́os-Vila, Jose M. Iñesta, and Jorge Calvo-Zaragoza. End- to-end full-page optical music recognition of monophonic documents via score unfolding. In Jorge Calvo-Zaragoza, Alexander Pacha, and Elona Shatri, editors, Proceedings of the 4th International Workshop on Reading Music Systems, pages 20–24, Online, 2022. doi:10.48550/ arXiv.2211.13285. [21] Jane Torres. Yolov8 architecture explained, March 2024. URL: https: //yolov8.org/yolov8-architecture-explained/. [22] Ajantha Vijayakumar and Subramaniyaswamy Vairavasundaram. Yolo- based object detection models: A review and its applications. Multi- media Tools and Applications, pages 1–40, 2024. doi:10.1007/ s11042-024-18872-y. [23] Frank Zalkow, Angel Villar Corrales, TJ Tsai, Vlora Arifi-Müller, and Meinard Müller. Tools for semi-automatic bounding box annotation of musical measures in sheet music. In Demos and Late Breaking News of the International Society for Music Information Retrieval Conference, Delft, The Netherlands, 2019. URL: https://www.audiolabs-erlangen.de/ resources/MIR/2019-ISMIR-LBD-Measures. [24] Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A survey. Proceedings of the IEEE, 111(3):257–276, 2023. doi:10.1109/JPROC.2023.3238524. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 22 On Designing a Representation for the Evaluation of Optical Music Recognition Systems Pau Torras Computer Vision Center Computer Science Department Universitat Autònoma de Barcelona Bellaterra, Spain ptorras@cvc.uab.cat Sanket Biswas Computer Vision Center Computer Science Department Universitat Autònoma de Barcelona Bellaterra, Spain sbiswas@cvc.uab.cat Alicia Fornés Computer Vision Center Computer Science Department Universitat Autònoma de Barcelona Bellaterra, Spain afornes@cvc.uab.es Abstract—Optical Music Recognition (OMR) is currently frag- mented, with incompatible datasets and methodologies making it difficult to combine or compare systems. This paper proposes the Music Tree Notation (MTN) format as a unified framework to promote collaboration, technology reuse, and fair evaluation in OMR research. MTN represents music using an abstract tree built upon the concept of visual primitives, a trade-off between fully graph-based and sequential-based formats. The authors also introduce a set of metrics and a typeset score dataset. Index Terms—Optical Music Recognition, Representation, Evaluation, Datasets, Computer Vision I. INTRODUCTION Written music has been a key part of human cultural heritage for centuries, from ancient neumes to modern Western notation, with countless pages preserved over time. Given the vast number of notable, often forgotten works, scholars now turn to computers for help in preserving and analysing them. Optical Music Recognition (OMR) plays a crucial role in this process, converting scanned or imaged scores into a computer- readable format for further analysis [1]. However, the field of OMR is quite fragmented nowadays. Only a small community is fully devoted to it, and each of its members have developed their unique point of view and methodology [1], [2]. This is particularly evident when analysing the available datasets [3], as most of them are restricted to specific steps or approaches and almost none of them are compatible with each other (the most notable exception being the DoReMi dataset introduced in 2021 [4], which incorporates its ground truth in multiple formats). Another point of disagreement among OMR researchers is the matter of the evaluation of models and systems [1], [5]–[7]. Evaluation of OMR models is currently performed on a per-methodology basis [1], [8], disregarding the full reconstruction of music scores. Most metrics revolve around measuring the fidelity of intermediate representations such as object bounding boxes or agnostic text representations without creating a final notation file. Moreover, the benchmarks these This work has been partially supported by the Spanish projects PID2021- 126808OB-I00 (GRAIL) and CNS2022-135947 (DOLORES). Pau Torras is funded by the Spanish FPU Grant FPU22/00207. The authors acknowledge the support of the Generalitat de Catalunya CERCA Program to CVC’s general activities. evaluation metrics are computed upon are rarely widespread, with each methodology using their own. To advance toward unifying the efforts of the OMR com- munity, we propose establishing a shared framework. The first step is to define a target final representation that supports a wide range of use cases within Common Western Music Notation (CWMN), the system used in Europe from the early 18th century to the present. Our focus on CWMN, rather than the broader range of related western notations, is intentional. While CWMN evolved from Mensural notation and shares some graphical elements, its distinct semantic concepts set it apart. Given the unique nature of each notation system and the specific needs of OMR for CWMN, we believe it is best to focus on this system exclusively. Once a shared representation is chosen, a set of evaluation metrics can be fairly defined. Thus, the contribution of this work can be summarised by the following claims: • We try to bridge the gap between the different benchmark suites in OMR literature with a universal tree-based notation format designed to represent musical scores at the graphical level1. • We also present an evaluation toolkit which aims towards unify existing benchmark OMR tasks for fairer compar- ison. • We have produced a typeset dataset using public domain works with permissive licenses 2. II. THE MUSIC TREE NOTATION FORMAT The cornerstone of the MTN format is understanding the task of structured OMR as the reconstruction of the score at the visual domain. The core idea of this format is therefore to build a notation that exclusively models relationships between graphical symbols and defers inference of music semantics until a later stage. Only those high-level music concepts that are strictly required to reconstruct the score unambiguously are kept iff there is a direct graphical cue that allows straight- forward inference. MTN is designed in order to • normalise the set of music primitives to be recognised, 1Repository of the project: https://github.com/CVC-DAG/comref-converter 2Link to the dataset https://datasets.cvc.uab.cat/comref/comref.zip Proceedings of the 6th International Workshop on Reading Music Systems, 2024 23 Measure Part: P1 ID: 1 Attributes Delta: 0 / 1 Note Group Delta: 0 / 1 Note Group Delta: 1 / 1 Barline STAFF 1 Key Clef Time Signature accidental s:1/p:10 type: sharp clef s:1/p:04 type: G timesig s:1/p:ANY type: common Chord Delta: 0 / 1 Note Group Delta: 1 / 2 beam s:ANY/p:ANY stem s:1/p:ANY type: down Note notehead s:1/p:05 type: black accidental s:1/p:05 type: sharp slur s:1/p:05 type: start Chord Delta: 1 / 2 Chord Delta: 3 / 4 beam s:ANY/p:ANY stem s:1/p:ANY type: down Note notehead s:1/p:07 type: black slur s:1/p:07 type: stop stem s:1/p:ANY type: down Note notehead s:1/p:07 type: black Chord Delta: 1 / 1 stem s:1/p:ANY type: down Note notehead s:1/p:07 type: white dot s:1/p:07 barline_tok s:ANY/p:ANY type: regular barline_tok s:ANY/p:ANY type: heavy Score Example Fig. 1. Example showing a fragment of a measure in which the annotation format for attributes and staff-modifying elements is shown. Rectangular nodes represent primitives as tokens and rounded nodes are abstract elements. • simplify conversion to a final structured format, • enable comparison of diverse OMR methods on equal grounds and • facilitate the usage of non-OMR-specific data. A graphical representation of a simple measure engraved in MTN can be seen in Figure 1. The core element of this format is the Musical Primitive, a concept that is quite widespread in the OMR literature [9]–[12] and can be defined as any of the independent structural elements that may or may not be combined together to form a semantic unit in the music score. The set of musical primitives includes all graphical elements in a score that are self-contained and require no other symbols to convey meaning (rests, clefs or time signature symbols), the set of graphical elements that compose notes (noteheads, stems, flags, dots, accidentals, etc.) and other miscellaneous elements such as numbers for compound time signatures. Every primitive is given a unique work-level identifier. These primitives associate together to form more abstract constructs. This is modelled in MTN using a tree-like structure of higher-order elements, which defines the set of dependen- cies among objects in the score. This idea, present in works such as [13], emulates parsing the contents of the score using a grammar, enabling the bulk of tools and research on parsers, parser generators and AST analysis and processing to be used in the context of music. Furthermore, it is a structure that can be modelled very easily using an exchange format such as XML. There are some elements in music that break the tree-like structure assumption. These are elements that connect multiple notes together outside their local note group structure: slurs, ties, parentheses and tuplets, among others. Both MEI and MusicXML acknowledge this limitation and circumvent it through the use of identifiers. MTN is no different: it provides a unique starting and ending token for each side of the object and gives both ends the same identifier. In order to describe the position of MTN elements, two magnitudes are used. Firstly, for every token a tuple of two integers denotes the staff the element belongs to and its position within the staff. The position is denoted counting the number of steps from the first ledger line below a staff. For those elements without a specific position (such as rests or stems), a null value is used. Secondly, for any object immediately below the class measure, an exact timing value is provided. It is measured in fractions of a quarter note from the start of the measure itself. This information is also provided for every chord in a note group even if this information can be inferred for the sake of simplifying evaluation procedures. Finally, to produce unambiguous scores, a reading order of sorts must be established. We propose the following ordering criterion: • By starting time counting from the beginning of the measure. • By top level class (in this order): Attributes, Directions, Rests, Note Groups and Barlines. • By staff position: first objects on upper staves and lower positions within them. • In case of Note Groups, by direction of the first stem: first stems looking upwards. • For anything else, token alphabetical order. This also guarantees stability of the notation if new token types are added. For other elements such as text or bounding boxes, we propose the use of extensions to complement the format. III. EVALUATION METRICS We propose a set of evaluation metrics that both acknowl- edge the existence of multiple paradigms for OMR while also setting ways to compare any structured output equally. These metrics draw inspiration from the currently used Symbol Error Proceedings of the 6th International Workshop on Reading Music Systems, 2024 24 Rate and some ideas from Hajič jr. [14]. We have divided our proposed metrics as tiers depending on the abstraction level they address and the problems they can help diagnose. A. Tier 0: Methodology-Specific Metrics In Tier 0 any methodology-specific metrics should be logged. This includes bounding box-level mean average pre- cision, symbol error rate or other metrics that are standard for an OMR approach but are not regulated by MTN. B. Tier 1: Primitive detection The first set of metrics addresses the presence or absence of terminals within the MTN string. These metrics do not take into account structural matters, making them quick to compute. Given a set of Predicted Terminals P and a set of Ground truth Terminals G, we define • Primitive-level precision precision = ∥P ∩G∥ ∥P∥ (1) • Primitive-level recall recall = ∥P ∩G∥ ∥G∥ (2) These metrics are computed per-class for the entire dataset. In order to produce a single precision and recall measure, results are aggregated per-class using a weighted average, where the weights are the relative frequency of each token in the ground truth. C. Tier 2: Structure Reconstruction This tier takes into account the structure of the produced MTN and compares it directly with that of the ground truth. A matching from ground truth elements to those present in the prediction is performed using a tree edit distance algorithm. In particular, since there is a restriction on the ordering of sibling labels, the O ( n3 ) solution from Zhang and Sasha can be employed [15]. In practice, we use a Python implementation [16] of Pawlik et al.’s APTED algorithm [17]. Given the following operations: • Substitution: Changing the label of a single node within the tree. • Deletion: Removing a single node of the tree and setting its children as siblings. • Insertion: Adding a new node under a parent one and setting a consecutive subsequence of its siblings as chil- dren. Given a predicted tree and a ground truth tree whose set of vertices is G and assuming an equal edit cost of 1 for all operations, the Tree Error Rate (TER) is defined as TER = S +D + I ∥G∥ (3) where S, D and I are the number of substitution, deletion and insertion operations required to produce the ground truth tree from the predicted tree. This metric is designed mostly for benchmarking and is defined by analogy to the ubiquitous Symbol Error Rate (SER). D. Tier 3: Semantic Reconstruction This tier considers whether the subset of music semantics required by MTN has been extracted correctly. It depends on the matching extracted from the structural level in order. Thus, the False Positive Rate and Missing Note Rate(MNR) metrics are defined as the ratio of ground truth notes that do or do not have a corresponding prediction. In MTN, a note n is defined semantically from its graphical properties: position and time. From this idea and the matching extracted from the previous tier, we define a few metrics. Pitch and Time Precision are defined as the number of correctly predicted graphical pitches and times w.r.t. the ground truth. Average Pitch Shift (APS) and Time Average Shift (TAS) are defined as the average offset in pitch and time from the predicted note w.r.t. its corresponding ground truth note. Signedness is kept in order to identify the direction in which the underlying OMR system tends to move the notes. In order for all of these metrics to be independent of the sequence length, they should be computed and accumulated for the entire dataset and not averaged on a by-prediction basis. IV. A PROOF OF CONCEPT We have developed a dataset built on transcriptions of public domain works as a proof of concept of the notation format. In particular, we have used the OpenScore project’s transcriptions of widely known works such as The Art of the Fugue by J.S. Bach or the Planets by Gustav Holst, among others. We have also incorporated the Lieder Corpus [18] and the String Quartet corpus [19]. All these scores are engraved from MusicXML files. In summary, the dataset is produced by processing of 894 individual works into images at the measure level (including all staves that belong to it), to produce a total of 435.162 images after cleanup. Page-level images are also provided. The process through which the dataset was produced is summarised in Figure 2. Scores are engraved through Verovio [20] into page-level SVG files. Using the hierarchical structure within the SVG and exploiting the optional identifier informa- tion Verovio can be instructed to attach, measures are engraved individually. It also marks those measures at the beginning of a line to insert attribute elements. Once the images are produced, the converter uses the MusicXML file and produces the MTN notation. In order to ensure all images have their corresponding ground truth, we use a cleaning script that finds matching identifiers for images in the MTN files. It also checks for outliers in case there are blatant mistakes in the notation. Although we have taken precautions to minimise the number of errors, there are a few images with objects far from the staff, either temporally or graphically. We remove these outliers heuristically to ensure the quality of the data. We conducted a simple proof-of-concept experiment on this dataset to assess the feasibility of the methodology proposed in this paper. For this purpose, we used an off-the-shelf OMR system to produce a transcription of the test partition and we analysed its results. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 25 TABLE I EVALUATION RESULTS. FROM LEFT TO RIGHT, PRIMITIVE PRECISION AND RECALL, TREE ERROR RATE, AVERAGE TIME SHIFT, AVERAGE PITCH SHIFT, TIME PRECISION, PITCH PRECISION, STAFF PRECISION, FALSE POSITIVE RATE AND MISSING NOTE RATE Prec. Rec. TER Time Shift Pitch Shift Staff Shift Time Prec. Pitch Prec. Staff Prec. FPR MNR 0.894 0.733 0.372 -0.096 -0.091 0.022 0.802 0.749 0.963 0.097 0.216 Fig. 2. The pipeline through which the COMREF dataset has been generated. The OMR system used for this experiment is Audiveris [21], an Open Source page-level system capable of generating a MusicXML output from a single input image. The page- level images of the dataset are used as input, since Audiveris requires the information of the clef, key and beat. The output MusicXML is then converted to MTN and a simple matching between predicted and ground truth samples is generated by imposing a top-down reading order given the samples known to be present on each page. If a prediction has more measures per page than the ones in the ground truth, the extra ones are just discarded. With the setup outlined above, Audiveris predicted 45822 measures from the 52884 present on the ground truth. Out of these, 40622 measures from both sets could be matched together, corresponding to a coverage of 76.9%. The missed predictions are as a result of the engine failing to give an output on certain pages. Results for all tiers are shown in I. In general, the model tends to identify objects quite reliably but misses objects, as the precision is higher than the recall. Inspecting the per-class precision and recall values we see that attributes and the smaller objects of the score are the ones that tend to be recognised worse. The 80% of recall on black noteheads is alarming because this can cause a very significant drop of performance in note detection. Consequentially, note groups are missed and a temporal shift forward appears. Overall, even if the results for this specific tool on the dataset still leave room for improvement, we consider that our proposed format and metric fulfil their design purposes: unique representation of scores and evaluation. Therefore, we consider this simple trial successful. V. CONCLUSIONS In this paper we have argued for the implementation of an Optical Music Recognition Framework through the develop- ment of a notation format in which score reconstruction is independent from the recognition methodology. Moreover, the resulting scores can be evaluated fairly an unambiguously. Our proposed reification of this idea is the MTN format. Since this method builds upon some of the most widely used abstractions of the community (e.g. symbols as combinations of primitives, time from ordering, etc) it stands as a good candidate for a common endpoint for OMR as a whole. Of course, CWMN is a tremendously complex notation system which has been optimised and streamlined for hundreds of years. Nevertheless, we believe the subset of music that can be expressed in this format is large enough to be useful for the community. In this work, we have also presented a concrete implementa- tion of a set of metrics for OMR practitioners with the hopes of bringing together the community to speak the same language; a lingua franca thanks to which research can be shared and compared fairly and easily. We provide a simple baseline from which to demonstrate how the evaluation framework works. The work that lies ahead now is building a corpus of music that can be employed with this format into a benchmark for CWMN recognition, both in typeset and handwritten domains, which shall be the focus of our next efforts. ACKNOWLEDGMENT We gratefully thank Jan Hajič Jr. and Carles Badal for discussions that led to improvements in this paper. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 26 REFERENCES [1] J. Calvo-Zaragoza, J. Hajič Jr., and A. Pacha, “Understanding Optical Music Recognition,” ACM Comput. Surv., vol. 53, pp. 1–35, July 2021. [2] A. Pacha, “Advancing OMR as a Community: Best Practices for Re- producible Research,” in 1st International Workshop on Reading Music Systems (J. Calvo-Zaragoza, J. Hajič jr., and A. Pacha, eds.), (Paris, France), pp. 19–20, 2018. [3] A. Pacha, “The OMR Datasets Project,” 2017. [4] E. Shatri and G. Fazekas, “DoReMi: First glance at a universal OMR dataset,” in Proceedings of the 3rd International Workshop on Reading Music Systems (J. Calvo-Zaragoza and A. Pacha, eds.), (Alicante, Spain), pp. 43–49, 2021. [5] J. Hajič and P. Pecina, “The MUSCIMA++ Dataset for Handwritten Op- tical Music Recognition,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 39–46, Nov. 2017. ISSN: 2379-2140. [6] J. Hajič jr., “A Case for Intrinsic Evaluation of Optical Music Recog- nition,” in 1st International Workshop on Reading Music Systems (J. Calvo-Zaragoza, J. Hajič jr., and A. Pacha, eds.), (Paris, France), pp. 15–16, 2018. [7] L. Mengarelli, B. Kostiuk, J. G. Vitório, M. A. Tibola, W. Wolff, and C. N. Silla, “OMR metrics and evaluation: a systematic review,” Multimedia Tools and Applications, Dec. 2019. [8] A. Rebelo, I. Fujinaga, F. Paszkiewicz, A. R. S. Marcal, C. Guedes, and J. S. Cardoso, “Optical music recognition: state-of-the-art and open issues,” Int J Multimed Info Retr, vol. 1, pp. 173–190, Oct. 2012. [9] A. Baró, P. Riba, and A. Fornés, “Towards the Recognition of Compound Music Notes in Handwritten Music Scores,” in 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 465– 470, Oct. 2016. ISSN: 2167-6445. [10] A. Baró, C. Badal, and A. Fornés, “Handwritten Historical Music Recog- nition by Sequence-to-Sequence with Attention Mechanism,” in 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 205–210, Sept. 2020. [11] J. Calvo-Zaragoza and D. Rizo, “Camera-PrIMuS: Neural End-to-End Optical Music Recognition on Realistic Monophonic Scores,” in 19th International Society for Music Information Retrieval Conference, (Paris, France), pp. 248–255, 2018. [12] L. Tuggener, I. Elezi, J. Schmidhuber, M. Pelillo, and T. Stadelmann, “DeepScores - A Dataset for Segmentation, Detection and Classification of Tiny Objects,” in 24th International Conference on Pattern Recogni- tion, (Beijing, China), ZHAW, 2018. [13] F. Foscarin, F. Jacquemard, and R. Fournier-S’niehotta, “A diff procedure for music score files,” in Proceedings of the 6th International Conference on Digital Libraries for Musicology, DLfM ’19, (New York, NY, USA), pp. 58–64, Association for Computing Machinery, Nov. 2019. [14] J. Hajič jr., J. Novotný, P. Pecina, and J. Pokorný, “Further Steps towards a Standard Testbed for Optical Music Recognition,” in 17th Interna- tional Society for Music Information Retrieval Conference (M. Mandel, J. Devaney, D. Turnbull, and G. Tzanetakis, eds.), (New York, USA), pp. 157–163, New York University, 2016. Backup Publisher: New York University. [15] K. Zhang and D. Shasha, “Simple Fast Algorithms for the Editing Distance between Trees and Related Problems,” SIAM J. Comput., vol. 18, pp. 1245–1262, Dec. 1989. Publisher: Society for Industrial and Applied Mathematics. [16] “JoaoFelipe/apted: Python APTED algorithm for the Tree Edit Dis- tance.” https://github.com/JoaoFelipe/apted/tree/master, 2017. Accessed: 2024-03-10. [17] M. Pawlik and N. Augsten, “Tree edit distance: Robust and memory- efficient,” Information Systems, vol. 56, pp. 157–173, Mar. 2016. [18] M. R. H. Gotham and P. Jonas, “The OpenScore Lieder Corpus,” in Music Encoding Conference Proceedings 2021 (S. Münnich and D. Rizo, eds.), pp. 131–136, Humanities Commons, 2022. [19] “String quartet corpus,” 2023. Accessed: 2023-10-10. [20] L. Pugin, “Verovio, a music notation engraving library.” https://www.verovio.org/, 20?? Accessed: 2024-03-14. [21] A. Project, “Audiveris - open-source optical music recognition.” https://github.com/Audiveris/audiveris/, 20?? Accessed: 2024-03-14. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 27 Enhanced User-Machine Interaction for Historical Sheet Music Retrieval: a Musical Notation Approach Aitana Menárguez-Box PRHLT Research Center Universitat Politècnica de València Valencia, Spain amenbox@prhlt.upv.es Alejandro H. Tosselli PRHLT Research Center Universitat Politècnica de València Valencia, Spain ahector@prhlt.upv.es Enrique Vidal PRHLT Research Center Universitat Politècnica de València Valencia, Spain evidal@prhlt.upv.es Abstract—Searching for musical information in historical handwritten scores poses a significant challenge, particularly for musicians and history researchers. Until now, an existing tool had enabled to query the Cod. 253 of the Vorau Abbey library through staff-relative-position symbols referred to as “geometrical notation”. We present advancements in user-machine interaction by en- abling queries expressed with pitch-relative symbols (musical notation) in this system, offering more intuitive and precise means of interaction. Leveraging a web piano interface, users can now input queries using real musical notes, enhancing both usability and accuracy. Even though previous works have already explored this kind of implementation, it has only been tested in already transcribed and digitalized sheet music. Our approach, based on fully au- tomatic Probabilistic Indexing (PrIx) of a manuscript, addresses the intricacies inherent in historical scores, including variations in clef types and positions, to transform musical queries into complex Boolean geometric expressions. By integrating these enhancements into an existing search engine, we provide re- searchers with a more accessible and efficient means of exploring vast collections of historical sheet music. This paper underscores the significance of user-machine inter- action improvements in facilitating meaningful discoveries and insights in musicology and historical research. Index Terms—Musical Probabilistic Indexing, Musical Infor- mation Retrieval, Historical Handwritten Music Recognition. I. INTRODUCTION Historical sheet music collections are invaluable resources for musicologists, historians, and musicians. These collections contain a wealth of information about the evolution of music notation, composition styles, and cultural practices. However, searching for specific musical information within these collec- tions can be challenging due to the complexity of historical scores and the limitations of existing search tools. In particular, searching for musical information in handwritten scores can be difficult because of variations in notation styles, clef types, and other notational conventions. To address these challenges, some work such as [1], [2] or [3] has been made to develop technologies based on Optical Music Recognition (OMR) to automatically transcribe and index historical scores. These enable the creation of tools that allow users to search for musical information in ancient sheet music. In this paper, we took as a starting point the demonstrator1 developed thanks to the work done in [4], which allowed to query the Cod. 253 of the Vorau Abbey library (Vorau- 253). We have tried to overcome the limitations in terms of usability and accuracy this tool had. These were mainly due to the fact that they required users to input queries using abstract symbols (positions within the staves) which had little to do with musical notation. This made the search process less intuitive and precise, especially for users who were familiar with music theory. Here we present an enhanced user-machine interaction ap- proach for historical sheet music retrieval that enables users to input queries using pitch-relative symbols (musical notation) within the same web-based search engine. Our approach leverages a web piano interface that allows users to input queries using real notes within the staff, making the search process more intuitive and precise. We also address the com- plexities in historical scores, such as variations in clef types and positions, by transforming musical queries into complex Boolean geometric expressions that can be used to search for matches inside the mentioned manuscript collection. II. LIMITATIONS OF CURRENT MUSIC SEARCH SYSTEMS When comparing our approach to existing music search tools, several critical distinctions emerge. Many current sys- tems do not operate on automatically recognized sheet music but instead rely on transcriptions prepared by musicologists. A prime example of this is the search tool available within the Cantus database [5]. This kind of tools, while robust for ex- ploring already by-hand-transcribed chant manuscripts, do not address the intricacies involved in working with emphuntran- scribed manuscripts, which constitute the vast majority of the millions of historical sheet music books in archives and libraries. 1Available at https://prhlt-carabela.prhlt.upv.es/musica. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 28 A. Recognition Level Approaches Methodologies employed for music recognition have evolved over time. Early systems like Aruspix or Gamera utilized traditional approaches, such as k-nearest neighbors (kNN) or hidden Markov models (HMMs) [6], to segment and recognize musical symbols. These and other similar systems [7] are built on workflows that treat musical elements in isolation as described in [2]. Although effective in specific cases, these segmentation-based approaches are inherently limited in accuracy and scalability. More recent advancements, such as the OMMR4ALL2 and the Cantus Ultimus3 projects employ modern deep learning tech- niques to enhance Handwritten Music Recognition (HMR)4 capabilities. While these neural network-based systems repre- sent significant progress, they still share a common limitation with their predecessors: they focus on the most probable hypothesis during recognition for performing the information retrieval. In both old and new implementations, the recognition pipeline reduces the output to a single, definitive interpreta- tion, which simplifies but also constrains the search process. In contrast, our approach introduces a novel dimension by utilizing Probabilistic Indexing [4]. Rather than selecting a single hypothesis for each recognized musical symbol, all recognition hypotheses are preserved alongside their probabil- ities. This comprehensive approach, as will bee seen further in this work, allows for more flexible and accurate search queries, as it accommodates uncertainty and variability of historical handwritten scores. Searching within a probabilistic framework enables users to retrieve not just the most likely match but also alternative possibilities, thereby improving the overall accuracy and reliability of the system. B. Web Interface Design Another significant difference can be noticed in the design of the user interface. Several existing systems offer web- based search functionalities, but these often suffer from poor usability. For example, the Liber Usualis Search5, while pi- oneering in its intent, lacks the intuitive input mechanisms necessary for efficient query formulation. Similarly, the F- tempo6 system, which employs the Aruspix engine, inherits many of the above mentioned limitations associated with segmentation-based recognition. There are also systems that offer interfaces which are closer to our proposal, such as Musiconn.scoresearch7, that allows users to input queries using musical notation through a simulated piano keyboard. However, like many others, it still relies on a deterministic search model that limits results to the highest-probability recognition, potentially overlooking valuable alternative in- terpretations. On the other hand, our system addresses these 2Available through https://ommr4all.informatik.uni-wuerzburg.de/en/. 3Available through https://cantus.simssa.ca/. 4HMR is the application of OMR for the specific case of handwritten sheet music. 5Available through https://liber.simssa.ca/. 6Available through https://f-tempo.org/. 7Available through https://www.musiconn.de/services/. shortcomings by introducing a more intuitive, pitch-relative input method using a web piano interface. This allows users to query the system with real musical notes, making the search process far more accessible, particularly for those familiar with music theory. Coupled with the recognition approach used [8] and the Probabilistic Indexing [4], this interface enables a more nuanced and accurate search experience, offering multiple layers of recognition possibilities that are absent in other systems. By preserving the ambiguity and flexibility inherent in historical music notation, our approach enhances both the usability and the accuracy of music information retrieval. III. QUERYING THE SYSTEM Before the proposal in this paper, querying the Vorau-253 music collection was conducted using geometrical notation which relied on the positional information of musical elements within the staff. This notation system, as described in previous works [4], was convenient for basic training and testing experiments with handwritten music images. A. Geometrical Notation Drawbacks In the geometrical notation, basic lowercase symbols (l for notes on lines, s for notes in spaces) were utilized, with appended numbers indicating the vertical position in the staff. Additionally, other symbols were used to represent clefs (c or f followed by a number depending on the line they were located) and accidentals (i.e., the word flat). While this geometrical notation facilitated optical modeling and decoding of staff images, it fell short in representing melodic patterns adequately from a musical point of view. An example of this notation for an extract of the manuscript can be seen in Fig. 1. Fig. 1. A small staff fragment of a real sheet music image from the dataset Vorau-253 used in this work. The sequence of notes (and clef) on this image becomes represented as ⟨c4,l3,l4,l4,l2,s2,l3,s3,s2,l2,s2,l2⟩. One significant limitation of geometrical notation is its in- ability to capture the contextual nuances in traditional musical notation systems. Unlike conventional musical notation, where a note’s interpretation relies heavily on its relationship with other musical symbols (thanks to its conversion into a specific pitch), geometrical notation treats each note as an isolated entity, solely determined by its position on the staff. This lack of contextual information poses challenges in accurately representing and querying musical patterns, hindering the system’s usability. To overcome these limitations and enhance user-machine in- teraction, our approach introduces querying capabilities using Proceedings of the 6th International Workshop on Reading Music Systems, 2024 29 pitch-relative symbols (musical notation) within the existing web-based search engine. By leveraging a web piano interface, users can now input queries using real notes within the staff, offering a more intuitive and precise search experience. Fur- thermore, our system addresses the complexities of historical scores, including variations in clef types and positions, by transforming musical queries into complex Boolean geometri- cal expressions. This integration of musical notation querying enhances accessibility and efficiency, empowering researchers to explore vast collections of historical sheet music with greater ease and accuracy. B. From Geometrical to Musical Notation A single geometrical symbol, such as l2, does not inher- ently denote a specific pitch. Its interpretation depends on the preceding clef symbol. For instance, if a c3 clef precedes it, the symbol represents an A3, whereas a c4 clef would render it as a F3. Moreover, a single note may have multiple geometric equivalents based on different clef positions. For example, the geometrical representation of D4 could be s3 with a preceding c3 clef or s4 with a c4 clef. These variations illustrate that each note can have a series of alternatives depending on the context provided by the clef. In a search system based on geometrical notation, queries are typically constrained to a single clef at a time, leading to potential bias in the expressed information and the retrieved results. To address this limitation, a new querying approach is proposed, which involves converting queries into multiple translations corresponding to each possible clef position. This enables the representation of a note’s context (i.e., its clef) within the query. Consider the sequence of notes “C4 D4 F4” in musical notation to be converted into geometrical notation. Each note in the sequence is translated for every possible clef, and the resulting translations are combined into a single query. Addi- tionally, certain constraints, such as excluding notes outside the potential pitch range (E2 to D5), are applied to refine the query and improve its relevance. The translation of the proposed note sequence into geometrical notation yields a complex query structure, as shown below: (c1 & [l1 s1 s2]) || (c2 & [l2 s2 s3]) || (c3 & [l3 s3 s4]) || (c4 & [l4 s4 s5]) || (f1 & [l3 s3 s4]) || (f2 & [l4 s4 s5]) In this query, the || symbol denotes a boolean OR operation, the square brackets [ and ] indicate a sequence of notes and the & symbol represents an AND operator, ensuring that the clef must precede the sequence of notes on the same staff. C. Expanding the Search: Ignoring the Key After considering the use of this new querying approach, we must also take into account that sometimes, the user may want to look for melodies that are not exactly the same as the one they input. This could be due to a mistake in the transcription of the pitch or because they want to find all melodies that share the same intervallic relation between notes. For example, if the user wants to search for an ascending melody of three notes separated by whole tones, a sequence query proposal would be “C4 D4 E4”. Although the search is explicitly for these three notes, the option to search for all melodies that maintain the intervallic relationship should be given to satisfy the initial query. Thus, the notes may not be looked for as they are and the sequences “G4 A4 B4” or “C3 D3 E3” could also be found, both corresponding to the initial query of the three ascending notes separated by whole tones. We have also implemented this type of queries inside the demonstrator. Now the queries will be referred to as queries with key (if they take into account the original pitch) or without key (if they do not). Both search forms necessarily induce a significant increase in the complexity compared to pure geometric queries, leading to potential performance implications for the search engine. Further research is needed to evaluate their impact on the system’s performance. IV. THE WEB PIANO INTERFACE The musical input to the web platform is facilitated through the piano tab, which has been implemented using an HTML dialog containing various elements for melody insertion and visualization. This dialog is accessible via a button on the right side of the interface. Below we describe the main components of the tab: • Piano Keyboard: HTML buttons simulating the keys of a real piano, each button is linked to the corresponding note in musical notation. Pressing a button (by clicking it or through the computer keyboard) transmits the note’s name directly. Additionally, each button is associated with a mp3 audio file, enabling users to hear the sound of the note. Furthermore, notes which will for sure not appear in the manuscript are lighter in color. If one is pressed, the corresponding sound is played, but the note is not transmitted. The appearance of the piano keyboard is shown in Fig. 2. Fig. 2. Screenshot of the piano keyboard, simulating a real one. The equivalence with the computer keyboard is included next to the corresponding keys. • Record Button: Positioned at the top left corner of the piano tab, enables users to play keys without transmitting the notes. This feature is beneficial for sound testing or practicing melodies without affecting search queries. When illuminated in bright red, recording is active, indi- cating that played notes will be transmitted. Activation or deactivation of recording is achieved with a simple click. • Search Bar: Featuring an HTML input element, the search bar allows users to input queries in musical Proceedings of the 6th International Workshop on Reading Music Systems, 2024 30 notation. As users play notes on the piano keyboard, the search bar dynamically populates with the played melody, contingent upon the recording’s activation. • Convert and Erase Buttons: The translation button initiates the conversion process of the musical query into complex Boolean geometrical expressions for search purposes. The resulting query is then transferred to the external search bar of the website. The erase button clears the content of the search bar from the piano dialog. An example of usage for the translation of a melody can be seen in Fig. 3. Fig. 3. Screenshot of the piano tab after a melody has been played and translated into a query. The resulting query is shown in the search bar. • Play Button: After melody input, users can utilize the play button to audibly review the entered melody. This functionality enables users to verify the correctness of the melody before executing the search. Or even listening to the melody after the search has been performed. • Mind Key Checkbox: Allows users to determine whether the search should consider the original pitch of the entered melody. When activated ensures that the search respects the original pitch. The piano interface supports MIDI input, permitting users to play keys using a connected MIDI instrument. These inputs are managed through the MIDI API8, included in most browsers. Further testing is essential to ascertain the full functionality and performance of the web piano interface. Refinement of its appearance and usability may be necessary to optimize user experience. As such, the interface remains in a testing phase, subject to iterative improvements based on user feedback and evaluation. V. EVALUATION AND RESULTS In this section we aim to determine whether allowing queries based on sequences in musical notation, along with the possible translation without considering the original pitch, result in a good performance of the search engine. To conduct the evaluation, it is necessary to send a substantial number of melody queries to the demonstrator to obtain a reasonably representative reference (185 musical queries in this case). Then, we can measure the quality of the results depending on the technology used to retrieve the information, i.e. the search method used (three different approaches), to- gether with the type of queries performed (with and without tonality) to assess the search engine’s performance. 8More information can be found at https://webaudio.github.io/web-midi-api/. TABLE I AVERAGE PRECISION (AP) RESULTS FOR THE DIFFERENT DETECTION METHODS AND THE TYPE OF QUERIES. AP Method With key Without key OP 0.72 0.82 ROP 0.73 0.82 GP 0.87 0.91 To simulate a realistic testing environment while maintaining stylistic criteria, melodies used for training the developed tool in [9] were employed to create the queries. The effectiveness of information retrieval systems is generally measured using recall and interpolated precision standards [10]. We report results in terms of Average Precision (AP), defined as the area under the recall-precision (R-P) curve. The higher its value the better the system’s performance. The set of staves in which the search has been performed is the same as in [4]. Within this evaluation, different approaches to detect queries of musical sequences in handwritten scores were employed in the experimentation. Three detection alternatives were tested: based on the logical positions of the indexed detections (OP), based on the logical positions whose order is consistent with their geometric location (ROP), and based solely on the geometric positions of the indexed detections (GP). Tab. I presents the Average Precision (AP) results for all these combinations, where the use of GP together with tonality-free queries achieves the best result. VI. CONCLUSION In this paper, we introduced advancements in user-machine interaction for searching historical handwritten scores. By enabling queries using musical notation symbols within the web-based search engine in https://prhlt-carabela.prhlt.upv.es/ musica, we enhanced its usability. Thanks to the implemen- tation of the piano interface, users can now input queries intuitively with real notes on the staff. Our approach addresses complexities in historical scores, such as variations in clef types and positions, by transforming musical queries into Boolean geometrical expressions. Eval- uation results demonstrate the effectiveness of our method, especially when combining geometric positions with tonality- free queries. While our results provide a promising starting point, further refinement and testing (specially of the web piano interface) are necessary. It is important to note that direct comparisons with previous studies [4] may not be feasible due to differences in methodologies and evaluation criteria. These enhancements represent a meaningful advance in en- abling discoveries in musicology and historical research, es- pecially given the widespread use of the treated notation style across the vast corpus of musical documents. Future research could extend this methodology to other notational systems and later musical sources, incorporating additional elements (rhythm, polyphony, etc.). Furthermore, continued testing and refinement will be essential for optimizing user experience and maximizing the impact of our approach in terms of computational efficiency. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 31 REFERENCES [1] J. Calvo-Zaragoza, J. H. Jr., and A. Pacha, “Understanding optical music recognition,” ACM Comput. Surv., vol. 53, no. 4, jul 2020. [Online]. Available: https://doi.org/10.1145/3397499 [2] A. Rebelo, I. Fujinaga, F. Paszkiewicz, A. R. Marcal, C. Guedes, and J. S. Cardoso, “Optical music recognition: state-of-the-art and open issues,” International Journal of Multimedia Information Retrieval, vol. 1, pp. 173–190, 2012. [3] M. Villarreal and J. A. Sánchez, “Handwritten music recognition improvement through language model re-interpretation for mensural notation,” in 2020 17th International Conference on Frontiers in Hand- writing Recognition (ICFHR), 2020, pp. 199–204. [4] J. Calvo-Zaragoza, A. H. Toselli, E. Vidal, and J. A. Sánchez, “Music symbol sequence indexing in medieval plainchant manuscripts,” in 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019, pp. 882–887. [5] D. Lacoste, “The cantus database: Mining for medieval chant traditions,” Digital Medievalist, vol. 7, 2012. [6] L. P. J. H. J. Ashley and B. I. Fujinaga, “Gamera versus aruspix two optical music recognition approaches,” ISMIR 2008, p. 139, 2008. [7] Y.-H. Huang, X. Chen, S. Beck, D. Burn, and L. Van Gool, “Automatic handwritten mensural notation interpreter: From manuscript to midi performance.” in ISMIR, 2015, pp. 79–85. [8] J. Calvo-Zaragoza, A. H. Toselli, and E. Vidal, “Handwritten music recognition for mensural notation with convolutional recurrent neural networks,” Pattern Recognition Letters, vol. 128, pp. 115–121, 2019. [9] P. P. Cruz-Alcazar and E. Vidal-Ruiz, “Modeling musical style using grammatical inference techniques: a tool for classifying and generating melodies,” in Proceedings Third International Conference on WEB Delivering of Music. IEEE Comput. Soc, 2004. [10] H. Schütze, C. D. Manning, and P. Raghavan, Introduction to informa- tion retrieval. Cambridge University Press Cambridge, 2008, vol. 39. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 32 1 The CollabScore project – From Optical Recognition to Multimodal Music Sources Bertrand Coüasnon Univ. Rennes, CNRS, IRISA, INSA Rennes bertrand.couasnon@irisa.fr Mathieu Giraud CNRS/Univ. Lille mathieu.giraud@univ-lille.fr Christophe Guillotel Nothmann CNRS/Sorbonne Univ. christophe.guillotel-nothmann@cnrs.fr Aurélie Lemaitre Université Rennes 2, CNRS, IRISA aurelie.lemaitre@irisa.fr Philippe Rigaux Cnam, philippe.rigaux@lecnam.net Abstract—We introduce COLLABSCORE, a project funded by the French National Research Agency, devoted to the design and production of tools and methods to improve accesses to large collections of sheet music scans. The new optical music recogni- tion (OMR) approach developed in COLLABSCORE is part of a larger goal, namely that of interlinking multimodal documents related to music works. In this perspective, the music notation obtained from the OMR process is seen as a pivot that associates related fragments of images, audio, video, XML, or text sources. As an application of this principle, COLLABSCORE supports the synchronization of sources, leveraging the raw content of digital libraries with listening and visualization experiences. The present paper introduces the project and exposes some of its current achievements. I. OVERVIEW The core concept of the project is that of multimodal music sources and the main project’s efforts aim at creating tools and methods to interlink these sources. We begin with an overview of this perspective before surveying some more technical aspects. A. Multimodal music sources Given a music work (say, the Goldberg variations) seen as an abstract entity, we can find many concrete documents that provide a specific representation. These documents can be recordings, in audio or video format, images (scans) of score sheets, editable scores in MusicXML or MEI, and even textual sources that comment/annotate/enrich the music. It turns out that each representation is difficult to use beyond its specific purpose. For non specialists, we know it is hard to “hear” the music from a score and, conversely, it is hard to “replay” or analyse the music from a performance, live or recording. Moreover, sources are usually self-contained, independent documents, encoded in some specific format. This keeps from easily mapping music components (a voice, an harmonic sequence, a phrase) from one source to another, at a finer level of granularity than the whole document itself. In COLLABSCORE, we address these issues with multi- modal music scores (MMS). A MMS combines an encoding of the music notation (a MEI file) with links that associate the notation elements to the corresponding fragments of mul- timedia sources, e.g., a region on an image, a time frame in an audio/video source, as section of a textbook. Music notation is thus used as a description language for music content, which serves as a reference, or pivot to link heterogeneous sources that encode the same content. COLLABSCORE implements this model in a data store1 which provides (i) a management of such pivot scores, (ii) a storage of each pivot with external or internal multimedia sources, and (iii) an annotation mechanism that maps the pivot fragments to the corresponding part of each source [1], [2]. Figure 1 shows an example of a MMS: the pivot score (here, La coccinelle, a melody from Saint-Saëns) stored as a MEI document in Neuma is the central piece that glues together several sources: an image (taken from the Gallica digital library), a video accessible on YouTube, a MIDI file (internal source). Fig. 1. A multimodal score and its sources The project’s work consists in designing tools to produce and manage MMS, including a powerful OMR system which the privileged mean to obtain a pivot. They are briefly sum- marized below. B. Producing the pivot via optical recognition and crowd- sourcing Although pivot scores could be obtained by edition or tran- scription, COLLABSCORE integrates Optical Music Recog- nition (OMR) as the primary mean to produce a notation from image sources. In this context, our definition of “OMR” corresponds to the class of “structured encoding” OMR in [3]: 1http://neuma.huma-num.fr Proceedings of the 6th International Workshop on Reading Music Systems, 2024 33 2 we ambition to produce an editable score featuring all the notation elements visible on the sheet scan, along with their proper interpretation. In others words, we implement a process that attempts to invert the production of a printed score from specifications entered in a music notation engraver. Moreover, we combine this process with crowdsourcing phases to achieve a high-quality output, as discussed in [4]. The process is validated on a corpus mostly taken from the BnF Gallica Digital Library. These aspects are covered in Section II. C. Alignment of sources Multimedia sources are aligned with the pivot as shown on Fig. 2. The XML encoding of the notation (in MEI) identifies each component (here, a chord) with a unique id which is the target of annotations that refer to the corresponding fragments of sources. In the case of image, the annotation specifies a region on the image; in the case of audio/video, a time frame gives the start/end of the fragment. Image source Pivot (MEI) Audio sourceregion(x,y,w,h) tframe(s,e) Fig. 2. Aligning sources: A multimodal score with three documents The alignment methods depends on the sources. In the case of images, annotations are supplied by the OMR system as a side effect of the recognition process. For other sources, dedicated interfaces have been implemented (Section III). D. Applications Finally, through the music description available in the pivot score, the content of two sources can be associated at a fine granularity level. The OMR output for instance can be controlled by a side by side display of both the source image and the pivot score rendering. Textual annotation (e.g., analytic comments) can be added on a score image at precise positions. An interface developed in COLLABSCORE allows to listen an audio/video source while highlighting the music being played on the original image source. Among many other advantages, this is likely to greatly leverage the content of digital libraries with attractive features (details in Section III). II. THE OMR PROCESS Among the various works on OMR [3], [5], two main types of approach can be observed in recent work. One is based on the detection of musical symbols [6], [7], inspired by architectures developed for object detection in natural scenes, with problems specific to OMR related to the large size of the images to be processed and the very small size of some musical symbols. The other is based on end-to-end recognition methods that directly produce a representation of the recognized score, which initially tackled monophonic scores and only very recently have been able to start to handle polyphonic systems [8], [9]. For the moment, these methods do not produce the localization of the recognized information required, for example, for image-sound synchronization. The OMR process we propose in COLLABSCORE to deal with polyphonic orchestra scores is founded on DMOS method, completed with a collaborative process that aims at clarifying the interpretation of symbols that have been identified as ambiguous. We experiment this combination of a large corpus for which a reference encoding has been produced. A. Automatic syntactic OMR with DMOS DMOS [10] relies a grammatical method that enables the combination of visual clues with syntactic rules, in order to describe both the physical and the logical content of the document. The process follows two steps, as shown in Fig. 3. Fig. 3. Overview of DMOS: combination of low level detectors and high level syntactic rules In a first step, three low level extractors are applied on the image: • a symbol extractor based on deep learning (Cascade R- CNN - FocalNet architecture), dedicated to the extraction of small musical symbols [11] from high-resolution full- page images; • an existing line segment extractor, based on Kalman filtering [12], used to extract linear elements, such as staff lines and stems; • the existing PeroOCR [13], for the extraction of textual elements, such as titles, lyrics, instrument names. In a second step, those elements are given as input to a syntactic system, based on DMOS method [10]. It produces a description of the graphical and syntactic content of the musical content of a score image: a score is made of staff systems, containing measures, and each measure contains musical objects (notes, rests, ...) that respect time constraints. Recognizing a measure involves three steps of analysis. First, the staves and barlines are identified. Then, inside of a score, the graphical content is detected based on the position and assembly constraints of both the symbols detected by the deep object detector and the linear elements extractor: key, notes, rests, dots, accidentals, ties, slurs, dynamics, articula- tions marks, lyrics... Each detected content is localized in the image, and produced with is associated bounding box (Fig. 4). Finally, the system organises the content into voices. After the distribution of notes into voices, the system checks the global consistency of the recognition, and produces warning if the detected elements do not follow some given rules. For example, if a eight note is miss-detected, the system will trigger a warning because the time signature is not respected. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 34 3 Fig. 4. The OMR process: Detection of graphical content Moreover, based on the vertical alignment of notes in a system, it is possible to locate or even correct the note with the wrong duration. Applying these rules makes detection more reliable in a context of ancient noisy documents. All the elements identified by DMOS are organized in a document compliant with the music notation grammar. This document is the main source of information to initiate a multimodal music source in COLLABSCORE. Indeed, from the music notation symbols, a symbolic score in MEI is reconstructed – the pivot, and the source image is aligned with this pivot thanks to the regions identified by DMOS. Moreover, the alerts raised by DMOS are recorded and subsequently submitted to the collaborative process. B. The collaborative process The “raw” music score obtained from the OMR process enters in a phase of corrections via a sequence of dedicated interfaces. A design choice of COLLABSCORE is to limit user actions to the list of alerts raised by the DMOS component. While this may seem restricted, we believe that going beyond would ultimately lead to implement a full online score editor2. The advantage of considering only the DMOS alerts is that we remain within the scope of an automatic recognition process, augmented with a one-time human assistance to solve difficult cases. This limits the competency expected from users, as well as the complexity of the required actions since they essentially consist in answering a question. This choice also provides a sound basis to evaluate the performance of DMOS: Given a ground truth, we can compare it first to the raw output, and second, to the corrected one, identifying the impact of human interaction on the final quality. The list of alerts raised by DMOS are classified in three categories, based on how globally the potential error may impact the resulting score. These categories result in three correction phases: • The first one, called Instrumentation, refers to the identi- fication of music parts, and to the correct assignment of staves to the parts. Any error on these structural aspect has a dramatic impact on the whole score. This is the case for instance of a double-staff piano part not recognized as such, or when some parts are introduced/removed from one system to the other (e.g., a solo/melody arriving after an instrumental introduction, resulting in the introduction of a new staff in systems). Special cases difficult to 2Note that it always remain possible to import the MEI or MusicXML output in a standard score engraver Fig. 5. The collaborative process, phase 1: checking parts and their staves identify automatically (e.g., transposing instrument) can also be solved during this step. • The second one, Transcription context, refers to all the notation element that dictates the transcription of music events: clefs, key signatures and time signatures. Here again, any misinterpretation severely hinders the music notation accuracy. • Finally, the last phase, Music objects, addresses the notation of musical events: notes, chords, rests, ties. At this point, the user cann locally correct a property of a faulty music object: duration, height, etc. For each phase, a list of microtasks is produced, and submitted to a group of users. At the end of each phase, the list of validated corrections is applied to the score, and this corrected version is proposed to the following phase. Fig 5 shows an example of the user interface dedicated to the first phase (Instrumentation). It heavily relies on information obtained from the DMOS analysis which comes as the default interpretation. Here, the list of parts (chant and piano) has been identified, and each staff (or pair of staves) assigned to a part. The user can correct this information if needed. The subsequent phases imply a display of both the initial image and the score for comparison purpose (see Fig. 6 for phase 2). Elements to be controlled (here, clefs and signatures) can be highlighted on both the image and the target score, thanks to the regions provided by the OMR and to the links between both sources. We implemented an interface that lets the user directly correct an object (a clef, Fig. 6), each action being immediately reported on the score. At the time of writing, we are finalizing the implementation of the collaborative system. It is based on the Open-source Cal- lico system [14] and available at https://collabscore.cnam.fr. An experiment will be conducted in early 2025 with a group of users on a large corpus to be described next. C. The reference corpus The reference corpus comprises all the works by Camille Saint-Saëns (1835-1921) with the exception of dramatic works (operas, oratorios, incidental music). Aside from considera- tions relating to the BnF’s promotion policy – COLLABSCORE coincided with a project to promote the composer’s work on Proceedings of the 6th International Workshop on Reading Music Systems, 2024 35 4 Fig. 6. The collaborative process, phase 2: checking the transcription context the occasion of the hundredth anniversary of his death in 2021 – two criteria prevailed in the selection of this corpus, which totals more than 500 compositions. 1) Variety of genres and instrumentation. The compo- sitions include sacred and secular works for a capella choir, chamber music, melodies for voice and piano, compositions for brass or military bands, keyboard repertoire and symphonic works with or without solo instruments. This diversity allows the software solution to be tested in different situations that present particular challenges, such as cross-staff notation in piano works, transposing instruments in orchestral works, or syllable positioning in melodies, etc. 2) Particularities of French printed music from the period 1850-1920. These scores, which have been made available by the BnF on Gallica, differ from modern, standardised notation, with regard to their implicit fea- tures (e.g. triplet notation), special signs (crochet rests), complexities relating to the placement of the text and the presence of artifacts in the preserved scores. Thus, we see this case study as an appropriate starting point for follow-up projects dedicated to printed music from earlier periods and handwritten notation. For all the items, MEI files were created containing mei- headers with metadata extracted from Gallica including title, date of creation, genre, authorial attribution(s), historical print identifier, location and physical description. A sample of 18 scores was then transcribed in full, either manually or using commercial software (PhotoScore) with post-correction. The reference corpus will serve as a ground truth to evaluate the performance of DMOS (for raw output) and of the collaborative phases (for users-corrected output). OMR evaluation is a notably difficult task [15]–[17] and we hope to contribute to progresses in this field. We started using the MusicDiff tool, designed by one of the project’s partners [18] and now available as a Python package 3, but additional work is required with the OMR community to achieve a commonly accepted yardstick. 3https://github.com/gregchapman-dev/musicdiff III. SOURCES ALIGNMENT AND SYNCHRONISATION Once obtained, the pivot score can be aligned with mul- timedia sources. We tailored the Dezrann platform [19] of our partner Algomus to propose tools for synchronization and synchronized score playback. Regarding images, as shown on Fig. 4, we can rely on the bounding box supplied by DMOS for each detected symbol, but also for all the measures, staves and systems. We link this region to the corresponding element ID in the pivot document. Aligning with recordings (audio or video) involves identify- ing the time frame at the finest possible temporal granularity (we target the beat level). The fields of audio-score alignment and score following are actively researched [20]–[22]. Com- mon methods involve dynamic time-warping algorithms or, more recently, deep learning approaches. In particular when sections are repeated. user interaction is often necessary to achieve a satisfying correspondence. We designed a simple in- terface to let users add and update alignment timestamps [19]. Finally, as a demonstration of the potential of our work to promote the content of digital libraries to a wide audience, COLLABSCORE proposes an interface where the sources of a multimodal score can be displayed simultaneously for an improved user experience. Fig 7 shows how the original Gallica image, the pivot score and a YouTube recording can be associated, exhibiting at any moment a close correspondence between the performance, the notation, and the original image. Fig. 7. COLLABSCORE interface showing three synchronized sources on La Coccinelle with the Dezrann libraries: the original image, the pivot score, and a YouTube performance. IV. CONCLUSION The COLLABSCORE project addresses many challenges in modeling and interlinking multimodal documents related to music, and has already required a lot of efforts to achieve its current state in OMR, collaborative process, score synchro- nization and playback. Each aspect would obviously deserve a much more detailed presentation and require further research and development, but we believe the the results obtained so far seem very promising. We are keen to showcase the COLLABSCORE project with the community, and obtain in return an informed feedback. 3https://gallica.bnf.fr/ark:/12148/bpt6k1162049x Proceedings of the 6th International Workshop on Reading Music Systems, 2024 36 5 REFERENCES [1] S. Cherfi, C. Guillotel, F. Hamdi, P. Rigaux, and N. Travers, “Ontology- Based Annotation of Music Scores,” in Intl. Conf. on Knowledge Capture (K-CAP’17), 2017, austin, Texas, Dec. 4-6 2017. [2] R. Sanderson, P. Ciccarese, and B. Young, “Web annotation data model,” Technical report, W3C Recommendation, 23 February, Tech. Rep., 2017. [3] J. Calvo-Zaragoza, J. Hajic, and A. Pacha, “Understanding optical music recognition,” ACM Computing Surveys (CSUR), vol. 53, pp. 1 – 35, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID: 199543265 [4] C. Saitis, A. Hankinson, and I. Fujinaga, “Correcting large-scale OMR data with crowdsourcing,” in 1st International Workshop on Digital Libraries for Musicology. ACM, 2014, pp. 1–3. [5] A. Rebelo, I. Fujinaga, F. Paszkiewicz, A. R. S. Marçal, C. Guedes, and J. S. Cardoso, “Optical music recognition: state-of-the-art and open issues,” International Journal of Multimedia Information Retrieval, vol. 1, pp. 173–190, 2012. [6] L. Tuggener, Y. P. Satyawan, A. Pacha, J. Schmidhuber, and T. Stadel- mann, “The DeepScoresV2 dataset and benchmark for music object detection,” in Proc. ICPR, 2021, pp. 9188–9195. [7] Y. Zhang, Z. Huang, Y. Zhang, and K. Ren, “A detector for page-level handwritten music object recognition based on deep learning,” Neural Comput. Appl., 2023. [8] J. Mayer, M. Straka, J. Hajič, and P. Pecina, “Practical end-to-end optical music recognition for pianoform music,” in Document Analysis and Recognition - ICDAR 2024, E. H. Barney Smith, M. Liwicki, and L. Peng, Eds. Cham: Springer Nature Switzerland, 2024, pp. 55–73. [9] A. Rı́os-Vila, J. Calvo-Zaragoza, and T. Paquet, “Sheet music trans- former: End-to-end optical music recognition beyond monophonic tran- scription,” in Document Analysis and Recognition - ICDAR 2024, E. H. Barney Smith, M. Liwicki, and L. Peng, Eds. Cham: Springer Nature Switzerland, 2024, pp. 20–37. [10] B. Coüasnon, “DMOS, a generic document recognition method: Appli- cation to table structure analysis in a general and in a specific way,” International Journal on Document Analysis and Recognition (IJDAR), vol. 8(2), pp. 111–122, 2006. [11] A. Yesilkanat, Y. Soullard, B. Coüasnon, and N. Girard, “Full-page music symbols recognition: state-of-the-art deep models comparison for handwritten and printed music scores,” in DAS 2024 Workshop on Document Analysis System, Sep. 2024. [12] C. Queguiner, J. Camillerapp, and I. Leplumey, “Kalman Filter Contri- butions Towards Document Segmentation,” in ICDAR 1995 Third Inter- national Conference on Document Analysis and Recognition, Montreal, Canada, Aug. 1995, pp. 765–769. [13] O. Kodym and M. Hradis, “Page layout analysis system for uncon- strained historic documents,” CoRR, vol. abs/2102.11838, 2021. [14] C. Kermorvant, E. Bardou, M. Blanco, and B. Abadie, “Callico: A versatile open-source document image annotation platform,” in Document Analysis and Recognition - ICDAR 2024: 18th International Conference, Athens, Greece, August 30 – September 4, 2024, Proceedings, Part III. Berlin, Heidelberg: Springer- Verlag, 2024, p. 338–353. [Online]. Available: https://doi.org/10.1007/ 978-3-031-70543-4 20 [15] D. Byrd and J. G. Simonsen, “Towards a standard testbed for optical music recognition: Definitions, metrics, and page images,” Journal of New Music Research, vol. 44, no. 3, pp. 169–195, 2015. [16] J. j. Hajič, “A case for intrinsic evaluation of optical music recognition,” in 1st International Workshop on Reading Music Systems, J. Calvo- Zaragoza, J. H. jr., and A. Pacha, Eds., Paris, France, 2018, pp. 15–16. [Online]. Available: https://sites.google.com/view/worms2018/ proceedings [17] P. Torras, S. Biswas, and A. Fornés, “A unified representation framework for the evaluation of Optical Music Recognition systems,” International Journal of Document Analysis and Recognition (IJDAR), vol. 27, no. 3, pp. 379–393, 2024. [18] F. Foscarin, F. Jacquemard, and R. Fournier-S’niehotta, “A diff procedure for music score files,” in 6th International Conference on Digital Libraries for Musicology, 2019, pp. 58–64. [19] L. Garczynski, M. Giraud, E. Leguy, and P. Rigaux, “Modeling and editing cross-modal synchronization on a label web canvas,” 2022. [20] M. Dorfer, F. Henkel, and G. Widmer, “Learning to listen, read, and follow: Score following as a reinforcement learning game,” in Proceeding of International Conference on Music Information Retrieval (ISMIR), 2018. [21] J. Thickstun, J. Brennan, and H. Verma, “Rethinking evaluation method- ology for audio-to-score alignment,” arXiv preprint arXiv:2009.14374, 2020. [22] M. Müller, Y. Özer, M. Krause, T. Prätzlich, and J. Driedger, “Sync Toolbox: A Python package for efficient, robust, and accurate music synchronization,” Journal of Open Source Software (JOSS), vol. 6, no. 64, pp. 3434:1–4, 2021. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 37 Semi-Automatic Annotation of Chinese Suzipu Notation Using a Component-Based Prediction and Similarity Approach Tristan Repolusk∗† and Eduardo Veas∗† ∗Graz University of Technology, †Know Center Research GmbH Email: †trepolusk@know-center.at, ∗eveas@tugraz.at ORCID: 0009-0009-8435-1185, 0000-0002-0356-4034 Abstract—In recent years, the development of Op- tical Music Recognition (OMR) has progressed signif- icantly. However, historical or smaller music cultures have only recently been considered in this process. This also includes Chinese music notations such as suzipu. In this paper, a component-based clustering and similarity approach for this kind of notation is introduced. Its goal is to facilitate and automate the manual annotation of such notation instances through artificial intelligence, resulting in a more efficient digitization of machine- readable digital archives which can also be used for OMR applications. The suzipu notation studied in this work is taken from the KuiSCIMA dataset, containing Jiang Kui’s influential collection Baishidaoren Gequ 白 石道人歌曲 from 1202. This contribution serves as the basis for the further development of OMR algorithms acting on suzipu and similar kinds of notations, thus fostering the dissemination, preservation and compu- tational analysis of historical Chinese music. Index Terms—Chinese music, Jiang Kui, optical mu- sic recognition, suzipu, banzipu, Baishidaoren Gequ I. Introduction The field of OMR is similar to optical character recogni- tion (OCR) regarding the extraction of information that is available as optical data. However, recovering the musical semantics (such as pitch, onset, duration, velocity) is a crucial step and often involving implicit rules. This makes OMR tasks extremely challenging [3]. A well estab- lished process pipeline of OMR comprises phases of pre- processing, symbol recognition, notation assembly, and encoding [12, 3]. OMR systems are at best incomplete when it comes to transcribing music scores as regarding the fulfillment of all transcription phases, therefore inviting the use of at least partial manual workflows [16]. This is perhaps even more accentuated when dealing with handwritten scores [2]. The majority of the works concern common practice period music scores, and most approaches do not support notations of other kinds of musical traditions. Baishidaoren Gequ is an important work in the his- tory of Chinese music. It is a compilation of the works of Jiang Kui 姜夔, a renowned poet, calligrapher, and music theorist of the Song dynasty (1127-1279 CE), also known by his courtesy name Baishi 白石. This collection is one of the earliest surviving examples of melodized lyrics in Chinese history [9], reflecting the sophisticated musical culture of the Southern Song period. It provides researchers with valuable information about the musical practices and aesthetics of the Song Dynasty, making it an important resource for studying the history of Chinese music. The collection primarily consists of ci poetry set to music, covering various themes including nature, emotions, and historical events. 17 out of the 109 pieces featured in Baishidaoren Gequ are endowed with the suzipu 俗字谱 (literal meaning: com- mon character notation) notation, also known as banzipu 半字谱 (literal meaning: half character notation). This kind of notation was especially common in China in Song dynasty (960–1279), with the 17 pieces in Baishidaoren Gequ being the largest historical source of this notation. Five handwritten editions of suzipu notation with opti- cal annotations are contained in the publicly available KuiSCIMA1 dataset [13]. Also some contemporary musical practices such as Xi’an Guyue 西安鼓乐 use related notations [7]. In the context of suzipu, many challenges arise for OMR: • The notations consist of a pitch and secondary compo- nent that can be realized in notation by different kinds of compositions (such as top-bottom or left-right). • The number of existing labeled samples is scarce and some symbols appear rarely. • The music notations may have implicit relationships to the poetry that is accompanying them. • Only a handful of experts worldwide have deep in- sights into the musical semantics of the score. Therefore, in this work the manual digitization is facil- itated through AI techniques that are embedded into the graphical user interface of the Chinese Musical Annotation Tool [14], thus giving rise to a semi-automated annotation approach that is designed to guide and support the human annotator. 1https://github.com/SuziAI/KuiSCIMA Proceedings of the 6th International Workshop on Reading Music Systems, 2024 38 II. Related Works The simultaneous recognition and encoding of music scores has been shown feasible, albeit for notation types for which large annotated datasets exist [15, 10]. A major problem arises with the lack of large volume data for train- ing and testing the benchmark model in OMR systems, which can be overcome with a semi-automated human-in- the-loop approach [6, 8]. Handwritten scores pose additional challenges due to nuances inherent in individual handwriting. MuRET (Mu- sic Recognition, Encoding, and Transcription), designed for music transcription and OMR, focuses on repertoires of handwritten monodic melody scores of traditional Spanish music and white mensural notation from 16th to 18th- century manuscripts [16]. Regarding Chinese notations, an OMR architecture was designed and evaluated with 100 songs of a Chinese songbook with a regular structure of monophonic score involving Chinese number notation jianpu 简谱 [17]. Two other works focused on gongche notation in Kunqu opera: In [5] a comparison of different algorithms for recogni- tion of gongche pitches is presented, while [4] focuses on extracting semantic information taking into account the spatial structures of Kunqu opera pieces. III. Methods The suzipu notation is characterized by having two properties: a pitch component indicating the syllable’s pitch, and an optional secondary component providing rhythmical and ornamentation information [13]. The in- dividual components with their machine-readable repre- sentations are found in Table I. In this section, the three intelligent user interface com- ponents facilitating the manual annotation of suzipu no- tation instances are introduced. The user interface can be seen in Figure 1. Button (1) in the GUI starts the automatic OMR prediction, and button (2) can be used to overwrite the annotations in the tool with the model predictions. (3) and (4) provide additional context, where (3) contains the OMR model predictions with confidence scores, and in (4), the most similar notations with respect to their optical features are shown. (5) is a display of all instances in KuiSCIMA that have the same annotation as the one assigned to the symbol by the user. A. Suzipu Classification For the prediction, the currently available state-of-the- art OMR model introduced in [13] is used. The classi- fier consists of two small convolutional neural networks leveraging the special structure of suzipu notation. The first classifier is for the pitch component, while the second classifier deals with the notation’s secondary component. Similar methods have already successfully been used in settings where images share features that can be described by product spaces, e.g. the Ethiopic script in [1]. Suzipu (Pitch) Name ASCII Representation 合 "HE" 四 "SI" 一 "YI" 上 "SHANG" 勾 "GOU" 尺 "CHE" 工 "GONG" 凡 "FAN" 六 "LIU" 五 "WU" 高五 "GAO_WU" Suzipu (Secondary) Name ASCII Representation 大顿 "DA_DUN" 小住 "XIAO_ZHU" 丁住 "DING_ZHU" 大住 "DA_ZHU" 折 "ZHE" 拽 "YE" TABLE I: The machine-readable representations of each of the 11 pitch and 6 secondary components of suzipu notation. The representation is the capitalized pinyin re- alization of the symbol’s name without tone marks. Fig. 1: The Suzipu Intelligent Assistant window al- lows for visualization and automatic labeling of notation instances. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 39 This decomposition is meaningful in settings of limited or imbalanced data, as is the case with the KuiSCIMA dataset. It is quite small with 7212 non-empty notation annotations, highly imbalanced, and out of the 77 classes only 66 occur. This makes the OMR on this kind of notation especially challenging. The details about the training and validation processes regarding the classifier is found in [13]. On unseen test data (consisting of Baishidaoren Gequ’s Shanghai MS edition), the best classifier achieves a single character error rate (CER) of 10.4% on the best model combination. Therefore, a user must only correct around every 10th model prediction instead of manually annotating every single instance. To make the OMR algorithm more interpretable to users of the GUI, the classifiers are calibrated using a temper- ature scaling approach on their respective validation sets. For the pitch classifier, the temperature is trained with 10 epochs and an SGD learning rate of 0.01, resulting in a temperature of 1.4574. The secondary classifier’s tem- perature is trained with a learning rate of 0.005, yielding a temperature of 1.2430. The reliability plots in Figure 2 show that the calibrated models of the suzipu pitch classifier are well-calibrated for confidences greater than 50%. Since the test dataset is with 1439 instances quite small, a good calibration on the whole confidence scale is not feasible. In (3), the model predictions with the corresponding confidence scores are displayed. B. Similarity Visualization As an additional guidance for the user, the similarity vi- sualization in (4) displays the three most similar notation instances found in in the KuiSCIMA dataset with respect to their optical features. For each of the two component classifiers models, i.e., the layer fc2 (which is a 120-dimensional vector) is ex- tracted as a feature encoding of the respective property. The features are collected for each notation in the dataset, and an unsupervised UMAP [11] dimensionality reduction to 2D is applied with a random_state of 42. Visualizations of the UMAP spaces with dataset samples are found in Figure 3. The UMAP representation of the currently investigated notation instance is compared against the precomputed UMAP representation of all instances in the KuiSCIMA dataset, and the three nearest neighbors are retrieved using K-means. The displayed similarity score is calculated as the inverse euclidean distance between the current instance and the neighbor. C. Display of Instances with Same Annotation In (5), KuiSCIMA dataset instances with the same annotation as the currently investigated notation instance are displayed, which is useful to validate the currently annotated sample against already annotated instances 0.0 0.2 0.4 0.6 0.8 1.0 Confidence 0.0 0.2 0.4 0.6 0.8 1.0 Ac cu ra cy Reliability Diagram 0.0 0.2 0.4 0.6 0.8 1.0 Confidence 0 200 400 600 800 1000 1200 1400 Nu m be r o f S am pl es Confidence Histogram Suzipu Pitch (Uncalibrated) 0.0 0.2 0.4 0.6 0.8 1.0 Confidence 0.0 0.2 0.4 0.6 0.8 1.0 Ac cu ra cy Reliability Diagram 0.0 0.2 0.4 0.6 0.8 1.0 Confidence 0 200 400 600 800 1000 1200 Nu m be r o f S am pl es Confidence Histogram Suzipu Pitch (Calibrated) 0.0 0.2 0.4 0.6 0.8 1.0 Confidence 0.0 0.2 0.4 0.6 0.8 1.0 Ac cu ra cy Reliability Diagram 0.0 0.2 0.4 0.6 0.8 1.0 Confidence 0 200 400 600 800 1000 1200 Nu m be r o f S am pl es Confidence Histogram Suzipu Secondary (Uncalibrated) 0.0 0.2 0.4 0.6 0.8 1.0 Confidence 0.0 0.2 0.4 0.6 0.8 1.0 Ac cu ra cy Reliability Diagram 0.0 0.2 0.4 0.6 0.8 1.0 Confidence 0 200 400 600 800 1000 1200 Nu m be r o f S am pl es Confidence Histogram Suzipu Secondary (Calibrated) Fig. 2: Reliability scores for the individual OMR classifiers indicate a good calibration for confidences greater than 50%. featuring the same label, and no optical features are used for this. However, since some classes are very frequent and occur more than 800 times, not all instances can be displayed, and an intelligent selection is made. In the case that an annotation occurs up to 39 times in the dataset, this selection is just the instances themselves. However, if there are more than 39 instances with this annotation in the dataset, the raw 28x28 pixel images are clustered into 39 classes using K-means. From each of those clusters, the first element is chosen as a repre- sentative. With this method, the possibly large amount of total samples is reduced to a small selection of most diverse samples in pixel space. The images for each annotation class are pre-generated and loaded when starting the tool to reduce loading times. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 40 5 0 5 10 15 5 0 5 10 15 Suzipu Pitch Embeddings (UMAP) HE SI YI SHANG GOU CHE GONG FAN LIU WU GAO_WU 5 0 5 10 15 5 0 5 10 15 Suzipu Pitch Embeddings (UMAP) 5.0 2.5 0.0 2.5 5.0 7.5 10.0 12.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0 Suzipu Secondary Embeddings (UMAP) None DA_DUN XIAO_ZHU DING_ZHU DA_ZHU ZHE YE 5.0 2.5 0.0 2.5 5.0 7.5 10.0 12.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0 Suzipu Secondary Embeddings (UMAP) Fig. 3: UMAP embedding space visualization for both the pitch and secondary component classifiers. The second and fourth images show the space with some dataset examples. IV. Conclusion and Future Work In this paper, an extension of the Chinese Musical Annotation Tool suzipu musical notation was introduced for semi-automatic collection of OMR datasets and eth- nomusicological archives. This is fostering the dissemina- tion and preservation of cultural heritage and laying the foundation for computational studies of suzipu. Since the involved datasets and the annotation tool are open source and publicly available2, a great impact on these areas of research is expected. In order to provide a good user experience and make the annotation of suzipu as easy as possible, the tool is endowed with intelligent systems: OMR algorithms for the automated annotation of notation symbols, similar- ity visualization to discover dataset samples that exhibit similar optical features, and a clustering of all annotation instances that occur in the KuiSCIMA dataset. In the case of suzipu notation OMR, a user must only correct around every 10th model prediction instead of manually annotating every single instance. A model calibration allows the user to interpret the confidence of the model’s predictions. With this, a significant reduction of human effort is achieved. For future work, we propose the following directions: 1) Conducting user studies for a practical evaluation of the annotation tool involving multiple human anno- tators. 2) Developing suitable OMR algorithm for the other two kinds of musical notations that appear in Baishi- daoren Gequ (lülüpu 律吕谱 and jianzipu 减字谱) and its integration into the GUI. 3) Incorporating other kinds of Chinese or related musi- cal notations, such as gongchepu 工尺谱 as it is used for Chinese music theaters Kunqu 昆曲 or Jyutkek 粵劇, or even the extension to Japanese or Korean musical notations. 4) Creating an educational intelligent user interface for interactive teaching and learning of ancient Chinese music notations. References [1] Birhanu Belay et al. “Factored Convolutional Neu- ral Network for Amharic Character Image Recogni- tion”. In: 2019 IEEE International Conference on Image Processing (ICIP). 2019, pp. 2906–2910. doi: 10.1109/ICIP.2019.8804407. [2] Manuel Burghardt and Sebastian Spanner. “Al- legro: User-centered Design of a Tool for the Crowdsourced Transcription of Handwritten Music Scores”. In: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage. DATeCH2017. Göttingen, Germany: As- sociation for Computing Machinery, 2017, pp. 15– 20. isbn: 9781450352659. doi: 10 . 1145 / 3078081 . 2https://github.com/SuziAI Proceedings of the 6th International Workshop on Reading Music Systems, 2024 41 3078101. url: https://doi .org/10.1145/3078081. 3078101. [3] Jorge Calvo-Zaragoza, Jan Haji Jr., and Alexander Pacha. “Understanding Optical Music Recognition”. In: ACM Comput. Surv. 53.4 (July 2020). issn: 0360- 0300. doi: 10.1145/3397499. url: https://doi.org/ 10.1145/3397499. [4] Gen-Fang Chen. “Music sheet score recognition of Chinese Gong-che notation based on Deep Learn- ing”. In: 2021 International Conference on Big Data Analysis and Computer Science (BDACS). 2021, pp. 183–190. doi: 10 . 1109 / BDACS53596 . 2021 . 00048. [5] Gen-Fang Chen and Jia-Shing Sheu. “An optical music recognition system for traditional Chinese Kunqu Opera scores written in Gong-Che Notation”. In: EURASIP Journal on Audio, Speech, and Music Processing (2014), pp. 7–17. doi: 10 . 1186 / 1687 - 4722-2014-7. [6] Liang Chen, Rong Jin, and Christopher Raphael. “Human-Guided Recognition of Music Score Im- ages”. In: Proceedings of the 4th International Work- shop on Digital Libraries for Musicology. DLfM ’17. Shanghai, China: Association for Computing Ma- chinery, 2017, pp. 9–12. isbn: 9781450353472. doi: 10.1145/3144749.3144752. url: https://doi.org/10. 1145/3144749.3144752. [7] Yu Cheng. “Xi’an Guyue –Xi’an Old Music in New China. ’Living fossil’ or ’flowing river’?” Dis- sertation. School of Oriental and African Studies, University of London, 2005. url: https://eprints. soas . ac . uk / 29336 / 1 / 10731431 . pdf (visited on 08/03/2023). [8] Stanisaw Graczyk et al. “An Online Tool for Semi- Automatically Annotating Music Scores for Opti- cal Music Recognition”. In: Proceedings of the 11th International Conference on Digital Libraries for Musicology. DLfM ’24. Stellenbosch, South Africa: Association for Computing Machinery, 2024, pp. 73– 77. isbn: 9798400717208. doi: 10 . 1145 / 3660570 . 3660571. url: https://doi .org/10.1145/3660570. 3660571. [9] Joseph S. C. Lam. “Ci Songs From the Song Dy- nasty: A Ménage à Trois of Lyrics, Music, and Performance”. In: New Literary History 46.4 (2015), pp. 623–646. issn: 00286087, 1080661X. url: http: / / www . jstor . org / stable / 24772762 (visited on 08/02/2023). [10] Aozhi Liu et al. “Residual Recurrent CRNN for End- to-End Optical Music Recognition on Monophonic Scores”. In: Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Under- standing. MMPT ’21. Taipei, Taiwan: Association for Computing Machinery, 2021, pp. 23–27. isbn: 9781450385305. doi: 10 . 1145 / 3463945 . 3469056. url: https://doi.org/10.1145/3463945.3469056. [11] Leland McInnes et al. “UMAP: Uniform Manifold Approximation and Projection”. In: Journal of Open Source Software 3.29 (2018), p. 861. doi: 10.21105/ joss .00861. url: https ://doi .org/10 .21105/ joss . 00861. [12] Ana Rebelo et al. “Optical music recognition: state- of-the-art and open issues”. In: International Jour- nal of Multimedia Information Retrieval 1.3 (2012), pp. 173–190. doi: 10.1007/s13735-012-0004-6. url: https://doi.org/10.1007/s13735-012-0004-6. [13] Tristan Repolusk and Eduardo Veas. “The KuiSCIMA Dataset for Optical Music Recognition of Ancient Chinese Suzipu Notation”. In: Document Analysis and Recognition - ICDAR 2024 . Ed. by Elisa H. Barney Smith, Marcus Liwicki, and Liangrui Peng. Cham: Springer Nature Switzerland, 2024, pp. 38–54. doi: 10.1007/978-3-031-70552-6_3. [14] Tristan Repolusk and Eduardo Veas. “The Suzipu Musical Annotation Tool for the Creation of Machine-Readable Datasets of Ancient Chinese Mu- sic”. In: Proceedings of the 5th International Work- shop on Reading Music Systems (WoRMS). Ed. by Jorge Calvo-Zaragoza, Alexander Pacha, and Elona Shatri. Milan, Italy, 2023, pp. 7–11. doi: 10.48550/ arXiv.2311.04091. url: https://sites.google.com/ view/worms2023/proceedings. [15] Antonio Ríos-Vila, Jorge Calvo-Zaragoza, and David Rizo. “Evaluating Simultaneous Recognition and Encoding for Optical Music Recognition”. In: Pro- ceedings of the 7th International Conference on Dig- ital Libraries for Musicology. DLfM ’20. Montréal, QC, Canada: Association for Computing Machinery, 2020, pp. 10–17. isbn: 9781450387606. doi: 10.1145/ 3424911.3425512. url: https://doi .org/10.1145/ 3424911.3425512. [16] David Rizo, Jorge Calvo-Zaragoza, and José M. Iñesta. “MuRET: a music recognition, encoding, and transcription tool”. In: Proceedings of the 5th International Conference on Digital Libraries for Musicology. DLfM ’18. Paris, France: Association for Computing Machinery, 2018, pp. 52–56. isbn: 9781450365222. doi: 10 . 1145 / 3273024 . 3273029. url: https://doi.org/10.1145/3273024.3273029. [17] Fu-Hai Frank Wu and Jyh-Shing Roger Jang. “An Architecture for Optical Music Recognition of Num- bered Music Notation”. In: Proceedings of Interna- tional Conference on Internet Multimedia Comput- ing and Service. ICIMCS ’14. Xiamen, China: As- sociation for Computing Machinery, 2014, pp. 241– 245. isbn: 9781450328104. doi: 10 .1145/2632856 . 2632930. url: https://doi .org/10.1145/2632856. 2632930. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 42 OMR on Early Music Sources at the Bavarian State Library with MuRET – Prototyping, Automating, Scaling Janosch Umbreit, M.A. Bavarian State Library, Music Department Munich, Germany janosch.umbreit@bsb-muenchen.de Silvana Schumann, M.A. Bavarian State Library, Music Department Munich, Germany silvana.schumann@bsb-muenchen.de Abstract—We describe the application of the OMR software MuRET to a corpus of mensural printed and handwritten music for the purpose of making a collection of music sources held at the Bavarian State Library searchable via the application musiconn.scoresearch. We focus on our workflow with MuRET, describe the improvements and roadblocks we have noticed, and discuss coming challenges concerning automation and batch processing of such a heterogeneous dataset. Index Terms—Optical Music Recognition, Mensural Notation, MuRET I. INTRODUCTION The digital availability of ever more musical sources from libraries and archives creates a growing desire for efficient and granular search options, not only of metadata, but also of the musical texts themselves. This desire for the searchability of music texts presents institutions with a twofold challenge: on the one hand, search entry points must be created that allow search queries to be formulated both expressively and intuitively, and on the other hand, the music sources must not only be digitized, but their content must be recognized as ac- curately as possible and made machine-readable. The develop- ment of an automated OMR workflow based on MuRET – an OMR application developed by Prof. David Rizo (University of Alicante) [1] – for the search portal musiconn.scoresearch showcases the particular difficulties of this task when working with a corpus of early modern mensural music. The aim of the collaboration – to create a robust OMR model that can reliably process sources of varying quality from the 16th and 17th centuries and to automate the recognition of musical characters over well over 1,000 sources – illustrates both the strides OMR is making and the challenges heterogeneous and idiosyncratic sources such as mensural choir- and partbooks pose for the OMR itself, as well as its automation. II. AIM AND SCOPE OF THE PROJECT musiconn.scoresearch1 is an interface that facilitates the search for digitized music notation based on optical music musiconn is generously supported by the German Research Founda- tion (Deutsche Forschungsgemeinschaft, DFG) under the project number 249121324. 1https://scoresearch.musiconn.de/ScoreSearch/about?lang=en recognition (OMR) and a search mask that allows users to search for melodies via a digital keyboard. It is currently being developed by the Specialized Information Service Musicology (FID Musikwissenschaft – musiconn) at the Bavarian State Library [2]. At present, the corpus of searchable music consists of roughly 159,000 scanned pages of works by selected composers of the 18th and 19th centuries, such as Beethoven, Händel, and Schubert, as well as the first two series of the “Denkmäler Deutscher Tonkunst” and older music prints from the Schott archive. For the next project phase (2024–2026), one of the goals for musiconn.scoresearch is to expand the searchable repertoire by including older music sources from the 16th and 17th centuries, including over 1,800 printed part books and 75 choir books in manuscript form held at the Bavarian State Library. In order to achieve high quality OMR results for this corpus of mostly white mensural notation, which differs substantially from the more recent sources that are already searchable, Prof. David Rizo and musiconn are cooperating to further extend the OMR software MuRET (Music Recognition, En- coding, and Transcription). The goal of this collaboration is to train a robust model for OMR on a representative selection of sources from the corpus, to develop needed additions for MuRET, such as new mensural meters and extended ligatures, and finally to implement a full offline pipeline that includes OMR, layout recognition, and indexing of the resulting MEI files. This approach presents a change from the workflow for scoresearch thus far – which relied on the SmartScore2 appli- cation and MusicXML output – due to the special requirements of the early modern repertoire. MuRET itself was initially developed for the HISPAMUS project, which has aims that are very similar to those of mu- siconn.scoresearch. However, unlike HISPAMUS, our focus lies less with providing material for editors and performers and more with devising a workflow for indexation that can be scaled and automated sufficiently to be applied to the collec- tion held at the Bavarian State Library [3]. The F-TEMPO project, too, which is concerned with OMR, indexing, and 2https://www.musitek.de/produkte/smartscore.php Proceedings of the 6th International Workshop on Reading Music Systems, 2024 43 query matching as well, acts as a point of reference. The issues with OMR and indexing the project’s authors describe are very applicable to the ones we are encountering [4]. However, while F-TEMPO aims at exposing a broad selection of musical sources through its API, scoresearch is mainly concerned with making the holdings of the Bavarian State Library accessible within the framework of the existing research options offered by musiconn. III. WORKFLOW WITH MURET The training of the OMR model with MuRET follows an iterative approach and comprises three discrete phases: At the beginning, the first page of each source in the training set is processed and then corrected manually. In the second step, the first 20 pages of each source are processed and corrected. Finally, 10 % of each remaining source are processed and corrected. OMR with MuRET is divided into three individual tasks that are performed for every scan: Document analysis, transcrip- tion, and semantic interpretation. Document analysis starts with identifying the individual staves on a given page. Once the staves are identified, all symbols on each staff are tran- scribed diplomatically. Finally, this diplomatic transcription is interpreted semantically. During both the document analysis and the transcription step, the supervisor can intervene and correct MuRET’s classification by redrawing bounding boxes, identifying symbols the model omitted, or changing the values of transcribed symbols. Over the past months, the project has progressed to the third phase, focusing first on document analysis and now on transcription. In the coming months, more focus will be placed on the semantic interpretation and, based on this work, the connections between individual parts that make up pieces and the assembly of a more automated pipeline from scan to MEI file. IV. OBSERVATIONS AND IMPROVEMENTS Concerning the accuracy of MuRET, we can report that the training up until now has yielded significantly improved results: The staff detection is very robust and can correctly process even crooked scans. The accuracy of the transcription has steadily improved as well, with features such as dots, rests of varying length, and flats being recognized much better than at the outset, while pitch recognition has been comparatively solid since the beginning. Duration, on the other hand, can sometimes still pose an issue, for example, when it comes to differentiating between fusae (eighths) and semifusae (six- teenths). In weighing the performance of MuRET, especially with regards to the ultimate goals of musiconn.scoresearch, we should keep in mind that in OMR, errors in classifying symbols such as the clef or the meter are disproportionately graver than errors of e. g. pitch, as they propagate throughout whole sections of music. Thankfully, MuRET has yielded consistently good results in recognizing both clefs and meters. Some features MuRET has to address when transcribing the musiconn.scoresearch corpus are unique to the music publishing industry of the early modern period. For example, sharp accidentals are frequently used instead of naturals. Furthermore, accidentals often do not appear exactly on the same line or space as the note to which they refer. The architecture of MuRET facilitates dealing with these issues, as they force us to differentiate between the correct recognition and diplomatic transcription of the symbol on the one hand, and its semantic interpretation, which may yield a more standardized representation, on the other. V. FURTHER DEVELOPMENT The application of a tool like MuRET to a task such as the expansion of musiconn.scoresearch highlights the special requirements and adjustments that are necessary for OMR on mensural notation: On a fundamental level, mensural notation requires a somewhat different set of glyphs, such as differ- ent meters and ligatures. The latter in particular can pose problems, as they can be synthesized in a great number of combinations and require specialized rules for their semantic interpretation. The enhancement of the glyph repertoire and the further development of ligature processing are some of the most imminent tasks in our continuing work with MuRET. Further considerations include the next steps to be taken in recognizing parts and whole pieces. Early modern poly- phonic music differs from common modern western layouts in that voices are typically either split up in partbooks, or arranged in choir book format. Of the two, choir books are simpler to assemble from their parts: As voices are arranged in the corners of a given spread of two pages, they cover the same musical duration and the difficulty lies mainly in distinguishing between the individual parts on a given page. In the case of partbooks, the different voices belonging to a polyphonic composition might be far apart in terms of scanned pages and their alignment requires a sophisticated understanding of the given musical structures, as well as additional information derived from non-musical texts or other visual cues, such as initials. At present, boundaries between individual pieces can only be designated manually. Part names, too, can only be assigned by users, although this task can be sped up significantly wherever parts repeat in a predictable pattern, which is often the case for this repertoire. However, in order to automate the OMR and the subsequent export to MEI, more work needs to be done here. Finding a balance between automatic segmentation & labeling and human intervention to ensure valid and meaningful encoding will be a major task for the further development of the project. The collaboration between the Bavarian State Library and Prof. Rizo and his team is still in the early stages. However, we can already report very promising results that showcase the great reliability of MuRET in analyzing page structure and recognizing musical characters. We are confident that our on- going work on the improvement of the model’s performance, as well as the layout recognition, will further advance the automatic processing of early modern music sources, a task that has proven just as intriguing as it is complex. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 44 REFERENCES [1] J. M. Iñesta, D. Rizo, and J. Calvo-Zaragoza. “MuRET as a Software for the Transcription of Historical Archives.” In Proceedings of the 2nd International Workshop on Reading Music Systems (WoRMS). Delft: 2019, pp. 12–15. [2] S. P. Achankunju. “Music search engine from noisy OMR data.” in 1st International Workshop on Reading Music Systems. Paris: 2018, pp. 23–24. [3] J. M. Iñesta, P. J. Ponce de León, D. Rizo, J. Oncina, L. Micó, J. R. Rico- Juan, C. Pérez-Sancho, and A. Pertusa. “HISPAMUS: Handwritten Spanish Music Heritage Preservation by Automatic Transcription.” In Proceedings of the 1st International Workshop on Reading Music Sys- tems. Paris: 2018, pp. 17–18. [4] T. Crawford, D. Lewis, and A. Porter. 2023. “Exploring Early Vocal Music and Its Lute Arrangements: Using F-TEMPO as a Musicological Tool. In Proceedings of the 10th International Conference on Digital Libraries for Musicology (DLfM ’23). Association for Computing Machinery, New York: 2023, pp. 77–81. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 45 OMMR4all revisited – a Semiautomatic Online Editor for Medieval Music Notations 1st Alexander Hartelt Dep. for Artificial Intelligence and Knowledge Systems) University of Wuerzburg D-97074 Germany alexander.hartelt@uni-wuerzburg.de 2nd Frank Puppe Dep. for Artificial Intelligence and Knowledge Systems University of Wuerzburg D-97074 Germany frank.puppe@uni-wuerzburg.de Abstract—Five years ago, OMMR4all has been presented at the WoRMS 2019 workshop. In this paper we report on advances and new evaluations of the editor for medieval music notations. The main contribution is the handling of lyrics (text) within the chants to be transcribed. The 2019 version of OMMR4all recognized staff lines, layout and symbols but required the syllables of the text to be entered in advance. In addition, the current version recognizes the start and end of a chant within a page, transcribes the text with an automatic alignment to the most similar chant from a chant database and assigns the symbols to the syllables of the text. Evaluations show a speed-up of the transcription process of medieval square notations by a factor of 6 to 9 compared to a factor of 1.2 to 1.3 in the 2019 version with a comfortable manual editor used as baseline. Index Terms—optical music recognition, web app, medieval manuscripts, square notation, user interface, pipeline I. INTRODUCTION A major challenge for optical music recognition is the alignment of music notation and lyrics which is essential for cultural heritage, where vocal music is prevalent [1]. This is particularly challenging for medieval chants, where the lyrics is written in a style which is very demanding for HTR- systems even with finetuning (s. Fig. 1). While there are generally few systems dealing with medieval music notations (see chapter 2), the text recognition and alignment issues are largely circumvented by assuming, that the text with syllables, which are the units for alignment, is given as input like e.g. in OMMR4all in a WoRMS paper from 2019 [2]. In this paper we report major advances in the OMMR4all system during the last five years and in particular, how we solve the lyrics transcription and alignment problem. The main idea is to use background knowledge from a large corpus of transcribed chants from e.g. the Cantus database1 or our Corpus Monodicum project2 extracting a chant text repository and a word dictionary. The chances are high, that a new chant to be transcribed uses the same or very similar lyrics than other chants with different music. Together with general advances in the deep learning technology for OMR we achieve a significant speed-up in the transcription process of medieval chants from a factor of 1.3 reported in OMMR4all-2019 compared to manual transcription used as baseline to a factor of 8.9 in the best 1https://cantus.uwaterloo.ca/ 2https://corpus-monodicum.de/ constellation in the new OMMR4all-2024 system3. II. RELATED WORK The literature presents a lack of workflows that display an entire OMR pipeline for historical data. While recent years have witnessed a surge in research on end-to-end OMR workflows [3], [4], these studies predominantly focus on clean and modern datasets. Such works typically rely on sequence-to-sequence architectures trained with the CTC loss for recognition tasks. Moreover, they often limit their scope to symbol recognition, frequently using a musical region as input, thereby necessitating a preceding segmentation step. The combined assignment of symbols, text and syllables is rarely addressed in these studies. In [5] Fujinaga et. al presented an OMR workflow for processing and encoding Medieval music manuscripts used in the SIMSSA [6] project. It uses a combination of convolutional neural networks, interactive classifiers, and web-based tools to process and encode the music in the MEI format. The workflow includes steps for layout analysis (Pixel.js), symbol classification by using an interactive classifier, neume process- ing, text recognition (using Calamari OCR engine) and MEI generation. The layout analysis divides the image into distinct components such as staff lines, text, and musical symbols. The symbols are then classified using a web-based k-nearest neighbor classifier. Afterwards the pitch is determined using the output of a CNN. In addition to the text layer, the transcrip- tion of the text layers serves as input for the text recognition. Calamari is then used to calculate text alignment information (position of the text/chars/syllables on the page). The workflow is managed by management software called Rodan4, which allows for configuration of the entire process. The resulting encoded music is afterwards merged with metadata from the Cantus Manuscript Database and displayed on a website. The project has encoded already over 8000 chants from Medieval manuscripts. In summary, both OMMR4all and SIMSSA share a common objective: to expedite the transcription of historical documents 3A detailed description of the pipeline used in the OMMR4all-2024 system is given in [7] 4http://ddmal.music.mcgill.ca/e2e-omr-documentation/overview/rodan.html Proceedings of the 6th International Workshop on Reading Music Systems, 2024 46 Fig. 1. Snippet of the split-view component of OMMR4all-2024. Displayed is the transcription result next to the original image. Different overlays (e.g. display of Symbols, Text, Syllabels, Layout, etc.) can be applied to the original image. Many letters of the lyrics in the original image look alike, posing a great challenge for automatic transcription and syllable separation. by providing tools and interfaces that support editors. While OMMR4all offers a comprehensive interface encompassing all functionalities (pipeline, training, correction), the SIMSSA project leverages numerous separate projects. Moreover, the projects employ various ground truth (GT) types. SIMSSA necessitates pixel-precise labeling of the original documents, whereas OMMR4all utilizes polygon-based GT for training staff, layout, and symbol recognition, significantly accelerating the creation of GT. Beyond additional differences in algorithm functionalities and post-processing, OMMR4all-2024 uniquely supports chant segmentation, both manually and automatically. III. ADVANCES IN OMMR4ALL In OMMR4all-2019 the problem of recognition of the hand- written chant texts and their assignment to the note symbols was circumvented by entering the text manually including syllables and assigning each syllable consecutively to a neume component. This approach requires a perfect aggregation of symbols to neume components, which needs a manual cor- rection step before the aggregation. In OMMR4all-2024 we experimented with different HTR engines (Handwritten Text Recognition) but did not get satisfactory results due to the chant notation style, where among other problems letters like e.g. u, v, n, m, I, r, t, e and c look very similar (compare Fig. 1). However, we had a large corpus of transcribed chants from the Corpus Monodicum project and found that there are many duplicated texts within this chant corpus. Therefore, we used the faulty transcription results of the HTR engine for a new chant and correct some words with a dictionary generated from the corpus and then select the most similar chant from the library, which succeeded in more than 80% of the new chants. The assignment of symbol sequences to syllables uses the corrected transcription results, because usually, the position of a syllable and the position of the corresponding note symbols agree with each other. However, there are some exceptions requiring additional knowledge. This new approach speeds up the efficiency of the transcription drastically. Further changes from OMMR4all-2019 to OMMR4all-2024 include an improved algorithm for recognizing staff lines achieving nearly 100% accuracy, enabling a complete automatic layout recognition for separating symbols and text regions. The recognition of symbols is also improved in OMMR4all-2024: both approaches still use an U-NET [8] architecture with an encoder and a decoder, but the custom encoder in OMMR4all- 2019 has been replaced in OMMRall-2024 by an EfficientNet- b3 [9] architecture. The decoder was custom fine-tuned for this encoder. The clefs at the beginning (and sometimes in the middle) of a line and the duplicated symbols at the end of line (they are identical to the first symbol of the next line, compare Fig. 2) are now recognized very well. Since OMMR4all is trained with a large corpus, a pretraining on the same source as in OMMR4all-2019 is not always needed; the evaluation results of OMMR4all-2024 (see next chapter) use a generic model for square notations. Further on, OMMR4all- 2024 tackles the task of document separation: In the sources, a new document (chant) usually starts in the middle of a page and even in the middle of a line and is marked by a prominent drop capital, i.e. a decorated first letter of the new chant. Recognizing these drop capitols with a separate component based initially on a Mask-RCNN and later on a YOLOv8 ar- chitecture allows the segmentation of the transcribed symbols and texts in chants, which is a prerequisite for the corpus- based text recognition approach (s. above). Since chants can span about more than one page, the new approach includes a management component for sources containing many pages instead of transcribing individual pages only as in OMMR4all- 2019. Finally, the editor support for validating and correcting the transcription results has been improved. Usually, a one-step correction step is sufficient in OMMR4all-2024, because the error rates are low. Since the exact position of note symbols relative to the lines is sometimes difficult to assess, the editor highlights the notes lying on a line resp. between two lines with different colors. An example of the editor is shown in Figure 2 containing two chants with two drop capitals in the main view and a list of consecutive pages from a source on the left. The editor provides different views and split views, where the transcription result can be compared directly to the original scan. IV. EVALUATION RESULTS OMMR4all-2019 published the following evaluation results for transcription of chants in square notation in minutes per page for postcorrection (average of five pages with 267 symbols per page): 0.6 minutes for correcting staff lines, 3.3 minutes for correcting symbols and 2.9 minutes for correcting for correcting the assignment of symbols to syllables, in total Proceedings of the 6th International Workshop on Reading Music Systems, 2024 47 Fig. 2. The images were cropped from the OMMR4all-2024 overlay editor. The purple areas mark drop capitals. Green areas are music regions. Red areas are lyric regions. Gray regions are also lyric regions, but they additionally mark the start of a new chant. Yellow and green squares within the music region mark symbols. The different colors of the symbol indicate whether they lie on a staff line or between two staff lines. The reading order of the symbols is represented by a thin line that connects the symbols to each other. Turquoise lines between symbols mark graphic connections. Vertical lines in the music regions mark note sequences to which a syllable is assigned 6,9 minutes per page. Compared to completely manual entry with the Monodi editor [10], which took 8.5 minutes, a speed- up of a factor of 1.3 was observed. However, transcribing the text of the chant and its separation in syllables was not part of the evaluation, since the text was given in advance for both editors. In OMMR4all-2024 these steps are covered too, and the full postcorrection time including transcription of the text in syllables and determining the begin and end of the chants is evaluated. The speed-up factor compared to manual transcription with the Monodi editor jumped from 1.3 in OMMR4all-2019 to 5.7 with a two-step process and even 8.9 with a one-step process (see table I): V. DISCUSSION OMMR4all-2024 excells at staffline recognition, which is nearly errorfree, and symbol recognition including differenti- ating between normal notes, clefs, accidentals including dupli- cated notes at the end of a line and pitch determination relative to the stafflines with combined error rates of 2,2% without finetuning and 1,5% with finetuning. The main error sources are recognizing graphical connections and assignment of these note complexes to syllables of the text. Using a large chant corpus and a corresponding word dictionary for automatic correction of the erroneous transcription of the handwritten texts was a major breakthrough, but does not deliver perfect results, since there are often minor variations in the chant texts, Proceedings of the 6th International Workshop on Reading Music Systems, 2024 48 TABLE I TIME IN MINUTES FOR POSTCORRECTION PER PAGE WITH CA. 200 SYMBOLS AND CHANT TEXT (COMPUTED AS AVERAGE FROM 20 PAGES) FROM THREE DIFFERENT PERSONS, WHERE PERSON3 CORRECTED THE OUTPUT OF OMMR4ALL-2024 IN ONE STEP, WHEREAS PERSON1 AND PERSON2 USED TWO STEPS FOR CORRECTION OF SYMBOLS FIRST AND THEN FOR THE CHANT TEXT INCLUDING ASSIGNMENT OF SYLLABLES TO SYMBOLS. FOR COMPARISON, THE MANUAL TRANSCRIPTION TIME WITH THE MONODI EDITOR IS STATED FOR PERSON2 (COMPUTED AS AVERAGE FROM 5 PAGES). OMMR4all-2024 Monodi Person Person1 Person2 Person3 Person2 Symbol Level 1.1 1.0 - 11.1 Text Level 2.0 2.1 - 6.7 Total 3.1 3.1 2.0 17.8 which currently must be corrected manually. In addition, the position of the syllables cannot be determined precisely in cases of bad handwritten recognition results and corresponds usually, but not always with the start of its matching note complex, which requires manual correction steps as well. VI. CONCLUSION AND FUTURE WORK OMMR4all was developed in the context of the Corpus Monodicum project, which consists of three parts: ”editions”, ”transcriptions”, and ”graduale synopticum”. While editions are transcribed mainly manually with the Monodi editor (to a large degree before OMMR4all-2024 was available), the transcriptions and the graduale synopticum (a retro- digitalisation project based on analog transcriptions from http://gregorianik.uni-regensburg.de/gr/) were made available with OMMR4all and manual postcorrection. Recently, we completed the transcription of the Graduale Synopticum5, involving the transcription of over 5,500 chants using the pipeline. Currently, we are engaged in the transcription of additional manuscripts employing square notation, such as the ”Köln, Dombibl. 1001b”6 manuscript. Furthermore, we are actively exploring the applicability of the pipeline to other similar notations, particularly ”Hufnagel” notation (e.g., Geesebook7) REFERENCES [1] Calvo-Zaragoza, J., Martinez-Sevilla, J.C., Penarrubia, C., Rios-Vila, A. (2023). Optical Music Recognition: Recent Advances, Current Chal- lenges, and Future Directions. In: Coustaty, M., Fornés, A. (eds.) Document Analysis and Recognition – ICDAR 2023 Workshops. ICDAR 2023. Lecture Notes in Computer Science, vol 14193. Springer, Cham. https://doi.org/10.1007/978-3-031-41498-5 7 [2] Wick, Christoph and Puppe, Frank, OMMR4all — a Semiautomatic Online Editor for Medieval Music Notations, In: 2nd International Workshop on Reading Music Systems, 2019, pp 31–34 [3] Rı́os-Vila, A., Rizo, D., Iñesta, J.M. et al. End-to-end optical music recognition for pianoform sheet music. IJDAR 26, pp. 347–362 (2023). https://doi.org/10.1007/s10032-023-00432-z 5http://gregorianik.uni-regensburg.de/gr/ 6https://digital.dombibliothek-koeln.de/hs/handschriften/content/zoom/312082 7http://geesebook.ab-c.nl/ [4] Rı́os-Vila, A., Rizo, D., Calvo-Zaragoza, J. (2021). Complete Optical Music Recognition via Agnostic Transcription and Machine Translation. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. https://doi.org/10.1007/ 978-3-030-86334-0 43 [5] Fujinaga, Ichiro and Vigliensoni, Gabriel, “Optical Music Recognition Workflow for Medieval Music Manuscripts,”5th International Workshop on Music Reading Systems, 2023. [6] Fujinaga, Ichiro and Hankinson, Andrew and Cumming, Julie E., In- troduction to SIMSSA (Single Interface for Music Score Searching and Analysis, In: Proceedings of the 1st International Workshop on Digital Libraries for Musicology 2014, pp. 1–3 [7] Hartelt, Alexander and Eipert, Tim and Puppe, Frank, Optical Medieval Music Recognition—A Complete Pipeline for Historic Chants, In: Applied Sciences, 2024 [8] Olaf Ronneberger and Philipp Fischer and Thomas Brox. U-Net: Con- volutional Networks for Biomedical Image Segmentation. In: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. pp. 234–241 [9] Mingxing Tan and Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In: Proceedings of the 36th Inter- national Conference on Machine Learning, ICML 2019, pp. 6105–6114 [10] Eipert, T., Haug, A., Herrmann, F., Puppe, F., and Wick, C., Editor Support for Digital Editions of Medieval Monophonic Music, In: 2nd International Workshop on Reading Music Systems (WoRMS), 2019 Proceedings of the 6th International Workshop on Reading Music Systems, 2024 49 Crafting Handwritten Notations: Towards Sheet Music Generation 1st Nivesara Tirupati, 2nd Elona Shatri, 3rd György Fazekas Centre for Digital Music, Queen Mary University of London London, UK Emails: n.tirupati@se23.qmul.ac.uk, e.shatri@qmul.ac.uk, george.fazekas@qmul.ac.uk Abstract—Handwritten musical notation represents a signif- icant part of the world’s cultural heritage, yet its complex and unstructured nature presents challenges for digitisation through Optical Music Recognition (OMR). While existing OMR systems perform well with printed scores, they struggle with handwritten music due to inconsistencies in writing styles and the quality of scanned images. This paper addresses these challenges by applying Enhanced Super-Resolution Generative Adversarial Networks (ESRGAN) to generate high-quality, synthetic hand- written music sheets. The generated sheets can then be used to improve OMR handwritten datasets with more style variability. Experimental results demonstrate that ESRGAN outperforms conventional models, producing detailed and high-fidelity syn- thetic music sheets. This research offers a practical approach to improving the preservation and digitisation of handwritten music, benefiting musicologists, educators, and archivists. Index Terms—Handwritten Musical Notation, OMR, Enhanced Super-Resolution Generative Adversarial Networks (ESRGANs), Synthetic Handwriting I. INTRODUCTION Handwritten music manuscripts form a critical part of the world’s musical heritage, preserving unique details about com- position, performance styles, and historical notations. Digitis- ing these manuscripts is essential for future generations and for advancing musicological research, education, and archival work. However, transforming handwritten music into digital formats through Optical Music Recognition (OMR) is a highly complex task due to variations in handwriting styles, inconsis- tencies in notation, and degradation of physical manuscripts. While OMR systems have achieved considerable success with printed scores, where symbols and layouts are standardised, they falter when applied to handwritten music. The natural variability in handwriting, along with issues such as blurred ink, faded writing, and physical wear, leads to significant recognition errors. These challenges create a pressing need for more sophisticated methods to accurately process and digitise handwritten music. Recent advances in machine learning, particularly with Generative Adversarial Networks (GANs), have opened new possibilities for generating realistic synthetic data. Enhanced Super-Resolution GANs (ESRGANs) [21], known for their ability to produce detailed high-resolution images, offer a promising solution for overcoming the limitations of tradi- tional OMR systems. By generating synthetic handwritten music sheets that replicate the complexity and nuances of real manuscripts, ESRGAN can be leveraged to train OMR systems more effectively. This paper proposes a novel approach to improving OMR for handwritten music by using ESRGAN-generated synthetic data. Our method enhances the quality of synthetic music sheets, providing OMR systems with training data that closely resembles real-world manuscripts. The results demonstrate improved accuracy in recognizing handwritten music, offering a new pathway for preserving and digitizing the world’s musical heritage. II. RELATED WORK The preservation of cultural heritage through handwritten music is crucial, as it represents a legacy passed down through generations. The digitisation of such notations is essential for ensuring their longevity and accessibility to future generations. As noted in [9], the introduction of the MUSCIMA++ dataset marked a significant advancement in the field of OMR. This dataset offers a comprehensive col- lection of notated manuscript music, encompassing a broad array of symbols, from basic notes and rests to complex articulations and dynamic markings. MUSCIMA++ provides researchers with essential tools for developing and evaluating OMR systems, serving as the ground truth for addressing key challenges in recognising music symbols, such as determining their occurrence and location. Despite the availability of this dataset, OMR systems con- tinue to face substantial challenges due to the variability in handwriting styles and the complexity of music notation. Even with the comprehensive MUSCIMA++ dataset, the accuracy of current OMR technologies remains a significant issue, as outlined in [6]. This highlights the need for more advanced techniques to improve the digitisation process. GANs have had a transformative impact on image process- ing and offer potential solutions to some of the persistent chal- lenges in OMR. Initially proposed in [10], GANs are capable of generating images with highly realistic features, which can enhance the optical recognition of digitised handwritten music. The authors in [2] introduced StyleGAN, a style-based gener- ator architecture that offers enhanced control over generated attributes, improving the quality of image generation. Previous research has demonstrated the effectiveness of deep learning techniques [7], [8], particularly GANs, in enhancing image resolution and improving recognition accuracy. To address Proceedings of the 6th International Workshop on Reading Music Systems, 2024 50 the specific challenges of handwritten music digitisation, [21] developed the ESRGAN, which focuses on improving image resolution while preserving fine details. In this study, ESR- GAN is applied to enhance the clarity of handwritten music sheets, building upon previous work to address the accuracy challenges in OMR. III. METHODOLOGY Our methodology is divided into several key stages: dataset acquisition and preparation, image preprocessing, model archi- tecture design, and training. Each stage is designed to ensure that the ESRGAN model generates high-fidelity handwritten music sheets that replicate the complexity and variability found in real-world music manuscripts. The following sections will outline each stage of the methodology in detail, starting with the preparation of the MUSCIMA++ dataset, followed by the image preprocessing steps, the architecture of the ESRGAN model, and the training procedure designed to ensure optimal performance. A. Dataset Acquisition and Image Preprocessing For training the ESRGAN model, we utilised the MUS- CIMA++ dataset [9], a collection of handwritten music. This dataset includes a variety of music notation styles, from simple notes to complex symbols from different writers, which allows the model to generalise effectively across different handwrit- ing styles. Its diversity is crucial for training ESRGAN to accurately replicate real-world handwritten music. Fig. 1: Segmentation of handwritten music sheets into smaller patches (256x256 pixels) for training. 1) Image Preprocessing and Cropping: To standardise the input data for ESRGAN, all music sheets were resized to 256x256 pixels and converted to greyscale. This ensures a uniform input format while preserving sufficient detail for effective model training [21]. a) Contrast Enhancement: Handwritten music often con- tains fine details [3], [4] that can be lost during digitisation, such as thin lines or varying ink densities. To preserve these details, we applied contrast enhancement to ensure the ESR- GAN model receives clear, well-defined input images [6]. This step is especially important for distinguishing staff lines from background noise in the scanned images [1]. b) Gaussian Noise: To simulate real-world imperfec- tions, such as smudged ink and paper texture variations, we introduced Gaussian noise into the images. This noise makes the dataset more robust, ensuring that ESRGAN can handle the kinds of distortions often encountered in practical applications [10], [11]. By managing pixel values to remain within a valid range [0,1], we prevent the model from over-darkening or brightening the images, thus maintaining overall quality. c) Image Cropping and Patching: To ensure the model focuses on capturing fine details in handwritten music notation, each music sheet image was divided into 256x256 pixel patches. This approach enables the ESRGAN model to process smaller, more manageable sections, improving its ability to learn subtle and complex features, such as variations in musi- cal symbols [6]. Additionally, patching reduces computational overhead and standardises the input format across the dataset, facilitating efficient and consistent model training, as depicted in Fig. 1. B. ESRGAN Architecture The ESRGAN architecture is composed of two primary components: the generator and the discriminator [21]. These components work together in an adversarial framework to enhance the quality and resolution of the synthetic handwritten music sheets. The generator is responsible for transforming input image patches into high-resolution outputs. It achieves this through several residual blocks, which allow for the preservation of important features while avoiding issues like vanishing gradients [15]. Each residual block contains convo- lutional layers, batch normalisation, and activation functions to progressively refine image details. By focusing on the fine details of music symbols, the generator ensures that high- resolution synthetic images closely resemble real handwritten sheets. The final output is activated with a Tanh function, ensuring pixel values remain within a usable range for image synthesis [11]. The discriminator acts as a binary classifier, distinguishing between real and generated images [10]. It consists of a series of convolutional layers followed by Leaky ReLU activation functions and batch normalisation [24]. These layers down- sample the input, progressively focusing on distinguishing fine details. The final layer uses a Sigmoid activation to output a probability score, guiding the generator in producing more realistic images over successive iterations [23]. By training the generator and discriminator together in an adversarial loop, ESRGAN produces high-resolution synthetic handwritten music sheets that effectively capture the nuances of real-world notation. This architecture can become pivotal to improving the performance of OMR systems in handling the variability of handwritten music scores. The ESRGAN model was trained using an adversarial training loop, where the generator and discriminator were updated alternately. The generator aimed to produce realistic high-resolution images, while the discriminator worked to distinguish these from real images [10]. We employed the Adam optimiser [23] to adjust learning rates and stabilise Proceedings of the 6th International Workshop on Reading Music Systems, 2024 51 convergence. The loss function consisted of adversarial loss, which encourages realism in generated images, and L1 loss, ensuring pixel-wise accuracy between generated and target im- ages [11]. This process was iteratively refined over 80 epochs, allowing the ESRGAN to effectively model the variability in handwritten music notation. In addition to the standard adversarial and L1 losses, the ESRGAN model incorporates a perceptual loss to enhance image quality. This loss is calculated by comparing high-level features between the generated and real images, as extracted by a pre-trained network. Unlike pixel-based losses, perceptual loss focuses on the overall structure and content of the image, ensuring that the generated music sheets are not only visually similar but also maintain the intricate relationships between musical symbols and notation features. This helps in preserving the subtle variations in handwriting styles. C. Post-processing and Image Reconstruction We initially generated these images at a resolution of 256x512 pixels. The images were then upscaled to match the dimensions of the original sheets for comparison purposes. We acknowledge that this resizing could introduce slight downscaling artefacts, which may have affected the visual quality in Figure 6. However, given resource constraints, generating directly at higher resolutions was challenging. To minimise visible seams between patches, a 4x4 pixel overlap was used, and pixel values were averaged across overlapping regions [21]. This ensures smooth transitions between patches, maintaining the fidelity and integrity of the overall image. Once all patches are recombined, the image is resized to match the original handwritten music sheet dimensions. IV. EXPERIMENTAL SETUP The ESRGAN model was trained iteratively, with the gener- ator producing high-resolution handwritten music sheet images and the discriminator distinguishing between real and synthetic images. This adversarial process continuously refines the out- put quality, ensuring more realistic and accurate handwritten music representations. The generator and discriminator networks were initialised and optimised using the Adam optimiser [23], with a learning rate of 0.00001. The adaptive nature of the Adam optimiser allowed for stable convergence and fine-tuning of model parameters, effectively managing the complex loss landscapes typical in GAN training [24]. The training process followed a standard adversarial loop [10], where the generator produced batches of synthetic im- ages, which the discriminator then classified alongside real images [21]. The generator’s loss function combined three key components: • Adversarial Loss: Encouraged realistic image generation [24]. • L1 Loss: Ensured pixel-level accuracy between generated and real images [25]. • Perceptual Loss: Maintained high-level feature alignment between generated and real images, ensuring the struc- tural integrity of the music notation [19]. The discriminator was trained with a binary cross-entropy loss to improve its ability to distinguish real from synthetic images, thereby refining the generator’s outputs over succes- sive iterations. Fig. 2: Handwritten music sheet generated by the ESRGAN model after 80 training epochs The model underwent 80 training epochs, during which the generator incrementally refined its outputs to better resemble original handwritten music sheets [21]. Intermediate outputs were inspected visually after each epoch to evaluate how well the model captured musical notation details, guiding further parameter adjustments as needed [18]. An example of these outputs is shown in Figures 2 and 7, illustrating the model’s ability to synthesise high-fidelity handwritten music sheets. A. Evaluation Metrics To assess the performance of the ESRGAN model and com- pare it with Pix2Pix, we employed two key sets of evaluation metrics: one for the overall quality of the generated images and another for edge detection. These metrics were selected to provide a comprehensive evaluation of the models’ ability to replicate handwritten music sheets with both structural and perceptual accuracy. a) Fréchet Inception Distance (FID) and Inception Score (IS): FID is a widely-used metric that measures the distance between the distribution of real and generated images. Lower FID scores indicate that the generated images are more similar to real images, making it an ideal metric for assessing the realism of generated handwritten music sheets. The IS, on the other hand, evaluates the quality and diversity of the generated images. Higher IS values reflect better image quality and variety, ensuring that the model generates not only realistic images but also diverse representations of handwritten music. b) Edge Detection (MSE): Edge detection is crucial for preserving the structural details of handwritten music, such as staff lines and note stems. To evaluate this, we used the MSE between the detected edges in the generated images and the ground truth images. Lower MSE values indicate that the model is better at replicating the fine details of the handwritten music sheets. This metric was chosen to assess how well each model retains the critical edge structures necessary for accurate digitisation. These evaluation metrics were chosen to provide a balanced and thorough assessment of the models’ capabilities in terms of both visual realism and structural accuracy.. V. RESULTS AND DISCUSSION After 80 training epochs, the ESRGAN model demonstrated stable performance improvements, with both generator loss Proceedings of the 6th International Workshop on Reading Music Systems, 2024 52 (GLOSS) and discriminator loss (DLOSS) stabilising over time, as shown in Fig. 3. Initially, GLOSS dropped signifi- cantly as the generator refined its ability to produce realistic images. Over time, fluctuations in both losses diminished, in- dicating that the adversarial training loop reached equilibrium, a common characteristic in GAN models [11]. Fig. 3: Graph showing the generator loss (GLoss) and discrim- inator loss (DLoss) over 80 training epochs for the ESRGAN model. The t-SNE clustering plot (Fig. 4) highlights how closely the ESRGAN-generated images resemble the original handwritten music sheets, with the generated images clustering tightly with the real ones. This close overlap suggests that ESRGAN successfully captures musical symbol features. [9]. TABLE I: FID and IS Scores Comparison Model FID Score IS Score ESRGAN 29.47 2.08 Pix2Pix 50.14 2.08 In comparison to Pix2Pix [14], ESRGAN consistently out- performed it in edge detection and image quality, as demon- strated by the lower MSE values (Table II) and Fig. 5. The Fig. 4: t-SNE clustering plot showing the distribution of real and ESRGAN-generated handwritten music sheets in a reduced feature space. TABLE II: MSE Comparison for Edge Detection between ESRGAN and Pix2Pix Model MSE (Edge Detection) ESRGAN 0.1081 Pix2Pix 0.3781 clearer and sharper edges produced by ESRGAN underscore its ability to retain fine structural details crucial for OMR tasks. Furthermore, ESRGAN achieved a significantly lower FID score (29.47) than Pix2Pix (50.14), indicating its superior ability to generate images that closely resemble real handwrit- ten music sheets (Table I). Fig. 5: Edge detection results comparing ESRGAN-generated handwritten music sheets with the ground truth. VI. CONCLUSIONS This study demonstrates that ESRGAN is highly effective in synthesising high-fidelity handwritten music sheets, outper- forming Pix2Pix in key areas such as edge detection, FID score, and overall image quality. By preserving finer details like musical notations and staff lines, ESRGAN is better suited for tasks requiring precise digitisation of handwritten music sheets. The lower FID score and more accurate edge detection metrics confirm ESRGAN’s ability to generate images that are not only visually similar to real music sheets but also capture critical features necessary for OMR systems. The results suggest that ESRGAN is an ideal candidate for advancing music sheet digitisation and preservation. Moving forward, integrating ESRGAN with multi-modal datasets, including audio data, could further enhance its appli- cations in music recognition and cultural preservation. Addi- tionally, expanding the diversity of training datasets to include a wider range of notations and historical periods would further improve the model’s generalisation, making it applicable to a broader array of musical traditions. The success of ESRGAN in this domain opens the door for future innovations in AI- driven cultural heritage preservation. ACKNOWLEDGMENTS The authors acknowledge the support of the AI and Music CDT, funded by UKRI and EPSRC under grant agreement no. EP/S022694/1, and our industry partner Steinberg Media Technologies GmbH for their continuous support. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 53 REFERENCES [1] J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, ”Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. IEEE Int. Conf. Computer Vision (ICCV), 2017, pp. 2223–2232. [2] T. Karras, S. Laine, and T. Aila, ”A style-based generator architecture for generative adversarial networks,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4401–4410. [3] E. Shatri and G. Fazekas, ”DoReMi: First glance at a universal OMR dataset,” arXiv preprint, arXiv:2107.07786, Jul. 2021. [4] E. Shatri and G. Fazekas, ”Knowledge Discovery in Optical Music Recognition: Enhancing Information Retrieval with Instance Segmenta- tion,” in Proc. Int. Conf. Knowledge Discovery and Information Retrieval (KDIR), 2024. [5] A. Brock, J. Donahue, and K. Simonyan, ”Large-scale GAN training for high-fidelity natural image synthesis,” arXiv preprint, arXiv:1809.11096, 2018. [6] E. Shatri and G. Fazekas, ”Optical music recognition: State of the art and major challenges,” arXiv preprint, arXiv:2006.07885, 2020. [7] E. Shatri, K. Palavala, and G. Fazekas, ”Synthesising Handwritten Music with GANs: A Comprehensive Evaluation of CycleWGAN, ProGAN, and DCGAN,” to appear in 2nd Workshop on AI Music Generation (AIMG 2024), IEEE Big Data, Washington D.C., 2024. [8] P. Hande, E. Shatri, B. Timms, and G. Fazekas, ”Towards Artificially Generated Handwritten Sheet Music Datasets,” in Proc. 5th Int. Work- shop on Reading Music Systems, 2023, p. 25. [9] J. Hajič and P. Pecina, ”The MUSCIMA++ Dataset for Handwritten Optical Music Recognition,” in Proc. 14th IAPR Int. Conf. Document Analysis and Recognition (ICDAR), 2017, pp. 39–46, doi: 10.1109/IC- DAR.2017.16. [10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, ”Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2014, doi: 10.1145/3422622. [11] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. Bharath, ”Generative Adversarial Networks: An Overview,” IEEE Signal Process. Mag., vol. 35, no. 1, pp. 53–65, 2017, doi: 10.1109/MSP.2017.2765202. [12] N. Li, ”Generative Adversarial Network for Musical Notation Recog- nition during Music Teaching,” Computational Intelligence and Neuro- science, 2022, doi: 10.1155/2022/8724688. [13] S. Lee, U. Hwang, S. Min, and S. Yoon, ”Polyphonic Music Genera- tion with Sequence Generative Adversarial Networks,” arXiv preprint, arXiv:1710.11418, 2017. [14] Raghavendra, M., & Sarappadi, P., 2022. Transfer Learning with Pix2Pix GAN for Generating Realistic Photographs from Viewed Sketch Arts. Journal of Southwest Jiaotong University. https://doi.org/10.35741/issn. 0258-2724.57.4.17. [15] H. Dong, W. Hsiao, L. Yang, and Y. Yang, ”MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Gen- eration and Accompaniment,” in Proc. AAAI Conf. Artificial Intelligence, 2017, pp. 34–41, doi: 10.1609/aaai.v32i1.11312. [16] H. Chen, Q. Xiao, and X. Yin, ”Generating Music Algorithm with Deep Convolutional Generative Adversarial Networks,” in Proc. IEEE Int. Conf. Electronics Technology (ICET), 2019, pp. 576–580, doi: 10.1109/ELTECH.2019.8839521. [17] M. Liu, X. Huang, J. Yu, T. Wang, and A. Mallya, ”Generative Adversarial Networks for Image and Video Synthesis: Algorithms and Applications,” Proc. IEEE, vol. 109, no. 5, pp. 839–862, 2020, doi: 10.1109/JPROC.2021.3049196. [18] É. Clabaut, M. Lemelin, M. Germain, Y. Bouroubi, and T. St-Pierre, ”Model Specialization for the Use of ESRGAN on Satellite and Air- borne Imagery,” Remote Sens., vol. 13, no. 20, p. 4044, 2021, doi: 10.3390/rs13204044. [19] Z. Zhu, Y. Lei, Y. Qin, C. Zhu, and Y. Zhu, ”IRE: Improved Image Super-Resolution Based on Real-ESRGAN,” IEEE Access, vol. 11, pp. 45334–45348, 2023, doi: 10.1109/ACCESS.2023.3256086. [20] J. Rabbi, N. Ray, M. Schubert, S. Chowdhury, and D. Chao, ”Small- object detection in Remote Sensing Images with End-to-End Edge- Enhanced GAN and Object Detector Network,” Remote Sens., vol. 12, no. 9, p. 1432, 2020, doi: 10.20944/preprints202003.0313.v1. [21] X. Wang et al., ”ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks,” in Proc. Eur. Conf. Computer Vision (ECCV), 2018, pp. 63–79, doi: 10.1007/978-3-030-11021-5 5. [22] T. Le-Tien, T. Nguyen-Thanh, H. Xuan, G. Nguyen-Truong, and V. Ta- Quoc, ”Deep Learning-Based Approach Implemented to Image Super- Resolution,” J. Adv. Inf. Technol., vol. 11, no. 4, pp. 209–216, 2020, doi: 10.12720/jait.11.4.209-216. [23] X. Wang, L. Xie, C. Dong, and Y. Shan, ”Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data,” in Proc. IEEE/CVF Int. Conf. Computer Vision Workshops (ICCVW), 2021, pp. 1905–1914, doi: 10.1109/ICCVW54120.2021.00217. [24] N. Singh, A. F, M. Rastogi, and R. Prasad, ”Performance Anal- ysis of Conditional GANs-based Image-to-Image Translation Mod- els for Low-Light Image Enhancement,” in Proc. Int. Conf. Sig- nal Process. and Communication (ICSC), 2022, pp. 468–474, doi: 10.1109/ICSC56524.2022.10009340. [25] Fujioka, T., Satoh, Y., Imokawa, T., Mori, M., Yamaga, E., Takahashi, K., Kubota, K., Onishi, H., & Tateishi, U., 2022. Proposal to Improve the Image Quality of Short-Acquisition Time-Dedicated Breast Positron Emission Tomography Using the Pix2pix Generative Adversarial Net- work. Diagnostics, 12. https://doi.org/10.3390/diagnostics12123114. [26] J. Calvo-Zaragoza, A. Gallego, and A. Pertusa, ”Recognition of Hand- written Music Symbols with Convolutional Neural Codes,” in Proc. 14th IAPR Int. Conf. Document Analysis and Recognition (ICDAR), 2017, pp. 691–696, doi: 10.1109/ICDAR.2017.118. APPENDIX Proceedings of the 6th International Workshop on Reading Music Systems, 2024 54 (a) (b) Fig. 6: Examples of generated ESRGAN scores with a resolution of 256x512 before post-processing. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 55 (a) (b) (c) (d) Fig. 7: Examples of ESRGAN generated scores after resizing and post-processing. Proceedings of the 6th International Workshop on Reading Music Systems, 2024 56