Selbstlernende Optische Notenerkennung DISSERTATION zur Erlangung des akademischen Grades Doktor der Technischen Wissenschaften eingereicht von Alexander Pacha, B.Sc. M.Sc. with honours Matrikelnummer 00828440 an der Fakultät für Informatik der Technischen Universität Wien Betreuung: Ao. Univ.-Prof. Mag. Dr. Horst Eidenberger Diese Dissertation haben begutachtet: Ichiro Fujinaga Oge Marques Wien, 18. Juni 2019 Alexander Pacha Technische Universität Wien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.ac.at D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Self-Learning Optical Music Recognition DISSERTATION submitted in partial fulfillment of the requirements for the degree of Doktor der Technischen Wissenschaften by Alexander Pacha, B.Sc. M.Sc. with honours Registration Number 00828440 to the Faculty of Informatics at the TU Wien Advisor: Ao. Univ.-Prof. Mag. Dr. Horst Eidenberger The dissertation has been reviewed by: Ichiro Fujinaga Oge Marques Vienna, 18th June, 2019 Alexander Pacha Technische Universität Wien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.ac.at D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Erklärung zur Verfassung der Arbeit Alexander Pacha, B.Sc. M.Sc. with honours Hiermit erkläre ich, dass ich diese Arbeit selbständig verfasst habe, dass ich die verwen- deten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der Arbeit – einschließlich Tabellen, Karten und Abbildungen –, die anderen Werken oder dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter Angabe der Quelle als Entlehnung kenntlich gemacht habe. Wien, 18. Juni 2019 Alexander Pacha v D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Danksagung Ich widme diese Arbeit meinem Vater—dem besten Vater den sich ein Kind wünschen konnte. Ein riesengroßes Dankeschön an meinen Betreuer Horst Eidenberger, der mich exzellent betreut hat und mir - wann immer es nötig war - einen kleinen Schubs in die richtige Richtung gegeben hat. Weiters möchte ich meiner Mutter, meiner Schwester und meinen Freunden danken, insbesondere Daniela Stoll, Iris Stuhr, Peter Frühwirt, Markus Pöschel und Florian Ganglberger, die mich zu dieser Arbeit ermutigt haben und für mich da waren, wann immer ich sie gebraucht habe. Ein besonderen Dank ergeht auch an meine Kollegen Jorge Calvo-Zaragoza und Jan Hajič jr. für die großartige Zusammenarbeit, ohne die diese Arbeit nicht möglich gewesen wäre. Zusätzlich danke ich auch Peter Kán, Iana Podkosova, und Khrystyna Vasylevska für die vielen Kleinigkeiten die mein Leben an der Universität bereichert haben. Zuletzt möchte ich noch meinem Freund Friedrich Plank für das Korrekturlesen danken. vii D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Acknowledgements I dedicate this work to my father—the best father a child could hope for. Many thanks to my supervisor Horst Eidenberger who supervised me exquisitely by providing me guidance whenever I needed it. I would also like to thank my mother, my sister and my friends, especially Daniela Stoll, Iris Stuhr, Peter Frühwirt, Markus Pöschel, and Florian Ganglberger who encouraged me to this work and always supported me. A special thank-you goes to my colleagues Jorge Calvo-Zaragoza and Jan Hajič jr. for the fantastic collaboration. Without you, this work would not have been possible. I would also like to thank Peter Kán, Iana Podkosova, and Khrystyna Vasylevska for all the small things that made my day. Finally, I would like to thank my dear friend Friedrich Plank for proofreading this thesis. ix D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Kurzfassung Musik ist ein essenzieller Teil unserer Kultur und unseres Erbes. Durch die Jahrhunderte wurden Millionen an Liedern komponiert und mittels Musiknotation auf Papier festgehal- ten. Die optische Notenerkennung (engl. Optical Music Recognition, kurz OMR) ist das Forschungsfeld, das untersucht, wie der Computer das Lesen von Musiknoten erlernen kann. Trotz jahrzehntelanger Forschung, gilt die optische Notenerkennung bis heute als alles andere als gelöst. Ein Grund hierfür ist die Tatsache, dass viele traditionelle Ansätze auf Heuristiken beruhen, die sich nur schwer verallgemeinern lassen. Deshalb schlage ich in dieser Arbeit einen anderen Weg vor, nämlich den Computer das Lesen von Musiknoten selbstständig erlernen zu lassen, mittels maschinellem Lernen, insbesondere Deep Learning. In zahlreichen Experimenten konnte ich demonstrieren, dass der Computer unter Über- wachung des Lernprozesses die meisten Herausforderungen der optischen Notenerkennung robust erlernen kann. Zu diesen Herausforderungen zählen die Analyse der Dokumenten- struktur, die Erkennung und Klassifikation von Symbolen, sowie die Konstruktion von einem Musiknotationsgraphen, der als zwischenzeitliche Repräsentation fungiert, die in ein passendes Format zur Weiterverarbeitung exportiert werden kann. Ein trainiertes neuronales Netzwerk kann zuverlässig vorhersagen, ob ein Bild Noten enthält oder nicht, während ein anderes imstande ist, den selben Takt in verschiedenen Ausgaben derselben Musik zu finden und miteinander zu verknüpfen, sodass man bequem zwischen diesen hin und her navigieren kann. Die Erkennung von Symbolen in gesetzten und handgeschrie- benen Noten kann ebenfalls erlernt werden, sofern man ausreichend annotierte Daten zur Verfügung hat. Die Klassifikation der erkannten Symbole hat sogar eine niedrigere Fehlerrate als die von Menschen. Für Noten, die in Mensurnotation verfasst wurden, kann man die gesamte Erkennung in drei Schritte vereinfachen, wovon zwei mittels maschinellem Lernen gelöst werden können. Neben dem Verfassen von wissenschaftlichen Artikeln, habe ich auch die größte Sammlung von Datensätzen für OMR zusammengetragen und dokumentiert, sowie die wahrscheinlich umfangreichste Bibliographie, die derzeit verfügbar ist. Beide Sammlungen sind online verfügbar. Desweiteren war ich an der Organisation des 1st International Workshop on Reading Music Systems beteiligt, habe gemeinsam mit Kollegen ein Tutorial bei der International Society For Music Information Retrieval Conference zum Thema optischer xi D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Notenerkennung gegeben, und ein weiterer Workshop bei der Music Encoding Conference findet im Sommer 2019 statt. Viele Herausforderungen der optischen Notenerkennung können mit Deep Learning effizi- ent gelöst werden, wie die Analyse des Layouts oder die Erkennung von Musikobjekten. Allerdings ist die Musiknotation ein strukturelles Schreibsystem, bei dem die Beziehungen und das Zusammenspiel zwischen den einzelnen Objekten die Semantik bestimmen. Ein Musiknotationgraph ist eine geeignete Datenstruktur um diese Information abzubilden und erlaubt es klar zwischen zwei Dingen zu unterscheiden: der Rekonstruktion von Informationen aus dem Bild und der Kodierung der rekonstruierten Information in ein bestimmtes Format unter Berücksichtigung der Regeln der Musiknotation. So eine Konstruktion eines Musiknotationsgraphen kann zwar erlernt werden, bleiben einige Forschungsfragen offen. Ich bin zuversichtlich, dass das Trainieren des Computers auf einem hinreichend großen Datensatz unter menschlicher Überwachung einen nachhal- tigen Ansatz darstellt, mit dem man in Zukunft viele Anwendungsfälle der optischen Notenerkennung lösen wird können. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Abstract Music is an essential part of our culture and heritage. Throughout the centuries, millions of songs were composed and written down in documents using music notation. Optical Music Recognition (OMR) is the research field that investigates how the computer can learn to read those documents. Despite decades of research, OMR is still considered far from being solved. One reason is that traditional approaches rely heavily on heuristics and often do not generalize well. In this thesis, I propose a different approach to let the computer learn to read music notation documents mostly by itself using machine learning, especially deep learning. In several experiments, I have demonstrated that the computer can learn to robustly solve many tasks involved in OMR by using supervised learning. These include the structural analysis of the document, the detection and classification of symbols in the scores as well as the construction of the music notation graph, which is an intermediate representation that can be exported into a format suitable for further processing. A trained deep convolutional neural network can reliably detect whether an image contains music or not, while another one is capable of finding and linking individual measures across multiple sources for easy navigation between them. Detecting symbols in typeset and handwritten scores can be learned, given a sufficient amount of annotated data, and classifying isolated symbols can be performed at even lower error rates than those of humans. For scores written in mensural notation the complete recognition can even be simplified into just three steps, two of which can be solved with machine learning. Apart from publishing a number of scientific articles, I have gathered and documented the most extensive collection of datasets for OMR as well as the probably most comprehensive bibliography currently available. Both are available online. Moreover I was involved in the organization of the International Workshop on Reading Music Systems, in a joint tutorial at the International Society For Music Information Retrieval Conference on OMR as well as in another workshop at the Music Encoding Conference. Many challenges of OMR can be solved efficiently with deep learning, such as the layout analysis or music object detection. As music notation is a configurational writing system where the relations and interplay between symbols determine the musical semantic, these relationships have to be recognized as well. A music notation graph is a suitable representation for storing this information. It allows to clearly distinguish between the challenges involved in recovering information from the music score image and the encoding xiii D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek of the recovered information into a specific output format while complying with the rules of music notation. While the construction of such a graph can be learned as well, there are still many open issues that need future research. But I am confident that training the computer on a sufficiently large dataset under human supervision is a sustainable approach that will help to solve many applications of OMR in the future. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Contents Kurzfassung xi Abstract xiii Contents xv 1 Introduction 1 2 Understanding Optical Music Recognition 9 3 Towards Self-Learning Optical Music Recognition 61 4 Towards A Universal Music Symbol Classifier 69 5 Music Object Detection 73 5.1 Handwritten Music Object Detection . . . . . . . . . . . . . . . . . . . 73 5.2 General Music Object Detection . . . . . . . . . . . . . . . . . . . . . . 81 6 Measure Detection and Structure Analysis 103 7 Music Notation Graph Construction 113 8 OMR for Mensural Notation 123 9 Other contributions 133 9.1 Optical Music Recognition Datasets project . . . . . . . . . . . . . . . 133 9.2 ISMIR Tutorial “Optical Music Recognition for Dummies” . . . . . . . 134 9.3 Workshop on Reading Music Systems (WoRMS) . . . . . . . . . . . . 135 9.4 Workshop at MEC 2019: Let’s Formalize Music Notation . . . . . . . 135 9.5 Discussion Group Summary: Optical Music Recognition . . . . . . . . 135 9.6 Community Engagement and Website for OMR-Research . . . . . . . 135 9.7 OMR Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 10 Conclusions and Outlook 137 xv D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek List of Figures 139 Bibliography 141 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek CHAPTER 1 Introduction “Music is the one incorporeal entrance into the higher world of knowledge which comprehends mankind but which mankind cannot comprehend." — Ludwig van Beethoven [Sul36] Music, rhythm, and dance amounts to a universal language which is used and understood worldwide. It existed long before spoken languages emerged. It is used to convey information and emotions as well as to entertain us. Music manifests itself as sound pressure waves that travel through the air. They are a temporal phenomenon that only exists between the musician emitting it and the listener perceiving it. To preserve music it either has to be reproduced by a musician or recorded in one way or the other. Long before electricity was invented, people thought it worthwhile to preserve music in order to reproduce it. They invented a language called music notation, which is an abstraction that captures the essential bits of music. As with other languages, music notation evolved over the centuries and emerged in many different forms. Millions of pieces have been composed and written down through the centuries, and this heritage still lives on and is actively extended by contemporary composers. It represents an essential part of our culture. Unfortunately, we are not born with the ability to read and understand music notation but acquire this skill by practicing it throughout our life. Starting to read music notation is very challenging and presents a large obstacle for beginners. Even experienced musicians are often surprised when they learn about yet another aspect of music notation. The reason why music notation is so hard to learn is its enormous complexity, imposed by the underlying information it tries to abstract and capture music, which is virtually without limits. The arguably most prominent music notation is called Common Western Music Notation (CWMN) or Modern Staff Notation (see Fig. 1.1). It is a visual representation of the musical parameters: pitch, duration, velocity, and timbre. The sequence of notes and 1 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1. Introduction rests are described by specific glyphs within a reference system of (typically) five parallel lines, called stave. The position on the y-axis represents the relative pitch while the x-axis depicts the temporal sequence. Additional symbols can contain instructions regarding velocity, timbre or the lyrics to be sung. Figure 1.1: Excerpt from the waltz “An der schönen blauen Donau” by Johann Strauss, Jr. By following these instructions, musicians can comprehend the original ideas of the composer, which enables them to reproduce the music—similar to books that can capture ideas, facts, moods and the likes for others to learn about. However, due to the complexity of the syntactic and semantic rules of CWMN, which requires years of practicing before it can be mastered, a large portion of the population cannot read it. One possibility of teaching them is by having a computer-assisted conversion of the written music scores into an audible version of the same piece. This process of reading music notation and automatically decoding it into a machine-readable format is the goal of Optical Music Recognition (OMR). More precisely: “Optical Music Recognition is the research field that investigates how to computationally read music notation in documents.” [CZHjP19] OMR has plenty of applications, including teaching students how to read music notation. It can also be used to digitize handwritten manuscripts for restoration and publica- tion, support musicological examinations of large bodies of music, or enable practical applications such as providing accompanying voices while practicing a piece of music. Follow me in this fictional story: imagine Lisa, a sixteen-year-old girl who loves music. Recently she discovered her passion for rock music. She loves it so much that she decided to pick up playing the guitar. She got a guitar from her parents for Christmas and took a few classes but quickly got bored by the music her teacher wanted her to play. She went on the internet and found a website that offers free scores of her favorite band in tabulature notation that she quickly understood. After all, tabulature notation can be much easier to read, since each line corresponds directly to one string on the guitar and the number indicates the fret of that string (see Fig. 1.2). While playing the music that she enjoys so much, she keeps on practicing, and her skills improve considerably. One day, her favorite band releases a new song. Another 2 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Figure 1.2: First measures from the guitar riff of the song “Enter Sandman” by Metallica. enthusiastic fan goes through the lengthy process of transcribing the entire song by ear and publishes the scores on the same website shortly afterward. Unfortunately, it is written in CWMN instead of tabulature. She grabs her smartphone and takes a picture of the music score. That picture is processed by an OMR system that produces a digital version of the scores that she can open in a music score editor. The editor supports her in automatically converting the music into tabulature notation which she can comprehend. After playing alone for half a year, she decides to join a band. But given her lack of experience, she struggles to keep up with the other musicians. So she decides to practice every day at home, but without the accompanying voices she does not really get in the right mood for the music. So she grabs her smartphone again and takes a picture of the full score with all voices of the band. The OMR system detects and reads all voices and produces a digital version of the song that she can play along to, at a slower tempo. After a while, she disables the guitar voice and just keeps the other voices to simulate the presence of her bandmates while she keeps practicing. Eventually, she learns how to read and write CWMN and composes her first song for the band. She writes it down on a piece of paper (see Fig. 1.3). Figure 1.3: The initial three measures of Lisa’s first composition for the piano. However, Lisa is uncertain if she got everything right and how the song sounds when played on the piano. Again, she picks up her smartphone, takes a photo of her handwritten manuscript and runs the OMR application. While listening to the replay, she notices that the digital version has some errors, so she quickly fixes them in her music notation editor before creating the final version of a nicely rendered score that she hands over to her band. I hope this short fictional story demonstrates the potential of Optical Music Recognition 3 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1. Introduction in helping musicians learning and practicing their art. It can even be useful in the everyday life of professional musicians and composers. Completely new use-cases have been invented in the last few years, such as a digital music stand that turns pages automatically. The conductor can jump to a specific location in everyone’s music scores at the same time without having to wait for them to turn their pages manually. Despite some existing applications and a history of over 50 years of research, OMR is still considered a wide-open challenge for everything except very simple music scores. While there are a few commercial applications, they all have significant drawbacks and are far away from products that can be used to digitize music scores robustly on a larger scale. For example, a common wish of many musicians and librarians would be to have a born-digital version of the International Music Score Library Project (IMSLP), which is the largest collection of freely available music scores with over 460.000 scores. But instead of using commercial products, initiatives like OpenScore [GJB+18] rather use humans to digitize these scores manually. A similar approach is also used in many libraries, as Laplante and colleagues learned from interviews with librarians [LF16]. While the potential benefit is unquestioned, they still refrain from using OMR system because of the high error rate. So why is OMR still performing so poorly? There are a couple of reasons. Underestimating the challenges is probably the most common one. Whenever someone joins the field, they see some scores like the example in Fig. 1.4 and classify it a moderately difficult task. It is only until they actually start building the system when they realize the number of problems which the recognition entails. Figure 1.4: A born-digital version of music scores, typeset by a music score editor and without artifacts or degradations. A (naive) computer scientist might see the score above and think: “There are always five parallel lines, larger and smaller black dots with vertical lines going up or down and several additional glyphs. This task of recognizing the symbols can be solved by running a line-detector to find the horizontal staff lines which should be removed first. Then a connected-component analysis can be applied to find the individual symbols. Finally, one runs a few scan-lines and template matching algorithms to find the remaining symbols, which should result in the recognition of everything in that image.” However, that is only half of the story. First of all, scores more often look like Fig. 1.5. Or they might even be handwritten, like Fig. 1.6. 4 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Figure 1.5: The same musical snippet as in Fig. 1.4, but degraded, as it can happen in real-world scenarios: The stave is slightly slanted, the image is blurred and noisy due to a poor image capturing process, and some straight lines are bent, which frequently happens when making photos of scores that are bound in a book. Figure 1.6: The same musical snippet as in Fig. 1.4, but handwritten on a tablet with a stylus. What can be learned from these examples is that the same scores might look very different, although containing the same information: the stave lines might be skewed, or the image quality so poor that it can be difficult to reliably count the number of flags attached to grace-notes or distinguish an articulation dot from noise. Humans usually fill this gap with their experience and prior knowledge about the rules of music notation. Given two ways how to interpret a particular situation, they chose the one which makes more sense. But even if we were able to devise a perfect algorithm for detecting everything in that score, i.e. we know exactly which pixel belongs to which object and have the right class information for each object (e.g., quarter rest, g-clef, or notehead), we would still only be half-way through because unlike Optical Character Recognition (OCR) which tries to read texts, OMR attempts to read music notation. And unlike text, music notation is a configurational writing system. This means that the semantics of the primitives, appearing in music scores are determined by their configuration, i.e. the position and positional relationship to other primitives. In other words, the letter ‘a’ in the Word ‘Research’ remains an ‘a’, regardless of whether it is slightly shifted upwards or downwards, 5 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1. Introduction whereas an ‘A’ in music scores becomes a ‘B’ when moving it a little bit up or a ‘G’ when moving a little bit down (see Fig. 1.7). Figure 1.7: The word ‘Research’ written three times with vertically shifted letters, which always remains the word research, whereas the values of the three notes that are also slightly shifted vertically represent three different notes with the pitches A, B, and G. Apart from knowing the vertical position of a note within the reference-system of five parallel lines, the pitch can furthermore be altered by the presence of accidentals before that note, the clef at the beginning of the stave, the key signature as well as other symbols that might appear in the music score. To illustrate this effect of how primitives interact with each other, consider the snippet in Fig. 1.8. Figure 1.8: Three quarter-notes appear in the second space from the top within the reference system. The reference system’s origin is given by the G-Clef at the beginning, which specifies the G to be on the second line from the bottom. So the first note corresponds to a C, but with the given key-signature at the beginning which depicts two sharps with one of them placed on the second space from the top, it makes the note a C#. The second note has a local modifier that undoes this alteration from the key signature, which makes the note a C. The third note has no local modifier, but the effect of the local modifier from the second note is propagated to consecutive notes within the measure, making it also a C. So even if the first and third note visually look exactly the same, their semantics (pitch) is different. As demonstrated, OMR requires more than just the recognition of the primitives, i.e. something like the construction of a (notation) graph that holds the configuration of the primitives and their relationships. And finally, the generation of music notation in the desired machine-readable format, typically a standard for music exchange, such as MIDI, MusicXML or MEI. Both tasks can become very complex when a system tries to recognize and process more sophisticated scores. OMR can also be seen as teaching the machine to read and understand music scores to a certain extent. A task that certainly can be automated, as demonstrated by many applications as early as 1985, with the Wabot-2 robot [MSH+85] reading music scores 6 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek and playing them on an organ. Unfortunately, the robot was only capable of playing a very limited number of songs. Likewise, many systems that were developed in the last thirty years only worked well on the limited set of scores that were used during their development. The reason is that the machines did not learn to read music scores, but were given a set of rules and processing directives by their developers optimized to a certain body of music. If a music sheet violates the assumptions that were weaved into these rules or contained cases that were forgotten during development, the system breaks down and propagates errors through the process. This state of affairs is not satisfying. It would be preferable if an OMR system was more independent from the developer and datasets. Ideally, the system would learn the rules of music by itself and be more generalizable, extensible and robust. This brings us to the fundamental research question of this thesis: Can a machine learn to read music scores reliably? Throughout the last few years I investigated this question from several perspectives and tried to find ways of how I can teach the computer to learn reading music scores mostly by itself. The central idea is to devise a data-driven approach that requires as little human intervention as possible. The most suitable technology for this approach available today is Machine Learning, especially Deep Learning, which has proven to provide superior solutions to many image recognition problems among other things. Given that developing an entire OMR system can be very complex, I decided to adapt existing workflows and reformulate the individual steps to make them machine-learnable. They are: 1. Detect and analyze the structure of the music score: This can be a simple decision, whether there are scores in the image at all, or finding the positions of staffs and measures, depending on the design of the following steps. 2. Find all objects in the music score: Music scores can contain hundreds of (tiny) objects in a single image. This step is responsible for finding them and classifying them accordingly. In computer vision, this task is called object detection, and its goal is to retrieve the bounding boxes and class labels of all objects in an image. 3. Understanding the relationship between music objects: Once the individual objects are found, their relationship has to be determined, and a notation graph can be constructed that holds this information. 4. Exporting the notation graph into music notation: The complete notation graph is still an abstraction that cannot be read by music notation editors or other programs. It needs to be exported into a portable format to enable compatibility with these editors. Except for the last step, I investigated how to machine-learn them and published my findings in the following articles. 7 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek CHAPTER 2 Understanding Optical Music Recognition During the last two years, I worked closely together with several other researchers. Most notably was my collaboration with Jorge Calvo-Zaragoza from the University of Alicante and Jan Hajič jr. from the University of Prague. We co-authored several papers, and our biggest venture was the paper “Understanding Optical Music Recognition” [CZHjP19]. It is currently under review as tutorial paper for the ACM Computing Survey series. It discusses fundamental questions, such as: What is OMR? Why is it worth attempting? What are the underlying challenges that make it into such a hard problem? What are the outputs of OMR systems and how to classify existing research with regards to them? To understand what OMR is, we collected and reviewed more than 200 papers that define or talk about OMR in many different ways. We tried to put an umbrella over them by proposing the following definition, which we hope will be adopted by future researchers: Optical Music Recognition is the field of research that investigates how to computationally read music notation in documents. The second major contribution from this paper is an in-depth analysis of how OMR inverts the music encoding process. We begin with the creation of a musical composition, how it is conceptualized, and then materialized. We then show how OMR can be described as the inversion of the encoding process. Furthermore, we discuss how OMR relates to other fields, such as Text Recognition or other Graphics Recognition challenges and what makes it particularly different from them, including the complex typographical alignment of objects, the interactions between objects and the extremely complex semantics, which can even be hard for humans to interpret correctly. 9 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 2. Understanding Optical Music Recognition Finally, we propose a comprehensive taxonomy of OMR inputs and outputs. We realized that the complexity of OMR systems is directly related to the required level of compre- hension of the document. We propose four categories, starting with document metadata extraction that requires only limited comprehension up to structured encoding, which not only tries to recover the musical content, but also the information on how it was encoded. We conclude the paper with a brief discussion of current approaches, but in contrast to most survey papers, we do not discuss technical details. We also provide a list of open issues and perspectives for future research. 10 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1 Understanding Optical Music Recognition JORGE CALVO-ZARAGOZA∗, University of Alicante, Spain JAN HAJIČ JR.∗, Charles University, Czech Republic ALEXANDER PACHA∗, TU Wien, Austria For over 50 years, researchers have been trying to teach computers to read music notation, referred to as Optical Music Recognition (OMR). However, this field is still difficult to access for new researchers, especially those without a significant musical background: few introductory materials are available, and furthermore the field has struggled with defining itself and building a shared terminology. In this tutorial, we address these shortcomings by (1) providing a robust definition of OMR and its relationship to related fields, (2) analyzing how OMR inverts the music encoding process to recover the musical notation and the musical semantics from documents, (3) proposing a taxonomy of OMR, with most notably a novel taxonomy of applications. Additionally, we discuss how deep learning affects modern OMR research, as opposed to the traditional pipeline. Based on this work, the reader should be able to attain a basic understanding of OMR: its objectives, its inherent structure, its relationship to other fields, the state of the art, and the research opportunities it affords. CCS Concepts: ·General and reference→ Surveys and overviews; · Information systems→Music re- trieval; · Applied computing→ Document analysis; Graphics recognition and interpretation; Sound and music computing; Digital libraries and archives. Additional Key Words and Phrases: Optical Music Recognition, Music Notation, Music Scores ACM Reference Format: Jorge Calvo-Zaragoza, Jan Hajič jr., and Alexander Pacha. 2019. Understanding Optical Music Recognition. ACM Comput. Surv. 1, 1, Article 1 (January 2019), 50 pages. https://doi.org/0000001.0000001 1 INTRODUCTION Music notation refers to a group of writing systems with which a wide range of music can be visually encoded so that musicians can later perform it. In this way, it is an essential tool for preserving a musical composition, facilitating permanence of the otherwise ephemeral phenomenon of music. In a broad, intuitive sense, it works in the same way that written text may serve as a precursor for speech. In the same way that Optical Character Recognition (OCR) technology has enabled the automatic processing of written texts, reading music notation also invites automation. In an analogy to OCR, the field of Optical Music Recognition (OMR) covers the automation of this task of “readingž in the context of music. However, while musicians can read and interpret very complex music scores even in real time, there is still no computer system that is capable of doing so with success. ∗Equal contribution Authors’ addresses: Jorge Calvo-Zaragoza, University of Alicante, Carretera San Vicente del Raspeig, Alicante, 03690, Spain, jcalvo@dlsi.ua.es; Jan Hajič jr. Charles University, Prague, Czech Republic, hajicj@ufal.mff.cuni.cz; Alexander Pacha, TU Wien, Institute of Information Systems Engineering, Favoritenstraße 9-11, Vienna, 1040, Austria, alexander.pacha@tuwien. ac.at. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). © 2019 Copyright held by the owner/author(s). 0360-0300/2019/1-ART1 https://doi.org/0000001.0000001 ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:2 Calvo-Zaragoza et al. We argue that besides the technical challenges, one reason for this state of affairs is also that OMR has not defined its goals with sufficient rigor to formulate its motivating applications clearly, in terms of inputs and outputs. Work on OMR is thus fragmented, and it is difficult for a would-be researcher, and even harder for external stakeholders such as librarians, musicologists, composers, and musicians, to understand and follow up on the aggregated state of the art. The individual contributions are formulated with relatively little regard to each other, although less than 500 works on OMR have been published to date. This makes it hard to combine the numerous contributions and use previous work from other researchers, leading to frequent “reinventions of the wheel.ž The field, therefore, has been relatively opaque for newcomers, despite its clear, intuitive appeal. One reason for the unsatisfactory state of affairs was a lack of practical OMR solutions: when one is hard-pressed to solve basic subproblems like staff detection or symbol classification, it seems far-fetched to define applications and chain subsystems. However, some of these traditional OMR sub-steps, which do have a clear definition and evaluation methodologies, have recently seen great progress, moving from the category of “hardž problems to “close to solved,ž or at least clearly solvable [70, 118]. Therefore, the breadth of OMR applications that have long populated merely the introductory sections of articles now comes within practical reach. As the field garners more interest within the document recognition and music information retrieval communities [1, 11, 34, 50, 78, 83, 92, 114, 135], we see further need to clarify how OMR talks about itself. The primary contributions of this paper are to clearly define what OMR is, what problems it seeks to solve and why. Readers should be able to fully understand what OMR is, even without prior knowledge of music notation. OMR is, unfortunately, a somewhat opaque field due to the fusion of the music-centric and document-centric perspectives. Even for researchers, it is difficult to clearly relate their work to the field, as illustrated in Section 2. Many authors also think of OMR as notoriously difficult to evaluate [84]. However, we show that this clarity also disentangles OMR tasks which are genuinely hard to evaluate, such as full re-typesetting of the score, from those where established methodologies can be applied straightfor- wardly, such as searching scenarios. Furthermore, the separation between music notation as a visual language and music as the information it encodes is sometimes not made clear, which leads to a confusing terminology. The way we formulate OMR should provide a framework of thought in which this distinction becomes obvious. In order to be a proper tutorial on OMR, this paper addresses certain shortcomings in the current literature, specifically by providing: • A robust definition of what OMR is, and a thorough analysis of its inherent structure; • Terminological clarifications that should make the field more accessible and easier to survey; • A review of OMR uses and applications; well-defined in terms of inputs and outputs, andÐas much as possibleÐrecommended evaluation methodologies; • A brief discussion of how OMR was traditionally approached and how modern machine learning techniques (namely deep learning) affects current and future research; • As supplementary material, an extensive, extensible, accessible and up-to-date bibliography of OMR (see Appendix A: OMR Bibliography).1 The novelty of this paper thus lies in collecting and systematizing the fragments found in the existing literature, all in order to make OMR more approachable, easier to collaborate on, andÐhopefullyÐprogress faster. 1https://github.com/OMR-Research/omr-research.github.io ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Understanding Optical Music Recognition 1:3 2 WHAT IS OPTICAL MUSIC RECOGNITION? So far, the literature onOMRdoes not really share a common definition ofwhat OMR is.Most authors agree on some intuitive understanding, which can be sketched out as “computers reading music.ž But until now, no rigorous analysis of this question has been carried out, as most of the literature on the field focuses on providing solutionsÐor, more accurately, solutions to certain subproblems. These solutions are usually justified by a certain envisioned application or by referencing a review paper that elaborates on common motivations, with [132] being the most prominent one. However, even these review papers [7, 22, 111, 132] focus almost exclusively on technical OMR solutions and avoid elaborating the scope of the research. A critical review of the scientific literature reveals a wide variety of definitions for OMR (see Appendix B: List of OMR definitions and descriptions from published works) with two extremes: On one end, the proposed definitions are clearly motivated by the (sub)problem which the authors sought to solve (e.g., “transforming images of music scores into MIDI filesž) which leads to a definition that is too narrow and does not capture the full spectrum of OMR. On the other end, there are some definitions that are so generic that they fail to outline what OMR actually is and what it tries to achieve. An obvious example would be to define OMR as “OCR for music.ž This definition is overly vague, and the authors areÐas likewise in many other papersÐparticularly unspecific when it comes to clarifying what it actually includes and what not. We have observed that the problem statements and definitions in these papers are commonly adapted to fit the provided solution or to demonstrate the relevance to a particular target audience, e.g., computer vision, music information retrieval, document analysis, digital humanities, or artificial intelligence. While people rely on their intuition to compensate for this lack of accuracy, we would rather prefer to put an umbrella over OMR and name its essence by proposing the following definition. Definition 1. Optical Music Recognition is a field of research that investigates how to computa- tionally read music notation in documents. The first claim of this definition is that OMR is a research field. In the published literature, many authors refer to OMR as “taskž or “process,ž which is insufficient, as OMR cannot be properly formalized in terms of unique inputs and outputs (as discussed in Section 6). OMR must, therefore, be considered something bigger, like the embracing research field, which investigates how to provide a computer with the ability to read music notation. Within this research field, several tasks can be formulated with specific, unambiguous input/output pairs. The term “computationallyž distinguishes OMR from the musicological and paleographic studies of how to decode a particular notation system. It also excludes studying how humans read music. OMR does not study the music notation systems themselvesÐrather, it builds upon this knowledge, with the goal that a computer should be able to read the music notation as well. The last part of the definition “reading music notation in documentsž tries to define OMR in a concise, clear, specific, and inclusive way. To fully understand this part of the definition, the next section clarifies what kind of information is captured in a music notation document and outlines the process by which it gets generated. The subsequent section then elaborates on how OMR attempts to invert this process to read and recover the encoded information. It should be noted that the output of OMR is omitted intentionally from its definition, as different tasks require different outputs (see Section 6) and specifying any particular output representation would make the definition unnecessarily restrictive. To conclude this section, Fig. 1 illustrates how various definitions of OMR in the literature relate to our proposed definition and are captured by it. A full list of the formulations that have appeared in OMR papers so far can be found in Appendix B: List of OMR definitions and descriptions from published works. ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:4 Calvo-Zaragoza et al. field of research that investigates how to OMR is a/the computationally read music notation in documents into a/an process technique algorithm task tool challenge discipline program system to [automatically] of [automatically] music music scores score images scores manuscripts music sheets music documents music notation note information musical information music works extract transform understand translate convert recognize read detect interpret transcribe decode digitize process (re-)set [handwritten] [printed] [pen-based] [symbolic] [scanned] [paper-based] machine-readable format symbolic format MIDI file MusicXML file symbolic representation musical codes editable form symbolic music library electronic format symbolic notation format digital representation digital notation format Fig. 1. How OMR tends to be defined or described and how our proposed definition relates to them. For example: łOMR is the challenge of (automatically) converting (handwritten) scores into a digital representa- tion.ž 3 FROM “MUSIC” TO A DOCUMENT Music can be conceptualized as a structure of notes in time. This is not necessarily the only way to conceptualize music,2 but it is the only one that has a consistent, broadly accepted visual language used to transmit it in writing, so it is the conceptualization we consider for the purposes of OMR. A note is a musical object that is defined by four parameters: pitch, duration, loudness, and timbre. Additionally, it has an onset: a placement onto the axis of time, which in music does not mean wall-clock time, but is measured in relative units called beats.3 Periods of musical time during which no note is supposed to be played are marked by rests, which only have an onset and a duration. Notes and rests are grouped hierarchically into phrases, voices, and other musical units that can have logical relationships to one another. This structure is a vital part of musicÐit is essential to work it out for making a composition comprehensible. In order to record this “conceptualization of musicž visually, for it to be performed over and over in (roughly) the same way, at least at the relatively coarse level of notes, multiple music notation systems have evolved. A music notation system is a visual language that encodes music into a graphical form and enriches it with information on how to perform it (e.g., bowing marks, fingerings or articulations).4 To do that, it defines a set of symbols as its alphabet and specific rules for how to position these symbols to capture a musical idea. Note that all music notation systems entail a certain loss of information as they are designed to preserve the most relevant properties 2As evidenced by either very early music (plainchant) or some later twentieth century compositional styles (mostly spectralism). 3Musical time is projected onto wall-clock time with an underlying tempo, which can further be stretched and compressed by the performer. Strictly speaking, the notion of beats might not be entirely applicable to some very early music and some contemporary music, where the rhythmic pulse is not clearly defined. However, the notation used to express such music usually does have beats. 4Feist [57] refers to notation whimsically as a “haphazard Frankenstein soup of tangentially related alphabets and hiero- glyphics via which music is occasionally discussed amongst its wonkier creators.ž ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Understanding Optical Music Recognition 1:5 of the composition very accurately, especially the pitches, durations, and onsets of notes, while under-specifying or even intentionally omitting other aspects. Tempo could be one of these aspects, where the composer might have expressed precise metronomic indication, given a verbal hint, or stated nothing at all. It is therefore considered the responsibility of the performer to fill those gaps appropriately. We consider this as a natural boundary of OMR: it ends where musicians start to disagree over the same piece of music. Arguably the most frequently used notation system is Common Western Music Notation (CWMN, also known as modern staff notation), which has evolved during the seventeenth century from its mensural notation predecessors and stabilized at the beginning of the nineteenth century. There have been attempts to supersede it in the avant-garde and postmodern movements, but so far, these have not produced workable alternatives. Apart from CWMN, there exist a wealth of modern tablature scores for guitar, used i.e. to write down popular music as well as a significant body of historical musical manuscripts that are using earlier notation systems (e.g., mensural notations, quadratic notation for plainchant, early organum, or a wealth of tablature notations for lutes). Once a music notation system is selected for writing down a piece of music, it is still a challenging task to engrave5 the music because a single set of notes can be expressed in many ways. For example, one must make sure that the stem directions mark voices consistently and appropriate clefs are used, in order to make the music as readable as possible [57, 79, 89, 143]. These decisions not only affect the visual appearance but also help to preserve the logical structure (see Fig. 2). Afterwards, it can be embodied in a document, whether physically or digitally. To summarize, music can be formalized as a structured assembly of notes, enriched through additional instructions for the performer that are encoded visually using amusic notational language and embodied in a medium such as paper (see Fig. 3). Once this embodiment is digitized, OMR can be understood in terms of inverting this process. 4 INVERTING THE MUSIC ENCODING PROCESS OMR starts after a musical composition has been expressed visually with music notation in a document.6 The music notation document serves as a medium, designed to encode and transmit a musical idea from the composer to the performer, enabling the recovery and interpretation of that envisioned music by reading through it. The performer would: (1) Read the visual signal to determine what symbols are present and what is their configuration, (2) Use this information to parse and decode the notes and their accompanying instructions (e.g., indications of which technique to use), and (3) Apply musical intuition, prior knowledge, and taste to interpret the music and fill in the remaining parameters which music notation did not capture. Note that step (3) is clearly outside of OMR since it needs to deal with information that is not written into the music documentÐand where human performers start to disagree, although they 5Normally, music engraving is defined as the process of drawing or typesetting music notation with a high quality for mechanical reproduction. However, we use the term to refer to “planning the pagež: selecting music notation elements and planning their layout to most appropriately capture the music, before it is physically (or digitally) written on the page. This is a loose analogy to the actual engraving process, where the publisher would carefully prepare the printing plates from soft metal, and use them to produce many copies of the music; in our case, this “printing processž might not be very accurate, e.g., in manuscripts. The engraving process involves complex decisions [24] that can affect only a local area, like spacings between objects but can also have global effects, like where to insert a page break to make it convenient for the musician to turn the page. 6While OMR mainly works with a complete image or document, it is also possible to perform online OMR with the temporal signal as it is being generated, e.g., by capturing the stylus input on an electronic tablet device, which also results in a document. ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:6 Calvo-Zaragoza et al. (a) (b) Fig. 2. Excerpt of Robert Schumann’s łVon fremden Ländern und Menschenž (Engl. łOf foreign countries and peoplež), Op. 15 for piano. Properly engraved (a), it has two staffs for the left and the right hand with three visible voices, a key signature and phrase markings to assist the musician. In a poor engraving of the same music (b), that logical structure is lost, and it becomes painfully hard to read and comprehend the music, although these two versions contain the same notes. "The music" Conceptualized with notes Engraved using music notation Embodied in a document Fig. 3. How music is typically expressed and embodied (written down). are reading the very same piece of music [98].7 Coming back to our definition of OMR, based on the stages of the writing/reading process we outlined above, there are two fundamental ways to interpret the term “readž in reading music notation as illustrated in Fig. 4. We may wish to: (A) Recover music notation and information from the engraving process, i.e. what elements were selected to express the given piece of music and how were they laid out? This corresponds to stage (1) in the analysis above and does not necessarily require specific musical knowledge, but it does require an output representation that is capable of storing music notation, e.g., MusicXML or MEI, which can be quite complex. (B) Recover musical semantics, which we define as the notes, represented by their pitches, veloci- ties, onsets, and durations. This corresponds to stage (2)Ðwe use the term “semanticsž to refer only to the information that can be unambiguously inferred from the music notation 7Analogously, speech synthesis is not considered a part of optical character recognition. However, there exists expressive performance rendering software that attempts to simulate more authentic playback, addressing step (3) in our analysis. More information can be found in [36]. ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Understanding Optical Music Recognition 1:7 "The music" Conceptualized with notes Engraved using music notation Embodied in a document Recover musical semantics Recover music notation Fig. 4. How łreadingž music can be interpreted as the operations of inverting the encoding process. document. In practical terms, MIDI would be an appropriate output representation for this goal. This is a fundamental distinction that dictates further system choices, as we discuss in the next sections. Note that counter-intuitively, going backwards through this process just one step (A - recover music notation) might be in fact more difficult than going back two steps (B - recover musical semantics) directly. This is because music notation contains a logical structure and more information than simply the notes. Skipping the explicit description of music notation allows bypassing this complexity. There is, of course, a close relationship between recovering music notation and musical semantics. A single system may even attempt to solve both at the same time because once the full score with all its notational details is recovered, the musical semantics can be inferred unambiguously. Keep in mind that the other direction does not necessarily work: if only the musical semantics are restored from a document without the engraving information that describes how the notes were arranged, those notes may still be typeset using meaningful engraving defaults, but the result is probably much harder to comprehend (see Fig. 2b for such an example). 4.1 Alternative Names Optical Music Recognition is a well-established term, and we do not seek to establish a new one. We just notice a lack of precision in its definition. Therefore, it is no wonder that people have been interpreting it in many different ways to the extent that even the optical detection of lip motion for identifying the musical genre of a singer [53] has been called OMR. Alternative names that might not exhibit this vagueness are Optical Music Notation Recognition, Optical Score Recognition8, or Optical Music Score Recognition. While the prefix “Opticalž is not compulsory, it could still prove beneficial in highlighting the visual characteristics and help distinguish it from techniques that work on audio recordings. 5 RELATION TO OTHER FIELDS Now that we have thoroughly described what Optical Music Recognition is, we briefly set it in context of other disciplines, both scientific and general fields of human endeavors. Figure 5 lays out the various key areas that are relevant for OMR, both as its tools and the “consumersž of its outputs. From a technical point of view, OMR can be considered a subfield of 8which is similar to the German equivalent “Optische Notenerkennungž ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:8 Calvo-Zaragoza et al. Fig. 5. Optical Music Recognition with its most important related fields, methods, and applications. computer vision and document analysis, with deep learning acting as a catalyst that opens up promising novel approaches. Within the context of Music Information Retrieval (MIR), OMR should enable the application of MIR algorithms that rely on symbolic data and audio inputs (through rendering the recognized scores). It furthermore can enrich digital music score libraries and make them much more searchable and accessible, which broadens the scope of digital musicology to compositions for which we only have the written score (which is probably the majority of Western musical heritage). Finally, OMR has practical implications for composers, conductors, and the performers themselves, as it cuts down the costs of digitizing scores, and therefore bring the benefits of digital formats to their everyday practice. 5.1 Optical Music Recognition vs. Text Recognition One must also address the obvious question: why should OMR be singled out besides Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR), given that they are tightly linked [18], and OMR has frequently been called “OCR for musicž [25, 26, 68, 80, 93, 94, 109, 128, 129, 147]?9 What is the justification of talking specifically about music notation and what differentiates it from other graphics recognition challenges? What are the special considerations in OMR that one does not encounter in other writing systems? A part of the justification lies in the properties of music notation as a featural writing system. While its alphabet consists of well-defined primitives (e.g., stems, noteheads, or flags) that have a clear interpretation, it is only in their configurationÐhow they are placed and arranged on the staffs, and with respect to each otherÐthat specifies what notes should be played. The properties of music notation that make it a challenge for computational reading have been discussed exhaustively by Byrd and Simonsen [29]; we hypothesize that these difficulties are ultimately caused by this featural nature of music notation. Another major reason for considering the field of OMR distinct from text recognition is the application domain itselfÐmusic. When processing a document of music notation, there is a 9Even the English Wikipedia article on OMR has been calling it “Music OCRž for over 13 years. ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Understanding Optical Music Recognition 1:9 Fig. 6. How the translation of the graphical concept of a note into a pitch is affected by the clef and accidentals. The effective pitch is written above each note. Accidentals immediately before a note propagate to other notes within the same measure, but not to the next measure. Accidentals at the beginning of a measure indicate a new key signature that affects all subsequent notes. Fig. 7. This excerpt by Ludwig van Beethoven, Piano Sonata op. 2 no. 2, Largo appassionato, m. 31 illustrates some properties of the music notation that distinguish it from other types of writing systems: a wide range of primitive sizes, the same primitives appearing at different scales and rotations, and the ubiquitous two-dimensional spatial relationships. natural requirement to recover its musical semantics (see Section 4, setting B) as well, as opposed to text recognition, which typically does not have to go beyond recognizing letters or words and ordering them correctly. There is no proper equivalent of this interpretation step in text recognition since there is no definite answer to how a symbol configuration (=words) should be further interpreted; therefore, one generally leaves interpretation to humans or to other well-defined tasks from the Natural Language Processing field. However, given that music is overwhelmingly often conceptualized as notes, and notes are well-defined objects that can be inferred from the score, OMR is, not unreasonably, asked to produce this additional level of outputs that text recognition does not. Perhaps the simplest example to illustrate this difference is given by the concept of the pitch of the notes (see Fig. 6). While graphically a note lies on a specific vertical position of the staff, other objects, such as the clefs and accidentals determine its musical pitch. It is therefore insufficient for the OMR to provide just the results in terms of positions, but it also has to take the context into account, in order to convert positions (graphical concept) into pitches (musical concept). In this regard, OMR is more ambitious than text recognition, since there is an additional interpretation step specifically for music that has no good analogy in other natural languages. The character set poses another significant challenge, compared to text recognition. Although writing systems like Chinese have extraordinarily complex character sets, the set of primitives for OMR spans a much greater range of sizes, ranging from small elements like a dot to big elements spanning an entire page like the brace. Many of the primitives may appear at various scales and rotations like beams or have a nearly unrestricted appearance like slurs that are only defined as more-or-less smooth curves that may be interrupted anywhere. Finally, in contrast to text recognition, music notation involves ubiquitous two-dimensional spatial relationships, which are salient for the symbols’ interpretation. Some of these properties are illustrated in Fig. 7. Furthermore, Byrd and Simonsen [29] argue that because of the vague limits of what one may want to express using music notation, its syntactic rules can be expected to be bent accordingly; this ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:10 Calvo-Zaragoza et al. Fig. 8. Brahms Intermezzo, Op. 117 no. 1. Adjacent notes of the chords in the first bar in the top staff are shifted to the right to avoid overlappings (yellow dotted boxes). The moving eighths in the second bar are forced even further to the right, although being played simultaneously with the chord (red dashed boxes). happens to such an extent that Homenda et al. [90] argued that there is no universal definition of music notation at all. Figure 7 actually contains an instance of such rule-breaking: while one would expect all notes in one chord to share the same duration, the chord on the bottom left contains a mix of white and black noteheads, corresponding to half- and quarter-notes. At the same time, however, the musical intent is yet another: the two quarter-notes in the middle of the chord are actually played as eighth notes, to add to the rich sonority of the fortissimo chord on the first beat.10 We believe this example succinctly illustrates the intricacies of the relationship between musical comprehension and music notation. This last difference between a written quarter and interpreted eighth note is, however, beyond what one may expect OMR to do, but it serves as further evidence that the domain of music presents its own difficulties, compared to the domains where text recognition normally operates. 5.2 Optical Music Recognition vs. Other Graphics Recognition Challenges Apart from text, documents can contain a wide range of other graphical information, such as engineering drawings, floor plans, mathematical expressions, comics, maps, patents, diagrams, charts or tables [44, 58]. Recognizing any of these comes with its own set of challenges, e.g., comics combine text and other visual information in order to narrate a story, which makes recovering the correct reading order a non-trivial endeavor. Similarly, the arrangement of symbols in engineering drawing and floor plans can be very complex with rather arbitrary shapes. Even tasks that are seemingly easy, such as the recognition of tables, must not be underestimated and are still subject to ongoing research [131, 144]. The hardest aspects of OMR are much closer to these challenges than to text recognition: the ubiquitous two-dimensionality, long-distance spatial relationships, and the permissive way of how individual elements can be arranged and appear at different scales and rotations. One thing that makes CWMN more complex than many graphics recognition challenges like mathematical formulae recognition is the complex typographical alignment of objects [7, 29] that is dictated by the content, e.g., each space between multiple notes of the same length should be equal. This complexity is often driven by interactions between individual objects that force other elements to move around, breaking the principal horizontal alignment of simultaneous events (see Fig. 8, 9 and 10). 10This effect would be especially prominent on the Hammerklavier instruments prevalent around the time Beethoven was composing this sonata. ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Understanding Optical Music Recognition 1:11 Fig. 9. Sample from the CVC-MUSCIMA dataset [60] with the same bar transcribed by two different writers. The first three notes and the second three notes form a chord and should be played simultaneously (see right figure) but is sometimes horizontally spelled out (see left figure) left is sometimes used in violin scores. Fig. 10. Sample from the Songbook of Romeo & Julia by Gerard Presgurvic [124] with uneven spacing between multiple sixteenth notes of the same length in the middle voice to align the notes with the lyrics. Apart from the typographical challenges, OMR also has an extremely complex semantic, with many implicit rules. To handle this complexity, researchers have started a long time ago to leverage the rules that govern music notation and formulate them into grammars [4, 123]. For instance, the fact that the note durations (in each notated voice) have to sum up to the length of a measure has been integrated into OMR as a post-processing step [120]. Fujinaga [67] even states that music notation can be recognized by an LL(k) grammar. Nevertheless, the following citation from Blostein and Baird [22] (p.425) is still mostly true: “Various methods have been suggested for extending grammatical methods which were developed for one-dimensional languages. While many authors suggest using grammars for music notation, their ideas are only illustrated by small grammars that capture a tiny subset of music notation.ž [22] (p.425; sec. 7 - Syntactic Methods). There has been progress on enlarging the subset of music notation captured by these grammars, most notably in the DMOS system [49], but there are still no tractable 2-D parsing algorithms that are powerful enough for recognizing music notation without relying on fragile segmentation heuristics. It is not clear whether current parsers used to recognize mathematical expressions [3] are applicable to music notation or simply have not been applied yetÐat least we are not aware of any such works. 6 A TAXONOMY OF OMR Now that we have progressed in our effort to define Optical Music Recognition, we can turn our attention to systematizing the field with respect to motivating applications, subtasks, and their interfaces. We reiterate that our objective is not to review the methods by which others have attempted to reach the goals of their OMR work; rather, we are proposing a taxonomy of the field’s ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:12 Calvo-Zaragoza et al. goals themselves. Our motivation is to find natural groups of OMR applications and tasks for which we can expect, among other things, shared evaluation protocols. The need for such systematization has long been felt [23, 30], but subsequent reviews [111, 132] have focused almost entirely on technical solutions. 6.1 OMR Inputs The taxonomy of inputs of OMR systems is generally established. The first fundamental difference can be drawn between offline and online11 OMR: offline OMR operates on a static image, while online OMR operates on a time series of user-interactions, typically pen positions that were captured from a touch interface [31, 72, 73, 150]. Online OMR is generally considered easier since the decomposition into strokes provides a high-quality over-segmentation essentially for free. Offline OMR can be further subdivided by the engraving mechanism that has been used, which can be either typeset by a machine, often inaccurately referred to as printed12, or handwritten by a human, with an intermediate, yet common scenario of handwritten notation on pre-printed staff paper. Importantly, music can be written down in many different notation systems that can be seen as different languages to express musical concepts (see Fig. 11). CWMN is probably the most prominent one. Before CWMN was established, other notations such as mensural or neumes preceded it, so we refer to them as early notations. Although this may seem like a tangential issue, the recognition of manuscripts in ancient notations has motivated a large number of works in OMR that facilitate the preservation and analysis of the cultural heritage as well as enabling digital musicological research of early music at scale [50, 51, 69, 158]. Another category of notations that are still being actively used today are instrument-specific notations, such as tablature for string instruments or percussion notation. The final category captures all other notations including, e.g., modern graphic notation, braille music or numbered notation that are only rarely used and for which the existing body of music is much smaller than for the other notations. To get an idea of how versatile music can be expressed visually, the Standard Music Font Layout [148] currently lists over 2440 recommended characters, plus several hundred optional glyphs. Byrd and Simonsen [29] further characterize OMR inputs by the complexity of the notated music itself, ranging from simple monophonic music to “pianoform.ž They use both the presence of multiple staffs as well as the number of notated voices inside a single staff as a dimension of notational complexity. In contrast, we do not see the number of staffs as a driver of complexity since a page typically contains many staffs and a decision on how to group them into systems has to be made anyway. Additionally, we explicitly add a category for homophonic music that only has a single logical voice, even though that voice may contain chords with multiple notes being played simultaneously. The reason for singling out homophonic music is that inferring onsets becomes trivial once notes are grouped into chords, as opposed to polyphonic music with multiple logical voices: one can simply read them left-to-right without having to do a voice assignment. Therefore, we propose the following four categories (see Fig. 12): (a) Monophonic: only one note (per staff) is played at a time. (b) Homophonic: multiple notes can occur at the same time to build up a chord, but only as a single voice. (c) Polyphonic: multiple voices can appear in a single staff. 11Although it might sound ambiguous, the term online recognition has been used systematically in the handwritten recognition community. Sometimes, this scenario is also referred to as pen-based recognition. 12Handwritten manuscripts can also be printed out, if they were scanned previously, therefore we prefer the word typeset. ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Understanding Optical Music Recognition 1:13 (a) (b) (c) (d) Fig. 11. Examples of scores written in various notations: (a) Common Western Music Notation (Dvorak Symphony No.9, IV), (b) White Mensural Notation (Belli [121]), (c) Tabulature (Regondi, Etude No.10) and (d) Braille (Beethoven, Sonata No.14 Op.27 No.2). (d) Pianoform: scores with multiple staffs and multiple voices that exhibit significant structural interactions. They can be much more complex than polyphonic scores and cannot be disas- sembled into a series of monophonic scores, such as in polyphonic renaissance vocal part books. This term was coined by Byrd and Simonsen [29]. This complexity of the encoded music has significant implications on the model design since the various levels translate into different sets of constraints on the output. It cannot simply be adjusted or simulated like the visual complexity by applying an image operation on a perfect image [95] because it represents an intrinsic property of the music. Finally, as with other digital document processing, OMR inputs can be classified according to their image quality which is determined by two independent factors: the underlying document quality, and the digital imaging acquisition mode. The underlying document quality is a continuum on a scale from perfect or nearly flawless (e.g., if the document was born-digital and printed) to heavily degraded or defaced documents (e.g., ancient manuscripts that deteriorated over time and exhibit faded ink, ink blots, stains, or bleedthrough) [29]. The image acquisition mode is also a continuum that can reach from born-digital images, over scans of varying quality to low-quality, distorted photos that originate from camera-based scenarios with handheld cameras, such as smartphones [2, 160]. 6.2 OMR Outputs The taxonomy of OMR outputs, on the other hand, has not been treated as systematically in the OMR literature. Lists of potential or hypothetical applications are typically given in introductory ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:14 Calvo-Zaragoza et al. (a) Monophonic (b) Homophonic (c) Polyphonic (d) Pianoform Fig. 12. Examples of the four categories of music notation complexity. sections [22, 38, 67, 111]. While this may not seem like a serious issue, it makes it hard to categorize different works and compare their results with each other because one often ends up comparing apples to oranges [7]. The need for a more principled treatment is probably best illustrated by the unsatisfactory state of OMR evaluation. As pointed out by [29, 81, 84], there is still no good way at the moment of how to measure and compare the performance of OMR systems. The lack of such evaluation methods is best illustrated by the way how OMR literature presents the state of the field: Some consider it a mature area that works well (at least for typeset music) [5, 12, 61, 62, 134]. Others describe their systems with reports of very high accuracies of up to nearly 100% [33, 91, 99, 104, 110, 122, 145, 160, 161], giving an impression of success; however, many of these numbers are symbol detection scores on a small corpus with a limited vocabulary that are not straightforward to interpret in terms of actual usefulness, since they do not generalize [19, 29]13. The existence of commercial applications [71, 106ś108, 112, 130, 149] is also sometimes used to support the claim that OMR “worksž [13]. On the other hand, many researchers think otherwise [19, 28, 40, 46, 82, 83, 109, 118, 132, 133], emphasizing that OMR does not provide satisfactory solutions in generalÐnot even for typeset music. Some indirect evidence of this can be gleaned from the fact that even for high-quality scans of typeset music, only a few projects rely on OMR,14 while other projects still prefer to 13The problem of incomparable results has already been noted in the very first review of OMR in 1972 by Kassler [96] when he reviewed the first two OMR theses by Pruslin [126] and Prerau [123]. 14Some users of the Choral Public Domain Library (CPDL) project use commercial applications such as SharpEye or PhotoScore Ultimate: http://forums.cpdl.org/phpBB3/viewtopic.php?f=9&t=9392 ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Understanding Optical Music Recognition 1:15 crowdsource the manual transcription instead of using systems for the automatic recognition [78], or at least crowdsource the correction of the errors produced by OMR systems [141]. Given the long-standing absence of OMR evaluation standards, this ambivalence is not surprising. However, a scientific field should be able to communicate its results in comprehensible terms to external stakeholdersÐsomething OMR is currently unable to do. We feel that to a great extent this confusion stems from the fact that the question “Does OMR work?ž is an overly vague question. As our analysis in Section 2 shows, OMR is not a monolithic problemÐtherefore, asking about the “state of OMRž is under-specified. “Does OMR work?ž must be followed by “... as a tool for X,ž where X is some application, in order for such questions to be answerable. There is, again, evidence for this in the OMR literature. OMR systems have been properly evaluated in retrieval scenarios [1, 10, 66] or in the context of digitally replicating a musicological study [83]. It has, in fact, been explicitly asserted [81] that evaluation methodologies are only missing for a limited subset of OMR applications. Specifically, there is no knownmeaningful edit distance between two scores (whatever their underlying representation). At the same time, the granularity at which we define the various tasks should not be too fine, otherwise one risks entering a different swamp: instead of no evaluation at all, each individual work is evaluated on themerits of a narrowly defined (and oftenmerely hypothetical) application scenario, which also leads to incomparable contributions. In fact, this risk has already been illustrated on the subtask of symbol detection, which seems like a well-defined problem where the comparison should be trivial. In 2018, multiple music notation object detection papers have been published [82, 116, 117, 152], but each reported results in a different way while presenting a good argument for choosing that kind of evaluation, so significant effort was necessary in order to make these contributions directly comparable [119]. A compromise is therefore necessary between fully specifying the question of whether OMR “worksž by asking for a specific application scenario, and on the other hand retaining sufficiently general categories of such tasks. Having put forward the reasoning for why systematizing the field of OMR with respect to its outputs is desirable, we proceed to do so. For defining meaningful categories of outputs for OMR, we come back to the fundamentals of how OMR inverts the music encoding process to recover the musical semantics and musical notation (see Section 2). These two prongs of reading musical documents roughly correspond to two broad areas of OMR applications [105] that overlap to a certain extent: • Replayability: recovering the encoded music itself in terms of pitch, velocity, onset, and duration. This application area sees OMR as a component inside a bigger music processing pipeline that enables the system to operate on music notation documents as just another input. Notice that readability by humans is not required for these applications, as long as the computer can process and “playž the symbolic data. • Structured Encoding: recovering the music along with the information on how it was encoded using elements of music notation. This avenue is oriented towards providing the score for music performance, which requires a (lossless) re-encoding of the score and assumes that humans read the OMR output directly. Recovering the musical semantics might not in fact be strictly necessary, but in practice, one often wishes to obtain that information too, in order to enable digitally manipulating the music in a way that would be easiest done with the semantics being recovered (e.g., transposing a part to make it suitable for another instrument). In other words, the output of an application that targets replayability is typically processed by a machine, whereas humans usually demand the complete recognition of the structured encoding to allow for a readable output (see Fig. 2). ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:16 Calvo-Zaragoza et al. While the distinction between replayability and structured encoding is already useful, there are other reasons that make it interesting to read musical notation from a document. For example, to search for specific content or to draw paleographic conclusions about the document itself. Therefore, we need to broaden the scope of OMR to actually capture these applications. We realized that some use-cases require much less comprehension of the input and music notation than others. To account for this, we propose the following four categories that demand an increasing level of comprehension: Document Metadata Extraction, Search, Replayability, and Structured Encoding (see Fig. 13). Level of Comprehension Search Replayability Encoding StructuredDocument Metadata Extraction CompletePartial Fig. 13. Taxonomy of four categories of OMR applications that require an increasing level of comprehension, starting with metadata extraction where a minimal understanding might be sufficient, up to structured encoding that requires a complete understanding of music notation with all its intricacies. Depending on the goal, applications differ quite drastically in terms of requirementsÐforemost in the choice of output representation. Furthermore, this taxonomy allows us to use different evaluation strategies. 6.2.1 Document Metadata Extraction. The first application area requires only a partial understand- ing of the entire document and attempts to answer specific questions about it. These can be very primitive ones, like whether a document contains music scores or not, but the questions can also be more elaborate, for example: • In which period was the piece written in? • What notation was used? • How many instruments are depicted? • Are two segments written by the same copyist? All of the aforementioned tasks entail a different level of underlying computational complexity. However, we are not organizing applications according to their difficulty but instead by the type of answer they provide. In that sense, all of these tasks can be formulated as classification or regression problems, for which the output is either a discrete category or a continuous value, respectively. Definition 2. Document metadata extraction refers to a class of Optical Music Recognition appli- cations that answer questions about the music notation document. The output representation for document metadata extraction tasks are scalar values or category labels, and if not, its structure is determined by the user, not by the properties of the domain. Again, this does not imply that extracting the target values is necessarily easy, but that the difficulties are not related to the output representation, as is the case for other uses. Although this type of application has not been very popular in the OMR literature, there are some works that approach this scenario. In [9] and [118] the authors describe systems that classify images whether they depict music scores or not. While the former one used a basic computer vision approach with a Hough transform and run-length ratios, the latter uses a deep convolutional neural network. Such systems can come in handy if one has to automatically classify a very large ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Understanding Optical Music Recognition 1:17 number of documents [114]. Perhaps the most prominent application is identifying the writer of a document [63, 64, 77, 139] (which can be different from the composer). This task was one of the main motivations behind the construction of the CVC-MUSCIMA dataset [60] and was featured in the ICDAR 2011 Music Score Competition [59]. The document metadata extraction scenario has the advantage of its unequivocal evaluation protocols. Tasks are formulated regarding either classification or regression, and these have well- defined metrics such as accuracy, f-measure, or mean squared error. 6.2.2 Search. Nowadays we have access to a vast amount of musical documents. Libraries and communities have taken considerable efforts to catalog and digitize music scores, by scanning them and freely providing users access to them, e.g., IMSLP [125], SLUB [140], DIAMM [20] or CPDL [113], to name a few. Here is a fast growing interest in automated methods which would allow users to search for relevant musical content inside these sources systematically. Unfortunately, searching for specific content often remains elusive because many projects only provide the images and manually entered metadata. We capture all applications that enable such lookups under the category Search. Examples of search questions could be: • Do I have this piece of music in my library? • On which page can I find this melody? • Where does this sequence of notes (e.g., a theme) repeat itself? • Was a melody copied from another composition? • Find the same measure in different editions for comparing them. Definition 3. Search refers to a class of Optical Music Recognition applications that, given a collection of sheet music and a musical query, compute the relevance of individual items of the collection with respect to the given query. Applications from this class share a direct analogy with keyword spotting (KWS) in the text domain [74] and a common formulation: the input is a query as well as the collection of documents where to look for it; the output is the selection of elements from that collection that match the query. However, “wherež is a loose concept and can refer to a complete music piece, a page, or in the most specific cases, a particular bounding-box or even a pixel-level location. In the context of OMR, the musical query must convey musical semantics (as opposed to general search queries, e.g., by title or composer; hence the term “musicalž query in Definition 3). The musical query is typically represented in a symbolic way, interpretable unambiguously by the computer (similar to query-by-string in KWS), yet it is also interesting to consider queries that involve other modalities, such as image queries (query-by-example in KWS) or audio queries (query-by-humming in audio information retrieval or query-by-speech in KWS). Additionally, it makes sense to establish different domain-specific types of matching, as it is useful to perform searches restricted to specific music concepts such as melodies, sequences of intervals, or contours, in addition to exact matching. A direct approach for search within music collections is to use OMR technology to transform the documents into symbolic pieces of information, over which classical content-based or symbolic retrieval methods can be used [1, 14, 47, 52, 55, 88, 97, 151]. The problem is that these transformations require a more comprehensive understanding of the processed documents (see Sections 6.2.3 and 6.2.4 below). To avoid the need for an accurate symbol-by-symbol transcription, search applications can resort to other methods to determine whether (or how likely) a given query is in a document or not. For instance, in cross-modal settings, where one searches a database of sheet music using aMIDI file [10, 66] or a melodic fragment that is given by the user on the fly [1], OMR can be used as a hash function. When the queries and documents are both projected into the search space by the same OMR system, some limitations of the system may even cancel out (e.g., ignoring key signatures), so ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:18 Calvo-Zaragoza et al. that retrieval performance might deteriorate less than one would expect. Unfortunately, if either the query or the database contains the true musical semantics, such errors do become critical [83]. A few more works have also approached the direct search of music content without the need to convert the documents into a symbolic format first. Examples comprise the works by [100] dealing with a query-by-example task in the CVC-MUSCIMA dataset, and by [35], considering a classical query-by-string formulation over early handwritten scores. In the cross-modal setting, the audio-sheet music retrieval contributions of [54] are an example of a system that explicitly attempts to gain only the minimum level of comprehension of music notation necessary for performing its retrieval job. Search systems usually retrieve not just a single result but all those that match the input query, typically sorted by confidence. This setting can re-use general information retrieval methodologies for evaluating performance [87, 101], such as precision and recall as well as encompassing metrics like average precision and mean average precision. 6.2.3 Replayability. Replayability applications are concerned with reconstructing the notes en- coded in the music notation document. Notice that producing an actual audio file is not considered to be part of OMR, despite being one of the most frequent use-cases of OMR. In any case, OMR can enable these applications by recovering the pitches, velocities, onsets, and durations of notes. This symbolic representation, usually stored as a MIDI file, is already a very useful abstraction of the music itself and allows for plugging in a vast range of computational tools such as: • synthesis software to produce an audio representation of the composition • music information retrieval tools that operate on symbolic data • tools that perform large-scale music-theoretical analysis • creativity-focused applications [162] Definition 4. Replayability refers to a class of Optical Music Recognition applications that recover sufficient information to create an audible version of the written music. Producing a MIDI (or an equivalent) representation is one key goal for OMRÐat least for the foreseeable future since MIDI is a representation of music that has a long tradition of computational processing for a vast variety of purposes. Many applications have been envisioned that only require replayability. For example applications that can sight-read the scores to assist practicing musicians or provide missing accompaniment. Replayability is also a major concern for digital musicology. Historically, the majority of com- positions has probably never been recorded, and therefore is only available in written form as scores; of these, most compositions have also never been typeset, since typesetting has been a very expensive endeavor, reserved essentially either for works with assured commercial success, or composers with substantial backing by wealthy patrons. Given the price of manual transcription, it is prohibitive to transcribe large historical archives. OMR that produces MIDI, especially if it can do so for manuscripts, is probably the only tool that could open up the vast amount of compositions to quantitative musicological research, which, in turn, could perhaps finally start answering broad questions about the evolutions of the average musical styles, instead of just relying on the works of the relatively few well-known composers. Systems designed for the goal of replayability traditionally seek first to obtain the structured encoding of the score (see Section 6.2.4), from which the sequences of notes can be straightfor- wardly retrieved [82]. However, if the specific goal is to obtain something equivalent to a MIDI representation, it is possible to simplify the recognition and ignore many of the elements of musical notation, as demonstrated by numerous research projects [16, 65, 90, 91, 102, 116, 138]. An even clearer example of this distinction can be observed in the works of Shi et al. [146] as well as van ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Understanding Optical Music Recognition 1:19 der Wel and Ullrich [157]; both focus only on obtaining the sequence of note pairs (duration, pitch) that are depicted in single-staff images, regardless of how these notes were actually expressed in the document. Another instance of a replay-oriented application is the Gocen system [5] that reads handwritten notes with a specially designed device with the goal of producing a musical performance while ignoring the majority of music notation syntax. Once a system is able to arrive at a MIDI-like representation, evaluating the results is a matter of comparing sets of pitch-onset-duration-triplets. Velocities may optionally be compared too, once the note-by-note correspondence has been established, but can be seen as secondary for many applications. Note, however, that even on the level of describing music as configurations of pitch-velocity-onset-duration-quadruples, MIDI is a further simplification that is heavily influenced by its origin as a digital representation of performance, rather than of a composition: the most obvious inadequacy of MIDI is its inability to distinguish pitches that sound equivalent but are named differently, e.g., F-sharp and G-flat.15 Multiple similarity metrics for comparing MIDI files have been proposed during the Symbolic Melodic Similarity track of the Music Information Retrieval Evaluation eXchange (MIREX),16 e.g., by determining the local alignment between the geometric representations of the melodies [153ś156]. Other options could be multi-pitch estimation evaluation metrics [17], Dynamic Time Warping [54], or edit distances between two time-ordered sequences of pitch-duration pairs [33, 163]. 6.2.4 Structured Encoding. It can be reasonably stated that digitizing music scores for “human consumptionž and score manipulation tasks that a vollkommener Capellmeister17 [103] routinely performs, such as part exporting, merging, or transposing for available instruments is the original motivation of OMR ever since it started [6, 67, 123, 126] and the one that appeals to the widest audience. Given that typesetting music is troublesome and time-consuming, OMR technology represents an attractive alternative to obtain a digital version of music scores on which these operations can be performed efficiently with the assistance of the computer. This brings us to our last category that requires the highest level of comprehension, called structured encoding. Structured encoding aims to recognize the entire music score while retaining all the engraving information available to a human reader. Since there is no viable alternative to music notation, the system has to fully transcribe the document into a structured digital format with the ultimate goal of keeping the same musical information that could be retrieved from the physical score itself. Definition 5. Structured Encoding refers to a class of Optical Music Recognition applications that fully decode the musical content, along with the information of ’how’ it was encoded by means of music notation. Note that the difference between replayability and structured encoding can seem vague: for instance, imagine a system that detects all notes and all other symbols and exports them into a MusicXML file. The result, however, is not the structured encoding unless the system also attempts to preserve the information on how the scores were laid out. That does not mean it has to store the bounding box and exact location of every single symbol, but the engraving information that conveys musical semantics, like whether the stem of a note went up or down. To illustrate this, consider the following musical snippet in Fig. 14. If a system like the one described in [33] recognized this, it would remain restricted to replayability. Not because of the current limitations to monophonic, 15This is the combined heritage of equal temperament, where these two pitches do correspond to the same fundamental frequency, and of the origins of MIDI in genres dominated by fretted and keyboard instruments. 16 https://www.music-ir.org/mirex/wiki/MIREX_HOME 17roughly translated from German as “ideal conductorž ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:20 Calvo-Zaragoza et al. Fig. 14. Beginning of Franz Schubert, Impromptu D.899 No. 2 with omitted thirds starting in the second measure of the top staff (gray) and a color-coding of the two distinct voices in the second staff (green and blue). single-staff music, but due to the selected output representation, which does not store engraving information such as the simplifications that start in the second measure of the top staff (the grayed out 3s that would be omitted in the printing) or the stem directions of the notes in the bottom staff (green and blue) that depict two different voices. In summary, any system discarding engraving information that conveys musical semantics cannot reach, by definition, the structured encoding goal. To help understand, why structured encoding poses such a difficult challenge, we would like to avail ourselves of the intuitive comparison given by Donald Byrd:18 representing music as time-stamped events (e.g., with MIDI) is similar to storing a piece of writing in a plain text file; whereas representing music with music notation (e.g., with MusicXML) is similar to a structured description like an HTML website. By analogy, obtaining the structured encoding from the image of a music score can be as challenging as recovering the HTML source code from the screenshot of a website. Since this use-case appeals to the widest audience, it has seen development both from the scien- tific research community and commercial vendors. Notable products that attempt full structured encoding include SmartScore [106], Capella Scan [37], PhotoScore [108] as well as the open-source application Audiveris [21]. While the projects described in many scientific publications seem to be striving for structured encoding to enable interesting applications such as the preservation of the cultural heritage [39], music renotation [41], or transcriptions between different music notation languages [135], we are not aware of any systems in academia that would actually produce structured encoding. A major stumbling block for structured encoding applications has for a long time been the lack of practical formats for representing music notation that would be powerful enough to retain the information from the input score, and at the same time be a natural endpoint for OMR. This is illustrated by papers that propose OMR-specific representations, both before the emergence of MusicXML [75, 76] as a viable interchange format [105] and after [86]. At the same time, however, even without regard for OMR, there are ongoing efforts to improve music notation file formats: further development of MusicXML has moved into the W3C Music Notation Community Group,19 and there is an ongoing effort in the development of the Music Encoding Initiative format [137], best illustrated by the annual Music Encoding Conference.20 Supporting the whole spectrum of music notation situations that arise in a reasonably-sized archive is already a difficult task. This can be evidenced by the extensive catalog of requirements for music notation formats that Byrd and Isaacson [27] list for a multi-purpose digital archive of music scores. Incidentally, the same paper 18http://music.informatics.indiana.edu/don_notation.html 19https://www.w3.org/community/music-notation/ 20https://music-encoding.org/conference/past.html ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Understanding Optical Music Recognition 1:21 also mentions support for syntactically incorrect scores among the requirements, which is one of the major problems that OMR has with outputting to existing formats directly. Although these formats are becoming more precise and descriptive, they are not designed to store information about how the content was automatically recognized from the document. This kind of information is actually relevant for systems’ evaluation, as it allows, for example, determining if a pitch was misclassified because of either a wrongly detected position in the staff or a wrongly detected clef. The imperfections of representation standards for music notation is also reflected in a lack of evaluation standards for structured encoding. Given the ground truth representation of a score and the output of a recognition system, there is currently no automatic method that is capable of reliably computing how well the recognition system performed. Ideally, such a method would be rigorously described and evaluated, have a public implementation, and give meaningful results. Within the traditional OMR pipeline, the partial steps (such as symbol detection) can use rather general evaluation metrics. However, when OMR is applied for getting the structured encoding of the score, no evaluation metric is available, or at least generally accepted, partially because of the lack of a standard representation for OMR output, as mentioned earlier. The notion of “edit costž or “recognition gainž that defines success in terms of how much time a human editor saves by using an OMR system is yet more problematic, as it depends on the editor and on the specific toolchain [19]. There is no reason why a proper evaluation should not be possible since there is only a finite amount of information that a music document retains, which can be exhaustively enumerated. It follows that we should be able tomeasure what proportion of this information our systems recovered correctly. The rationale why this is still such a hard problem is because there is no underlying formal model of music notation. Such a model could support structured encoding evaluation by being: • Comprehensive: integrating naturally both the “reprintabilityž and “replayabilityž level (also called graphical and semantical level in the literature), by being capable of describing the various corner cases (which implies extensibility); • Useful: enabling tractable inference (at least approximate) and an adequate distance function; and • Sufficiently supported through open-source software. The existing XML formats for encoding music notation are inadequate representations for OMR. For example, the XML tree structure is unsuitable, as evidenced by the frequent need for referencing the XML elements across arbitrarily distant subtrees. Historically, context-free grammars have been the most explored avenue for a unified formal description of music notation, both with an explicit grammar [4, 49] and implicitly using a modified stack automaton [8]: this feels natural, given that music notation has strict syntactic rules and hierarchical structures that invite such descriptions. The 2-D nature of music notation also inspired graph grammars [56] and attributed graph grammars [15]. Recently, modeling music notation as a directed acyclic graph has been proposed as an alternative [82, 86]. However, none of these formalisms has yet been adopted: the notation graph is too recent and does not have sufficient software and community support, and the older grammar-based approaches lack up-to-date open-source implementations altogether (and are insufficiently detailed in the respective publications for re-implementation). Without an appropriate formalism and the corresponding tooling, the evaluation of structured encoding can hardly hope to move beyond ad-hoc methods. ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:22 Calvo-Zaragoza et al. Hajič [81] argues that a good OMR evaluation metric should be intrinsic21 and independent of a certain use-case. The benefits would be the independence from the selected score editing toolchain as well as the music notation format and a clearly interpretable automatic metric for guiding OMR development (which could ideally be used as a differentiable loss function for training full-pipeline end-to-end machine learning-based systems). This question is still one of the major issues in the field. 7 APPROACHES TO OMR In order to complete our journey through the landscape of Optical Music Recognition, we yet have to visit the arena of OMR techniques. These have recently undergone a paradigm shift towardsmachine learning that has brought about a need to revisit the way that OMR methods have traditionally been systematized. As opposed to OMR applications, the vocabulary of OMR methods and subtasks already exists [132] and only needs to be updated to reflect the new reality of the field. As mentioned before, obtaining the structured encoding of the scores has been the main moti- vation to develop the OMR field. Given the difficulty of such objective, the process was usually approached by dividing it into smaller stages that could represent challenges within reach with the available technologies and resources. Over the years, the pipeline described by Bainbridge and Bell [7], refined by Rebelo et al. in 2012 [132] became the de-facto standard. That pipeline is traditionally organized into the following four blocks, sometimes with slightly varying names and scopes of the individual stages: (1) Preprocessing: Standard techniques to ease further steps, e.g., contrast enhancement, binariza- tion, skew-correction or noise removal. Additionally, the layout should be analyzed to allow subsequent steps to focus on actual content and ignore the background. (2) Music Object Detection: Finding and classifying all relevant symbols or glyphs in the image. (3) Notation Assembly: Recovering the music notation semantics from the detected and classified symbols. The output is a symbolic representation of the symbols and their relationships, typically as a graph. (4) Encoding: Encoding the music into any output format unambiguously, e.g., into MIDI for playback or MusicXML/MEI for further editing in a music notation program. With the appearance of deep learning in OMR, many steps that traditionally produced suboptimal results, such as the staff-line removal or symbol classification have seen drastic improvements [70, 118] and are nowadays considered solved or at least clearly solvable. This caused some steps to become obsolete or collapse into a single (bigger) stage. For instance, the music object detection stage was traditionally separated into a segmentation stage and classification stage. Since staff lines make it hard to separate isolated symbols through connected component analysis, they typically were removed first, using a separate method. However, deep learning models with convolutional neural networks have been shown to be able to deal with the music object detection stage holistically without having to remove staff lines at all. In addition to the performance gains, a compelling advantage is the capability of these models to train them in a single step by merely providing pairs of images and positions of the music objects to be found, eliminating the preprocessing step altogether. A baseline of competing approaches on several datasets containing both handwritten and typeset music can be found in the work of Pacha et al. [119]. The recent advances also diversified the way of how OMR is approached altogether: there are alternative pipelines with their own ongoing research that attempt to face the whole process in a 21Extrinsic evaluation means evaluating the system in an application context: “How good is this system for purpose X?.ž Intrinsic evaluation attempts to evaluate a system without reference to a specific use-case, asking how much of the encoded information has been recovered. In the case of OMR, this essentially reduces evaluation to error counting. ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Understanding Optical Music Recognition 1:23 single step. This holistic paradigm, also referred to as end-to-end systems, has been dominating the current state of the art in other tasks such as text, speech, or mathematical formula recognition [45, 48, 163]. However, due to the complexity of how musical semantics are inferred from the image, it is difficult (for now) to formulate it as a learnable optimization problem.While end-to-end systems for OMR do exist, they are still limited to a subset of music notation, at best. Pugin pioneered this approach utilizing hidden Markov models for the recognition of typeset mensural notation [127], and some recent works have considered deep recurrent neural networks for monophonic music written in both typeset [32, 33, 146, 157] and handwritten [13] modern notation. Unfortunately, polyphonic and pianoform scores are currently out of reach for end-to-end modelsÐnot just that the results would be disappointing, there is simply no appropriate model formulation. Therefore, even when only trying to produce the “notesž (semantics), one may choose to recover some of the engraving decisions explicitly as well, relying on the rules of inferring musical semantics as in the last stages of the traditional pipeline. Along with the paradigm shift towards machine learningÐwhich nowadays can be considered widely establishedÐseveral public datasets have emerged, such as MUSCIMA++ [86], DeepScores [152] or Camera-PrIMuS [32].22 There are also significant efforts to develop tools by which training data for OMR systems can be obtained including MUSCIMarker [85], Pixel.js [142], and MuRET [135]. On the other hand, while the machine learning paradigm has undeniably brought significant progress, it has shifted the costs onto data acquisition. This means that while the machine learning paradigm is more general and delivers state-of-the-art results when appropriate data is available, it does not necessarily drive down the costs of applying OMR. Still, we would sayÐtentativelyÐthat once these resources are spent, the chances of OMR yielding useful results for the specific use-case are higher compared to earlier paradigms. Tangentially to the way of dealing with the process itself, there has been continuous research on interactive systems for years. The idea behind such systems is based on the insight that OMR systems might always make some errors, and if no errors can be tolerated, the user is essential to correct the output. These systems attempt to incorporate user feedback into the OMR process in a more efficient way than just post-processing system output. Most notably is the interactive system developed by Chen et al. [42, 43], where the user directly interacts with the OMR system by specifying which constraints to take into account while visually recognizing the scores. The user can then iteratively add or remove constraints before re-recognizing individual measures until he is satisfied. The most powerful feature of interactive systems is probably the displaying of recognition results, superimposed on top of the original image, which allows to quickly spot errors [21, 37, 135, 159]. 8 CONCLUSIONS In this article, we have first addressed what Optical Music Recognition is and proposed to define it as research field that investigates how to computationally read music notation in documentsÐa definition that should adequately delimit the field, and set it in relation to other fields such as OCR, graphics recognition, computer vision, and fields that await OMR results. We furthermore analyzed in depth the inverse relation of OMR to the process of writing down a musical composition and highlighted the relevance of engraving music properlyÐsomething that must also be recognized to ensure readability for humans. The investigation of what OMR is, revealed why this seemingly easy task of reading music notation has turned out to be such a hard problem: besides the technical difficulties associated with document analysis, many fundamental challenges arise from the way 22A full list of all available datasets can be found at https://apacha.github.io/OMR-Datasets/ ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:24 Calvo-Zaragoza et al. howmusic is expressed and captured in music notation. By providing a sound, concise and inclusive definition, we capture how the field sees and talks about itself. We have then reviewed and improved the taxonomy of OMR, which should help systematize the current and future contributions to the field. While the inputs of OMR systems have been described systematically and established throughout the field, a taxonomy of OMR outputs and applications has not been proposed before. An overview of this taxonomy is given in Fig. 15. Finally, we have also updated the general breakdown of OMR systems into separate subtasks in order to reflect the paradigm shift towards machine learning methods and discussed alternative paradigms such as end-to-end systems and interactive scenarios. One of the key points we wanted to stress is the internal diversity of the field: OMR is not a monolithic task. As analyzed in Section 4, it enables various use-cases that require fundamentally different system designs, as discussed in Section 6.2. So before creating an OMR system, one should be clear about the goals and the associated challenges. The sensitivity to errors is another relevant issue that needs to be taken into account. As long as errors are inevitable [43, 50], it is important to consider the impact of those errors to the envisioned application. If someone wants to transcribe a score with an OMR system, but the effort needed for correcting the errors is greater than the effort for directly entering the notes into a music notation program, such anOMR systemwould obviously be useless [19]. Existing literature on error-tolerance is inconclusive: while we tend to believe that usersÐespecially practicing musiciansÐwould not tolerate false recognitions [136], we also see systems that can handle a substantial amount of OMR errors [1, 50, 83] and still produce meaningful results, e.g., when searching in a large database of scores. Therefore, it cannot be decided in advance how severe errors are, as it is always the end user who sets the extent of tolerable errors. The reader should now comprehend the spectrum of what OMR might do, understand the challenges that reading music notation entails, and have a solid basis for further exploring the field on his ownÐin other words, be equipped to address the issues described in the next section. 8.1 Open Issues and Perspectives for Future Research We conclude this paper by listing major open problems in Optical Music Recognition that signifi- cantly impede its progress and usefulness. While some of them are technical challenges, there are also many non-technical issues: • Legal aspects: Written music is the intellectual property of the composer and its allowed uses are defined by the respective publisher. Recognizing and sharing music scores can be seen as copyright infringement, like digitizing books without permission. To avoid this dispute, many databases such as IMSLP only store music scores whose copyright protection has expired. So an OMR dataset is either limited to old scores or one enters a legal gray area if not paying close attention to the respective license of every piece stored therein. • Stable community: For decades, OMR research was conducted by just a few individuals that worked distributedly and mostly uncoordinated. Most OMR researchers joined the field with minor contributions but left again soon afterward. Furthermore, due to a lack of dedicated venues, researchers rarely met in person [30]. This unstable setting and researchers that were not paying sufficient attention to reproducibility led to the same problems being solved over and over again [115]. • Lack of standards representations: There exist no standard representation formats for OMR outputs, especially not for structured encoding, and virtually every system comes with its own internal representation and output format, even for intermediate steps. This causes incompatibilities between different systems and makes it very hard to replace subcomponents. ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Understanding Optical Music Recognition 1:25 Document Metadata Extraction Search Replayability Structured Encoding Outputs Preprocessing Object Detection Notation Assembly Encoding E n d -to -E n d R eco g n itio n System Architectures Inputs Perfect Degraded Document Quality Born-digital Distorted Image Acquisition Monophonic Homophonic Polyphonic Pianoform Notational Complexity Signal Offline Online or Production Typeset Handwritten or Notation Common Western Other Instrument-specific Preceding or or or Fig. 15. An overview of the taxonomy of OMR inputs, architectures, and outputs. A fairly simple OMR system could, for example, read high-quality scans (offline) of well-preserved documents that contain typeset, monophonic, mensural notation, process it in a tradition pipeline and output the results in a MIDI file to achieve replayability. An extremely complex system, on the other hand, would allow images (offline) of handwritten music in common western notation from degraded documents as input and strive to recognize the full structured encoding in an end-to-end system. ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:26 Calvo-Zaragoza et al. Work on underlying formalisms for describing music notation can also potentially have a wide impact, especially if done in collaboration with the relevant communities (W3C Community Group on Music Notation, Music Encoding Initiative). • Evaluation: Due to the lack of standards for outputting OMR results, evaluating them is currently in an equally unsatisfactory state. An ideal evaluation method would be rigorously described and verified, have a public implementation, give meaningful results, and not rely on a particular use-case, thus only intrinsically evaluating the system [81]. On the technical side, there are also many interesting avenues, where future research is needed, including: • Music Object Detection: recent work has shown that the music object detection stage can be addressed in one step with deep neural networks. However, the accuracy is still far from optimal, which is especially detrimental to the following stages of the pipeline that are based on these results. In order to improve the detection performance, it might be interesting to develop models that are specific to the type of inputs that OMR works on: large images with a high quantity of densely packed objects of various sizes from a vast vocabulary. • Semantical reconstruction: merely detecting the music objects in the document does not represent a complete music notation recognition system, and so the music object detection stage must be complemented with the semantical reconstruction. Traditionally, this stage is addressed by hand-crafted heuristics that either hardly generalize or do not cover the full spectrum of music notation. Machine learning-based semantical reconstruction represents an unexplored line of research that deserves further consideration. • Structured encoding research: despite being the main motivation for OMR in many cases, there is a lack of scientific research and open systems that actually pursue the objective of retrieving the full structure encoding of the input. • Full end-to-end systems: end-to-end systems are accountable for major advances in machine learning tasks such as text recognition, speech recognition, or machine translation. The state of the art of these fields is based on recurrent neural networks. For design reasons, these networks currently deal only with one-dimensional output sequences. This fits the aforementioned tasks quite naturally since their outputs are mainly composed of word sequences. However, its application for music notationÐexcept for simple monophonic scoresÐis not so straightforward, and it is unknown how to formulate an end-to-end learning process for the recognition of fully-fledged music notation in documents. • Statistical modeling: most machine learning algorithms are based on statistical models that are able to provide a probability distribution over the set of possible recognition hypotheses. When it comes to recognizing, we are typically interested in the best hypothesisÐthe one that is proposed as an answerÐforgetting the probability given to such hypothesis by the model. However, it could be interesting to be able to exploit this uncertainty. For example, in the standard decomposition of stages in OMR systems, the semantic reconstruction stage could benefit from having a set of hypotheses about the objects detected in the previous stage, instead of single proposals. Then, the semantic reconstruction algorithm could establish relationships that are more logical a priori, although the objects involved have a lower probability according to the object detector. These types of approaches have not been deeply explored in the OMR field. Statistical modeling could also be useful so that the system provides its certainty about the output. Then, the end user might have a certain notion about the accuracy that has been obtained for the given input. • Generalizing systems: A pressing issue is generalizing from training datasets to various real- world collections because the costs for data acquisition are still significant and currently ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Understanding Optical Music Recognition 1:27 represent a bottleneck for applying state-of-the-art machine learning models in stakehold- ers’ workflows. However, music notation follows the same underlying rules, regardless of graphical differences such as whether it is typeset or handwritten. Can one leverage a typeset sheet music dataset to train for handwritten notation? Given that typeset notation can be synthetically generated, this would open several opportunities to train handwritten systems without the effort of getting labeled data manually. Although it seems more difficult to transfer knowledge across different kinds of music notation, a system that recognizes some specific music notation could be somehow useful for the recognition of shared elements in other styles as well, e.g., across the various mensural notation systems. • Interactive systems: Interactive systems are based on the idea of including users in the recog- nition process, given that they are necessary if there is no tolerance for errorsÐsomething that at the moment can only be ensured by human verification. This paradigm reformulates the objective of the system, which is no longer improving accuracy but reducing the effortÐ usually measured as timeÐthat the users invest in aiding the machine to achieve that perfect result. This aid can be provided in many different ways: error corrections that then feed back into the system, or manually activating and deactivating constraints on the content to be recognized. However, since user effort is the most valuable resource, there is still a need to reformulate the problem based on this concept, which also includes aspects related to human-computer interfaces. The conventional interfaces of computers are designed to enter text (keyboard) or perform very specific actions (mouse); therefore, it would be interesting to study the use of more ergonomic interfaces to work with musical notation, such as an electronic pen or a MIDI piano, in the context of interactive OMR systems. We hope that these lists demonstrate that OMR still provides many interesting challenges that await future research. ACKNOWLEDGMENTS The authors would like to thank David Rizo and Horst Eidenberger for their valuable feedback and helpful comments on the manuscript. REFERENCES [1] Sanu Pulimootil Achankunju. 2018. Music Search Engine from Noisy OMR Data. In 1st International Workshop on Reading Music Systems. Paris, France, 23ś24. [2] Julia Adamska, Mateusz Piecuch, Mateusz Podgórski, Piotr Walkiewicz, and Ewa Lukasik. 2015. Mobile System for Optical Music Recognition and Music Sound Generation. In Computer Information Systems and Industrial Management. Cham, 571ś582. [3] Francisco Álvaro, Joan-Andreu Sánchez, and José-Miguel Benedí. 2016. An integrated grammar-based approach for mathematical expression recognition. Pattern Recognition 51 (2016), 135ś147. [4] Alfio Andronico and Alberto Ciampa. 1982. On Automatic Pattern Recognition and Acquisition of Printed Music. In International Computer Music Conference. Venice, Italy. [5] Tetsuaki Baba, Yuya Kikukawa, Toshiki Yoshiike, Tatsuhiko Suzuki, Rika Shoji, Kumiko Kushiyama, and Makoto Aoki. 2012. Gocen: A Handwritten Notational Interface for Musical Performance and Learning Music. In ACM SIGGRAPH 2012 Emerging Technologies. New York, USA, 9ś9. [6] David Bainbridge and Tim Bell. 1997. Dealing with Superimposed Objects in Optical Music Recognition. In 6th International Conference on Image Processing and its Applications. 756ś760. [7] David Bainbridge and Tim Bell. 2001. The Challenge of Optical Music Recognition. Computers and the Humanities 35, 2 (2001), 95ś121. [8] David Bainbridge and Tim Bell. 2003. A music notation construction engine for optical music recognition. Software: Practice and Experience 33, 2 (2003), 173ś200. [9] David Bainbridge and Tim Bell. 2006. Identifying music documents in a collection of images. In 7th International Conference on Music Information Retrieval. Victoria, Canada, 47ś52. ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:28 Calvo-Zaragoza et al. [10] Stefan Balke, Sanu Pulimootil Achankunju, and Meinard Müller. 2015. Matching Musical Themes Based on Noisy OCR and OMR Input. In International Conference on Acoustics, Speech and Signal Processing. 703ś707. [11] Arnau Baró, Pau Riba, Jorge Calvo-Zaragoza, and Alicia Fornés. 2017. Optical Music Recognition by Recurrent Neural Networks. In 14th International Conference on Document Analysis and Recognition. Kyoto, Japan, 25ś26. [12] Arnau Baró, Pau Riba, and Alicia Fornés. 2016. Towards the recognition of compound music notes in handwritten music scores. In 15th International Conference on Frontiers in Handwriting Recognition. 465ś470. [13] Arnau Baró, Pau Riba, and Alicia Fornés. 2018. A Starting Point for HandwrittenMusic Recognition. In 1st International Workshop on Reading Music Systems. Paris, France, 5ś6. [14] Louis W. G. Barton. 2002. The NEUMES Project: digital transcription of medieval chant manuscripts. In 2nd Interna- tional Conference on Web Delivering of Music. 211ś218. [15] Stephan Baumann. 1995. A Simplified Attributed Graph Grammar for High-Level Music Recognition. In 3rd Interna- tional Conference on Document Analysis and Recognition. 1080ś1083. [16] Stephan Baumann and Andreas Dengel. 1992. Transforming Printed Piano Music into MIDI. In Advances in Structural and Syntactic Pattern Recognition. World Scientific, 363ś372. [17] Mert Bay, Andreas F. Ehmann, and J. Stephen Downie. 2009. Evaluation of Multiple-F0 Estimation and Tracking Systems. In 10th International Society for Music Information Retrieval Conference. Kobe, Japan, 315ś320. [18] Pierfrancesco Bellini, Ivan Bruno, and Paolo Nesi. 2001. Optical music sheet segmentation. In 1st International Conference on WEB Delivering of Music. 183ś190. [19] Pierfrancesco Bellini, Ivan Bruno, and Paolo Nesi. 2007. Assessing Optical Music Recognition Tools. Computer Music Journal 31, 1 (2007), 68ś93. [20] Margaret Bent and Andrew Wathey. 1998. Digital Image Archive of Medieval Music. https://www.diamm.ac.uk/ [21] Hervé Bitteur. 2004. Audiveris. https://github.com/audiveris [22] Dorothea Blostein and Henry S. Baird. 1992. A Critical Survey of Music Image Analysis. In Structured Document Image Analysis. Springer Berlin Heidelberg, 405ś434. [23] Dorothea Blostein and Nicholas Paul Carter. 1992. Recognition of Music Notation: SSPR’90 Working Group Report. In Structured Document Image Analysis. Springer Berlin Heidelberg, 573ś574. [24] Dorothea Blostein and Lippold Haken. 1991. Justification of Printed Music. Commun. ACM 34, 3 (1991), 88ś99. [25] JohnAshley Burgoyne, JohannaDevaney, Laurent Pugin, and Ichiro Fujinaga. 2008. Enhanced BleedthroughCorrection for Early Music Documents with Recto-Verso Registration. In 9th International Conference on Music Information Retrieval. Philadelphia, PA, 407ś412. [26] John Ashley Burgoyne, Ichiro Fujinaga, and J. Stephen Downie. 2015. Music Information Retrieval. In A New Companion to Digital Humanities. Wiley Blackwell, 213ś228. [27] Donald Byrd and Eric Isaacson. 2016. A Music Representation Requirement Specification for Academia. Technical Report. Indiana University, Bloomington. [28] Donald Byrd and Megan Schindele. 2006. Prospects for Improving OMRwith Multiple Recognizers. In 7th International Conference on Music Information Retrieval. 41ś46. [29] Donald Byrd and Jakob Grue Simonsen. 2015. Towards a Standard Testbed for Optical Music Recognition: Definitions, Metrics, and Page Images. Journal of New Music Research 44, 3 (2015), 169ś195. [30] Jorge Calvo-Zaragoza, JanHajič jr., andAlexander Pacha. 2018. DiscussionGroup Summary: OpticalMusic Recognition. In Graphics Recognition, Current Trends and Evolutions (Lecture Notes in Computer Science). 152ś157. [31] Jorge Calvo-Zaragoza and Jose Oncina. 2014. Recognition of Pen-Based Music Notation: The HOMUS Dataset. In 22nd International Conference on Pattern Recognition. 3038ś3043. [32] Jorge Calvo-Zaragoza and David Rizo. 2018. Camera-PrIMuS: Neural End-to-End Optical Music Recognition on Realistic Monophonic Scores. In 19th International Society for Music Information Retrieval Conference. Paris, France, 248ś255. [33] Jorge Calvo-Zaragoza and David Rizo. 2018. End-to-End Neural Optical Music Recognition of Monophonic Scores. Applied Sciences 8, 4 (2018). [34] Jorge Calvo-Zaragoza, Alejandro Toselli, and Enrique Vidal. 2017. Handwritten Music Recognition for Mensural No- tation: Formulation, Data and Baseline Results. In 14th International Conference on Document Analysis and Recognition. Kyoto, Japan, 1081ś1086. [35] Jorge Calvo-Zaragoza, Alejandro H. Toselli, and Enrique Vidal. 2018. Probabilistic Music-Symbol Spotting in Handwritten Scores. In 16th International Conference on Frontiers in Handwriting Recognition. Niagara Falls, USA, 558ś563. [36] Carlos E. Cancino-Chacón, Maarten Grachten, Werner Goebl, and Gerhard Widmer. 2018. Computational Models of Expressive Music Performance: A Comprehensive and Critical Review. Frontiers in Digital Humanities 5 (2018), 25. [37] capella-software AG. 1996. Capella Scan. https://www.capella-software.com ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Understanding Optical Music Recognition 1:29 [38] Nicholas Paul Carter. 1992. A New Edition of Walton’s Façade Using Automatic Score Recognition. In Advances in Structural and Syntactic Pattern Recognition. World Scientific, 352ś362. [39] Gen-Fang Chen and Jia-Shing Sheu. 2014. An optical music recognition system for traditional Chinese Kunqu Opera scores written in Gong-Che Notation. EURASIP Journal on Audio, Speech, and Music Processing 2014, 1 (2014), 7. [40] Liang Chen and Kun Duan. 2016. MIDI-assisted egocentric optical music recognition. In Winter Conference on Applications of Computer Vision. [41] Liang Chen, Rong Jin, and Christopher Raphael. 2015. Renotation from Optical Music Recognition. In Mathematics and Computation in Music. Cham, 16ś26. [42] Liang Chen, Rong Jin, and Christopher Raphael. 2017. Human-Guided Recognition of Music Score Images. In 4th International Workshop on Digital Libraries for Musicology. [43] Liang Chen and Christopher Raphael. 2018. Optical Music Recognition and Human-in-the-loop Computation. In 1st International Workshop on Reading Music Systems. Paris, France, 11ś12. [44] Atul K. Chhabra. 1998. Graphic symbol recognition: An overview. In Graphics Recognition Algorithms and Systems. Berlin, Heidelberg, 68ś79. [45] Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, and Michiel Bacchiani. 2018. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4774ś4778. [46] Kwon-Young Choi, Bertrand Coüasnon, Yann Ricquebourg, and Richard Zanibbi. 2017. Bootstrapping Samples of Accidentals in Dense Piano Scores for CNN-Based Detection. In 14th International Conference on Document Analysis and Recognition. Kyoto, Japan. [47] G. Sayeed Choudhury, M. Droetboom, Tim DiLauro, Ichiro Fujinaga, and Brian Harrington. 2000. Optical Music Recognition System within a Large-Scale Digitization Project. In 1st International Symposium on Music Information Retrieval. [48] Arindam Chowdhury and Lovekesh Vig. 2018. An Efficient End-to-End Neural Model for Handwritten Text Recogni- tion. In 29th British Machine Vision Conference. [49] Bertrand Coüasnon and Jean Camillerapp. 1994. Using Grammars to Segment and Recognize Music Scores. In International Association for Pattern Recognition Workshop on Document Analysis Systems. Kaiserslautern, Germany, 15ś27. [50] Tim Crawford, Golnaz Badkobeh, and David Lewis. 2018. Searching Page-Images of Early Music Scanned with OMR: A Scalable Solution Using Minimal Absent Words. In 19th International Society for Music Information Retrieval Conference. Paris, France, 233ś239. [51] Christoph Dalitz, Georgios K. Michalakis, and Christine Pranzas. 2008. Optical recognition of psaltic Byzantine chant notation. International Journal of Document Analysis and Recognition 11, 3 (2008), 143ś158. [52] Jürgen Diet. 2018. Innovative MIR Applications at the Bayerische Staatsbibliothek. In 5th International Conference on Digital Libraries for Musicology. Paris, France. [53] Ing-Jr Ding, Chih-Ta Yen, Che-Wei Chang, and He-Zhong Lin. 2014. Optical music recognition of the singer using formant frequency estimation of vocal fold vibration and lip motion with interpolated GMM classifiers. Journal of Vibroengineering 16, 5 (2014), 2572ś2581. [54] Matthias Dorfer, Jan Hajič jr., Andreas Arzt, Harald Frostel, and Gerhard Widmer. 2018. Learning AudiośSheet Music Correspondences for Cross-Modal Retrieval and Piece Identification. Transactions of the International Society for Music Information Retrieval 1, 1 (2018), 22ś33. [55] Matthew J. Dovey. 2004. Overview of the OMRAS Project: Online Music Retrieval and Searching. Journal of the American Society for Information Science and Technology 55, 12 (2004), 1100ś1107. [56] Hoda M. Fahmy and Dorothea Blostein. 1993. A graph grammar programming style for recognition of music notation. Machine Vision and Applications 6, 2 (1993), 83ś99. [57] Jonathan Feist. 2017. Berklee Contemporary Music Notation. Berklee Press. [58] Alicia Fornés and Lamiroy Bart (Eds.). 2018. Graphics Recognition, Current Trends and Evolutions. Lecture Notes in Computer Science, Vol. 11009. Springer International Publishing. [59] Alicia Fornés, Anjan Dutta, Albert Gordo, and Josep Llados. 2011. The ICDAR 2011 Music Scores Competition: Staff Removal and Writer Identification. In International Conference on Document Analysis and Recognition. 1511ś1515. [60] Alicia Fornés, Anjan Dutta, Albert Gordo, and Josep Lladós. 2012. CVC-MUSCIMA: A Ground-truth of Handwritten Music Score Images for Writer Identification and Staff Removal. International Journal on Document Analysis and Recognition 15, 3 (2012), 243ś251. [61] Alicia Fornés, Josep Lladós, and Gemma Sánchez. 2006. Primitive Segmentation in Old Handwritten Music Scores. In Graphics Recognition. Ten Years Review and Future Perspectives. Berlin, Heidelberg, 279ś290. ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:30 Calvo-Zaragoza et al. [62] Alicia Fornés, Josep Lladós, and Gemma Sánchez. 2008. Old Handwritten Musical Symbol Classification by a Dynamic Time Warping Based Method. In Graphics Recognition. Recent Advances and New Opportunities. Berlin, Heidelberg, 51ś60. [63] Alicia Fornés, Josep Lladós, Gemma Sánchez, and Horst Bunke. 2008. Writer Identification in Old Handwritten Music Scores. In 8th International Workshop on Document Analysis Systems. Nara, Japan, 347ś353. [64] Alicia Fornés, Josep Lladós, Gemma Sánchez, and Horst Bunke. 2009. On the Use of Textural Features for Writer Identification in Old Handwritten Music Scores. 10th International Conference on Document Analysis and Recognition (2009), 996ś1000. [65] Stavroula-Evita Fotinea, George Giakoupis, Aggelos Livens, Stylianos Bakamidis, and George Carayannis. 2000. An Optical Notation Recognition System for Printed Music Based on Template Matching and High Level Reasoning. In RIAO ’00 Content-Based Multimedia Information Access. Paris, France, 1006ś1014. [66] Christian Fremerey, Meinard Müller, Frank Kurth, and Michael Clausen. 2008. Automatic Mapping of Scanned Sheet Music to Audio Recordings. In 9th International Conference on Music Information Retrieval. 413ś418. [67] Ichiro Fujinaga. 1988. Optical Music Recognition using Projections. Master’s thesis. McGill University. [68] Ichiro Fujinaga and Andrew Hankinson. 2014. SIMSSA: Single Interface for Music Score Searching and Analysis. Journal of the Japanese Society for Sonic Arts 6, 3 (2014), 25ś30. [69] Ichiro Fujinaga, Andrew Hankinson, and Julie E. Cumming. 2014. Introduction to SIMSSA (Single Interface for Music Score Searching and Analysis). In 1st International Workshop on Digital Libraries for Musicology. 1ś3. [70] Antonio-Javier Gallego and Jorge Calvo-Zaragoza. 2017. Staff-line removal with selectional auto-encoders. Expert Systems with Applications 89 (2017), 138ś148. [71] Gear Up AB. 2017. iSeeNotes. http://www.iseenotes.com/ [72] Susan E. George. 2003. Online Pen-Based Recognition of Music Notation with Artificial Neural Networks. Computer Music Journal 27, 2 (2003), 70ś79. [73] Susan E. George. 2004. Wavelets for Dealing with Super-Imposed Objects in Recognition of Music Notation. In Visual Perception of Music Notation: On-Line and Off Line Recognition. IRM Press, Hershey, PA, 78ś107. [74] Angelos P. Giotis, Giorgos Sfikas, Basilis Gatos, and Christophoros Nikou. 2017. A survey of document image word spotting techniques. Pattern Recognition 68 (2017), 310ś332. [75] Michael Good. 2001. MusicXML: An Internet-Friendly Format for Sheet Music. Technical Report. Recordare LLC. [76] Michael Good and Geri Actor. 2003. Using MusicXML for File Interchange. In Third International Conference on WEB Delivering of Music. 153. [77] Albert Gordo, Alicia Fornés, and Ernest Valveny. 2013. Writer identification in handwritten musical scores with bags of notes. Pattern Recognition 46, 5 (2013), 1337ś1345. [78] Mark Gotham, Peter Jonas, Bruno Bower, William Bosworth, Daniel Rootham, and Leigh VanHandel. 2018. Scores of Scores: An Openscore Project to Encode and Share Sheet Music. In 5th International Conference on Digital Libraries for Musicology. Paris, France, 87ś95. [79] Elaine Gould. 2011. Behind Bars. Faber Music. [80] Gianmarco Gozzi. 2010. OMRJX: A framework for piano scores optical music recognition. Master’s thesis. Politecnico di Milano. [81] Jan Hajič jr. 2018. A Case for Intrinsic Evaluation of Optical Music Recognition. In 1st International Workshop on Reading Music Systems. Paris, France, 15ś16. [82] Jan Hajič jr., Matthias Dorfer, Gerhard Widmer, and Pavel Pecina. 2018. Towards Full-Pipeline Handwritten OMR with Musical Symbol Detection by U-Nets. In 19th International Society for Music Information Retrieval Conference. Paris, France, 225ś232. [83] Jan Hajič jr., Marta Kolárová, Alexander Pacha, and Jorge Calvo-Zaragoza. 2018. How Current Optical Music Recognition Systems Are Becoming Useful for Digital Libraries. In 5th International Conference on Digital Libraries for Musicology. Paris, France, 57ś61. [84] Jan Hajič jr., Jiří Novotný, Pavel Pecina, and Jaroslav Pokorný. 2016. Further Steps towards a Standard Testbed for Optical Music Recognition. In 17th International Society for Music Information Retrieval Conference. New York, USA, 157ś163. [85] Jan Hajič jr. and Pavel Pecina. 2017. Groundtruthing (Not Only) Music Notation with MUSICMarker: A Practical Overview. In 14th International Conference on Document Analysis and Recognition. Kyoto, Japan, 47ś48. [86] Jan Hajič jr. and Pavel Pecina. 2017. The MUSCIMA++ Dataset for Handwritten Optical Music Recognition. In 14th International Conference on Document Analysis and Recognition. Kyoto, Japan, 39ś46. [87] Donna Harman. 2011. Information Retrieval Evaluation (1st ed.). Morgan & Claypool Publishers. [88] Kate Helsen, Jennifer Bain, Ichiro Fujinaga, Andrew Hankinson, and Debra Lacoste. 2014. Optical music recognition and manuscript chant sources. Early Music 42, 4 (2014), 555ś558. [89] George Heussenstamm. 1987. The Norton Manual of Music Notation. W. W. Norton & Company. ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Understanding Optical Music Recognition 1:31 [90] Władysław Homenda. 1996. Automatic recognition of printed music and its conversion into playable music data. Control and Cybernetics 25, 2 (1996), 353ś367. [91] Yu-Hui Huang, Xuanli Chen, Serafina Beck, David Burn, and Luc Van Gool. 2015. Automatic Handwritten Mensural Notation Interpreter: From Manuscript to MIDI Performance. In 16th International Society for Music Information Retrieval Conference. Málaga, Spain, 79ś85. [92] José Manuel Iñesta, Pedro J. Ponce de León, David Rizo, José Oncina, Luisa Micó, Juan Ramón Rico-Juan, Carlos Pérez- Sancho, and Antonio Pertusa. 2018. HISPAMUS: Handwritten Spanish Music Heritage Preservation by Automatic Transcription. In 1st International Workshop on Reading Music Systems. Paris, France, 17ś18. [93] Linn Saxrud Johansen. 2009. Optical Music Recognition. Master’s thesis. University of Oslo. [94] Graham Jones, Bee Ong, Ivan Bruno, and Kia Ng. 2008. Optical Music Imaging: Music Document Digitisation, Recognition, Evaluation, and Restoration. In Interactive multimedia music technologies. IGI Global, 50ś79. [95] Nicholas Journet, Muriel Visani, Boris Mansencal, Kieu Van-Cuong, and Antoine Billy. 2017. DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images. Journal of Imaging 3, 4 (2017), 62. [96] Michael Kassler. 1972. Optical Character-Recognition of Printed Music : A Review of Two Dissertations. Automatic Recognition of Sheet Music by Dennis Howard Pruslin ; Computer Pattern Recognition of Standard Engraved Music Notation by David Stewart Prerau. Perspectives of New Music 11, 1 (1972), 250ś254. [97] Klaus Keil and Jennifer A.Ward. 2017. Applications of RISMdata in digital libraries and digital musicology. International Journal on Digital Libraries (2017). [98] Daniel Lopresti and George Nagy. 2002. Issues in Ground-Truthing Graphic Documents. In Graphics Recognition Algorithms and Applications. Springer Berlin Heidelberg, Ontario, Canada, 46ś67. [99] Nawapon Luangnapa, Thongchai Silpavarangkura, Chakarida Nukoolkit, and Pornchai Mongkolnam. 2012. Optical Music Recognition on Android Platform. In International Conference on Advances in Information Technology. 106ś115. [100] RakeshMalik, Partha Pratim Roy, Umapada Pal, and Fumitaka Kimura. 2013. HandwrittenMusical Document Retrieval Using Music-Score Spotting. In 12th International Conference on Document Analysis and Recognition. 832ś836. [101] Chirstopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press. [102] T. Matsushima, I. Sonomoto, T. Harada, K. Kanamori, and S. Ohteru. 1985. Automated High Speed Recognition of Printed Music (WABOT-2 Vision System). In International Conference on Advanced Robotics. 477ś482. [103] Johann Mattheson. 1739. Der vollkommene Capellmeister. Herold, Christian, Hamburg. [104] Apurva A. Mehta and Malay S. Bhatt. 2015. Optical Music Notes Recognition for Printed Piano Music Score Sheet. In International Conference on Computer Communication and Informatics. Coimbatore, India. [105] Hidetoshi Miyao and Robert Martin Haralick. 2000. Format of Ground Truth Data Used in the Evaluation of the Results of an Optical Music Recognition System. In 4th International Workshop on Document Analysis Systems. Brasil, 497ś506. [106] Musitek. 2017. SmartScore X2. http://www.musitek.com/smartscore-pro.html [107] Neuratron. 2015. NotateMe. http://www.neuratron.com/notateme.html [108] Neuratron. 2018. PhotoScore 2018. http://www.neuratron.com/photoscore.htm [109] Kia Ng, Alex McLean, and Alan Marsden. 2014. Big Data Optical Music Recognition with Multi Images and Multi Recognisers. In EVA London 2014 on Electronic Visualisation and the Arts. 215ś218. [110] Tam Nguyen and Gueesang Lee. 2015. A Lightweight and Effective Music Score Recognition on Mobile Phones. Journal of Information Processing Systems 11, 3 (2015), 438ś449. [111] Jiri Novotnỳ and Jaroslav Pokornỳ. 2015. Introduction to Optical Music Recognition: Overview and Practical Challenges. In Annual International Workshop on DAtabases, TExts, Specifications and Objects. 65ś76. [112] Organum. 2016. PlayScore. http://www.playscore.co/ [113] Rafael Ornes. 1998. Choral Public Domain Library. http://cpdl.org [114] Tuula Pääkkönen, Jukka Kervinen, and Kimmo Kettunen. 2018. Digitisation and Digital Library Presentation System ś Sheet Music to the Mix. In 1st International Workshop on Reading Music Systems. Paris, France, 21ś22. [115] Alexander Pacha. 2018. Advancing OMR as a Community: Best Practices for Reproducible Research. In 1st International Workshop on Reading Music Systems. Paris, France, 19ś20. [116] Alexander Pacha and Jorge Calvo-Zaragoza. 2018. Optical Music Recognition in Mensural Notation with Region-Based Convolutional Neural Networks. In 19th International Society for Music Information Retrieval Conference. Paris, France, 240ś247. [117] Alexander Pacha, Kwon-Young Choi, Bertrand Coüasnon, Yann Ricquebourg, Richard Zanibbi, and Horst Eidenberger. 2018. Handwritten Music Object Detection: Open Issues and Baseline Results. In 13th International Workshop on Document Analysis Systems. 163ś168. [118] Alexander Pacha and Horst Eidenberger. 2017. Towards a Universal Music Symbol Classifier. In 14th International Conference on Document Analysis and Recognition. Kyoto, Japan, 35ś36. ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:32 Calvo-Zaragoza et al. [119] Alexander Pacha, Jan Hajič jr., and Jorge Calvo-Zaragoza. 2018. A Baseline for General Music Object Detection with Deep Learning. Applied Sciences 8, 9 (2018), 1488ś1508. [120] Victor Padilla, Alan Marsden, Alex McLean, and Kia Ng. 2014. Improving OMR for Digital Music Libraries with Multiple Recognisers and Multiple Sources. In 1st International Workshop on Digital Libraries for Musicology. London, United Kingdom, 1ś8. [121] Emilia Parada-Cabaleiro, Anton Batliner, Alice Baird, and Björn Schuller. 2017. The SEILS Dataset: Symbolically Encoded Scores in Modern-Early Notation for Computational Musicology. In 18th International Society for Music Information Retrieval Conference. Suzhou, China. [122] Viet-Khoi Pham, Hai-Dang Nguyen, and Minh-Triet Tran. 2015. Virtual Music Teacher for New Music Learners with Optical Music Recognition. In International Conference on Learning and Collaboration Technologies. 415ś426. [123] David S. Prerau. 1971. Computer pattern recognition of printed music. In Fall Joint Computer Conference. 153ś162. [124] Gérard Presgurvic. 2005. Songbook Romeo & Julia. https://www.musicalvienna.at/de/souvenirs/12/ ANDERE-MUSICALS/10/Songbook-Romeo-und-Julia [125] Project Petrucci LLC. 2006. International Music Score Library Project. http://imslp.org/ [126] Denis Pruslin. 1966. Automatic Recognition of Sheet Music. Ph.D. Dissertation. Massachusetts Institute of Technology, Cambridge, Massachusetts, USA. [127] Laurent Pugin. 2006. Optical Music Recognitoin of Early Typographic Prints using Hidden Markov Models. In 7th International Conference on Music Information Retrieval. Victoria, Canada, 53ś56. [128] Laurent Pugin, John Ashley Burgoyne, and Ichiro Fujinaga. 2007. Reducing Costs for Digitising Early Music with Dynamic Adaptation. In Research and Advanced Technology for Digital Libraries. Berlin, Heidelberg, 471ś474. [129] Laurent Pugin and Tim Crawford. 2013. Evaluating OMR on the Early Music Online Collection. In 14th International Society for Music Information Retrieval Conference. Curitiba, Brazil, 439ś444. [130] Gene Ragan. 2017. KompApp. http://kompapp.com/ [131] Sheikh Faisal Rashid, Abdullah Akmal, Muhammad Adnan, Ali Adnan Aslam, and Andreas Dengel. 2017. Table Recognition in Heterogeneous Documents Using Machine Learning. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). 777ś782. [132] Ana Rebelo, Ichiro Fujinaga, Filipe Paszkiewicz, Andre R.S. Marcal, Carlos Guedes, and Jamie dos Santos Cardoso. 2012. Optical music recognition: state-of-the-art and open issues. International Journal of Multimedia Information Retrieval 1, 3 (2012), 173ś190. [133] Pau Riba, Alicia Fornés, and Josep Lladós. 2017. Towards the Alignment of Handwritten Music Scores. In Graphic Recognition. Current Trends and Challenges (Lecture Notes in Computer Science). 103ś116. [134] Adrià Rico Blanes and Alicia Fornés Bisquerra. 2017. Camera-Based Optical Music Recognition Using a Convolutional Neural Network. In 14th International Conference on Document Analysis and Recognition. Kyoto, Japan, 27ś28. [135] David Rizo, Jorge Calvo-Zaragoza, and JoséM. Iñesta. 2018. MuRET: AMusic Recognition, Encoding, and Transcription Tool. In 5th International Conference on Digital Libraries for Musicology. Paris, France, 52ś56. [136] Heinz Roggenkemper and Ryan Roggenkemper. 2018. How can Machine Learning make Optical Music Recognition more relevant for practicing musicians?. In 1st International Workshop on Reading Music Systems. Paris, France, 25ś26. [137] Perry Roland. 2002. The music encoding initiative (MEI). In 1st International Conference on Musical Applications Using XML. 55ś59. [138] Florence Rossant and Isabelle Bloch. 2004. A fuzzy model for optical recognition of musical scores. Fuzzy Sets and Systems 141, 2 (2004), 165ś201. [139] Partha Pratim Roy, Ayan Kumar Bhunia, and Umapada Pal. 2017. HMM-based writer identification in music score documents without staff-line removal. Expert Systems with Applications 89 (2017), 222ś240. [140] Sächsische Landesbibliothek. 2007. Staats- und Universitätsbibliothek Dresden. https://www.slub-dresden.de [141] Charalampos Saitis, Andrew Hankinson, and Ichiro Fujinaga. 2014. Correcting Large-Scale OMR Data with Crowd- sourcing. In 1st International Workshop on Digital Libraries for Musicology. 1ś3. [142] Zeyad Saleh, Ke Zhang, Jorge Calvo-Zaragoza, Gabriel Vigliensoni, and Ichiro Fujinaga. 2017. Pixel.js: Web-Based Pixel Classification Correction Platform for Ground Truth Creation. In 14th International Conference on Document Analysis and Recognition. Kyoto, Japan, 39ś40. [143] Eleanor Selfridge-Field. 1997. Beyond MIDI: The Handbook of Musical Codes. MIT Press, Cambridge, MA, USA. [144] Asif Shahab, Faisal Shafait, Thomas Kieninger, and Andreas Dengel. 2010. An Open Approach Towards the Bench- marking of Table Structure Recognition Systems. In 9th International Workshop on Document Analysis Systems. Boston, Massachusetts, USA, 113ś120. [145] Muhammad Sharif, Quratul-Ain Arshad, Mudassar Raza, and Wazir Zada Khan. 2009. [COMSCAN]: An Optical Music Recognition System. In 7th International Conference on Frontiers of Information Technology. 34. [146] Baoguang Shi, Xiang Bai, and Cong Yao. 2017. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition. IEEE Transactions on Pattern Analysis and Machine ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Understanding Optical Music Recognition 1:33 Intelligence 39, 11 (2017), 2298ś2304. [147] Mahmood Sotoodeh, Farshad Tajeripour, Sadegh Teimori, and Kirk Jorgensen. 2017. A music symbols recognition method using pattern matching along with integrated projection and morphological operation techniques. Multimedia Tools and Applications (2017). [148] Daniel Spreadbury and Robert Piéchaud. 2015. Standard Music Font Layout (SMuFL). In First International Conference on Technologies for Music Notation and Representation - TENOR2015. Paris, France, 146ś153. [149] StaffPad Ltd. 2017. StaffPad. http://www.staffpad.net (Last visited 16.04.2019). http://www.staffpad.net [150] Gabriel Taubman. 2005. MusicHand : A Handwritten Music Recognition System. Technical Report. Brown University. [151] Jessica Thompson, Andrew Hankinson, and Ichiro Fujinaga. 2011. Searching the Liber Usualis: Using CouchDB and ElasticSearch to Query Graphical Music Documents. In 12th International Society for Music Information Retrieval Conference. [152] Lukas Tuggener, Isamil Elezi, Jürgen Schmidhuber, Marcello Pelillo, and Stadelmann Thilo. 2018. DeepScores - A Dataset for Segmentation, Detection and Classification of Tiny Objects. In 24th International Conference on Pattern Recognition. Beijing, China. [153] Julián Urbano. 2013. MIREX 2013 Symbolic Melodic Similarity: A Geometric Model supported with Hybrid Sequence Alignment. Technical Report. Music Information Retrieval Evaluation eXchange. [154] Julián Urbano, Juan Lloréns, JorgeMorato, and Sonia Sánchez-Cuadrado. 2010.MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Representations. Technical Report. Music Information Retrieval Evaluation eXchange. [155] Julián Urbano, Juan Lloréns, Jorge Morato, and Sonia Sánchez-Cuadrado. 2011. MIREX 2011 Symbolic Melodic Similarity: Sequence Alignment with Geometric Representations. Technical Report. Music Information Retrieval Evaluation eXchange. [156] Julián Urbano, Juan Lloréns, JorgeMorato, and Sonia Sánchez-Cuadrado. 2012.MIREX 2012 Symbolic Melodic Similarity: Hybrid Sequence Alignment with Geometric Representations. Technical Report. Music Information Retrieval Evaluation eXchange. [157] Eelco van der Wel and Karen Ullrich. 2017. Optical Music Recognition with Convolutional Sequence-to-Sequence Models. In 18th International Society for Music Information Retrieval Conference. Suzhou, China. [158] Gabriel Vigliensoni, John Ashley Burgoyne, Andrew Hankinson, and Ichiro Fujinaga. 2011. Automatic Pitch Detection in Printed Square Notation. In 12th International Society for Music Information Retrieval Conference. Miami, Florida, 423ś428. [159] Gabriel Vigliensoni, Jorge Calvo-Zaragoza, and Ichiro Fujinaga. 2018. Developing an environment for teaching computers to read music. In 1st International Workshop on Reading Music Systems. Paris, France, 27ś28. [160] Quang Nhat Vo, Guee Sang Lee, Soo Hyung Kim, and Hyung Jeong Yang. 2017. Recognition of Music Scores with Non-Linear Distortions in Mobile Devices. Multimedia Tools and Applications (2017). [161] Matthias Wallner. 2014. A System for Optical Music Recognition and Audio Synthesis. Master’s thesis. TU Wien. [162] Gus G. Xia and Roger B. Dannenberg. 2017. Improvised Duet Interaction: Learning Improvisation Techniques for Automatic Accompaniment. In New Interfaces for Musical Expression. Aalborg University Copenhagen, Denmark. [163] Jianshu Zhang, Jun Du, Shiliang Zhang, Dan Liu, Yulong Hu, Jinshui Hu, Si Wei, and Lirong Dai. 2017. Watch, Attend and Parse: An End-to-end Neural Network Based Approach to Handwritten Mathematical Expression Recognition. Pattern Recognition (2017). ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:34 Calvo-Zaragoza et al. APPENDIX A: OMR BIBLIOGRAPHY Along with this paper, we are also publishing the most comprehensive and complete bibliography on OMR that we were able to compile at https://omr-research.github.io/. It is a curated list of verified publications in an open-source Github repository (https://github.com/OMR-Research/ omr-research.github.io) that is open for submissions both via pull requests and via templated issues. The website is automatically generated from the underlying BibTeX files using the BibTex2HTML library, available at https://www.lri.fr/~filliatr/bibtex2html/. The repository contains three distinct bibliographic files that are rendered into separate pages: (1) OMR Research Bibliography: A collection of scientific and technical publications, that were manually verified for correctness from a trustworthy source (see below). Most of these entries have either a Digital Object Identifier (DOI) or a link to the website, where the publication can be found. (2) OMR Related Bibliography: A collection of scientific and technical publications, that were manually verified for correctness from a trustworthy source but are not primarily directed towards OMR, such as musicological research or general computer vision papers. (3) Unverified OMR Bibliography: A collection of scientific and technical publications, that are related to Optical Music Recognition, but they could not be verified from a trustworthy source and might contain incorrect information. Many publications from this collection were authored before 1990 and are often not indexed by the search engines, or the respective proceedings could no longer be accessed and verified by us. Acquisition and Verification Process The bibliography was acquired and merged from multiple sources, such as the public and private collections from multiple researchers that have historically grown, including a recent one by Andrew Hankinson, who provided us with an extensive BibTeX library. Additionally, we have a Google Scholar Alert on [174] as it currently represents the latest survey and is cited by almost every publication. To verify the information of each entry in the bibliography, we proceeded with the following steps: (1) Search on Google Scholar for the title of the work, if necessary with the authors last name and the year of publication. (2) Find a trustworthy source such as the original publisher, the authors’ website, the website of the venue (that lists the article in the program) or indexing services including IEEE Xplore Digital Library, ACMDigital Library, Springer Link, Elsevier ScienceDirect, arXiv.org, dblp.org or ResearchGate. Information from the last three services are used with caution and if possible backed up with information from other sources. (3) Manually verify the correctness of the metadata by inspecting and correct it by obtaining the necessary information from another source, e.g., the conference website or the information state in the document. Suspicious information could be if the author’s name is missing letters because of special characters or if the year of publication is before that of cited references. Once we verified the entry, we add it to the respective bibliography with JabRef (http://www. jabref.org/) and link the original PDF file or at least the DOI. Articles that were only found as PDF without the associated venue of publication were classified as technical reports. Bachelor theses and online sources such as websites of commercial applications were classified as ’Misc’ because of the lack of an appropriate category in BibTex. ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Understanding Optical Music Recognition 1:35 APPENDIX B: LIST OF OMR DEFINITIONS AND DESCRIPTIONS FROM PUBLISHED WORKS To demonstrate how versatile OMR was referred to in the literature, we collected a list of definitions and descriptions (alphabetically ordered by the first author name). While most of these are direct citations (we omitted quotation marks for better readability), some were shortened or slightly rephrased to unify their structure and make them comparable. Optical Music Recognition has been defined or described as: • technology which transforms sheet music or printed scores into a machine readable format [1] • automatic recognition and classification of symbolic music notation [2] • system that aims to minimise human involvement in music input. The musical score is scanned to a bitmap image, and the computer attempts to parse the bitmap [3] • form of structured document analysis where symbols overlaid on the conventional five-line stave are isolated and identified so that the music can be played through a MIDI system, or edited in a music publishing system [5] • identifying musical symbols on a scanned sheet of music, and interpreting them so that the music can either be played by the computer, or put into a music editor [6] • system to convert optically scanned pages of music into a versatile machine-readable format [4] • system that aims at converting optically scanned pages of music into a versatile machine- readable format [9] • system that aims at converting the vast repositories of sheet music in the world into an on-line digital format [11] • computer system that can ’read’ printed music [7] • system that can be used to convert music scanned from paper into a format suitable for playing or editing on a computer [8] • technique that makes it possible to automatically build indexes on the actual content of sheet music [10] • process to automatically extract symbolic note information from scanned pages [12] • system to convert sheet music images to symbolic music representations [13] • the recognition of music scores [15] • field devoted to transcribe sheet music into some machine-readable format [14] • the process to convert a music score image into a machine-readable format [16] • task of transcribing a music score into a machine readable format [17] • task of recognizing and interpreting printed music and its transformation into MIDI [19] • research directed towards the recognition of printed scores as well as handwritten music notation [18] (Actually referred to as Optical Music Reading) • systems for music score recognition [20] • software that recognises music notation and produces a symbolic representation of music [21] • key problem for coding western music sheets in the digital world [22] • system that aims at saving time in converting hardcopy of the music score into an electronic version [23] • task devoted to convert an image of a music score into a machine-readable format, such as MIDI, MEI or MusicXML [179] • systems that consist of three main steps, namely image pre-processing, symbol recognition and musical reconstruction [24] ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:36 Calvo-Zaragoza et al. • software unit called Computerized Note Recognition, whose function is to interpret and recognize handwritten musical notes [26] • musical analog to optical character recognition [27] • musical analogue to optical character recognition [29] • converting images of musical scores into faithful symbolic representations of the same score [28] • electronic conversion of scanned or photographed images of handwritten or printed sheet music into symbolic and therefore editable form [30] • automatic score transcription tool [36] • task of automatically extracting the musical information from an image of a score in order to export it to some digital format [32] • offline music score recognition systems [37] • field (of computer science) devoted to providing computers (with) the ability to extract the musical content of a score from the optical scanning of its source [40, 42, 44, 46] • ability of a computer to understand the musical information contained in the image of a music score [35] • field of computer science devoted to understanding the musical information contained in the image of a music score [38] • research field that consists in [sic] extracting the musical content of a given score image in a structured, symbolic format [41] • field devoted to the automatic transcription of sheet music into some machine-readable format [43] • branch of artificial intelligence, focused on automatically recognizing the content of a musical score from the optical scan of its source [45] • systems to import a scanned version of the music sheet and try to automatically export the information into some type of structured format such as MusicXML, MIDI or MEI [34] • systems, whose objective is to automatically extract the information contained in the image of a musical score [48] • system to automatic transcription of musical documents into a structured digital format [47] • field of research that investigates how to computationally decode music notation from images [39] • research field that focuses on the automatic detection and encoding of musical content from scanned images [33] • research field that investigates how to make computers be capable of reading music [31] • technology for automatically transcribing musical documents [25] • digitization of music works [49, 50] • computational process that reads musical notation from images, with the aim of automatically exporting the content to a structured format [52] • technique that converts (or interprets) printed musical documents into computer read- able/editable formats [60] • automatic processing and analysis of images of musical notation [53] • musical cousin of Optical Character Recognition, (which) seeks to convert score images into symbolic (music) representations [54, 59] • system to transform score images into symbolic music libraries [57] • key technology in Music Information Retrieval by mining symbolic knowledge directly from images of scores [56] • seeking to convert music score images into symbolic representations [55, 58] • software to convert scanned sheets of music into computer readable formats [62] ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Understanding Optical Music Recognition 1:37 • software to generate a logical representation of the score [61] • software to transform an image of a score into symbolic format [63] • field of document analysis [64ś66] • automatic transcription of scores [67] • process similar to the well-known optical character recognition to extract score data such as note events, the key and time signatures and other musical symbols [68] • process of recognising a printed music score and converting it to a format that is understood by computers [69] • program to automatically recognize music scores (translated from German “Ein Programm zur automatischen Erkennung von Musiknotenž) [70] • the study of automatic techniques in information engineering, which can be used to determine the musical style of the singer [71] • field to recognize and play live the notes in images captured from sheet music [72] • process of automatically (re-)setting the score to create a symbolic, computer-readable repre- sentation of sheet music, such as MusicXML or MIDI [73, 74] • technology that promises to accelerate the process of entering music scores in a machine- readable format by automatically interpreting the musical content from (the digitized image of) the printed score [75, 76] • transformation of digital music score images to computer readable format symbols [77] • automatic recognition of a scanned page of printed music [78, 79] • research area that consists in [sic] the identification of music information from images of scores and their conversion into a machine readable format [80] • process of identifying music information from images of scores and converting them into machine legible format [84] • classical area of interest of Document Image Analysis and Recognition that combines textual and graphical information [86] • classical application area of interest, whose aim is the identification of music information from images of scores and their conversion into a machine readable format [85] • research field that consists in [sic] the understanding of information from music scores and its conversion into a machine readable format [87], [82] • recognition of handwritten music scores [81], [83] • automatic recognition of music notation by the computer [88] • task of converting scanned sheet music into a computer readable symbolic music format such as MIDI or MusicXML [90] • process of extracting musical note parameters (onset times, pitches, durations) along with 2D position parameters from the scanned image [89] • task of converting scores into a machine-readable format [91] • program for recognition of musical notation [92] • technology which transforms digital images of music into searchable representations of music notation [93, 94] • process of automatically transcribing music notation from a digital image [95] • research field, which focuses on detecting and storing the musical content of a score from a scanned image. The objective is to import a scanned musical score and export its musical content to a machine-readable format, typically MusicXML or MEI [96] • technique to transform paper musical scores into musical acoustic, and it is a basic way to apply to digital medium music data, large digital music library, robot reading musical score and perform, computer music education, Chinese tradition music digitalization [sic][97] • technique to convert scanned pages of music into a machine-readable format [98] ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:38 Calvo-Zaragoza et al. • problem of recognising and interpreting the symbols of printed music notation from a scanned image [100] • systems designed to perform recognition of music notation, chiefly from a scanned image of music notation [99] • process that aims to “recognizež images of music notation and capture the “meaningž of the music [101] • system for recognizing music notation [102] • systems, designed to recognise printed sheet music scores [103] • branch of OCR oriented to musical documents [104] • field of document analysis that aims to automatically read musical scores [110] • process that attempts to extract musical information from its written representation, the musical score [108] • task of recovering symbolic musical information such as MIDI from the image of the written score [105] • field of graphics recognition that aims to automatically read music [109] • field of document analysis that aims to automatically read music [111] • field of computationally reading music notation in documents [107] • field of automatically reading music notation from images [106] • tool for document transcription that tries to extract symbolic music from page images for use in an editor [113] • technology that can transform large quantities of music document page images into searchable and retrievable document entities [112] • field of research that attempts to transcribe musical symbols into digital format [114] • process of structured data processing applied to music notation [115] • research and technological field aimed on recognizing and representing music notation [117] • technology to automatically recognize music notation [116] • technique for processing music notes in old manuscripts and books [118] • form of optical character recognition that use different method and algorithms to convert printed music into its digital form [sic] [119] • direct path to create rich and extensive symbolic databases for music in machine-generated common Western notation [121] • process that automatically converts the image of a music score into symbolic data [120] • systems that convert music scores into a computer-readable format, similar to Optical Char- acter Recognition (OCR) except that it is used to recognize musical symbols instead of letters [122] • OCR for music [123] • system that can play printed or handwritten music score images without any knowledge of music primitives or musical instruments [124] • system to transform a sheet music into a format readable by a machine [125] • case of optical character recognition for the automatic recognition and classification of music notation [126] • system that can convert digital image data into digital semantic data [127] • system that addresses the problem of musical data acquisition, with the aim of converting optically scanned music scores into a versatile machine-readable format [128, 129] • subcategory of optical character recognition that recognizes an image of printed sheet music and interprets it to a machine-readable document [130] • technology that is a rewarding subject for pattern recognition researches [131] • system for music score recognition [132] ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Understanding Optical Music Recognition 1:39 • technology that makes it possible to extract symbolic representations from scores or micro- films of scores [133] • technique that involves interpreting the symbols in a picture, such as a scanned image of sheet music, and recreating the information in a format that encapsulates the implied audio content [134] • process of converting a graphical representation of music (such as sheet music) into a symbolic format [135] • process of automatically extracting musical meaning from a printed music score. It is some- times also called musical score recognition or simply score recognition [137] • process of automatically processing and understanding an image of a music score [136] • process that recognized music from any form of score sheet and makes sheet readable and editable for computer [sic] [138] • automatic conversion of scannedmusic scores into computer readable data in variable formats, e.g., MusicXML, or MEI [139] • technique that achieves the automatic recognition of music notation with high-speed and further plays music automatically, which is an important topics (sic!) in the process [140] • process to convert handwritten music symbols on sheets of paper into computer readable data [141] • systems that analyze and convert digitized music scores to machine readable formats [142] • process of automatically recovering the information present onmusic scores based on scanned data [144] • input technique to obtain a machine representation of music [147] • efficient and automatic method to transform paper-based music scores into a machine repre- sentation [148] • system that can provide an automated and time-saving input method to transform paper- based music scores into a machine readable representation, for a wide range of music software, in the same way as Optical Character Recognition is useful for the processing applications [145] • system to transform paper-based music scores and manuscripts into a machine-readable symbolic format [146] • equivalent task for music, that is OCR for digital images of words [149] • system that can automatically interpret the images and automatically create new scores that can be understood by the computer [150] • discipline that investigates music score recognition systems [151] • area of document analysis that aims to automatically understand written music scores. Given an image of musical scores, an OMR system attempts to recognize the content and translate it into a machine-readable format such as MusicXML [155] • branch of artificial intelligence that aims at automatically recognizing and understanding the content of music scores [156] • challenge of understanding the content of musical scores [154] • research field that investigates how to automatically decode written music into a machine- readable format [152] • field of research that investigates how to build systems that decode music notation from images [153] • field of research that investigates how to computationally read music notation in documents [157] • task of recognizing all music symbols in a score sheet [158] ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:40 Calvo-Zaragoza et al. • system to convert music scores into a machine-readable data that could be reproduced in computer and stored as compact digitalised data [sic] [159] • process of identifying music from an image of a music score [160] • system to transform paper-based music scores and manuscripts into a machine-readable symbolic format [161, 162] • tools for the creation of searchable digital music libraries [163] • systems that create encodings of the musical content in digital images automatically [164] • musical analogue to optical character recognition (OCR) [165] • applications that enable document images to be encoded into digital symbolic music repre- sentations [167] • the equivalent of OCR for music [166] • pathway to a large set of symbolic scores [168] • analogous to optical character recognition to convert music score images into symbolic form [169] • form of structured document image analysis where music symbols are isolated and identified so that the music can be conveniently processed [172] • system to transform paper-based music scores and manuscripts into a machine-readable (symbolic) format [51, 170, 173] • system with three main objectives: the recognition, the representation and the storage of musical scores in a machine-readable format [177] • tool for the automatic recognition of digitized music scores [176] • computer system that can automatically decode and create new scores [174] • research field, that deals with the recognition, the representation and the storage of musical scores in a machine-readable format [171] • tool to transform pen-based music scores and manuscripts into a machine-readable symbolic format [175] • system capable of recognizing printed music of reasonable quality [178] • task of recognizing images of musical scores [180] • recognition of images of musical scores [181] • key tools for publication of music score collections that are currently found only on paper [182] • system that can automatically recognize the main musical symbols of a scanned paper-based music score [183] • field of research that aims at reading automatically scanned scores in order to convert them in an electronic format, such as a midi file [184] • method that aims at automatically reading scanned scores [185] • method that aims at automatically reading scanned scores in order to convert them into an electronic format, such as MIDI file, or an audio waveform [186] • automatic recognition of a scanned page of printed music notation by a computer program [187] • translation of a digitized image of a music score into a representation more amenable to computer manipulation of the musical content [188] • systems that analyse images of music scores to convert their content to machine readable formats [143] • problem of obtaining a complete representation of a musical document given only a digital image [189] • problem of recognizing musical scores in images [190] ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Understanding Optical Music Recognition 1:41 • application of optical character recognition to interpret sheet music or printed scores into editable or playable form [191] • systems that play a very important role in the process of creating the digital libraries of musical documents [192] • tools for automatic sheet music transcription [193] • system for extracting musical symbols from images similar to the Optical Character Recogni- tion [194] • method that involves identifying musical symbols on a scanned sheet of music and trans- forming them into a computer readable format [195] • process of converting digitized sheets of music into an electronic form that is suitable for further processing such as editing and performing by computer [196] • efficient and automatic method for transforming paper-based music scores into a machine representation [197] • algorithm for processing images of musical scores [198] • work for automatically recognizing music expressions for printed and handwritten music [199] • program to convert scanned score into an electronic format and even recognize and under- stand the contents of the score [200] • application to automatically transcribe digitized page images of music [202] • automatic recognition of a scanned music score [203] • system to input music by detecting musical symbols, based on strokes drawn by the user [201] • automatic recognition of scanned music scores [203] • area of document recognition and computer vision that aims at converting scans of written music to machine-readable form, much like optical character recognition [204] • area within music information retrieval with the goal of transforming images of printed or handwritten music scores into machine readable form, thereby understanding the semantic meaning of music notation [205] • process of identifying music from an image of a music score [207] • process of turningmusical notation represented in a digital image into a computer-manipulable symbolic notation format [208] • process of converting a scanned image of pages of music into computer readable and manip- ulable symbols using a variety of image processing techniques [209] • process that reads and extracts the content from digitized images of music documents [210] • a computer system for automatically storing and interpreting musical information (of music scores) [212] • system that can automatically interpret images of music scores and create new scores that the computer could understand [211] • particular case of high-level document analysis [214, 215] • task of interpreting the content of the bitmap image of a musical score and reformulating it with a high-level symbolic structure [213] • way to convert music notation into a digital representation, and its acoustic rendition [216] • systems whose main purpose is to convert images of paper-based music scores into digitised formats [217] • application of recognition algorithms to musical scores, to encode the musical content to some kind of digital format [206] • tool to recognize a scanned page of music scores automatically [218, 219] • conversion of scanned pages of music into a musical database [220] ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:42 Calvo-Zaragoza et al. • process of a computer reading sheet music [221] • process of converting paper sheets of music score into an electronic format which can be “readž by computer [222] • tool that takes a score that is likely to be correct, scans it and tries to recreate what it scans in a digital notation format [223] REFERENCES [1] Sanu Pulimootil Achankunju. 2018. Music Search Engine from Noisy OMR Data. In 1st International Workshop on Reading Music Systems. Paris, France, 23ś24. [2] Julia Adamska, Mateusz Piecuch, Mateusz Podgórski, Piotr Walkiewicz, and Ewa Lukasik. 2015. Mobile System for Optical Music Recognition and Music Sound Generation. In Computer Information Systems and Industrial Management. Cham, 571ś582. [3] Jamie Anstice, Tim Bell, Andy Cockburn, and Martin Setchell. 1996. The design of a pen-based musical input system. In 6th Australian Conference on Computer-Human Interaction. 260ś267. [4] David Bainbridge. 1997. Extensible optical music recognition. Ph.D. Dissertation. University of Canterbury. [5] David Bainbridge and Tim Bell. 1996. An extensible optical music recognition system. Australian Computer Science Communications 18 (1996), 308ś317. [6] David Bainbridge and Tim Bell. 1997. Dealing with Superimposed Objects in Optical Music Recognition. In 6th International Conference on Image Processing and its Applications. 756ś760. [7] David Bainbridge and Tim Bell. 2001. The Challenge of Optical Music Recognition. Computers and the Humanities 35, 2 (2001), 95ś121. [8] David Bainbridge and Tim Bell. 2003. A music notation construction engine for optical music recognition. Software: Practice and Experience 33, 2 (2003), 173ś200. [9] David Bainbridge and Nicholas Paul Carter. 1997. Automatic reading of music notation. In Handbook of Character Recognition and Document Image Analysis. World Scientific, Singapore, 583ś603. [10] David Bainbridge, Xiao Hu, and J. Stephen Downie. 2014. A Musical Progression with Greenstone: How Music Content Analysis and Linked Data is Helping Redefine the Boundaries to a Music Digital Library. In 1st International Workshop on Digital Libraries for Musicology. [11] David Bainbridge and Stuart Inglis. 1998. Musical image compression. In Data Compression Conference. 209ś218. [12] Stefan Balke, Sanu Pulimootil Achankunju, and Meinard Müller. 2015. Matching Musical Themes Based on Noisy OCR and OMR Input. In International Conference on Acoustics, Speech and Signal Processing. 703ś707. [13] Stefan Balke, Christian Dittmar, Jakob Abeßer, Klaus Frieler, Martin Pfleiderer, and Meinard Müller. 2018. Bridging the Gap: Enriching YouTube Videos with Jazz Music Annotations. Frontiers in Digital Humanities 5 (2018), 1ś11. [14] Arnau Baró, Pau Riba, Jorge Calvo-Zaragoza, and Alicia Fornés. 2017. Optical Music Recognition by Recurrent Neural Networks. In 14th International Conference on Document Analysis and Recognition. Kyoto, Japan, 25ś26. [15] Arnau Baró, Pau Riba, and Alicia Fornés. 2016. Towards the recognition of compound music notes in handwritten music scores. In 15th International Conference on Frontiers in Handwriting Recognition. 465ś470. [16] Arnau Baró, Pau Riba, and Alicia Fornés. 2018. A Starting Point for HandwrittenMusic Recognition. In 1st International Workshop on Reading Music Systems. Paris, France, 5ś6. [17] Arnau Baró-Mas. 2017. Optical Music Recognition by Long Short-Term Memory Recurrent Neural Networks. Master’s thesis. Universitat Autònoma de Barcelona. [18] Stephan Baumann. 1995. A Simplified Attributed Graph Grammar for High-Level Music Recognition. In 3rd Interna- tional Conference on Document Analysis and Recognition. 1080ś1083. [19] Stephan Baumann and Andreas Dengel. 1992. Transforming Printed Piano Music into MIDI. In Advances in Structural and Syntactic Pattern Recognition. World Scientific, 363ś372. [20] Pierfrancesco Bellini, Ivan Bruno, and Paolo Nesi. 2001. Optical music sheet segmentation. In 1st International Conference on WEB Delivering of Music. 183ś190. [21] Pierfrancesco Bellini, Ivan Bruno, and Paolo Nesi. 2004. An Off-Line Optical Music Sheet Recognition. In Visual Perception of Music Notation: On-Line and Off Line Recognition. IGI Global, 40ś77. [22] Pierfrancesco Bellini, Ivan Bruno, and Paolo Nesi. 2008. Optical Music Recognition: Architecture and Algorithms. In Interactive Multimedia Music Technologies. IGI Global, Hershey, PA, USA, 80ś110. [23] Tomáš Beran and Tomáš Macek. 1999. Recognition of Printed Music Score. In Machine Learning and Data Mining in Pattern Recognition. 174ś179. [24] Alexandra Bonnici, Julian Abela, Nicholas Zammit, and George Azzopardi. 2018. Automatic Ornament Localisation, Recognition and Expression from Music Sheets. In ACM Symposium on Document Engineering. Halifax, NS, Canada, 25:1ś25:11. ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Understanding Optical Music Recognition 1:43 [25] Vicente Bosch Campos, Jorge Calvo-Zaragoza, Alejandro H. Toselli, and Enrique Vidal Ruiz. 2016. Sheet Music Statistical Layout Analysis. In 15th International Conference on Frontiers in Handwriting Recognition. 313ś318. [26] Alex Bulis, Roy Almog, Moti Gerner, and Uri Shimony. 1992. Computerized recognition of hand-written musical notes. In International Computer Music Conference. 110ś112. [27] JohnAshley Burgoyne, JohannaDevaney, Laurent Pugin, and Ichiro Fujinaga. 2008. Enhanced BleedthroughCorrection for Early Music Documents with Recto-Verso Registration. In 9th International Conference on Music Information Retrieval. Philadelphia, PA, 407ś412. [28] John Ashley Burgoyne, Ichiro Fujinaga, and J. Stephen Downie. 2015. Music Information Retrieval. In A New Companion to Digital Humanities. Wiley Blackwell, 213ś228. [29] John Ashley Burgoyne, Yue Ouyang, Tristan Himmelman, Johanna Devaney, Laurent Pugin, and Ichiro Fujinaga. 2009. Lyric Extraction and Recognition on Digital Images of Early Music Sources. In 10th International Society for Music Information Retrieval Conference. Kobe, Japan, 723ś727. [30] Donald Byrd and Jakob Grue Simonsen. 2015. Towards a Standard Testbed for Optical Music Recognition: Definitions, Metrics, and Page Images. Journal of New Music Research 44, 3 (2015), 169ś195. [31] Jorge Calvo-Zaragoza. 2018. Why WoRMS?. In 1st International Workshop on Reading Music Systems. Paris, France, 7ś8. [32] Jorge Calvo-Zaragoza, Isabel Barbancho, Lorenzo J. Tardón, and Ana M. Barbancho. 2015. Avoiding staff removal stage in optical music recognition: application to scores written in white mensural notation. Pattern Analysis and Applications 18, 4 (2015), 933ś943. [33] Jorge Calvo-Zaragoza, Francisco J. Castellanos, Gabriel Vigliensoni, and Ichiro Fujinaga. 2018. Deep Neural Networks for Document Processing of Music Score Images. Applied Sciences 8, 5 (2018). [34] Jorge Calvo-Zaragoza, Antonio-Javier Gallego, and Antonio Pertusa. 2017. Recognition of Handwritten Music Symbols with Convolutional Neural Codes. In 14th International Conference on Document Analysis and Recognition. Kyoto, Japan, 691ś696. [35] Jorge Calvo-Zaragoza, Luisa Micó, and Jose Oncina. 2016. Music staff removal with supervised pixel classification. International Journal on Document Analysis and Recognition 19, 3 (2016), 211ś219. [36] Jorge Calvo-Zaragoza and Jose Oncina. 2014. Recognition of Pen-Based Music Notation: The HOMUS Dataset. In 22nd International Conference on Pattern Recognition. 3038ś3043. [37] Jorge Calvo-Zaragoza and Jose Oncina. 2015. Clustering of strokes from pen-based music notation: An experimental study. Lecture Notes in Computer Science 9117 (2015), 633ś640. [38] Jorge Calvo-Zaragoza, Antonio Pertusa, and Jose Oncina. 2017. Staff-line detection and removal using a convolutional neural network. Machine Vision and Applications (2017), 1ś10. [39] Jorge Calvo-Zaragoza and David Rizo. 2018. End-to-End Neural Optical Music Recognition of Monophonic Scores. Applied Sciences 8, 4 (2018). [40] Jorge Calvo-Zaragoza, David Rizo, and José Manuel Iñesta. 2016. Two (note) heads are better than one: pen-based multimodal interaction with music scores. In 17th International Society for Music Information Retrieval Conference. New York City, 509ś514. [41] Jorge Calvo-Zaragoza, Alejandro Toselli, and Enrique Vidal. 2017. Handwritten Music Recognition for Mensural No- tation: Formulation, Data and Baseline Results. In 14th International Conference on Document Analysis and Recognition. Kyoto, Japan, 1081ś1086. [42] Jorge Calvo-Zaragoza, Alejandro H. Toselli, and Enrique Vidal. 2017. Early handwritten music recognition with Hidden Markov Models. In 15th International Conference on Frontiers in Handwriting Recognition. 319ś324. [43] Jorge Calvo-Zaragoza, Jose J. Valero-Mas, and Antonio Pertusa. 2017. End-to-end Optical Music Recognition using Neural Networks. In 18th International Society for Music Information Retrieval Conference. Suzhou, China. [44] Jorge Calvo-Zaragoza, Gabriel Vigliensoni, and Ichiro Fujinaga. 2016. Document Analysis for Music Scores via Machine Learning. In 3rd International workshop on Digital Libraries for Musicology. New York, USA, 37ś40. [45] Jorge Calvo-Zaragoza, Gabriel Vigliensoni, and Ichiro Fujinaga. 2017. A machine learning framework for the categorization of elements in images of musical documents. In 3rd International Conference on Technologies for Music Notation and Representation. A Coruña, Spain. [46] Jorge Calvo-Zaragoza, Gabriel Vigliensoni, and Ichiro Fujinaga. 2017. One-step detection of background, staff lines, and symbols in medieval music manuscripts with convolutional neural networks. In 18th International Society for Music Information Retrieval Conference. Suzhou, China. [47] Jorge Calvo-Zaragoza, Gabriel Vigliensoni, and Ichiro Fujinaga. 2017. Pixel-wise binarization of musical documents with convolutional neural networks. In 15th International Conference on Machine Vision Applications. 362ś365. [48] Jorge Calvo-Zaragoza, Gabriel Vigliensoni, and Ichiro Fujinaga. 2017. Pixelwise classification for music document analysis. In 7th International Conference on Image Processing Theory, Tools and Applications. 1ś6. ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:44 Calvo-Zaragoza et al. [49] Artur Capela, Jamie dos Santos Cardoso, Ana Rebelo, and Carlos Guedes. 2008. Integrated recognition system for music scores. In International Computer Music Conference. 3ś6. [50] Artur Capela, Ana Rebelo, Jamie dos Santos Cardoso, and Carlos Guedes. 2008. Staff Line Detection and Removal with Stable Paths. In International Conference on Signal Processing and Multimedia Applications. [51] Jamie dos Santos Cardoso, Artur Capela, Ana Rebelo, Carlos Guedes, and Joaquim Pinto da Costa. 2009. Staff Detection with Stable Paths. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 6 (2009), 1134ś1139. [52] Fancisco J. Castellanos, Jorge Calvo-Zaragoza, Gabriel Vigliensoni, and Ichiro Fujinaga. 2018. Document Analysis of Music Score Images with Selectional Auto-Encoders. In 19th International Society for Music Information Retrieval Conference. Paris, France, 256ś263. [53] Gen-Fang Chen and Jia-Shing Sheu. 2014. An optical music recognition system for traditional Chinese Kunqu Opera scores written in Gong-Che Notation. EURASIP Journal on Audio, Speech, and Music Processing 2014, 1 (2014), 7. [54] Liang Chen, Rong Jin, and Christopher Raphael. 2014. Optical Music Recognition with Human Labeled Constraints. In CHI’14 Workshop on Human-Centred Machine Learning. Toronto, Canada. [55] Liang Chen, Rong Jin, and Christopher Raphael. 2017. Human-Guided Recognition of Music Score Images. In 4th International Workshop on Digital Libraries for Musicology. [56] Liang Chen, Rong Jin, Simo Zhang, Stefan Lee, Zhenhua Chen, and David Crandall. 2016. A Hybrid HMM-RNN Model for Optical Music Recognition. In Extended abstracts for the Late-Breaking Demo Session of the 17th International Society for Music Information Retrieval Conference. [57] Liang Chen and Christopher Raphael. 2016. Human-Directed Optical Music Recognition. Electronic Imaging 2016, 17 (2016), 1ś9. [58] Liang Chen and Christopher Raphael. 2018. Optical Music Recognition and Human-in-the-loop Computation. In 1st International Workshop on Reading Music Systems. Paris, France, 11ś12. [59] Liang Chen, Erik Stolterman, and Christopher Raphael. 2016. Human-Interactive Optical Music Recognition. In 17th International Society for Music Information Retrieval Conference. 647ś653. [60] Yung-Sheng Chen, Feng-Sheng Chen, and Chin-Hung Teng. 2013. An Optical Music Recognition System for Skew or Inverted Musical Scores. International Journal of Pattern Recognition and Artificial Intelligence 27, 07 (2013). [61] G. Sayeed Choudhury, Tim DiLauro, Michael Droettboom, Ichiro Fujinaga, and Karl MacMillan. 2001. Strike Up the Score: Deriving searchable and playable digital formats from sheet music. D-Lib Magazine 7, 2 (2001). [62] G. Sayeed Choudhury, M. Droetboom, Tim DiLauro, Ichiro Fujinaga, and Brian Harrington. 2000. Optical Music Recognition System within a Large-Scale Digitization Project. In 1st International Symposium on Music Information Retrieval. [63] Maura Church and Michael Scott Cuthbert. 2014. Improving Rhythmic Transcriptions via Probability Models Applied Post-OMR. In 15th International Society for Music Information Retrieval Conference. 643ś648. [64] Bertrand Coüasnon, Pascal Brisset, and Igor Stéphan. 1995. Using Logic Programming Languages For Optical Music Recognition. In 3rd International Conference on the Practical Application of Prolog. [65] Bertrand Coüasnon and Jean Camillerapp. 1994. Using Grammars to Segment and Recognize Music Scores. In International Association for Pattern Recognition Workshop on Document Analysis Systems. Kaiserslautern, Germany, 15ś27. [66] Bertrand Coüasnon and Jean Camillerapp. 1995. A Way to Separate Knowledge From Program in Structured Document Analysis: Application to Optical Music Recognition. In 3rd International Conference on Document Analysis and Recognition. 1092ś1097. [67] Tim Crawford, Golnaz Badkobeh, and David Lewis. 2018. Searching Page-Images of Early Music Scanned with OMR: A Scalable Solution Using Minimal Absent Words. In 19th International Society for Music Information Retrieval Conference. Paris, France, 233ś239. [68] David Damm, Christian Fremerey, Frank Kurth, Meinard Müller, and Michael Clausen. 2008. Multimodal Presentation and Browsing of Music. In 10th International Conference on Multimodal Interfaces. Chania, Greece, 205ś208. [69] Arnaud F. Desaedeleer. 2006. Reading Sheet Music. Master’s thesis. University of London. [70] Jürgen Diet. 2018. Optical Music Recognition in der Bayerischen Staatsbibliothek. BIBLIOTHEK ś Forschung und Praxis (2018). [71] Ing-Jr Ding, Chih-Ta Yen, Che-Wei Chang, and He-Zhong Lin. 2014. Optical music recognition of the singer using formant frequency estimation of vocal fold vibration and lip motion with interpolated GMM classifiers. Journal of Vibroengineering 16, 5 (2014), 2572ś2581. [72] Cong Minh Dinh, Hyung-Jeong Yang, Guee-Sang Lee, and Soo-Hyung Kim. 2016. Fast lyric area extraction from images of printed Korean music scores. IEICE Transactions on Information and Systems E99D, 6 (2016), 1576ś1584. [73] Matthias Dorfer, Andreas Arzt, and Gerhard Widmer. 2016. Towards Score Following In Sheet Music Images. In 17th International Society for Music Information Retrieval Conference. 789ś795. ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Understanding Optical Music Recognition 1:45 [74] Matthias Dorfer, Florian Henkel, and Gerhard Widmer. 2018. Learning To Listen, Read And Follow: Score Following As A Reinforcement Learning Game. In 19th International Society for Music Information Retrieval Conference. Paris, France, 784ś791. [75] Michael Droettboom and Ichiro Fujinaga. 2001. Interpreting the semantics of music notation using an extensible and object-oriented system. Technical Report. John Hopkins University. [76] Michael Droettboom, Ichiro Fujinaga, and Karl MacMillan. 2002. Optical Music Interpretation. In Structural, Syntactic, and Statistical Pattern Recognition. Berlin, Heidelberg, 378ś387. [77] Yang Fang and Teng Gui-fa. 2015. Visual music score detection with unsupervised feature learning method based on K-means. International Journal of Machine Learning and Cybernetics 6, 2 (2015), 277ś287. [78] Miguel Ferrand, João Alexandre Leite, and Amilcar Cardoso. 1999. Hypothetical reasoning: An application to Optical Music Recognition. In Appia-Gulp-Prode’99 joint conference on declarative programming. 367ś381. [79] Miguel Ferrand, João Alexandre Leite, and Amilcar Cardoso. 1999. Improving Optical Music Recognition by Means of Abductive Constraint Logic Programming. In Progress in Artificial Intelligence. Berlin, Heidelberg, 342ś356. [80] Alicia Fornés. 2005. Analysis of Old Handwritten Musical Scores. Master’s thesis. Universitat AutÚnoma de Barcelona. [81] Alicia Fornés, Anjan Dutta, Albert Gordo, and Josep Llados. 2011. The ICDAR 2011 Music Scores Competition: Staff Removal and Writer Identification. In International Conference on Document Analysis and Recognition. 1511ś1515. [82] Alicia Fornés, Anjan Dutta, Albert Gordo, and Josep Lladós. 2012. CVC-MUSCIMA: A Ground-truth of Handwritten Music Score Images for Writer Identification and Staff Removal. International Journal on Document Analysis and Recognition 15, 3 (2012), 243ś251. [83] Alicia Fornés, Anjan Dutta, Albert Gordo, and Josep Lladós. 2013. The 2012 Music Scores Competitions: Staff Removal and Writer Identification. In Graphics Recognition. New Trends and Challenges. Berlin, Heidelberg, 173ś186. [84] Alicia Fornés, Josep Lladós, and Gemma Sánchez. 2006. Primitive Segmentation in Old Handwritten Music Scores. In Graphics Recognition. Ten Years Review and Future Perspectives. Berlin, Heidelberg, 279ś290. [85] Alicia Fornés, Josep Lladós, and Gemma Sánchez. 2008. Old Handwritten Musical Symbol Classification by a Dynamic Time Warping Based Method. In Graphics Recognition. Recent Advances and New Opportunities. Berlin, Heidelberg, 51ś60. [86] Alicia Fornés, Josep Lladós, Gemma Sánchez, and Horst Bunke. 2008. Writer Identification in Old Handwritten Music Scores. In 8th International Workshop on Document Analysis Systems. Nara, Japan, 347ś353. [87] Alicia Fornés, Josep Lladós, Gemma Sánchez, and Horst Bunke. 2009. On the Use of Textural Features for Writer Identification in Old Handwritten Music Scores. 10th International Conference on Document Analysis and Recognition (2009), 996ś1000. [88] Stavroula-Evita Fotinea, George Giakoupis, Aggelos Livens, Stylianos Bakamidis, and George Carayannis. 2000. An Optical Notation Recognition System for Printed Music Based on Template Matching and High Level Reasoning. In RIAO ’00 Content-Based Multimedia Information Access. Paris, France, 1006ś1014. [89] Christian Fremerey, David Damm, Frank Kurth, and Michael Clausen. 2009. Handling Scanned Sheet Music and Audio Recordings in Digital Music Libraries. In International Conference on Acoustics NAG/DAGA. 1ś2. [90] Christian Fremerey, Meinard Müller, Frank Kurth, and Michael Clausen. 2008. Automatic Mapping of Scanned Sheet Music to Audio Recordings. In 9th International Conference on Music Information Retrieval. 413ś418. [91] Ichiro Fujinaga. 1988. Optical Music Recognition using Projections. Master’s thesis. McGill University. [92] Ichiro Fujinaga. 1996. Exemplar-based learning in adaptive optical music recognition system. In International Computer Music Conference. Hong Kong, 55ś56. [93] Ichiro Fujinaga and Andrew Hankinson. 2014. SIMSSA: Single Interface for Music Score Searching and Analysis. Journal of the Japanese Society for Sonic Arts 6, 3 (2014), 25ś30. [94] Ichiro Fujinaga, Andrew Hankinson, and Julie E. Cumming. 2014. Introduction to SIMSSA (Single Interface for Music Score Searching and Analysis). In 1st International Workshop on Digital Libraries for Musicology. 1ś3. [95] Ichiro Fujinaga, Andrew Hankinson, and Laurent Pugin. 2018. Automatic Score Extraction with Optical Music Recognition (OMR). In Springer Handbook of Systematic Musicology. Springer Berlin Heidelberg, Berlin, Heidelberg, 299ś311. [96] Antonio-Javier Gallego and Jorge Calvo-Zaragoza. 2017. Staff-line removal with selectional auto-encoders. Expert Systems with Applications 89 (2017), 138ś148. [97] Chen Genfang, Zhang Wenjun, and Wang Qiuqiu. 2009. Pick-up the Musical Information from Digital Musical Score Based on Mathematical Morphology and Music Notation. In 1st International Workshop on Education Technology and Computer Science. 1141ś1144. [98] Susan E. George. 2003. Online Pen-Based Recognition of Music Notation with Artificial Neural Networks. Computer Music Journal 27, 2 (2003), 70ś79. [99] Susan E. George. 2004. Evaluation in the Visual Perception of Music Notation. In Visual Perception of Music Notation: On-Line and Off Line Recognition. IRM Press, Hershey, PA, 304ś349. ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:46 Calvo-Zaragoza et al. [100] Susan E. George. 2004. Visual Perception of Music Notation On-Line and Off-Line Recognition. IRM Press. [101] Susan E. George. 2004. Wavelets for Dealing with Super-Imposed Objects in Recognition of Music Notation. In Visual Perception of Music Notation: On-Line and Off Line Recognition. IRM Press, Hershey, PA, 78ś107. [102] Velissarios G. Gezerlis and Sergios Theodoridis. 2002. Optical character recognition of the Orthodox Hellenic Byzantine Music notation. Pattern Recognition 35, 4 (2002), 895ś914. [103] Roland Göcke. 2003. Building a system for writer identification on handwritten music scores. In IASTED International Conference on Signal Processing, Pattern Recognition, and Applications. 250ś255. [104] Gianmarco Gozzi. 2010. OMRJX: A framework for piano scores optical music recognition. Master’s thesis. Politecnico di Milano. [105] Jan Hajič jr. and Matthias Dorfer. 2017. Prototyping Full-Pipeline Optical Music Recognition with MUSCIMARKER. In Extended abstracts for the Late-Breaking Demo Session of the 18th International Society for Music Information Retrieval Conference. Suzhou, China. [106] Jan Hajič jr., Matthias Dorfer, Gerhard Widmer, and Pavel Pecina. 2018. Towards Full-Pipeline Handwritten OMR with Musical Symbol Detection by U-Nets. In 19th International Society for Music Information Retrieval Conference. Paris, France, 225ś232. [107] Jan Hajič jr., Marta Kolárová, Alexander Pacha, and Jorge Calvo-Zaragoza. 2018. How Current Optical Music Recognition Systems Are Becoming Useful for Digital Libraries. In 5th International Conference on Digital Libraries for Musicology. Paris, France, 57ś61. [108] Jan Hajič jr. and Pavel Pecina. 2017. Detecting Noteheads in Handwritten Scores with ConvNets and Bounding Box Regression. Computing Research Repository abs/1708.01806 (2017). [109] Jan Hajič jr. and Pavel Pecina. 2017. Groundtruthing (Not Only) Music Notation with MUSICMarker: A Practical Overview. In 14th International Conference on Document Analysis and Recognition. Kyoto, Japan, 47ś48. [110] Jan Hajič jr. and Pavel Pecina. 2017. In Search of a Dataset for Handwritten Optical Music Recognition: Introducing MUSCIMA++. Computing Research Repository abs/1703.04824 (2017), 1ś16. [111] Jan Hajič jr. and Pavel Pecina. 2017. The MUSCIMA++ Dataset for Handwritten Optical Music Recognition. In 14th International Conference on Document Analysis and Recognition. Kyoto, Japan, 39ś46. [112] Andrew Hankinson. 2014. Optical music recognition infrastructure for large-scale music document analysis. Ph.D. Dissertation. McGill University. [113] Andrew Hankinson, John Ashley Burgoyne, Gabriel Vigliensoni, Alastair Porter, Jessica Thompson, Wendy Liu, Remi Chiu, and Ichiro Fujinaga. 2012. Digital Document Image Retrieval Using Optical Music Recognition. In 13th International Society for Music Information Retrieval Conference. 577ś582. [114] Ali Hemmatifar and Ashish Krishna. 2018. DeepPiano: A Deep Learning Approach to Translate Music Notation to English Alphabet. Technical Report. Stanford University. [115] Władysław Homenda. 2001. Optical Music Recognition: the Case of Granular Computing. In Granular Computing: An Emerging Paradigm. Physica-Verlag HD, Heidelberg, 341ś366. [116] Władysław Homenda. 2006. Automatic understanding of images: integrated syntactic and semantic analysis of music notation. In International Joint Conference on Neural Network. Vancouver, Canada, 3026ś3033. [117] Władysław Homenda and Marcin Luckner. 2004. Automatic Recognition of Music Notation Using Neural Networks. In International Conference on AI and Systems. Divnormorkoye, Russia. [118] Yu-Hui Huang, Xuanli Chen, Serafina Beck, David Burn, and Luc Van Gool. 2015. Automatic Handwritten Mensural Notation Interpreter: From Manuscript to MIDI Performance. In 16th International Society for Music Information Retrieval Conference. Málaga, Spain, 79ś85. [119] Krzysztof Jastrzębski. 2014. OMR for sheet music digitization. Master’s thesis. Politechnika Wrocławska. [120] Rong Jin. 2017. Graph-Based Rhythm Interpretation in Optical Music Recognition. Ph.D. Dissertation. Indiana University. [121] Rong Jin and Christopher Raphael. 2012. Interpreting Rhythm in Optical Music Recognition. In 13th International Society for Music Information Retrieval Conference. Porto, Portugal, 151ś156. [122] Linn Saxrud Johansen. 2009. Optical Music Recognition. Master’s thesis. University of Oslo. [123] Graham Jones, Bee Ong, Ivan Bruno, and Kia Ng. 2008. Optical Music Imaging: Music Document Digitisation, Recognition, Evaluation, and Restoration. In Interactive multimedia music technologies. IGI Global, 50ś79. [124] Elyor Kodirov, Sejin Han, Guee-Sang Lee, and YoungChul Kim. 2014. Music with Harmony: Chord Separation and Recognition in Printed Music Score Images. In 8th International Conference on Ubiquitous Information Management and Communication. Siem Reap, Cambodia, 1ś8. [125] Worapan Kusakunniran, Attapol Prempanichnukul, Arthid Maneesutham, Kullachut Chocksawud, Suparus Tongsamui, and Kittikhun Thongkanchorn. 2014. Optical music recognition for traditional Thai sheet music. In International Computer Science and Engineering Conference. 157ś162. [126] Wojciech Lesinski and Agnieszka Jastrzebska. 2015. Optical Music Recognition: Standard and Cost-Sensitive Learning with Imbalanced Data. In IFIP International Conference on Computer Information Systems and Industrial Management. ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Understanding Optical Music Recognition 1:47 601ś612. [127] Karen Lin and Tim Bell. 2000. Integrating Paper and Digital Music Information Systems. In International Society for Music Information Retrieval. 23ś25. [128] Xiaoxiang Liu. 2012. Note Symbol Recognition for Music Scores. In Intelligent Information and Database Systems. Berlin, Heidelberg, 263ś273. [129] Xiaoxiang Liu, Mi Zhou, and Peng Xu. 2015. A Robust Method for Musical Note Recognition. In 14th International Conference on Computer-Aided Design and Computer Graphics. 212ś213. [130] Nawapon Luangnapa, Thongchai Silpavarangkura, Chakarida Nukoolkit, and Pornchai Mongkolnam. 2012. Optical Music Recognition on Android Platform. In International Conference on Advances in Information Technology. 106ś115. [131] Marcin Luckner. 2006. Recognition of Noised Patterns Using Non-Disruption Learning Set. In 6th International Conference on Intelligent Systems Design and Applications. 557ś562. [132] SimoneMarinai and Paolo Nesi. 1999. Projection Based Segmentation of Musical Sheets. In 5th International Conference on Document Analysis and Recognition. 3ś6. [133] Cory McKay and Ichiro Fujinaga. 2007. Style-independent computer-assisted exploratory analysis of large music collections. Journal of Interdisciplinary Music Studies 1, 1 (2007), 63ś85. [134] John R. McPherson. 1999. Page Turning Ð Score Automation for Musicians. Technical Report. University of Canterbury, New Zealand. [135] John R. McPherson. 2002. Introducing Feedback into an Optical Music Recognition System. In 3rd International Conference on Music Information Retrieval. Paris, France. [136] John R. McPherson. 2006. Coordinating Knowledge To Improve Optical Music Recognition. Ph.D. Dissertation. The University of Waikato. [137] John R. McPherson and David Bainbridge. 2002. Coordinating Knowledge Within an Optical Music Recognition System. Technical Report. University of Waikato, Hamilton, New Zealand. [138] Apurva A. Mehta and Malay S. Bhatt. 2015. Optical Music Notes Recognition for Printed Piano Music Score Sheet. In International Conference on Computer Communication and Informatics. Coimbatore, India. [139] Yevgen Mexin, Aristotelis Hadjakos, Axel Berndt, Simon Waloschek, Anastasia Wawilow, and Gerd Szwillus. 2017. Tools for Annotating Musical Measures in Digital Music Editions. In 14th Sound and Music Computing Conference. Espoo, Finland, 279ś286. [140] Du Min. 2011. Research on numbered musical notation recognition and performance in a intelligent system. In International Conference on Business Management and Electronic Information. 340ś343. [141] Hidetoshi Miyao and Minoru Maruyama. 2004. An online handwritten music score recognition system. In 17th International Conference on Pattern Recognition. [142] Igor dos Santos Montagner, Roberto Jr. Hirata, and Nina S. T. Hirata. 2014. Learning to remove staff lines from music score images. In International Conference on Image Processing. 2614ś2618. [143] Igor dos Santos Montagner, Roberto Jr. Hirata, and Nina S. T. Hirata. 2014. A Machine Learning based method for Staff Removal. In 22nd International Conference on Pattern Recognition. 3162ś3167. [144] Diego Nehab. 2003. Staff Line Detection by Skewed Projection. Technical Report. [145] Kia Ng. 2002. Music manuscript tracing. Lecture Notes in Computer Science 2390 (2002), 322ś334. [146] Kia Ng. 2004. Optical Music Analysis for Printed Music Score and Handwritten Music Manuscript. In Visual Perception of Music Notation: On-Line and Off Line Recognition. IGI Global, 108ś127. [147] Kia Ng and Roger Boyle. 1992. Segmentation of Music Primitives. In BMVC92. London, 472ś480. [148] Kia Ng, Roger Boyle, and David Cooper. 1995. Low- and high-level approaches to optical music score recognition. In IEE Colloquium on Document Image Processing and Multimedia Environments. 31ś36. [149] Kia Ng, Alex McLean, and Alan Marsden. 2014. Big Data Optical Music Recognition with Multi Images and Multi Recognisers. In EVA London 2014 on Electronic Visualisation and the Arts. 215ś218. [150] Vo Quang Nhat and GueeSang Lee. 2014. Adaptive Line Fitting for Staff Detection in Handwritten Music Score Images. In 8th International Conference on Ubiquitous Information Management and Communication. Siem Reap, Cambodia, 991ś996. [151] Jiri Novotnỳ and Jaroslav Pokornỳ. 2015. Introduction to Optical Music Recognition: Overview and Practical Challenges. In Annual International Workshop on DAtabases, TExts, Specifications and Objects. 65ś76. [152] Alexander Pacha. 2018. Self-learning Optical Music Recognition. In Vienna Young Scientists Symposium. 34ś35. ISBN: 978-3-9504017-8-3. [153] Alexander Pacha and Jorge Calvo-Zaragoza. 2018. Optical Music Recognition in Mensural Notation with Region-Based Convolutional Neural Networks. In 19th International Society for Music Information Retrieval Conference. Paris, France, 240ś247. [154] Alexander Pacha, Kwon-Young Choi, Bertrand Coüasnon, Yann Ricquebourg, Richard Zanibbi, and Horst Eidenberger. 2018. Handwritten Music Object Detection: Open Issues and Baseline Results. In 13th International Workshop on ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:48 Calvo-Zaragoza et al. Document Analysis Systems. 163ś168. [155] Alexander Pacha and Horst Eidenberger. 2017. Towards a Universal Music Symbol Classifier. In 14th International Conference on Document Analysis and Recognition. Kyoto, Japan, 35ś36. [156] Alexander Pacha and Horst Eidenberger. 2017. Towards Self-Learning Optical Music Recognition. In 16th International Conference on Machine Learning and Applications. 795ś800. [157] Alexander Pacha, Jan Hajič jr., and Jorge Calvo-Zaragoza. 2018. A Baseline for General Music Object Detection with Deep Learning. Applied Sciences 8, 9 (2018), 1488ś1508. [158] Viet-Khoi Pham, Hai-Dang Nguyen, and Minh-Triet Tran. 2015. Virtual Music Teacher for New Music Learners with Optical Music Recognition. In International Conference on Learning and Collaboration Technologies. 415ś426. [159] Roberto M. Pinheiro Pereira, Caio E.F. Matos, Geraldo Jr. Braz, João D.S. de Almeida, and Anselmo C. de Paiva. 2016. A Deep Approach for Handwritten Musical Symbols Recognition. In 22nd Brazilian Symposium on Multimedia and the Web. Teresina, Piau; Brazil, 191ś194. [160] João Caldas Pinto, Pedro Vieira, and João M. Sousa. 2003. A new graph-like classification method applied to ancient handwritten musical symbols. Document Analysis and Recognition 6, 1 (2003), 10ś22. [161] Telmo Pinto, Ana Rebelo, Gilson Giraldi, and Jamie dos Santos Cardoso. 2010. Content Aware Music Score Binarization. Technical Report. Universidade do Porto, Portugal. [162] Telmo Pinto, Ana Rebelo, Gilson Giraldi, and Jamie dos Santos Cardoso. 2011. Music Score Binarization Based on Domain Knowledge. In Pattern Recognition and Image Analysis. 700ś708. [163] Laurent Pugin, John Ashley Burgoyne, and Ichiro Fujinaga. 2007. Goal-directed Evaluation for the Improvement of Optical Music Recognition on Early Music Prints. In 7th ACM/IEEE-CS Joint Conference on Digital Libraries. Vancouver, Canada, 303ś304. [164] Laurent Pugin, John Ashley Burgoyne, and Ichiro Fujinaga. 2007. MAP Adaptation to Improve Optical Music Recognition of Early Music Documents Using Hidden Markov Models. In 8th International Conference on Music Information Retrieval. 513ś516. [165] Laurent Pugin, John Ashley Burgoyne, and Ichiro Fujinaga. 2007. Reducing Costs for Digitising Early Music with Dynamic Adaptation. In Research and Advanced Technology for Digital Libraries. Berlin, Heidelberg, 471ś474. [166] Laurent Pugin and Tim Crawford. 2013. Evaluating OMR on the Early Music Online Collection. In 14th International Society for Music Information Retrieval Conference. Curitiba, Brazil, 439ś444. [167] Laurent Pugin, Jason Hockman, John Ashley Burgoyne, and Ichiro Fujinaga. 2008. Gamera versus Aruspix ś Two Optical Music Recognition Approaches. In 9th International Conference on Music Information Retrieval. [168] Christopher Raphael. 2011. Optical Music Recognition on the IMSLP. Technical Report. Indiana University, Bloomington. [169] Christopher Raphael and Rong Jin. 2013. Optical music recognition on the international music score library project. In IS&T/SPIE Electronic Imaging. [170] Ana Rebelo. 2008. New Methodologies Towards an Automatic Optical Recognition of Handwritten Musical Scores. Master’s thesis. Universidade do Porto. [171] Ana Rebelo. 2012. Robust Optical Recognition of Handwritten Musical Scores based on Domain Knowledge. Ph.D. Dissertation. University of Porto. [172] Ana Rebelo, Artur Capela, Joaquim F. Pinto da Costa, Carlos Guedes, Eurico Carrapatoso, and Jamie dos Santos Cardoso. 2007. A Shortest Path Approach for Staff Line Detection. In 3rd International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution. 79ś85. [173] Ana Rebelo, G. Capela, and Jamie dos Santos Cardoso. 2010. Optical recognition of music symbols. International Journal on Document Analysis and Recognition 13, 1 (2010), 19ś31. [174] Ana Rebelo, Ichiro Fujinaga, Filipe Paszkiewicz, Andre R.S. Marcal, Carlos Guedes, and Jamie dos Santos Cardoso. 2012. Optical music recognition: state-of-the-art and open issues. International Journal of Multimedia Information Retrieval 1, 3 (2012), 173ś190. [175] Ana Rebelo, André Marçal, and Jamie dos Santos Cardoso. 2013. Global constraints for syntactic consistency in OMR: an ongoing approach. In International Conference on Image Analysis and Recognition. [176] Ana Rebelo, Filipe Paszkiewicz, Carlos Guedes, Andre R. S. Marcal, and Jamie dos Santos Cardoso. 2011. A Method for Music Symbols Extraction based on Musical Rules. In Bridges 2011: Mathematics, Music, Art, Architecture, Culture. 81ś88. [177] Ana Rebelo, Jakub Tkaczuk, Sousa Sousa, and Jamie dos Santos Cardoso. 2011. Metric Learning for Music Symbol Recognition. In 10th International Conference on Machine Learning and Applications and Workshops. 106ś111. [178] K. Todd Reed and J. R. Parker. 1996. Automatic Computer Recognition of Printed Music. In 13th International Conference on Pattern Recognition. 803ś807. [179] Adrià Rico Blanes and Alicia Fornés Bisquerra. 2017. Camera-Based Optical Music Recognition Using a Convolutional Neural Network. In 14th International Conference on Document Analysis and Recognition. Kyoto, Japan, 27ś28. ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Understanding Optical Music Recognition 1:49 [180] Dan Ringwalt, Roger Dannenberg, and Andrew Russell. 2015. Optical Music Recognition for Interactive Score Display. In International Conference on New Interfaces for Musical Expression. Baton Rouge, Louisiana, USA, 95ś98. [181] Dan Ringwalt and Roger B. Dannenberg. 2015. Image Quality Estimation for Multi-Score OMR. In 16th International Society for Music Information Retrieval Conference. 17ś23. [182] David Rizo, Jorge Calvo-Zaragoza, and JoséM. Iñesta. 2018. MuRET: AMusic Recognition, Encoding, and Transcription Tool. In 5th International Conference on Digital Libraries for Musicology. Paris, France, 52ś56. [183] Florence Rossant. 2002. A global method for music symbol recognition in typeset music sheets. Pattern Recognition Letters 23, 10 (2002), 1129ś1141. [184] Florence Rossant and Isabelle Bloch. 2004. A fuzzy model for optical recognition of musical scores. Fuzzy Sets and Systems 141, 2 (2004), 165ś201. [185] Florence Rossant and Isabelle Bloch. 2005. Optical music recognition based on a fuzzy modeling of symbol classes and music writing rules. In IEEE International Conference on Image Processing 2005. IIś538. [186] Florence Rossant and Isabelle Bloch. 2006. Robust and Adaptive OMR System Including Fuzzy Modeling, Fusion of Musical Rules, and Possible Error Detection. EURASIP Journal on Advances in Signal Processing 2007, 1 (2006), 081541. [187] Martin Roth. 1994. An approach to recognition of printed music. Technical Report. Swiss Federal Institute of Technology. [188] Alan Ruttenberg. 1991. Optical Reading of Typeset Music. Master’s thesis. Massachusetts Institute of Technology, Boston, MA. [189] W. Brent Seales and Arcot Rajasekar. 1995. Interpreting music manuscripts: A logic-based, object-oriented approach. In Image Analysis Applications and Computer Graphics. Berlin, Heidelberg, 181ś188. [190] Baoguang Shi, Xiang Bai, and Cong Yao. 2017. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 11 (2017), 2298ś2304. [191] Rui Miguel Filipe da Silva. 2013. Mobile framework for recognition of musical characters. Master’s thesis. Universidade do Porto. [192] Maciej Smiatacz and Witold Malina. 2008. Matrix-based classifiers applied to recognition of musical notation symbols. In 1st International Conference on Information Technology. 1ś4. [193] Javier Sober-Mira, Jorge Calvo-Zaragoza, David Rizo, and José Manuel Iñesta. 2017. Multimodal Recognition for Music Document Transcription. In 10th International Workshop on Machine Learning and Music. Barcelona, Spain. [194] Mahmood Sotoodeh, Farshad Tajeripour, Sadegh Teimori, and Kirk Jorgensen. 2017. A music symbols recognition method using pattern matching along with integrated projection and morphological operation techniques. Multimedia Tools and Applications (2017). [195] Mu-Chun Su, Chee-Yuen Tew, and Hsin-Hua Chen. 2001. Musical symbol recognition using SOM-based fuzzy systems. In Joint 9th IFSA World Congress and 20th NAFIPS International Conference. 2150ś2153 vol.4. [196] Mariusz Szwoch. 2005. A Robust Detector for Distorted Music Staves. In Computer Analysis of Images and Patterns. Berlin, Heidelberg, 701ś708. [197] Mariusz Szwoch. 2007. Guido: A Musical Score Recognition System. In 9th International Conference on Document Analysis and Recognition. 809ś813. [198] Mariusz Szwoch. 2008. Using MusicXML to Evaluate Accuracy of OMR Systems. In International Conference on Theory and Application of Diagrams. Herrsching, Germany, 419ś422. [199] Paul Taele, Laura Barreto, and Tracy Hammond. 2015. Maestoso: An Intelligent Educational Sketching Tool for Learning Music Theory. In 27th Conference on Innovative Applications of Artificial Intelligence. Austin, Texas, 3999ś 4005. [200] Lorenzo J. Tardón, Simone Sammartino, Isabel Barbancho, Verónica Gómez, and Antonio Oliver. 2009. Optical Music Recognition for Scores Written in White Mensural Notation. EURASIP Journal on Image and Video Processing 2009, 1 (2009), 843401. [201] Gabriel Taubman. 2005. MusicHand : A Handwritten Music Recognition System. Technical Report. Brown University. [202] Jessica Thompson, Andrew Hankinson, and Ichiro Fujinaga. 2011. Searching the Liber Usualis: Using CouchDB and ElasticSearch to Query Graphical Music Documents. In 12th International Society for Music Information Retrieval Conference. [203] Fubito Toyama, Kenji Shoji, and Juichi Miyamichi. 2006. Symbol Recognition of Printed Piano Scores with Touching Symbols. In 18th International Conference on Pattern Recognition. 480ś483. [204] Lukas Tuggener, Isamil Elezi, Jürgen Schmidhuber, Marcello Pelillo, and Stadelmann Thilo. 2018. DeepScores - A Dataset for Segmentation, Detection and Classification of Tiny Objects. In 24th International Conference on Pattern Recognition. Beijing, China. [205] Lukas Tuggener, Ismail Elezi, Jürgen Schmidhuber, and Thilo Stadelmann. 2018. Deep Watershed Detector for Music Object Recognition. In 19th International Society for Music Information Retrieval Conference. Paris, France, 271ś278. ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 1:50 Calvo-Zaragoza et al. [206] Eelco van der Wel and Karen Ullrich. 2017. Optical Music Recognition with Convolutional Sequence-to-Sequence Models. In 18th International Society for Music Information Retrieval Conference. Suzhou, China. [207] Pedro Vieira and João Caldas Pinto. 2001. Recognition of musical symbols in ancient manuscripts. In International Conference on Image Processing. 38ś41 vol.3. [208] Gabriel Vigliensoni, John Ashley Burgoyne, Andrew Hankinson, and Ichiro Fujinaga. 2011. Automatic Pitch Detection in Printed Square Notation. In 12th International Society for Music Information Retrieval Conference. Miami, Florida, 423ś428. [209] Gabriel Vigliensoni, Gregory Burlet, and Ichiro Fujinaga. 2013. Optical measure recognition in common music notation. In 14th International Society for Music Information Retrieval Conference. Curitiba, Brazil. [210] Gabriel Vigliensoni, Jorge Calvo-Zaragoza, and Ichiro Fujinaga. 2018. Developing an environment for teaching computers to read music. In 1st International Workshop on Reading Music Systems. Paris, France, 27ś28. [211] Quang Nhat Vo, Guee Sang Lee, Soo Hyung Kim, and Hyung Jeong Yang. 2017. Recognition of Music Scores with Non-Linear Distortions in Mobile Devices. Multimedia Tools and Applications (2017). [212] Quang Nhat Vo, Tam Nguyen, Soo-Hyung Kim, Hyung-Jeong Yang, and Guee-Sang Lee. 2014. Distorted music score recognition without Staffline removal. In 22nd International Conference on Pattern Recognition. 2956ś2960. [213] Marc Vuilleumier Stückelberg and David Doermann. 1999. On musical score recognition using probabilistic reasoning. In 5th International Conference on Document Analysis and Recognition. 115ś118. [214] Marc Vuilleumier Stückelberg, Christian Pellegrini, and Mélanie Hilario. 1997. An architecture for musical score recognition using high-level domain knowledge. In 4th International Conference on Document Analysis and Recognition. 813ś818 vol.2. [215] Marc Vuilleumier Stückelberg, Christian Pellegrini, and Mélanie Hillario. 1997. A preview of an architecture for musical score recognition. Technical Report. University of Geneva. [216] Matthias Wallner. 2014. A System for Optical Music Recognition and Audio Synthesis. Master’s thesis. TU Wien. [217] Lee Ling Wei, Qussay A. Salih, and Ho Sooi Hock. 2008. Optical Tablature Recognition (OTR) system: Using Fourier Descriptors as a recognition tool. In International Conference on Audio, Language and Image Processing. 1532ś1539. [218] Cuihong Wen, Ana Rebelo, Jing Zhang, and Jamie dos Santos Cardoso. 2014. Classification of optical music symbols based on combined neural network. In International Conference on Mechatronics and Control. 419ś423. [219] Cuihong Wen, Ana Rebelo, Jing Zhang, and Jamie dos Santos Cardoso. 2015. A new optical music recognition system based on combined neural network. Pattern Recognition Letters 58 (2015), 1ś7. [220] K. Wijaya and David Bainbridge. 1999. Staff line restoration. In 7th International Conference on Image Processing and its Applications. 760ś764. [221] Carl Witt. 2013. Optical Music Recognition Symbol Detection using Contour Traces. [222] Yang Yin-xian and Yang Ding-li. 2012. Staff Line Removal Algorithm Based on Trajectory Tracking and Topological Structure of Score. In 4th International Conference on Computer Modeling and Simulation. [223] Emily H. Zhang. 2017. An Efficient Score Alignment Algorithm and its Applications. Master’s thesis. Massachusetts Institute of Technology. ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2019. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek CHAPTER 3 Towards Self-Learning Optical Music Recognition The paper “Towards Self-Learning Optical Music Recognition” [PE17b], published at the International Conference on Machine Learning and Applications 2017 in Cancun, Mexico, was the first publication of my research and contains the overall plan of my research. In particular, it discusses why a new paradigm is needed for OMR and what it could look like. Along with the discussion, two experiments are presented. In the first one, a convolutional neural network was trained to distinguish images of music scores from images depicting something else, such as natural photographs, documents or tables. The goal was to see if a neural network was capable of learning the concept of “how music scores look like.” A real-world application emerged at the WoRMS 2018 when a librarian expressed her need for such a classifier to assist her in automatically finding music scores in millions of documents in her library. For training the network, a new dataset of 2000 images was collected by taking real photos of scores and other documents under various angles and lighting conditions. The results on that dataset were exceptional with nearly 100% accuracy, which means that the task is easily solvable with deep learning. The second experiment reproduced a previously conducted study, trying to classify isolated, handwritten music symbols from the HOMUS [CZO14] dataset with a deep convolutional neural network. Previously reported results were already very good with 96% and 97% accuracy, so the state of the art could only be improved slightly to 98% accuracy, which is even better than the performance of humans on the same task (95% accuracy). Another interesting observation was made: the trained network was coping exceptionally well (97% accuracy) with superimposed staves that were artificially introduced into the images of isolated symbols. This indicates that the removal of staves might be superfluous when using convolutional neural networks. 61 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Towards Self-Learning Optical Music Recognition Alexander Pacha, Horst Eidenberger Interactive Media Systems, TU Wien, Vienna, Austria alexander.pacha@tuwien.ac.at, horst.eidenberger@tuwien.ac.at Abstract—Optical Music Recognition (OMR) is a branch of artificial intelligence that aims at automatically recognizing and understanding the content of music scores in images. Several approaches and systems have been proposed that try to solve this problem by using expert knowledge and specialized algorithms that tend to fail at generalization to a broader set of scores, imperfect image scans or data of different formatting. In this paper we propose a new approach to solve OMR by investigating how humans read music scores and by imitating that behavior with machine learning. To demonstrate the power of this approach, we conduct two experiments that teach a machine to distinguish entire music sheets from arbitrary content through frame-by-frame classification and distinguishing between 32 classes of handwritten music symbols which can be a basis for object detection. Both tasks can be performed at high rates of confidence (>98%) which is comparable to the performance of humans on the same task. I. INTRODUCTION Music plays a central role in our cultural heritage with written music scores being an essential way of communi- cating the composer’s intention to musicians that perform a piece of music. The music notation encodes the information into a graphical form that follows certain syntactic and semantic rules to encode pitch, rhythm, tempo, and articu- lation. Optical Music Recognition (OMR) tries to recognize and understand the notation and the contents of an image for a machine to be able to comprehend the music. Given a system that is able to translate an image into a machine- readable format, the applications are manifold, including preservation and digitization of hand-written manuscripts, supporting music education or accompanying musicians that practice their performance. Although considerable research has been conducted and many systems have been developed [1] that reportedly perform well on the specific set of music scores for which they have been designed for, the robustness and extensibility of these systems is limited due to the underlying architecture and used algorithms that discard information and propagate errors from one step to the next, e.g. an error in the binarization which is often the first step of an OMR system might cause the symbol detection to detect notes where there are none. Many algorithms have been proposed to improve individual steps of this linear process, but to the best of our knowledge, there exists no system that is capable of automatically recognizing a large set of real-world data with satisfactory precision, good usability, and reasonably low editing costs [2] of errors that were introduced during the process. Many people could benefit from digitizing a large body of music scores that is accessible and searchable [3]. As a result, there are ongoing projects to do so including SIMSSA1 and OpenScore2. To support such projects, we propose a new approach: rather than designing features and defining rules by hand, the system should learn to extract features and appropriate rules by itself (given a certain amount of supervision). Ideally, such a system is capable of transcribing music scores as accurately as humans. II. RELATED WORK OMR has been a subject of interest at least since 1966 [4], and received substantial attention by Bainbridge and Bell [5] who established a general framework for OMR that has been adopted by many researchers [1]. Since then, many re- searchers suggested entire OMR systems [6], [7] or proposed specialized algorithms for solving or improving sub-tasks such as binarization [8] or staff-line detection and removal [9], [10]. However, most of them use ad-hoc solutions based on expert knowledge that follow widely used practices that work best on datasets fulfilling certain prerequisites, e.g. detecting staff-lines with horizontal projections requires the scores to have straight staff-lines. Unfortunately, these systems tend to experience difficulties when confronted with images that deviate from the expected input format for which they were designed (e.g. if the staff-lines are curved due to the bonding of a textbook). Adding another preprocessing step or improving an algorithm can help to overcome one or the other limitation, but might not help a system to gain robustness beyond a certain level. In the last few years, machine learning - and especially Deep Learning with Convolutional Neural Networks (CNNs) - received a lot of attention with results that surpass human- level performance on computer vision tasks such as image classification [11]. Wen et al. proposed a machine learning approach for symbol segmentation and symbol classifica- tion [12] in combination with a pre-defined ruleset. Calvo- Zaragoza et al. [13] classify music scores at pixel-level with CNNs into foreground, background, and staff-lines. Gallego et al. [14] use auto-encoders to remove staff lines and finally Pinheiro Pereira et al. [15] classify handwritten 1http://simssa.ca/, last visited on Oct. 4, 2017 2http://openscore.cc, last visited on Oct. 4, 2017 795 2017 16th IEEE International Conference on Machine Learning and Applications 0-7695-6321-X/17/31.00 ©2017 IEEE DOI 10.1109/ICMLA.2017.00-60 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek music symbols from the HOMUS database [16] into 32 different categories with a precision of over 96%. Together, they provide strong evidence, that machine learning can successfully be applied to develop new types of OMR systems that are robust and extensible to a wide range of scores. III. HOW HUMANS READ SCORES We believe that an OMR system should be able to read and comprehend music scores with all their facets as well as humans. To the best of our knowledge, there exists no system that would come close to human performance [1]. As far as it is understood today, humans process visual scenes in a hierarchical way at three levels [17, p. 557]: 1) Low-level, where contrast, orientation, color, and move- ment are processed, primarily in the retina and ganglion cells [17, p. 600] 2) Intermediate-level, where the layout of the scene is processed by parsing the visual image into contours and surfaces of objects, segregating them from the background, involving the primary visual cortex [17, p. 619]. 3) High-level, where actual object identification is per- formed, by matching surfaces and contours to known shapes from our memory (or more precisely to their neuronal representation) which happens primarily in the Inferior Temporal Cortex [17, p. 622] By processing visual information in this hierarchical way, humans become very good at arriving at scene descriptions, grasping the gist of a scene. But reading music scores includes not only the visual perception of objects, but also relating objects to each other and to the context, a process where, unfortunately, today little is known about how humans perform this task, apart from certain brain regions that have been identified to be involved in this process [18], [17, p. 1353]. Note that for relating elements to each other and interpreting them correctly, it appears that humans use all information available. For music scores, this includes the staff-lines as the reference system, knowledge about the type of music, the notational system and also prior knowledge such as the probabilities of continuations within idioms [18] to resolve ambiguities if the available information is incomplete or doubtful. The expectancy can even replace a stimulus, making up for misprints as shown in the Goldovsky experiment [18] indicating that reading involves both top-down (or conceptually-driven) and bottom- up (or data-driven) processes. Learning from the way humans read scores, binarizing the image as a first step or removing staff-lines seems to be counterproductive as it discards potentially relevant information. In summary, we conclude that OMR systems could benefit from operating directly on the input image (which is possible with Deep Learning), providing feedback loops from later steps to refine earlier steps and consider information that might not have been used so far. IV. HOW MACHINES READ SCORES David Marr proposed a computational framework of vi- sion that has three levels and to us appears very useful when discussing vision problems [19]: • Computational theory, which specifies how a vision task can be solved in principle • Algorithmic level, that gives precise details on how the theory can be implemented. In other words: What is the input and output and how to obtain the output given the input? • Hardware for realizing the algorithm in a physical system (which is not necessarily computer hardware, but in our case it is) Given this framework, we think that the computational theory of how humans or machines can read scores is correct and sound: detecting systems, staves and staff-lines and using them as structural guidance is a solid foundation; segmenting elements into smaller parts and constructing a relational mapping leads to a symbolic representation; finally, this symbolic representation can be interpreted in its context, according to syntactic and semantic rules that correspond to a particular notational language. The algorithmic level, however, seems to be much harder to solve, possibly because the inherent complexity of the problem is often underestimated. Many proposed approaches can be seen as concept-driven because they use prior knowl- edge of the specific object, in this case, music sheets. We believe that a data-driven, Deep Learning approach is a viable alternative that should be investigated further. Therefore, we propose the following five questions as a model for bottom-up music processing that are specifically formulated to facilitate the development of such an Optical Music Recognition algorithm. Can a machine mimic human behavior in ... Q-I distinguishing between music scores and arbitrary con- tent? Q-II understanding the structure of music scores (staves, systems) and distinguish basic music symbols from each other and from the background? Q-III detecting and locating music symbols (notes, rests, ornaments, accidentals, bar-lines, articulations, ...) in the scores? Q-IV understanding the relation of objects to each other in music scores (the relation between a note and the staff- lines, an accidental to the left of a note which relates to that note, etc.)? Q-V fully understanding the syntax and semantics of music scores (inferring the actual note from relative position, shape and preceding symbols such as key signatures or accidentals)? 796 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek These five questions define our research program for the data-driven investigation of the OMR problem using deep networks. In our opinion, each question can be solved using an appropriate model and sufficient data. Note that the ques- tions are of increasing complexity with Q-V representing a complete system that is capable of reading scores and fully understanding their content like humans. Q-I and Q-II can be implemented by using CNNs that operate directly on the raw input data. A promising approach for Q-III is to extend a classifier into an object detector by using region proposal networks [20]. As for questions Q-IV and Q-V, Recurrent Neural Networks (RNN) seem to be a good fit [21], as they can learn relationships in sequential data and already achieved remarkable results in Optical Character Recognition [22], a task that is comparable to OMR but in many regards simpler [5]. V. EXPERIMENTATION To evaluate whether a data-driven approach is suitable for improving the state-of-the-art in OMR, two experiments were conducted that try to answer Q-I and partially Q-II. The first, to recognize music scores in an image and classify that image into one of the two categories: ’scores’ or ’other’. The second, to classify isolated handwritten music symbols into 32 different classes, reproducing [15] in greater depth and improving their results significantly. For both experiments, a Convolutional Neural Network was trained using the popular Deep Learning frameworks Keras3 and Tensorflow4. The resulting models can then be used for inference on almost any machine including mobile devices (see Figure 1) to classify images from the live camera-feed and display a frame-by-frame classification. A. Datasets The dataset used for training, validation, and testing in the first experiment contains over 5500 images of which 2000 images contain scores and 3500 images contain something else (see Table I). The largest portion was obtained by using two publicly available datasets: the MUSCIMA database, which contains 1000 handwritten music scores [23] and the training database of the Pascal VOC Challenge 2006 which contains over 2600 images [24] that were considered part of the ground-truth for the category ’other’. Additionally, we created a new dataset containing 2000 imperfect but realistic images, by taking 1000 images depicting music scores and 1000 images of text documents and other objects with a smartphone camera. Preliminary testing showed that text documents were likely to be confused with scores, especially if they contain tables. Hence, a large portion of the additional images contains such documents in order to enable the network to learn the distinction. The complexity of the scores ranges from simple childrens’ tunes to modern 3http://keras.io/, last visited on Oct. 4, 2017 4http://www.tensorflow.org/, last visited on Oct. 4, 2017 Figure 1: Screenshots of the Android application, classifying a sheet of music scores (left top) and a table with data (right top) with a certainty of 99%. When presented with images that contain scores and text (left bottom) or unusual forms (right bottom), certainty drops to approximately 70% but the system still classifies the image correctly. orchestral scores, taken in various lighting conditions and from different angles. The dataset for the second experiment is the Handwrit- ten Online Musical Symbols (HOMUS) dataset [16] that contains 15200 samples of hand-written musical symbols, written by 100 different musicians5. B. Architecture and Training For both experiments, various network architectures were evaluated, including a VGG-like architecture [25] and resid- ual networks [26]. The first experiment attempts to answer Q-I and uses color-images that are non-uniformly resized to 128x128 pixels for the first trial and 256x256 pixels for the second. For the second experiment that is targeted towards Q-II, black and white images are generated from the textual representation of strokes by connecting the points of each stroke. Since individual symbols vary drastically in size, while CNNs expect a fixed-size image as input, the following two approaches were evaluated: 5Note that the original dataset contained a few mistakes and artifacts that were reported to the authors and corrected before the training see https://github.com/apacha/Homus for details, last visited on Oct. 4, 2017 797 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Handwritten scores Images of scores Images of documents Other images Table I: Sample images of the various categories, as they were shown to the classifier during training (non-uniformly resized). The upper two rows form the class ’scores’ and the lower two rows the category ’other’. 1) Drawing the symbols in the center of a large enough canvas that fits most of them (e.g. 192x96 pixels, with only 23 out of 15200 symbols exceeding this size) 2) Drawing each symbol in a canvas that exactly fits its size and rescaling all symbols non-uniformly to a fixed size, e.g. 96x96 pixels These particular sizes were empirically selected because they yielded the best results while allowing multiple down- scaling operations by a factor of two without interpolation. Batch-normalization, early-stopping, weight-decay and dynamic learning-rate-reduction are used as regularization strategies to improve training speed and overall performance. Random-rotation by 10° and random-zoom of 20% are used as data-augmentation strategies to simulate the images being taken from slightly different points-of-view which leads to results that are robust to minor variations. C. Evaluation To evaluate each experiment, the respective dataset was split into three parts of which 80% are used as training data, 10% are used for validation during the training and for hyperparameter optimization and the final 10% are used for evaluating the performance of the trained model on previously unseen data. To obtain a baseline, a subset of the images was also shown to a number of people that were asked to perform the same classification task in a desktop application on a computer screen. The application did not allow for zooming and the users classified the images using the keyboard but were allowed to go back and revise their decisions without any time constraints. 1) First Experiment: Typical training took 30 epochs before early stopping the training to prevent overfitting. The trained model classified 98.5% of the images in the test set correctly on the 128x128 pixels condition and 100% on the 256x256 condition, meaning that this task appears almost trivial to the machine. The more than 500 images from the test set were also shown to three different users, who were asked to manu- ally classify them either as ’something that displays music scores’ or ’something else’. The images were down-scaled to the same 128x128 pixels that correspond to approximately 3.5cm on a desktop screen. In total, they classified over 1500 images with an average precision of 96.49%. The main 798 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Figure 2: Superimposed staff-lines over isolated symbols to create meaningful context. Five parallel lines are drawn with an equal spacing of 14 pixels between each line [16]. From left to right: Quarter-Note, F-Clef, Eighth-Rest, Sharp, Whole-Half-Rest, Sixty-Four-Note source of error was due to the very small images. Partially repeating the process with images of size 256x256 pixels, which corresponds to approximately 7cm on a desktop screen, showed that humans can perform this task without exceptional errors. 2) Second Experiment: The second experiment contains a wide range of conditions whose effects were investigated: image-size, stroke-thickness, superimposing staff-lines (see Figure 2) and of course the hyperparameters for the training of a deep neural network, including the network architecture, the used optimizer, and minibatch-size. A total of over 150 different hyperparameter-combinations were tested and doc- umented. The following hyperparameters have empirically shown to work very well for this task: • Monitoring the accuracy on the validation set after each epoch and reducing the learning-rate by a factor of 0.5 if it does not improve for 8 epochs. Similarly, the entire training was stopped if no improvement was observed for 20 epochs. • Adam, Adadelta and Stochastic gradient descent (SGD) were evaluated as optimizers with Adadelta performing slightly better than Adam and much better than SGD. • Evaluated minibatch-sizes included 16, 32 and 64 but the impact is rather small and in our opinion can be neglected. The obtained results reach up to 98.02% accuracy on a test-set of 1520 images which is a significant improvement, compared to previously reported results of 97.26% [27] and 96.01% [15]. For images with undistorted symbols drawn on a fixed canvas (Section V-B, approach 1) a Res-Net archi- tecture with 25 convolutional layers and about five million parameters performed best. Similar results were obtained with a VGG architecture for non-uniformly resized symbols (Section V-B, approach 2) that consists of 13 convolutional layers and about 8 million parameters. The results of the best run, broken down by symbol class, are given in Table II and show that the network struggled most with notes and rests that are only discriminable by the number of flags, such as Thirty-Two- and Sixty-Four-Notes. Five users were asked to perform the same task on a random sample of the dataset. In total, they classified 1520 images with an average precision of 95% and experiencing most difficulties in Quarter-Rests and Sixteenth-Rests that Table II: The recall and precision per class for the best trained residual network in comparison to human perfor- mance on the same task. Residual Network Human test subjects Class name Recall Precision Recall Precision 12-8-Time 1.00 1.00 1.00 0.97 2-2-Time 1.00 1.00 0.95 1.00 2-4-Time 0.97 0.95 1.00 0.98 3-4-Time 0.95 1.00 1.00 0.97 3-8-Time 1.00 1.00 1.00 1.00 4-4-Time 1.00 0.98 0.97 1.00 6-8-Time 1.00 1.00 1.00 1.00 9-8-Time 1.00 1.00 1.00 1.00 Barline 1.00 0.98 0.97 0.92 C-Clef 1.00 1.00 1.00 0.91 Common-Time 1.00 1.00 0.97 1.00 Cut-Time 0.95 1.00 0.98 0.98 Dot 0.97 1.00 1.00 1.00 Double-Sharp 1.00 1.00 0.97 1.00 Eighth-Note 0.99 0.95 0.92 0.98 Eighth-Rest 1.00 1.00 0.98 0.86 F-Clef 1.00 1.00 0.97 0.92 Flat 0.97 1.00 0.95 0.95 G-Clef 1.00 0.95 0.98 0.98 Half-Note 1.00 1.00 0.97 0.94 Natural 0.95 1.00 0.74 1.00 Quarter-Note 1.00 1.00 0.93 0.95 Quarter-Rest 0.95 0.95 0.89 0.82 Sharp 1.00 1.00 1.00 0.97 Sixteenth-Note 0.94 0.95 0.90 0.92 Sixteenth-Rest 0.97 0.97 0.76 0.81 Sixty-Four-Note 0.96 0.95 0.94 0.94 Sixty-Four-Rest 0.97 0.97 0.83 0.97 Thirty-Two-Note 0.91 0.95 0.99 0.91 Thirty-Two-Rest 0.97 0.95 0.91 0.89 Whole-Half-Rest 1.00 0.98 1.00 1.00 Whole-Note 1.00 0.98 1.00 0.98 both have manifestations that deviate from their printed counterparts dramatically or are simply ambiguous (see Figure 3). Another very interesting detail was observed: When su- perimposing staff-lines as depicted in Figure 2, test-accuracy remains at high rates of up to 97.03%, indicating that the network can learn to ignore them almost entirely, thus providing evidence that staff-line removal might be omitted in future systems, as discussed in Section III. VI. CONCLUSION Given the results presented in Section V-C we conclude that Q-I can be answered with yes, showing that humans and machines can achieve similar results on the given dataset. Detecting music scores and distinguishing them from ar- bitrary content is a relatively easy problem compared to the entire challenge of Optical Music Recognition but what experiment 1 shows, is that machines can learn something as abstract as the concept of ’what music scores look like’ by just providing enough data and using a Deep Learning approach. As for Q-II, we showed that a CNN can be trained to distinguish handwritten music symbols from each other at high rates of confidence, even with staff-lines being present. 799 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek (a) (b) Figure 3: Examples of symbols from the test set that were misclassified by the machine (a) and by humans (b). Their intended classes from left top to right bottom: Sixteenth- Rest, 2-4-Time, Sixteenth-Note, Cut-Time, Quarter-Rest, Sixteenth-Note, Quarter-Note, Sixty-Four-Note, Sixty-Four- Rest, Quarter-Rest, Natural, and Sixty-Four-Note. When combining these results with the work from [28] and [13] we conclude that Q-II can also be answered with yes. VII. FUTURE WORK To promote collaboration and reproducibility, all datasets, the entire source-code and the raw data from both experi- ments have been released on Github at https://github.com/ apacha/MusicScoreClassifier and https://github.com/apacha/ MusicSymbolClassifier under a liberal MIT-license. We are confident, that by following the described path, an OMR system can be created that is capable of not only classifying entire images but also recognizing the structure of the document, reliably detecting objects in the image and even understanding the relation of elements to each other with- out formulating explicit rules by only training appropriate models on a comprehensive dataset. REFERENCES [1] A. Rebelo, I. Fujinaga, F. Paszkiewicz, A. R. Marcal, C. Guedes, and J. S. Cardoso, “Optical music recognition: state-of-the-art and open issues,” International Journal of Multimedia Information Retrieval, vol. 1, no. 3, pp. 173–190, 2012. [2] P. Bellini, I. Bruno, and P. Nesi, “Assessing optical music recognition tools,” Computer Music Journal, vol. 31, no. 1, pp. 68–93, 2007. [3] A. Laplante and I. Fujinaga, “Digitizing musical scores: Challenges and opportunities for libraries,” in Proceedings of the 3rd Interna- tional workshop on Digital Libraries for Musicology. ACM, 2016. [4] A. Rebelo, G. Capela, and J. S. Cardoso, “Optical recognition of music symbols,” International Journal on Document Analysis and Recognition (IJDAR), vol. 13, no. 1, pp. 19–31, 2010. [5] D. Bainbridge and T. Bell, “A music notation construction engine for optical music recognition,” Software: Practice and Experience, vol. 33, no. 2, pp. 173–200, 2003. [6] L. Pugin, J. Hockman, J. A. Burgoyne, and I. Fujinaga, “Gamera versus Aruspix – two optical music recognition approaches,” in ISMIR 2008–Session 3C–OMR, Alignment and Annotation, 2008. [7] Y.-S. Chen, F.-S. Chen, and C.-H. Teng, “An optical music recognition system for skew or inverted musical scores,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 27, no. 07, 2013. [8] Q. N. Vo, S. H. Kim, H. J. Yang, and G. Lee, “An MRF model for binarization of music scores with complex background,” Pattern Recognition Letters, vol. 69, pp. 88 – 95, 2016. [9] C. Dalitz, M. Droettboom, B. Pranzas, and I. Fujinaga, “A comparative study of staff removal algorithms,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 5, May 2008. [10] J. dos Santos Cardoso, A. Capela, A. Rebelo, C. Guedes, and J. P. da Costa, “Staff detection with stable paths,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 6, June 2009. [11] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034. [12] C. Wen, A. Rebelo, J. Zhang, and J. Cardoso, “A new optical music recognition system based on combined neural network,” Pattern Recognition Letters, vol. 58, pp. 1 – 7, 2015. [13] J. Calvo-Zaragoza, G. Vigliensoni, and I. Fujinaga, “Document anal- ysis for music scores via machine learning,” in Proceedings of the 3rd International workshop on Digital Libraries for Musicology. ACM, 2016, pp. 37–40. [14] A.-J. Gallego and J. Calvo-Zaragoza, “Staff-line removal with se- lectional auto-encoders,” Expert Systems with Applications, vol. 89, 2017. [15] R. M. Pinheiro Pereira, C. E. Matos, G. Braz Junior, J. a. D. de Almeida, and A. C. de Paiva, “A deep approach for handwritten musical symbols recognition,” in Proceedings of the 22Nd Brazilian Symposium on Multimedia and the Web, ser. Webmedia ’16. New York, NY, USA: ACM, 2016, pp. 191–194. [16] J. Calvo-Zaragoza and J. Oncina, “Recognition of pen-based music notation: The HOMUS dataset,” in 2014 22nd International Confer- ence on Pattern Recognition, Aug 2014, pp. 3038–3043. [17] E. R. Kandel, J. H. Schwartz, T. M. Jessell, S. A. Siegelbaum, and A. J. Hudspeth, Principles of neural science. McGraw-hill New York, 2012, vol. 5. [18] J. Sloboda, Exploring the musical mind. Oxford University Press, 2005. [19] J. P. Frisby and J. V. Stone, Seeing, Second Edition: The Computational Approach to Biological Vision, 2nd ed. The MIT Press, 2010. [20] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015. [21] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 99, 2016. [22] C.-Y. Lee and S. Osindero, “Recursive recurrent nets with attention modeling for OCR in the wild,” in Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, 2016. [23] A. Fornés, A. Dutta, A. Gordo, and J. Lladós, “CVC-MUSCIMA: a ground truth of handwritten music score images for writer identifica- tion and staff removal,” International Journal on Document Analysis and Recognition (IJDAR), vol. 15, no. 3, pp. 243–251, 2012. [24] M. Everingham, A. Zisserman, C. K. I. Williams, and L. Van Gool, “The PASCAL Visual Object Classes Challenge 2006 (VOC2006) Results,” http://www.pascal- network.org/challenges/VOC/voc2006/results.pdf, 2006. [25] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014. [26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [27] J. Calvo-Zaragoza, A.-J. Gallego, and A. Pertusa, “Recognition of handwritten music symbols with convolutional neural codes,” Pro- ceedings of the 14th IAPR International Conference on Document Analysis and Recognition, 2017. [28] J. Calvo-Zaragoza, A. Pertusa, and J. Oncina, “Staff-line detection and removal using a convolutional neural network,” Machine Vision and Applications, pp. 1–10, 2017. 800 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek CHAPTER 4 Towards A Universal Music Symbol Classifier The classification of symbols can be seen as a substantial part of detecting objects in an image because modern approaches tackle the problem usually in two stages: the first stage proposes regions of interest in the image and the second stage classifies these proposals accordingly. Therefore, it is useful to evaluate how well a classification task can be solved with Deep Learning, especially when using both handwritten and typeset music scores. The paper “Towards a Universal Music Symbol Classifier” [PE17a], presented at the 12th IAPR International Workshop on Graphics Recognition 2017 in Kyoto, Japan, extended the previously conducted classification experiment [PE17b] to a much larger scale. While the HOMUS dataset already contains 15 000 samples, the dataset collected for this work was significantly larger with over 90 000 isolated musical symbols, collected from seven heterogeneous datasets and categorized into 79 classes (see Fig. 4.1 for a few samples). The resulting dataset contains more than 74 000 handwritten symbols and more than 16 000 symbols that were typeset, unfortunately with a heavy class-imbalance. A convolutional neural network was trained to classify the symbols, and the results were auspicious with an error rate below 2%. More than 200 different hyperparameter combinations were evaluated, including a range of model architectures (inspired by VGG [SZ14] and ResNet [HZRS16]), image-sizes and class-balancing methods. Some combinations performed slightly better than others. However, it should be noted all tested combinations achieved error-rates between 2%-3%. As with all other experiments, the source code as well as the dataset and the results, were made publicly available online [Pac17a]. The exact details of the best-performing hyperparameter combination can be found there. 69 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Towards a Universal Music Symbol Classifier Alexander Pacha Institute of Software Technology and Interactive Systems TU Wien Vienna, Austria alexander.pacha@tuwien.ac.at Horst Eidenberger Institute of Software Technology and Interactive Systems TU Wien Vienna, Austria horst.eidenberger@tuwien.ac.at Abstract—Optical Music Recognition (OMR) aims to recognize and understand written music scores. With the help of Deep Learning, researchers were able to significantly improve the state- of-the-art in this research area. However, Deep Learning requires a substantial amount of annotated data for supervised training. Various datasets have been collected in the past, but without a common standard that defines data formats and terminology, combining them is a challenging task. In this paper we present our approach towards unifying multiple datasets into the largest currently available body of over 90000 musical symbols that belong to 79 classes, containing both handwritten and printed music symbols. A universal music symbol classifier, trained on such a dataset using Deep Learning, can achieve an accuracy that exceeds 98%. Index Terms—Optical Music Recognition, dataset, classifica- tion, deep learning I. INTRODUCTION Optical Music Recognition (OMR) is an area of document analysis that aims to automatically understand written music scores [1]. Given an image of musical scores, an OMR system attempts to recognize the content and translate it into a machine-readable format such as MusicXML. Music symbol classification is the subtask of OMR, where isolated symbols are assigned with class labels. In this work we present the first attempt of building a universal music symbol classifier, that is capable of classifying music symbols regardless of whether they are well printed or just handwritten. To build such a classifier, we propose a data-driven approach. Therefore, we developed tools that can unify multiple datasets into a single large dataset on which the universal music symbol classifier can be trained. In our test setup, we were unifying seven datasets into a collection of over 90000 samples, be- longing to 79 classes. II. DATASETS For training a universal music symbol classifier, we tried to obtain the largest possible dataset that contains both printed and handwritten symbols. We did so by combining the follow- ing publicly available datasets: • The Handwritten Online Musical Symbols (HOMUS) dataset [2] contains 15200 samples of isolated music symbols of 32 different classes. • The MUSCIMA++ dataset [3] is the largest available dataset that contains detailed annotations for the un- derlying CVC-MUSCIMA dataset [4] of handwritten music scores. More than 55000 complete symbols can be extracted from the music symbol primitives. • The group of Rebelo et al. collected at least three different datasets [5], containing more than 15000 printed music symbols. • The group of Fornés et al. collected a dataset of approxi- mately 4100 images of handwritten symbols [6] depicting accidentals and clefs. • The Audiveris OMR dataset1 is a small dataset of four images of scores, along with annotations of 400 printed symbols in those images. • The Printed Music Symbols dataset2 is a new dataset created by us, in which we collected more than 200 printed music symbols of 36 different classes. • The OpenOMR dataset3 is the last included dataset, that contains 500 printed music symbols of seven different classes. The resulting dataset contains more than 74000 handwritten and more than 16000 printed symbols, with a substantial amount of inter-class variation. III. UNITING THE DATASETS A. Selecting classes and resolving ambiguities Modern musical notation knows over 100 different symbol classes, with some classes being more present, like quarter notes or G clefs, whereas other classes are rarely used or just used for specific instruments like glissando or breath marks. Apart from selecting which classes to include into the dataset (ideally all of them), one has to deal with ambiguous class names. E.g. a quarter note may also be called quaver or a G clef is also referred to as Treble clef. To resolve this issue, a common terminology is selected and all aliases and variations are mapped to those names. The actual names are secondary, as long as the schema is clear. We follow the naming conventions of the HOMUS dataset and map all other names to their respective counterparts or to similar class names if they did not exist in the HOMUS dataset. Besides class names, symbols themselves can be ambiguous too. Although having the same visual appearance, they might resolve to different semantics depending on the context (e.g. 1https://github.com/Audiveris/omr-dataset-tools 2https://github.com/apacha/PrintedMusicSymbolsDataset 3https://sourceforge.net/projects/openomr/ 2017 14th IAPR International Conference on Document Analysis and Recognition 2379-2140/17 $31.00 © 2017 IEEE DOI 10.1109/ICDAR.2017.265 35 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek tie vs. slur vs. phrase mark or staccato vs. dot of a dotted note). This ambiguity can not be resolved when working with isolated symbols outside of a context which determines the class. Therefore, all ambiguous symbols are placed in a unifying super-class such as Dot or Whole-Half-Rest. B. Joining different levels of decomposition Some creators of OMR systems suggest to decompose music symbols into individual primitives (e.g. note-heads, stems, numbers, letters) and combine them in a later stage, whereas others choose to work with entire sets of symbols that might consist of multiple smaller units (e.g. eighth-note, 2/4-time). This decision can be made for notes, accidentals, numbers, and letters. While some primitives form a class on their own (e.g. flat or sharp), others do not (e.g. stem, flag). Datasets with different conventions are at least partially incompatible. To integrate them nevertheless, a decision has to be made for each type, whether to exclude samples, use primitive symbol classes, preprocess primitives into compound symbols or enumerate all variants of combining primitives (e.g. 2/4-time, 3/4-time, 6/8-time, ...). To lose as little data as possible when joining the mentioned datasets, we propose a mixed approach: notes only appear as compound classes which require preprocessing in some cases, time signatures are enumerated and key signatures consisting of multiple flats or sharps are excluded with only their primitives being considered. C. Tools for the automatic unification We have built tools that are capable of automatically downloading all datasets and processing them. As images are the input for music symbol classification in OMR, all other representations have to be processed to obtain images: Our HOMUS image generator allows to render textual descriptions into symbol images and the MUSCIMA++ image generator creates symbol images from the underlying masks. The image extractor for the Audiveris OMR dataset takes annotations and extracts sub-images that contain individual symbols while the image inverter converts the white-on-black images from the Fornés dataset to black-on-white images. Finally, the entire dataset can be obtained and split into a training-, validation-, and test-set by calling a single script, the training dataset provider. IV. BUILDING A UNIVERSAL MUSIC CLASSIFIER A universal music classifier should be able to recognize all sorts of music symbols, regardless of whether they are handwritten or printed. Deep neural networks, especially con- volutional neural networks offer a convenient, yet powerful way of solving computer vision tasks like the one at hand [7]. Therefore, we aim to build such a classifier by training a con- volutional neural network on the presented dataset. Extending it to other notations is possible by adding a respective dataset. To the best of our knowledge, no such work has been done before. V. DISCUSSION AND CONCLUSION By providing tools for easily obtaining and merging multiple datasets, we believe that building a universal music symbol classifier can be reduced to the training of a suitable deep neural network. We evaluated this thesis by training various networks on the presented dataset and our preliminary results are promising with an error rate below 2% and over 98% precision and recall on an unseen test-set containing 10% of the data4. Our next step is to analyze the results and build a music symbol object detector on top of the classifier. The united dataset is not perfect and currently suffers from being somewhat unbalanced with some classes having fewer than 10 instances while others have more than 1000, with the quarter note alone having almost 18000 samples. This poses a problem to any classifier that optimizes for accuracy on this dataset, as it might just learn the underlying distribution and simply ignore the classes with the fewest samples. Therefore, there is a need to gather more samples from classes with in- sufficient instances. Furthermore, our dataset has the following limitations: • It currently contains modern notation symbols only. • Some datasets have one dedicated class for non- recognizable symbols, including text fragments and dy- namics. We incorporated that container class and store symbols in there, that currently do not fit our categoriza- tion as opposed to discarding them. In the next version, some symbols will be extracted from this container and put into their appropriate classes. • Despite their prominence, beamed notes are currently underrepresented, because most underlying datasets do not contain any or decompose them into primitives that can not be joined easily. To have the greatest possible impact, we publish all tools under a liberal MIT license along with a list of other OMR datasets at https://apacha.github.io/OMR-Datasets/. REFERENCES [1] A. Rebelo, I. Fujinaga, F. Paszkiewicz, A. R. Marcal, C. Guedes, and J. S. Cardoso, “Optical music recognition: state-of-the-art and open issues,” International Journal of Multimedia Information Retrieval, vol. 1, no. 3, pp. 173–190, 2012. [2] J. Calvo-Zaragoza and J. Oncina, “Recognition of pen-based music notation: The HOMUS dataset,” in 2014 22nd International Conference on Pattern Recognition, Aug 2014, pp. 3038–3043. [3] J. j. Hajič and P. Pecina, “In search of a dataset for handwritten optical music recognition: Introducing MUSCIMA++,” arXiv preprint arXiv:1703.04824, vol. 1, pp. 1–16, 2017. [4] A. Fornés, A. Dutta, A. Gordo, and J. Lladós, “CVC-MUSCIMA: a ground truth of handwritten music score images for writer identification and staff removal,” International Journal on Document Analysis and Recognition (IJDAR), vol. 15, no. 3, pp. 243–251, 2012. [5] A. Rebelo, G. Capela, and J. S. Cardoso, “Optical recognition of music symbols,” International Journal on Document Analysis and Recognition (IJDAR), vol. 13, no. 1, pp. 19–31, 2010. [6] A. Fornés, J. Lladós, and G. Sánchez, Old Handwritten Musical Symbol Classification by a Dynamic Time Warping Based Method. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 51–60. [7] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, May 2015, insight. 4https://github.com/apacha/MusicSymbolClassifier 36 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 4. Towards A Universal Music Symbol Classifier Figure 4.1: A small sample of music symbols that are part of the collected music symbols dataset. It depicts ten different classes of handwritten and typeset symbols in modern notation. 72 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek CHAPTER 5 Music Object Detection Being able to classify music symbols so well laid the foundations for building an object detector with a deep convolutional neural network—which in its simplest form is just a classifier over a sliding window. However, more advanced approaches for solving object detection with deep learning were used in subsequent experiments. 5.1 Handwritten Music Object Detection The first experiments on trying to solve music object detection with deep learning were conducted for the paper “Handwritten Music Object Detection: Open Issues and Baseline Results,” [PCC+18] presented at the 13th IAPR International Workshop on Document Analysis System 2018 in Vienna, Austria. The main idea is to unify all steps of the music object detection into a single, learnable stage that can be solved by a deep convolutional neural network. The MUSCIMA++ dataset [HjP17] served as the data source because it provided a large body of handwritten music scores that were manually annotated. The full score images were preprocessed into smaller chunks to ease the detection, and sequentially fed into the network. The entire image was first cropped in such a way that each sub-image contains only one stave, and then horizontally cropped to maintain an aspect ratio of approximately 1:2 (see Fig. 5.1). It turned out later that while the cropping of images per stave makes sense, the additional horizontal cropping does not because many objects such as beams or slurs often cross boundaries and could, therefore, not be detected reliably. Various models and hyperparameter-configurations were evaluated with a Faster R-CNN model performing best. The mean average precision (mAP), which is a commonly used metric for object detection tasks, yielded a value of over 80%, which is very high 73 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 5. Music Object Detection and comparable to the best results for detecting objects in natural images1. We also showed that the removal of staves has no significant impact on the detection performance, complying with previously found evidence. Figure 5.1: Illustration of the sliding window approach, used to crop music scores into sub-images (red boxes). Boxes overlap both vertically with the boxes above and below as well as with adjacent crops (orange). 1Top entry of COCO Detection leaderboard [LPR+] as of 2018 was Megvii (Face++) with Average Precision of 0.53 for IoU=0.5:0.05:0.95 and 0.73 for IoU=0.5, submitted 05.10.2017. 74 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Handwritten Music Object Detection: Open Issues and Baseline Results Alexander Pacha, Horst Eidenberger Kwon-Young Choi, Bertrand Coüasnon, Institute of Visual Computing and Human-Centered Yann Ricquebourg Technology, TU Wien, Vienna, Austria Univ Rennes, CNRS, IRISA, F-35000 Rennes, France {first name}.{last name}@tuwien.ac.at {first name}.{last name}@irisa.fr Richard Zanibbi Rochester Institute of Technology, Rochester, USA rlaz@cs.rit.edu Abstract—Optical Music Recognition (OMR) is the chal- lenge of understanding the content of musical scores. Accu- rate detection of individual music objects is a critical step in processing musical documents because a failure at this stage corrupts any further processing. So far, all proposed methods were either limited to typeset music scores or were built to detect only a subset of the available classes of music symbols. In this work, we propose an end-to-end trainable object detector for music symbols that is capable of detecting almost the full vocabulary of modern music notation in handwritten music scores. By training deep convolutional neural networks on the recently released MUSCIMA++ dataset which has symbol-level annotations, we show that a machine learning approach can be used to accurately detect music objects with a mean average precision of over 80%. Keywords-Optical Music Recognition; Object Detection; Handwritten Scores; Deep Learning I. INTRODUCTION Optical Music Recognition (OMR) attempts to under- stand the musical content of documents containing printed or handwritten music scores by recognizing the visual structure and the objects within a music sheet. Once, all objects are recognized, a semantic reconstruction step at- tempts to understand the relations of objects to each other and recover the musical semantics. With recent advances in computer vision, accelerated by the popularity of deep convolutional neural networks (CNN), OMR received a number of groundbreaking contributions that generate very accurate results for particular sub-problems, such as staff line removal [1] or symbol classification [2]. In this work, we investigate the challenge of music object detection which aims at accurately detecting music objects in music scores. Music objects can be both primitive glyphs (e.g. note-head, stem, beam) or compound symbols (e.g. notes, key-signatures, time-signatures) used in music notation. A music object detector takes an image and outputs the bounding-box and class-label for each found object. Tradi- tionally, this was solved by first removing the staff lines, followed by symbol segmentation and classification [3] (see Figure 1). In this work, we present the first attempt to establish a baseline for music object detection of handwritten scores with the full vocabulary of modern music notation. By following a machine learning approach and using an end- to-end trainable object detector on the recently published Digital Image of Scores Image Pre- processing Sta line detection and removal Music symbol segmentation Music symbol classi cation Playback, Reprinting Music encoded data  le Music notation reconstruction Music Object Detection Optical Music Recognition Figure 1. The traditional pipeline for Optical Music Recognition. Music object detection subsumes segmentation and classification of music symbols. MUSCIMA++ dataset, we demonstrate how to build a generalizable and accurate music object detector and in- vestigate the effects of various technical choices like the use of a particular detector or feature extractor. II. RELATED WORK Visual object detection is a very active field of research with remarkable results on detecting objects in natural images with a variety of active competitions. Many com- peting approaches have been proposed in the last few years such as Faster R-CNN [4], R-FCN [5] and Single shot detectors [6], [7]. While some optimize for accuracy, others strive for high-performance [8]. However, all of them share the fact, that they heavily make use of deep convolutional neural networks. The traditional pipeline of segmenting and classifying symbols has been shown to work well on simple typeset music scores with a known music font [9]. But when considering low-quality images, complex scores or even handwritten ones [10], these systems tend to fail, mainly because errors propagate from one step to subsequent steps [11], e.g. a segmentation error could cause incor- rectly detected objects. Initial attempts to overcome this limitation by directly detecting music objects with CNNs were made by Hajič and colleagues, who suggest an adaptation of Faster R-CNN with a custom region pro- posal mechanism based on the morphological skeleton to 2018 13th IAPR International Workshop on Document Analysis Systems 978-1-5386-3346-5/18 $31.00 © 2018 IEEE DOI 10.1109/DAS.2018.51 163 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek accurately detect noteheads [12] and Choi and colleagues, who are able to detect accidentals in dense piano scores with high accuracy, given previously detected noteheads, that are being used as input-feature to the network [13]. However, both of them are limited to experimentations on a tiny subset of the full vocabulary used in modern music notation. Although both approaches can be extended to other classes, it remains an open question, whether a general purpose detector that can learn a large vocabulary is superior to multiple class-specific detectors. A very interesting alternative to the traditional OMR pipeline is the attempt of solving OMR in a holistic fashion. The first notable attempt at doing so was by Pugin [14], who used Hidden Markov Models to read typographic prints of early music. More recently, the combination of using CNNs jointly with Recurrent Neural Networks to build an end-to-end trainable OMR system [15] was adapted and extended in [16] and [17]. Both train very similar models on a very large set of monophonic music scores containing a single staff per image. Although the reported results on the given datasets are very good, the two systems mentioned lastly, currently exhibit the following limitations: • They operate only on very primitive, printed, mono- phonic scores. Extending their pipeline to more com- plex music scores with multiple voices requires a different formulation of the output data to at least include onset and offset of each note and not only the pitch and duration. • By using pooling operations during the feature ex- traction, the network gains location invariance that conflicts with the interest of precise location infor- mation, which is needed to correctly infer the pitch of a note. • By omitting the positional information of individual symbols and only considering the audible information of music symbols as output, such systems restrict themselves to replayability, as reprinting of music scores requires precise positional information [18]. While in theory semantic segmentation of the scores would go one step further and extract considerable more information – basically a classification of each pixel – two things should be noted: classifying pixels assumes that the class of each pixel is unique and mutually exclusive [19], an assumption that might not hold for overlapping symbols but can probably be ignored for practical applications; and most traditional systems that attempt to perform semantic reconstruction operate on detected objects, not on individual pixels, thus requiring a clustering step after the semantic segmentation. Therefore we argue, that detecting bounding boxes of musical objects directly is preferable for OMR. III. THE CHALLENGE OF DETECTING MUSIC SYMBOLS When comparing music object detection to detection of objects in natural scenes or optical character recognition, two unique challenges are worth noting: firstly, music Figure 2. Beginning of Franz Schubert’s Ave Maria D. 839, with simplifications in the second bar that intentionally violate the syntactic rules of common music notation. scores often have a very high density of objects with more than 1000 objects printed on a single page. Secondly, the relative position between a symbol and its staff lines is crucial. Already a tiny error along the y-axis may have a significant impact on recovering the correct pitch of a note. The detection of music objects is of paramount im- portance to the overall OMR process because once all symbols were detected accurately, a set of rules can be applied to infer the semantics of the objects and perform music notation reconstruction as demonstrated by [20]. We also suggest that the point right after individual objects were detected and classified, is probably the best moment for putting the user into the loop, if that is intended. Fixing errors at this stage can be performed locally without dealing with complicated semantic rules or affecting neighboring symbols (changing the duration of a single note in a music notation program often entails side effects on other notes within the same of subsequent bars). Highlighting uncertain detections and suggesting likely alternatives could improve the usability and reduce editing costs even further. Note that even with all symbols being correctly de- tected and classified, recovering the musical semantics still remains a very challenging problem, as demonstrated in Figure 2. Here, the second staff in the first bar contains a small 6 for each tuplet, indicating that the first rest and the following five chords sum up to a quarter note. This small number is intentionally omitted in the second bar for simplification but would now result in an invalid meter if interpreted in isolation. Only with the preceding informa- tion and prior knowledge about common simplifications, a musician can interpret such scores correctly. To be able to introduce such semantics into an OMR system, it is necessary to formalize and use musical notation knowledge. Rule-based systems can perform such formalization. For example, with the DMOS system [20] it has been possible to formalize the musical notation, graphically and syntactically, for full polyphonic scores, and produce a system which allows to assign notes to multiple voices and use the vertical alignments of syn- chronized notes in orchestral scores as well as the number of beats in a bar to detect and correct recognition errors. This grammatical formalization is built on terminals which correspond to the musical objects we propose to recognize with deep convolutional neural networks. 164 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek IV. BUILDING A MUSIC OBJECT DETECTOR For building a robust and extensible music object de- tector, we propose a machine-learning approach with deep convolutional neural networks, which operate directly on the input image. This simplifies the OMR process to the following steps: preprocessing, music object detection, and semantic reconstruction. Steps such as removing the staff lines and segmenting symbols do not need to be addressed explicitly. Existing state-of-the-art object detectors such as Faster R-CNN or R-FCN were designed to detect objects in natural scenes and have been shown to work well on challenging datasets such as COCO [21] or ImageNet [22]. But applying them out-of-the-box on sheets of music can lead to a suboptimal performance, due to the dense nature of music scores with many tiny objects. Therefore, we suggest applying a certain amount of preprocessing to the data and tailor these detectors to perform well on the task at hand. A. Dataset and Preprocessing Steps For training a music object detector, we use the MUS- CIMA++ dataset [23], as it contains 140 high-quality images with over 90000 symbol-level annotations, made by human annotators across 105 different classes of music symbols for the underlying CVC-MUSCIMA dataset [24]. The images have a high resolution of about 3500x2000 pixel, are binarized and optionally come with staff lines removed. For consistency, all white-on-black images are first inverted and then converted to RGB, as the evalu- ated implementations take colored images as input1. To efficiently train an object detector on such images, the image size has to be reduced. We propose to crop the images in a context-sensitive way, by cutting images first vertically and then horizontally, such that each image contains exactly one staff and has a width-to-height-ratio of no more than 2:1, with about 15% horizontal overlap to adjacent slices (see Figure 3). Basically, each horizontal slice extends from the bottom of the staff above to the top of the staff below. This cropping can also be done by automatically detecting staffs and then applying the same slicing rules leading to image crops that partially overlap both horizontally and vertically. For splitting the cropped images into a train and test set, we follow the recommendations from [23] to ensure that the test set contains scores of all complexities and that there is no overlap of writers between the training and the test set. We furthermore used 10% of the remaining training set for validation during the training. In total, we obtained 6181 samples, that were divided into a training, validation and test set, containing 4794, 533 and 854 images respectively. One limitation of this approach is, that all objects significantly exceeding the size of such a cropped region, will not appear in the data, as only annotations that have an intersection-over-area of 0.8 or higher between the object and the cropped region are considered part of the ground truth. 1The overhead created by this conversion is only minimal, as the duplicated information gets merged again in the first layer of the CNN. Figure 3. Illustration of the sliding window approach, used to crop music scores into meaningful subimages (red) with horizontally overlapping areas (orange) between adjacent crops. As music objects, we consider the full vocabulary of all 105 classes contained in the MUSCIMA++ dataset, containing both primitives such as noteheads as well as compound objects such as key-signatures that consist of one or multiple accidentals. B. Experimental Design For evaluating our suggested approach, we conducted several experiments to study the performance of vari- ous object detectors and feature extractors, as well as the effects of staff line removal, transfer-learning and removing classes with rare symbols. Using the deep learning library TensorFlow2, we adapted the work from [8] to detect music objects by training on the data de- scribed in Section IV-A. The entire source code, including training protocols and detailed instructions to reproduce our results, can be found at http://github.com/apacha/ MusicObjectDetector-TF. We considered: • the three meta-architectures Faster R-CNN, R-FCN, and SSD as object detectors. Faster R-CNN and R- FCN are both two-stage detectors with a region pro- posal network and a region classifier. The difference is that Faster R-CNN uses a sliding window for classification, whereas R-FCN uses position sensitive score maps and per-RoI pooling, which is more efficient at the cost of a slightly reduced precision. SSD is a generalized region proposal network for one stage detection on multiple feature maps • ResNet50, Inception-ResNet-v2, MobileNet-v1 and Inception-v2 as feature extractors, explicitly exclud- ing custom-made networks that cannot benefit from transfer-learning • images with and without staff lines (based on the images provided along the CVC-MUSCIMA dataset) • the full vocabulary of all 105 classes included in the MUSCIMA++ dataset, as well as a reduced set of only 71 classes, removing 34 classes that appear less than 50 times in the ground truth and are only of minor importance such as uncommon numerals and letters. Exceptions were only made for the classes double sharp and the numerals 5, 6, 7 and 8: although they appear less than 50 times in the dataset, we consider them essential to recover music semantics such as pitch and time signature. 2https://www.tensorflow.org, last seen 9th February 2018 165 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Figure 4. Typical sample of a cropped image that serves as input for the music object detector. All of the above-mentioned object detectors have a certain set of hyperparameters that need to be fine-tuned for the particular dataset. For example, [7] shows that using statistical analysis to obtain a sensitive number of anchor boxes, anchor box sizes, and anchor box ratios can improve the results significantly compared to handpicked priors. When running similar analysis on the cropped images, we obtain the following characteristics: For a typical input image of 600 pixels width and 300 pixels height (see Figure 4), we found the average square box size is about 37 pixels with a standard deviation of 48 pixels. Note, that the dataset also contains extreme cases of small objects like dots with only a few pixels and large objects that spans hundreds of pixels. The mean ratio from width to height of boxes is 0.7 which means that the majority of boxes are higher than they are wide. Furthermore, cropped images that are to be fed to the detector contain 19 symbols on average, with a standard deviation of 11. Concluding the analysis, we decided to use a grid of 32x32 pixels with a stride of 8 pixels and aspect ratios of 0.06, 0.29, 0.48, and 2.2 with the scales 0.25, 0.5, 0.75, 1.0, 1.75, and 4.0 to reflect the wide range of object shapes in the dataset. C. Evaluation and Results Following the evaluation protocols of the Pascal VOC challenge [25], we report the mean average precision (mAP) for each completed training in Table I and the detailed average precision per class for the combination that yielded the best results in Table II. Figure 5 shows a typical detection within a single image. We find that the best performing detector with regards to precision is the Faster R-CNN using the Inception-Resnet V2 feature extractor, pre-trained on the COCO dataset. This model produces a mAP of over 80%. The training on a GeForce GTX 1080 Ti takes approximately one day per configuration before results become stable. Validating ˜500 images takes about 2-4 minutes, so inference should take less than half a second per (cropped) image. When comparing the results of training on images with and without staff lines, the impact is no longer significant, supporting the claim of [14], that staff line removal might no longer be necessary. However, readers should also note that the staff lines in the CVC-MUSCIMA dataset are synthetic and do not experience the usual distortions that apply to scans or pictures of real music scores. Figure 5. Typical detection results with most symbols recognized correctly. Other detectors like the R-FCN or SSD produce good results as well, with a mAP of 75% and 71% respectively. Our results, therefore, comply with the findings of [8], where in particular the SSD model trades smaller accuracy for higher processing speed. Using pre-trained weights, in- stead of random initialization and the RMSprop optimizer as opposed to Stochastic Gradient Descent, improved the results significantly, speeded up convergence and was therefore used throughout the experiments. Modifying the set of classes by removing underrepresented classes as described in Section IV-B, boosted the mAP by up to 6% in some cases. Note, that Table II is missing six classes, that did not have any instances in the test set because they exceeded the size of the image crops and were thus discarded during the preprocessing. V. DISCUSSION AND CONCLUSION In this work, we show that state-of-the-art deep learning detectors like Faster R-CNN, R-FCN and SSD can pro- duce accurate detection results on a wide range of music symbols. After optimizing different hyperparameters, we achieve a mAP of over 80%, which is a solid baseline. However, there are still a couple of open issues, that need to be addressed in future work, like how to process a whole page of a score. In this work, we used a sim- ple overlapping sliding window approach. This method, although simple to use, has many well-known downsides like the poor performance of processing empty images or cutting up large symbols as well as a non-trivial merging step that has to fuse information from multiple overlapping sections. Another problem, specific to OMR, is the inherent imbalance of symbol classes: some symbols like noteheads are extremely frequent whereas others like double sharps are rare and often tied to a specific type of score. Having experimented with state-of-the-art deep learning object detectors, we found that classes do not interact with each other: simplifying the task by removing line-shaped classes did not improve the overall precision. There also seems to be a minimum threshold of about 20 samples per class, in order to be meaningful during the training. Currently, there is no guarantee, that the model does not overfit, but with recently published work like the RetinaNet and its focus loss [26] the effects of this class- imbalance could be mitigated to improve the training, especially on hard to detect classes. 166 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Table I DETAILED RESULTS FOR VARIOUS HYPERPARAMETER COMBINATIONS OF THE MUSIC OBJECT DETECTOR. Meta-Architecture Feature Extractor Number of classes Images have staff lines Mean Average Precision on Test Set (%) Weighted Mean Average Precision on Test Set (%) Faster R-CNN Inception-ResNet-v2 105  81.56 94.22 Faster R-CNN Inception-ResNet-v2 105 ✗ 81.23 94.56 Faster R-CNN Inception-ResNet-v2 71  85.12† 94.68 Faster R-CNN Inception-ResNet-v2 71 ✗ 87.80‡ 95.05 Faster R-CNN ResNet50 105  76.39 93.07 Faster R-CNN ResNet50 105 ✗ 78.45 93.10 Faster R-CNN ResNet50 71  82.30 93.47 Faster R-CNN ResNet50 71 ✗ 84.85 93.63 R-FCN Inception-ResNet-v2 105  69.75 89.12 R-FCN Inception-ResNet-v2 105 ✗ 70.88 89.42 R-FCN ResNet50 105  75.53 92.59 R-FCN ResNet50 105 ✗ 74.29 92.33 SSD Inception-v2 105  71.52 82.44 SSD Inception-v2 105 ✗ 70.40 81.75 SSD MobileNet-v1 105  62.30 74.97 SSD MobileNet-v1 105 ✗ 61.56 76.74 Although we used the test set, proposed by the MUS- CIMA++ authors, where writers in the test set do not appear in the training set, we are still not certain whether this system is truly writer independent or not. One way to confirm this would be to perform a cross-validation, where each writer in the dataset is evaluated independently. Finally, we have shown that removing staff lines can be omitted for music object detection, when using CNNs. Future experiments that apply data-augmentation using noise models and deformed images, as proposed for the staff removal challenge [27], can give even more insights into the robustness of our approach. ACKNOWLEDGMENT The authors would like to thank all creators of public OMR datasets for collecting them and making them freely available to other researchers. REFERENCES [1] A.-J. Gallego and J. Calvo-Zaragoza, “Staff-line removal with selectional auto-encoders,” Expert Systems with Ap- plications, vol. 89, pp. 138 – 148, 2017. [2] A. Pacha and H. Eidenberger, “Towards a universal music symbol classifier,” in Proceedings of the 12th IAPR Inter- national Workshop on Graphics Recognition, IAPR TC10 (Technical Committee on Graphics Recognition). New York, USA: IEEE Computer Society, 2017, pp. 35–36. [3] A. Rebelo, I. Fujinaga, F. Paszkiewicz, A. R. Marcal, C. Guedes, and J. S. Cardoso, “Optical music recognition: state-of-the-art and open issues,” International Journal of Multimedia Information Retrieval, vol. 1, no. 3, pp. 173– 190, 2012. [4] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99. [5] Y. Li, K. He, J. Sun et al., “R-FCN: Object detection via region-based fully convolutional networks,” in Advances in Neural Information Processing Systems, 2016, pp. 379–387. [6] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37. [7] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” arXiv preprint arXiv:1612.08242, 2016. [8] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy, “Speed/accuracy trade-offs for modern convolutional object detectors,” CoRR, vol. abs/1611.10012, 2016. [9] F. Rossant and I. Bloch, “Robust and adaptive OMR system including fuzzy modeling, fusion of musical rules, and possible error detection,” EURASIP Journal on Advances in Signal Processing, vol. 2007, no. 1, p. 081541, 2006. [10] Arnau Baró, Pau Riba, and Alicia Fornés, “Towards the recognition of compound music notes in handwritten music scores,” in 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR). Institute of Elec- trical and Electronics Engineers Inc., 2016, pp. 465–470. [11] A. Pacha and H. Eidenberger, “Towards self-learning op- tical music recognition,” in Proceedings of the 16th IEEE International Conference On Machine Learning and Appli- cations, 2017, in print. [12] J. j. Hajič and P. Pecina, “Detecting noteheads in hand- written scores with convnets and bounding box regression,” arXiv preprint arXiv:1708.01806, 2017. [13] K.-Y. Choi, B. Coüasnon, Y. Ricquebourg, and R. Zanibbi, “Bootstrapping samples of accidentals in dense piano scores for cnn-based detection,” in Proceedings of the 12th IAPR International Workshop on Graphics Recognition, IAPR TC10 (Technical Committee on Graphics Recogni- tion). New York, USA: IEEE Computer Society, 2017. [14] L. Pugin, “Optical music recognitoin of early typographic prints using hidden markov models.” in ISMIR, 2006, pp. 53–56. 167 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Table II DETAILED PRECISION RESULTS PER CLASS FOR THE BEST OBTAINED MUSIC OBJECT DETECTOR ON THE REDUCED SET OF CLASSES (SEE TABLE I, LINE 3† AND 4‡). Class name Total number of instances Average precision on the test set (%) with staff lines† w/o staff lines‡ notehead-full 31084 99.85 99.64 stem 27108 98.82 98.71 ledger line 14500 97.89 97.40 beam 8677 93.86 94.57 slur 3859 90.34 88.54 duration-dot 3195 95.12 94.21 thin barline 3071 99.49 99.64 8th flag 2744 93.46 93.37 measure separator 2649 43.64 52.09 staccato-dot 2507 94.23 94.97 sharp 2420 99.42 99.46 notehead-empty 2385 99.31 99.11 flat 1467 96.97 97.98 natural 1427 96.90 97.61 dynamics text 1374 85.25 87.12 8th rest 1339 98.86 99.36 tie 1085 82.39 81.85 quarter rest 1060 96.05 96.78 letter p 1038 89.70 89.84 letter f 1035 93.10 92.77 letter e 926 82.12 85.29 letter r 750 51.64 62.25 key signature 697 79.31 77.80 letter o 655 94.47 93.82 16th flag 652 36.62 40.19 letter s 649 71.89 74.30 grace-notehead-full 576 85.75 85.37 numeral 3 548 98.73 98.04 16th rest 531 96.17 99.93 letter t 513 92.10 94.42 other text 508 83.99 89.30 letter c 469 89.82 88.57 tuple 459 30.41 77.11 accent 421 99.08 95.75 g-clef 403 100.00 100.00 other-dot 402 94.40 95.19 repeat-dot 359 99.75 100.00 trill 315 100.00 99.74 letter d 313 93.49 89.36 letter m 293 74.19 74.43 f-clef 285 100.00 98.21 half rest 241 95.53 91.16 time signature 221 96.33 95.02 tenuto 216 88.45 74.79 letter l 192 78.75 86.00 c-clef 190 97.68 98.68 whole rest 183 90.73 84.66 letter P 177 45.83 45.80 tempo text 174 69.40 78.32 letter i 171 66.48 81.08 letter n 164 79.51 80.26 numeral 4 155 99.60 99.47 letter a 134 90.36 83.81 multiple-note tremolo 126 81.01 82.42 ornament(s) 123 85.22 83.90 letter M 115 65.83 71.47 grace strikethrough 110 98.14 97.96 letter u 106 65.98 62.69 repeat 73 84.42 88.87 double sharp 44 100.00 100.00 numeral 2 40 100.00 92.50 numeral 6 36 100.00 100.00 numeral 8 36 100.00 91.67 numeral 7 24 28.32 62.59 numeral 5 11 26.67 100.00 [15] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its ap- plication to scene text recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PP, no. 99, pp. 1–1, 2016. [16] J. Calvo-Zaragoza, J. J. Valero-Mas, and A. Pertusa, “End- to-end optical music recognition using neural networks,” in 18th International Society for Music Information Retrieval Conference, 2017. [17] E. van der Wel and K. Ullrich, “Optical music recognition with convolutional sequence-to-sequence models,” in Proceedings of the 18th International Society for Music Information Retrieval Conference, ISMIR 2017, Suzhou, China, 2017. [18] H. Miyao and R. M. Haralick, “Format of ground truth data used in the evaluation of the results of an optical music recognition system,” in IAPR workshop on document analysis systems, 2000, pp. 497–506. [19] J. Calvo-Zaragoza, G. Vigliensoni, and I. Fujinaga, “A machine learning framework for the categorization of elements in images of musical documents,” in Third International Conference on Technologies for Music Notation and Representation. A Coruna: University of A Coruna, 2017. [20] B. Coüasnon, “Dmos: a generic document recognition method, application to an automatic generator of mu- sical scores, mathematical formulae and table structures recognition systems,” in Proceedings of Sixth International Conference on Document Analysis and Recognition, 2001, pp. 215–220. [21] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, Microsoft COCO: Common Objects in Context. Cham: Springer International Publishing, 2014, pp. 740–755. [22] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei- Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, June 2009, pp. 248–255. [23] J. j. Hajič and P. Pecina, “The MUSCIMA++ dataset for handwritten optical music recognition.” Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition, 2017. [24] A. Fornés, A. Dutta, A. Gordo, and J. Lladós, “CVC- MUSCIMA: a ground truth of handwritten music score images for writer identification and staff removal,” Inter- national Journal on Document Analysis and Recognition (IJDAR), vol. 15, no. 3, pp. 243–251, 2012. [25] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” International Journal of Computer Vision, vol. 111, no. 1, pp. 98–136, 2015. [26] T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” CoRR, vol. abs/1708.02002, 2017. [27] A. Fornés, A. Dutta, A. Gordo, and J. Lladós, The 2012 Music Scores Competitions: Staff Removal and Writer Identification. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 173–186. 168 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 5.2. General Music Object Detection 5.2 General Music Object Detection Given that multiple competing approaches for solving music object detection were developed simultaneously, Jorge Calvo-Zaragoza, Jan Hajič jr., and I joined forces to evaluate these approaches on a common ground—with the same datasets and the same evaluation protocol. Our efforts resulted in the paper “A Baseline for General Music Object Detection with Deep Learning,” published in a special issue of the applied sciences journal [PHjCZ18]. The three approaches Faster R-CNN, RetinaNet, and U-Net were evaluated on three different datasets: DeepScores [TES+18, ETPS18], MUSCIMA++ [HjP17] and Capitan [PCZ18]. The same evaluation protocol was used for all experiments, reporting the mAP and weighted mAP. In contrast to the previous paper [PCC+18], the more strict version of the metric was used, as defined by the COCO evaluation protocol [LMB+14]. This means that the average precision was not just taken at a single point where the Intersection over Union (IoU) is 50%, but averaged across a range of different values for the IoU, ranging from 50% to 95% in steps of 5%. The results were mixed but particularly disappointing for the Faster R-CNN method on the MUSCIMA++ dataset, which only produced a mAP of 3.9%, as opposed to over 80% from previous research [PCC+18]. There are two reasons for this circumstance. First, the evaluation metric got much stricter. Second, we trained both the Faster R-CNN as well as the RetinaNet network on the entire image, instead of on sub-images. This increased the required memory so much that we were forced to reduce the image sizes, which in turn caused small objects to nearly disappear. On the Capitan dataset, which contained mostly bigger objects, Faster R-CNN and RetinaNet performed much better with 15.2% mAP and 14.5% mAP, respectively. In contrast, U-Nets are more independent from the input image size, since they contain convolutional filters only. Therefore, they could process the entire image in its full resolution. It also avoids some problems of Faster R-CNN which has a limited number of region proposals internally, which can become an issue in densely populated regions. Unfortunately, the Deep Watershed Detector [TESS18] was not yet available when this paper was written. However, according to Lukas Tuggener, their results on the DeepScores dataset were approximately as good as ours. 81 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek applied sciences Article A Baseline for General Music Object Detection with Deep Learning Alexander Pacha 1,* , Jan Hajič, Jr. 2 and Jorge Calvo-Zaragoza 3 1 Institute for Visual Computing and Human-Centered Technology, TU Wien, 1040 Wien, Austria 2 Institute of Formal and Applied Linguistics, Charles University, 116 36 Staré Město, Czech Republic; hajicj@ufal.mff.cuni.cz 3 PRHLT Research Center, Universitat Politècnica de València, 46022 València, Spain; jcalvo@upv.es * Correspondence: alexander.pacha@tuwien.ac.at Received: 31 July 2018; Accepted: 26 August 2018; Published: 29 August 2018   Abstract: Deep learning is bringing breakthroughs to many computer vision subfields including Optical Music Recognition (OMR), which has seen a series of improvements to musical symbol detection achieved by using generic deep learning models. However, so far, each such proposal has been based on a specific dataset and different evaluation criteria, which made it difficult to quantify the new deep learning-based state-of-the-art and assess the relative merits of these detection models on music scores. In this paper, a baseline for general detection of musical symbols with deep learning is presented. We consider three datasets of heterogeneous typology but with the same annotation format, three neural models of different nature, and establish their performance in terms of a common evaluation standard. The experimental results confirm that the direct music object detection with deep learning is indeed promising, but at the same time illustrates some of the domain-specific shortcomings of the general detectors. A qualitative comparison then suggests avenues for OMR improvement, based both on properties of the detection model and how the datasets are defined. To the best of our knowledge, this is the first time that competing music object detection systems from the machine learning paradigm are directly compared to each other. We hope that this work will serve as a reference to measure the progress of future developments of OMR in music object detection. Keywords: optical music recognition; deep learning; object detection; music scores 1. Introduction Optical Music Recognition (OMR) is the field of research that investigates how to computationally read music notation in documents. Having accurate OMR technology would enable fully integrating written music into the ecosystem of digital music processing. In recent years, diverse initiatives have been launched to digitize musical heritage in the written form, such as the The Digital Image Archive of Medieval Music project [1] on the academic side, or at the same time the crowd-sourced International Music Score Library Project (IMSLP) repository of public-domain and openly available music [2] which has grown to become a primary provider of sheet music worldwide. However, making not only the digital images of all these compositions, but also their structured representation accessible at scale, as attempted e.g., by the Single Interface for Music Score Searching and Analysis (SIMSSA) project [3], would constitute a breakthrough in interacting with written music, and making it accessible to both the professional and the general public in previously unseen ways: content-based search in large sheet music libraries including cross-modal retrieval, digital musicology at scale and with access to structured representations of music that only exists in written form, renotation of early notation to modern notation, manuscript transcription and part-matching to directly cut costs of music directors Appl. Sci. 2018, 8, 1488; doi:10.3390/app8091488 www.mdpi.com/journal/applsci D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Appl. Sci. 2018, 8, 1488 2 of 21 and composers. These (and more) applications have been envisioned in OMR literature for a long time [4,5]; however, results have not been forthcoming [6]. In order to be able to apply Music Information Retrieval (MIR) algorithms on music scores and enable this wide range of applications, it is first necessary to bring them into this symbolic, machine-readable format. Manually creating such symbolic representations by means of specialized music typesetting software is an expensive effort, and constitutes the bottleneck to digitally encoding music at large scales—which is, in turn, a bottleneck both for digital musicology, subsequent MIR applications, and music accessibility. OMR is expected to provide the enabling technology for scalable structured encoding. From this perspective, OMR can be seen as the key to diversifying the available symbolic music sources in reasonable time and cost. Crucially, OMR has seen a shift in paradigms in the last few years, mainly triggered by advances in the field of computer vision and machine learning through deep learning [7–10]. This development is further fueled by the availability of large annotated datasets (e.g., MUSCIMA++, DeepScores) and sufficient computational power to work with such large datasets. This new paradigm, combined with a better understanding of the challenges [11,12], allow approaching the problem of OMR somewhat differently. The entire process of OMR can be broken down into the following steps [6,13–15]: 1. Preprocessing: Standard techniques to ease further steps, e.g., contrast enhancement, binarization, skew-correction or noise removal. Additionally, the layout should be analyzed to allow subsequent steps to focus on actual content and ignore the background. 2. Music Object Detection: This step is responsible for finding and classifying all relevant symbols or glyphs in the image. Note that music object detection is sometimes referred to as music symbol recognition, but we use the former term because of its relation to “object detection”, which is commonly used in computer vision to refer to the very same localization and classification task in (natural) images, answering the question “What is where in this image?”. 3. Relational Understanding: From the detected and classified symbols, a music notational graph (MuNG) can be constructed that holds both the symbols and their relationships to each other. Note that, for a complete and unambiguous reconstruction, two kinds of relations are necessary: a logical relationship (e.g., between a notehead and a stem) and a temporal relationship to guarantee the correct order of the symbols. The graph formulation essentially re-casts the notation reconstruction algorithms like that of [16] as a problem of recovering binary labels over symbol pairs, therefore also making it amenable to machine learning approaches. Again, other works sometimes refer to the stage after object detection as semantical reconstruction. Note that, in this approach, this stage only attempts to reconstruct the relations between symbols and a large part of the semantics is assigned in the encoding stage. 4. Encoding: Given a complete music notation graph, the music can be encoded into any output format unambiguously, e.g., into MIDI for playback or MusicXML/MEI for further editing in a music notation program. Keep in mind that this step potentially has to deal with the subtleties of music notation, such as omitted symbols. Currently, the hardest challenge of this pipeline is posed by the music object detection step. Unfortunately, it is unclear to what extent deep learning has been successful in addressing this stage. Existing studies that focus on music notation objects are dispersed and not comparable with each other in terms of the used algorithms, datasets, and metrics, which has so far made a fair comparison impossible. However, there is no good reason for this state of affairs: music object detection can borrow standard evaluation from generic object detection settings, and the deep learning models are similarly domain-agnostic. Therefore, this work aims to fill an obvious gap: provide a direct comparison between the different general deep learning models for object detection that were recently proposed for the task of music object detection, across the available musical symbol datasets, and thus establish a clear state-of-the-art baseline. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Appl. Sci. 2018, 8, 1488 3 of 21 We evaluate three competing approaches on three distinct datasets containing both handwritten and typeset music. To compare the different approaches on common ground, we propose a standard bounding-box based data model, usable with multiple OMR datasets, and use an up-to-date standard for evaluating object detection, namely the Common Objects in Context (COCO) evaluation protocol [17]. All scripts for obtaining the test-bed, preprocessing the data and evaluating the results are being made publicly available [18]. To the best of our knowledge, this marks the first time that music object detection methods based on machine learning are directly compared against each other. Bellini et at. [19] evaluated a number of commercial OMR applications in 2007 , but it was done manually, making it difficult to replicate, and, more importantly, the systems have no published descriptions, which means the comparison has limited value for guiding future developments. The evaluation methodology in [19] also does not correspond to current object detection evaluation protocols. 2. Background on Music Object Detection Traditionally, OMR has been approached by workflows composed of several stages, as outlined in the previous section. In addition, these stages were further subdivided into smaller steps. Inside of the music object detection stage, the key step used to be the staff-line detection and removal [20]. Although staves are essential for the understanding of music notation, their presence hindered the isolation of musical primitives using classical algorithms such as connected-components analysis. That is why, for many years, much research was devoted to improving staff-line removal [21]. Currently, thanks to the use of deep neural networks, the staff-line removal can be considered a solved problem, with selectional auto-encoders outperforming all previously existing methods given a sufficient amount of training data [22]. However, even with an ideal staff-line removal algorithm, isolating musical symbols by means of connected components remains problematic, since multiple primitives could be connected to each other (e.g., a beam group can be a single connected component that includes several heads, stems, and beams) or a single unit can have multiple disconnected parts (e.g., a fermata, voltas, f-clef). The second case is particularly severe in the context of handwritten notation, where symbols can be written with such a high variability (e.g., detached noteheads) that modeling all possible appearances becomes intractable. Recently, it has been shown that the use of region-based machine learning models is an alternative that can deal with the stage of music object detection holistically. These models have been widely developed in the computer vision community, attaining high performance in detecting objects in images by using convolutional neural networks. In addition to the performance, a compelling advantage is that these models can be trained in an end-to-end manner, that is, by merely providing pairs of images and positions where the objects to be detected are located; these models, therefore, make it possible to bypass several stages of the classical OMR workflow by directly detecting symbols in music score images. Pacha et al. [23] presented the first work that considered region-based convolutional neural networks for the task of music object detection. They proposed a sliding-window based approach, that cuts the image in a context-sensitive way into smaller chunks that contain no more than one staff and ran a Faster R-CNN detector to obtain the positions and classes of all symbols in the cropped image. While the evaluation is limited to the detection performance on small image chunks instead of the entire images, the extension of this approach to full pages of handwritten music scores, written in mensural notation, is reported to yield promising results [24]. Hajič jr. et al. [25] use a different approach: instead of applying an object detection model directly, they use a semantic segmentation model and a subsequent detection stage. More specifically, the semantic segmentation is done with the U-Net architecture [26]. The overall detection problem is broken down into a set of binary pixel classification problems and subsequently uses a connected components detector to arrive at the final detection proposals. The object detection results are reported in terms of F-scores, broken down by symbol class with no aggregate result, and the experiments are D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Appl. Sci. 2018, 8, 1488 4 of 21 done only for a subset of the symbol classes available in the MUSCIMA++ dataset; on the other hand, the notation reconstruction step is subsequently applied, and the object detection is evaluated in terms of the subsequent MIDI inference. The Deep Watershed Detector proposed by Tuggener et al. [27] is another attempt to solve music object detection by training a convolutional neural network to learn a custom energy function that is used in a watershed transformation to perform semantic segmentation of an entire score. They evaluate their approach on the DeepScores and the MUSCIMA++ dataset. While the results for some classes are promising, e.g., it works exceptionally well on small objects such as staccato dots, the algorithm generally struggles with rare classes, overlapping symbols, and accurate bounding box regression. Unfortunately, no overall results of the detection performance are given by the authors. As discussed above, while these studies use standard object detection models, they used completely different datasets, vocabularies, and metrics for the reported results. A major part of the motivation for this paper is to evaluate these advances in music object detection in a consistent manner, so that future advances have a clear, up-to-date formulation and baseline. 3. Task Formulation We formulate the task of object detection in images in the following way. Given an image, a variable-length list of 6-tuples (y1, x1, y2, x2, c, s) is obtained, where y1, x1 and y2, x2 denote the coordinates of the top-left and bottom-right corners, respectively, of a predicted bounding box, c is the category assigned to the object therein, and s is the confidence score given by the model to such a prediction. In the specific case of music object detection, the categories correspond to the music-notation primitives that are considered relevant to the user, depending on the specific OMR task. Note that the requirements may vary depending on both the input music notation and the pursued application: the interesting primitives for replayability may differ from the interesting ones for getting a structured encoding of the music. The main reason to formulate the music object detection as bounding box retrieval is that it provides a direct relationship between the detection results and the entities to be recognized in the music score image. It has already been discussed in Section 2 that the traditional segmentation step based on connected components can produce both super-symbols (a single component that gathers several symbols) and sub-symbols (a single symbol separated into several parts), which increases the complexity of post-processing considerably. Similarly, a pixel-wise categorization (known as semantic segmentation in the computer vision community) might avoid predicting super-symbols, yet the problem with sub-symbols remains. In addition, a pixel-level annotation provides ambiguities that are difficult to handle when nearby or touching pixels are labeled in the same way while belonging to different entities (for example, multiple noteheads in a chord). Furthermore, the prediction with bounding boxes provides an implicit grouping. Thus, detecting isolated entities directly, along with their positions in the image, is the kind of information that the following stages of the OMR workflow might need, in which detected symbols are grouped to reconstruct the actual music notation. Therefore, once objects have been detected, the image is no longer relevant, since the bounding boxes are sufficient representatives of the graphical information that needs to be recovered from the music score image. For example, bounding box dimensions have long been used as features for symbol classification in pipelines where this step is separate [4]; they are suitable for filtering false positives [28]; in the dependency graph approach of MUSCIMA++, bounding boxes already provide useful features for the reconstruction step [14]; and they could be also used to model terminals of a music notation grammar for the reconstruction stage [29]. In addition to the above, the reality with music documents is that the stylistic and graphical differences amongst different manuscripts is very pronounced, especially in the case of handwritten notation. That means it is advisable to build ground-truth data for each type of manuscript with which to train the recognition models, as is happening in other similar domains such as text recognition [30]. We believe that annotating images at the bounding box level is less expensive than building a dataset D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Appl. Sci. 2018, 8, 1488 5 of 21 to train a traditional multi-stage system, in which each stage needs its own ground truth. Furthermore, this level of annotation represents a good trade-off between effort and accuracy in comparison to other current approaches in computer vision that include pixel-wise labeling [31]. Although these fine-grained annotations could eventually lead to better localization results, the required initial effort for building ground-truth data is much higher, which is especially detrimental when dealing with a new type of music manuscript. 4. Experimental Setup 4.1. Object Detection Models The objective of this work is to provide a good baseline for the music object detection task, and so we consider three neural models of different nature for performing the experiments. While we do want our detectors to be as accurate as possible, we primarily wish to exemplify the different deep learning approaches to object detection. We believe that this is more interesting from the point of view of some reference results, and can help to draw more interesting conclusions. Thus, we use Faster R-CNN as a representative of two-stage detectors, RetinaNet as a representative of one-stage detectors, and U-Nets as a representative of models based on pixel-level segmentation. Figure 1 overviews the general operation of these types of detectors. 4.1.1. Faster R-CNN Faster Region-based Convolutional Neural Network (Faster R-CNN) [32] is the evolution of the first convolutional network schemes for object detection R-CNN [33] and Fast R-CNN [34]. Faster R-CNN belongs to the class of two-stage detectors, with the first stage generating a sparse set of region proposals that are classified and further refined in the second stage. While the previous R-CNN schemes used an external mechanism for generating the proposals, such as Selective Search [35] or EdgeBoxes [36], Faster R-CNN attempts to learn the object proposal stage directly from the data employing a region proposal network. The whole process can be carried out efficiently because the convolutional features are shared between both stages, and therefore computing the region proposals does not represent a bottleneck. This also increases the efficiency to train such a network. The details for training this model followed the recommendations given in the work of Pacha et al. [23]. That is, an Inception-ResNet-V2 [37] is used for the feature extraction stage, initialized with pre-trained weights from ImageNet (as provided by TensorFlow Object Detection API [38]). Input images are rescaled so that the longest edge is no longer than 1000 pixels. A clustering of symbol bounding box shapes is done for each dataset, in order to establish an appropriate set of bounding box shapes to predict, therefore providing appropriate hyperparameters for the object proposal stage. 4.1.2. RetinaNet The RetinaNet [39] belongs to the family of one-stage detectors that are built on convolutional neural networks. Other prominent representatives are OverFeat [40], Single Shot Detector (SSD) [41] or You Look Only Once (YOLO) [42]. These one-shot detectors create a dense set of proposals along a grid and directly classify and refine those proposals. As opposed to the two-stage detectors, they have to handle a large number of background samples, which potentially can dominate the learning signal. The RetinaNet [39] is an adaptation of a Residual Network [43] with lateral connections to create features on multiple scales [44]. Small convolutional subnetworks perform classification and bounding box regression on each output layer. RetinaNet was proposed along with the focal loss function, which tries to overcome the hard object-background imbalance issue by dynamically shifting weight to increase the contribution of hard negative examples and decreasing the contribution of easy positives. The configuration of the network model requires setting several hyperparameters. We specifically checked four different back-ends for feature extraction, namely: ResNet50 [43], MobileNet128 [45], D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Appl. Sci. 2018, 8, 1488 6 of 21 DenseNet121 [46], and a highly simplified version of the DenseNet. Various anchor dimension settings were also examined: the ResNet50 feature extractor performed best in preliminary experiments and was subsequently chosen. The negative overlap threshold was set to 40%, so every box with lower Intersection over Union (IoU) counts as background; similarly, the positive overlap threshold was set to 50%, and every box with a higher IoU is treated as foreground; boxes in between are omitted from the training signal. Feature Extractor Detection Generator Object Classification Box Regression (a) Basic architecture of a one-stage detector. Feature Extractor Proposal Generator Box Classifier Objectness Classification Box Regression Object Classification Box Regression Crop (b) Basic architecture of a two-stage detector. Feature Extractor (U-Net) Detection Generator Connected Components SearchThresholding Probability Map (c) Basic architecture of the U-Net detector. Figure 1. Basic architectures of the considered types of object detectors. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Appl. Sci. 2018, 8, 1488 7 of 21 4.1.3. U-Net The U-Net [26] is a model for performing semantic segmentation that assigns each pixel of the input image to a certain class. It can be extended to perform object detection, as defined in Section 3. The U-Net architecture combines three key elements: standard 2D convolutions, the “hourglass” architecture inspired by auto-encoders, and residual connections from ResNets [43]. As no other operations than convolutions and element-wise sums of corresponding layers in the “hourglass” are used, the U-Net can in parallel assign a label—or a numerical value, or a probability distribution—to each pixel of an arbitrarily large image. The architecture is depicted in Figure 2. Input Image Probability Map 1x1 Convolution, Sigmoid 3x3 Convolution, Batch-Norm, ELU ⊕ Element-wise Sum ⊕ ⊕ 2x2 Max Pooling 2x2 Up-convolution, Stride 2 8 16 64 32 Figure 2. The U-Net architecture, with computation flowing left-to-right; the “hourglass” is unrolled downwards. Green arrows indicate 2D convolution with 3 × 3 kernels, downward orange arrows indicate 2 × 2 Max-Pooling, upward purple arrows indicate 2 × 2 up-convolution, and blue arrows indicate element-wise sums that form the residual connections between corresponding parts of the two “hourglass” halves. In order to generate the binary pixel mask training data from the bounding box ground truth, we set all pixels within the bounding boxes of a given symbol class to 1, resulting in rectangular foreground regions for each symbol instance (despite the fact that the symbols themselves are not rectangles). One drawback of U-Nets is that they were initially designed for semantic segmentation: based on the pixel-wise outputs (such as a probability map), one needs to add a detector stage to actually perform object detection. However, if we thus decide on the detector in advance, we can manipulate the output masks on which we train the behavior of this detector. In the case of music notation, for symbols that may consist of multiple connected components or have complex shapes (the f-clef is an example that combines both), this can be attenuated by training on masks computed from their convex hulls rather than directly from their pixels [25]. Fortunately, as a side effect of using bounding box data in this paper to generate the rectangular pixel-wise masks, we are in essence already getting crude approximations of convex hulls. Note, however, that the bounding box data model thus forces the model to classify background pixels to belong to the symbol, which might otherwise be some way off; this is pronounced especially with beams that are slanted or close to each other. By not considering the bounding boxes themselves at all during training, U-Nets avoid questions of granularity and the corresponding anchor box hyperparameters, which is a welcome property given the variability of musical symbol shapes—both inter-class and in some cases intra-class. On the D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Appl. Sci. 2018, 8, 1488 8 of 21 other hand, the arbitrary detector step, of course, introduces its own hyperparameters: the masking threshold, and the pixel merging strategy. One can consider the pixel-wise labels as a very fine-grained over-segmentation; the detector then acts as the over-segment merging step. The only architectural hyperparameter one has to set is the size of the receptive field of an output pixel, which is defined implicitly through the number of convolutional and max-pooling layers and their filter sizes; if we fix the size of the network, we can also trade off the receptive field size and resolution by downscaling the images. Model specifics We follow the architecture of [26] and our U-Nets have four “depth” levels, as depicted in Figure 2. The final layer that produces the probability map uses 1 × 1 convolutions with just one filter, with a sigmoid activation. (This is an efficient implementation of computing a weighted combination of the convolutional features for each pixel from the second-to-last layer.) Training setup To go from bounding box ground truth to labels for each pixel, we render the rectangles specified by the bounding box ground truth as foreground. Each image is downscaled with a factor of 0.5. Training is not performed on entire images; instead, in each epoch, we uniformly sample a random 256 × 512 window from each training image (corresponding to a 512 × 1024 window from the original image). If this window contains no foreground pixel for the given class, we re-sample up to 5 times; this is a general way of slightly oversampling rare classes. For each symbol class, one U-Net is trained with exactly the same setup. We use cross-entropy loss, using the Adam optimizer with the default parameters suggested in [47]. Batch size is set to 2. We use a learning rate attenuation schedule: starting from 0.001, if the validation loss does not improve for 50 epochs, we multiply the learning rate by 0.2, a process that is repeated five times. Again, none of these steps are domain specific. Detection is then performed independently for each symbol class: in this setup, the fact that a pixel is classified as belonging to, e.g., a barline, does not preclude it from also being classified as a stem pixel (note that certain music notation symbols indeed overlap to a great extent, e.g., noteheads and ledger lines). As opposed to [25], we do not experiment with multi-channel outputs, as this is a step that already requires domain-specific knowledge. For the detection stage, we use simple thresholding at 50% and a connected component detector, this time following the setup of [25]. The detector does not output any natural confidence score, so we add a placeholder value of 1 for each detected foreground region. 4.2. Datasets As we are considering generic object detection methods, we can evaluate all of them across a range of OMR datasets for symbol detection [48]. As a side-effect of this evaluation, we also obtain a notion of the difficulty of these datasets for object detection in general. Each dataset contains a different kind of typography, adding to the breadth of the baselines we establish. • DeepScores: DeepScores [49] is a very large synthetic dataset of music scores in Common Western Modern Notation (CWMN), consisting of 300,000 images along with their ground-truth annotations for performing symbol classification, image segmentation, and object detection. It is based on a large collection of freely available MusicXML files from MuseScore [50] that were converted into Lilypond files and digitally rendered into images using five different fonts to obtain a higher visual variability. The first version of this dataset only has annotations for a limited vocabulary that is missing essential glyphs, such as stems, beams, barlines, ledger lines or slurs. The second version, which is currently under development, contains these missing annotations and has been made available to us by the original authors. This set contains only 100 pages, but has full annotations for all relevant music symbols. • MUSCIMA++: MUSCIMA++ [14] is a dataset of handwritten music that has over 90,000 manually annotated handwritten musical symbols in CWMN. The dataset is built on the CVC-MUSCIMA dataset for staff removal [51]. The ground truth is defined as a notation graph: in addition to the individual symbols, their relationships are annotated as well, so that the semantics (pitch, D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Appl. Sci. 2018, 8, 1488 9 of 21 duration, and onset) can be inferred and the full OMR pipeline can be trained on the dataset. However, in this paper, we only focus on symbol detection, equivalent to recovering the vertices of the notation graph. • Capitan: Capitan consists of 46 fully-annotated pages in Spanish mensural notation from the 16th–18th century. The manuscripts represent sacred music, composed for vocal interpretation. The compositions were written in music books by different copyists of that time. To preserve the integrity of the physical sources, images of the manuscripts were taken with a camera instead of scanning them in a flatbed scanner, leading to suboptimal conditions in some cases. The corpus is based on the dataset used in the work of Pacha and Calvo-Zaragoza [24]. However, the refined version used in this work is focused on obtaining a diplomatic transcript, keeping the information of how symbols were written in the source as intact as possible. That is why there is a higher number of categories, since now symbols that have the same meaning—for example, a minima with the stem pointing up or down—are considered as different categories. An overview of the corpora considered is given in Table 1, while we show some patches extracted from their images in Figure 3. As can be observed, the characteristics of the different corpora are quite heterogeneous, which is interesting for drawing generalizable conclusions from our experiments. Table 1. Overview of the considered datasets. Dataset Notation Engraving Images Categories Scores Symbols DeepScores CWMN Printed Binary 39 100 87,703 MUSCIMA++ CWMN Handwritten Binary 107 140 91,254 Capitan Mensural Handwritten Color 56 46 11,242 It is important to mention the variability in the aspects of the bounding boxes of the elements within these datasets. This variability appears not only amongst elements of different classes but also, especially in the case of handwritten notation, amongst elements of the same class. To illustrate this scenario, Figure 4 shows the different shapes of the boxes to be recognized in each dataset. The majority of objects in the DeepScores dataset are very tiny. The MUSCIMA++ dataset shows a greater variation in aspect ratios with one dominant cluster, the noteheads. In addition, the Capitan dataset contains a significant number of bigger objects, compared to the other two datasets with distinct clusters. (a) DeepScores (b) MUSCIMA++ (c) Capitan Figure 3. Samples of notation from the considered datasets.D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Appl. Sci. 2018, 8, 1488 10 of 21 (a) DeepScores (b) MUSCIMA++ (c) Capitan Figure 4. Scatter plot (top) row and density plot (bottom) row of the normalized object sizes for the considered corpora to illustrate the challenges of each dataset (best viewed in color). Each point in the top row depicts one instance from the dataset with the color encoding the respective class. The width and height of a sample are reported as the fraction of the full image size. To evaluate the models in the different corpora, we followed a fixed partitioning scheme for training, validating, and testing. Therefore, the experiments are reproducible, and future results will be directly comparable. Specifically, 60% of the available data is used for training, to learn the values of the neural models; 20% for validation and hyperparameter optimization; and 20% for testing and computing the final evaluation metrics. 4.3. Evaluation As stated in Section 3, our formulation expects models to provide a set of detection proposals, each of which consists of a bounding box and the recognized class of the object therein. The models are also expected to provide a score of their confidence for each proposal. A bounding box proposal Bp is considered a positive sample if it overlaps with the ground-truth bounding box Bg according to the Intersection over Union (IoU) criterion area(Bp ∩ Bg) area(Bp ∪ Bg) exceeding a certain threshold (tIoU). If the predicted category matches the actual category of the object, it is considered a true positive (TP), being otherwise a false positive (FP). Additional detections of the same object are considered as false positives as well. Those ground-truth objects for which the model makes no proposal are considered false negatives (FN). From these values, precision (P) and recall (R) metrics can be computed as P = TP TP + FP , R = TP TP + FN . D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Appl. Sci. 2018, 8, 1488 11 of 21 P measures how reliable detections are (ratio of correct detections), whereas R measures the ability of the model to detect symbols (ratio of detected symbols). Object detection can be seen as a retrieval task, in which bounding boxes are ordered by their associated scores. Then, P and R can be computed as previously described from the top k predictions. However, different values of P and R are obtained by varying the parameter k. To obtain a single metric encompassing the performance of the model, the average precision (AP) can be computed, which is defined as the area under the precision–recall curve for all possible values of k. A single AP value is obtained independently for each class, and then the mean AP (mAP) is computed as the average across all classes. Since our problem is highly unbalanced with respect to the number of objects of each class, we also compute the weighted mAP (w-mAP), in which the mean value is weighted according to the frequency of each class. The difference between mAP and w-mAP gives a quick idea of how the evaluated models deal with the rare classes. When tIoU is set to 50%, the described evaluation protocol matches the PASCAL Visual Object Classes (VOC) challenge [52]. The accuracy of the localization is especially important for OMR, as objects are often packed densely. Failing to locate them correctly heavily affects the subsequent recognition. To account for this, we average mAP and w-mAP over different values of tIoU, ranging from 50% to 95% by steps of 5%. This evaluation protocol is taken from the COCO challenge [17], and it is expected to provide figures that are more sensitive to precise symbol localization. 5. Results The aggregate detection performance of the individual models over each of the datasets is reported in Table 2, presenting both mAP and w-mAP as defined for the COCO challenge [17]. These results should serve as the baseline for further music object detection research. Generally, it can be observed that the results are still very far from the optimal. The evaluated models struggle most with the MUSCIMA++ dataset, with the U-Net performing best at around 16% mAP and 33% w-mAP. It might be that the comparison is not entirely fair since the U-Net was specially designed for this dataset. However, U-Net outperforms the rest of the models in the case of DeepScores as well, where it attains around 24% in both mAP and w-mAP, leaving Faster R-CNN and RetinaNet below 20% and 10%, respectively, in both metrics. Concerning the Capitan dataset, all models behave quite similarly, except for the superior performance from RetinaNet regarding the w-mAP metric. Table 2. Results in terms of mAP (%) and w-mAP (%) with respect to the dataset and object detector model following the COCO evaluation protocol. mAP (%) w-mAP (%) DeepScores MUSCIMA++ Capitan DeepScores MUSCIMA++ Capitan Faster R-CNN 19.6 3.9 15.2 14.4 7.9 23.2 RetinaNet 9.8 7.7 14.5 1.9 4.9 34.9 U-Net 24.8 16.6 17.4 23.3 33.6 26.0 In general, Faster R-CNN performs better than RetinaNet. However, it is especially sensitive to the selection of hyperparameters that regulate the shape and scale of the objects to be detected. The high variability in the bounding box shapes shown in Figure 4 might explain why Faster R-CNN is far from offering the performance it demonstrates for detecting objects in natural images. Compared to previous works that reported 80% mAP for snippets [23] and 76% mAP for full pages [24], a few differences need to be pointed out to understand the large difference between the numbers: the experiments from this work used less training data due to a stricter dataset split, the vocabulary of the Capitan dataset became larger and the final results are computed following the strict COCO evaluation protocol as opposed to reporting the PASCAL VOC metrics [52]. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Appl. Sci. 2018, 8, 1488 12 of 21 In the case of RetinaNet, an in-depth analysis of its operation reveals that it is not capable of detecting small objects. This explains the noticeable discrepancy between their mAP and w-mAP in DeepScores, where the noteheads—small objects—are the most represented category. Note that Faster R-CNN also exhibits this behavior on the DeepScores dataset, where more frequent symbols are also more problematic for the model than the more rare symbols. In practical settings, inference speed, and in some situations (re-)training speed, can offset small differences in detection performance. We give a rough comparison when running the experiments on a standard consumer PC, equipped with a GTX 1080 graphics card: • Faster R-CNN: Training time: 8–12 h; inference time: 20–50 s per image, • RetinaNet: Training time: 1–2 h; inference time: less than 1 s per image, • U-Net: Training time: 2–3 h per symbol class; inference time: 40–80 s per image, or about 0.8 s per symbol class. In this comparison, the RetinaNet has a clear advantage: if one were to find a way to improve its accuracy to an acceptable level, it would be a clear champion for interactive OMR or online recognition settings. U-Nets, on the other hand, are impractical for situations where frequent re-training is needed: unless one has a cluster of graphical processing units (GPUs), training even the minimum 30+ classes that are necessary for pitch and duration inference would take several days. Qualitative Results To illustrate the differences in performance, we show samples of detector outputs across the three datasets for some selected classes. Figure 5 shows how the detectors fare with the born-digital printed music of DeepScores. As the rendered symbols have relatively little variability, this sample allows for comparing the strengths and weaknesses of the models’ designs, especially with respect to music notation data. The Faster R-CNN model (Figure 5 top) has trouble with symbols that are bunched together closely, especially in the upper left corner. This may be due to too few available proposals in a given region. On the other hand, it can distinguish slanted parallel beams (first and third measure). The RetinaNet (Figure 5 middle) is unable to deal with symbols smaller than the beams and does not even find all of them. The U-Nets (Figure 5 bottom) shine in this specific example, perhaps a bit more than the quantitative results suggest: they also recover the heavily overlapping eighth rest in the third and fourth measures. On the other hand, the inherent limitation of the connected component detector causes beams with overlapping bounding boxes to get lumped together. If one were to choose an image with dense chords, noteheads within a chord would also invariably get merged into one. Detection performance on the MUSCIMA++ dataset (Figure 6) displays a similar pattern. The RetinaNet again cannot detect anything but the large objects; Faster R-CNN again seems to run out of proposals in cluttered regions, or perhaps proposals get inadvertently merged into one due to insufficient feature map resolution. U-Nets are lucky in this image: the descending thirds in the first measure are just far enough from each other so that they get detected separately; if they were as close to each other as the bottom two noteheads on the third and fourth beat of the second measure of the sample, they would get merged into one. Beams, even though their bounding boxes do not necessarily overlap (bottom staff, second measure), again get merged, and there are false positive beams in hairpins. On the Capitan dataset, the situation changes, as illustrated in Figure 7. We hypothesize that the main driver for this difference is the change in symbol class definition: instead of using notation primitives such as noteheads or stems, the Capitan dataset uses composite symbols such as note.quarter-up, note.beamedLeft1. This discrepancy in defining music notation objects has persisted throughout the literature on music object detection [19]. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Appl. Sci. 2018, 8, 1488 13 of 21 Figure 5. Detection sample on some selected classes from the DeepScores dataset. (top–bottom): Faster R-CNN, RetinaNet, and U-Nets detection results. This presents a problem for the U-Nets: the most prominent feature of a note, whether facing down or up, is the notehead. As the symbols are processed independently, there is a risk that noteheads will be detected as instances of all applicable objects according to the notehead type. If one looks at the U-Nets’ output (Figure 7 bottom), e.g., the middle of the second staff on the second page, eighth notes get classified as quarter notes, and half-note stems fool the quarter-note detector into false positives. In addition, as the symbols get larger, the U-Net runs into one of its inherent risks concerning the connected components detector: symbol fragmentation. As the pixels of symbols that are easily classified tend to be on their extremes, the system may become less certain in their centers, and the symbol falls apart after thresholding the U-Net output probability map. We have observed this behavior on barlines and long stems on the MUSCIMA++ dataset as well. This breakup produces many false positives (in Figure 7, especially for quarter notes). D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Appl. Sci. 2018, 8, 1488 14 of 21 Figure 6. Detection sample on some selected classes from the MUSCIMA++ dataset. (top–bottom): Faster R-CNN, RetinaNet, and U-Nets detection results. On the other hand, while Faster R-CNN still struggles—although to a much smaller extent—with false negatives, RetinaNet does not face too small symbols anymore, and learns well: when symbol class frequencies are used to weight the result, it outperforms both contenders by a large margin. It falls into none of the U-Nets’ traps. What can we say regarding the datasets? For DeepScores, our results seem to confirm the intentions of the dataset authors: the main difficulty of the dataset is the large number of tiny objects [49]. While Faster R-CNN does outperform the same baseline architecture of [49] (which, according to the authors, does not detect anything at all), it does still encounter the limitations that they expected of this class of models. The single-shot RetinaNet detector runs into even worse trouble (and thus the authors of [49] were probably right to not use single-shot detection at all). D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Appl. Sci. 2018, 8, 1488 15 of 21 Figure 7. Detection sample on some selected classes from the Capitan dataset. (top–bottom): Faster R-CNN, RetinaNet, and U-Nets detection results. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Appl. Sci. 2018, 8, 1488 16 of 21 The Capitan dataset seems to present a more straightforward object detection challenge. The close relationship of the composite object classes does not seem to be a problem for standard detectors; semantic segmentation, however, struggles. From the perspective of music object detection, the MUSCIMA++ dataset has turned out to be essentially a more difficult version of DeepScores: the ground truth is defined at the level of notation primitives, the music contained in the datasets has similar complexity, but MUSCIMA++ is handwritten, which makes the shapes more variable, and topological features such as corners less reliable. 6. Conclusions In this work, we establish a baseline for detecting music notation objects with deep learning models for generic object detection. Experiments were performed over three diverse major OMR datasets: the synthesized DeepScores dataset of born-digital modern notation, the MUSCIMA++ dataset of handwritten modern notation with varying degrees of writing quality, and the Capitan dataset that contains mensural notation which is also handwritten, but of consistently high quality. Three types of neural models have been evaluated, namely the two-stage Faster R-CNN detector, the one-stage RetinaNet detector, and the U-Net detection mechanism that combines flexible semantic segmentation with a connected component detector. The choice of experimental setup and evaluation in this paper can serve as a basis for further music object detection experiments that will, therefore, be directly comparable to these baselines and will enable drawing conclusions and model design recommendations from these direct comparisons. Based on the quantitative and qualitative results in this paper, can we already formulate tentative practical recommendations for choosing a certain detection approach over another? We are well aware that three datasets may not be enough to draw such general conclusions; however, it is the most comprehensive experimentation that the current state of the OMR concerning available data allows. The suggestions should, therefore, be treated as tentative suggestions for further targeted investigations rather than fully-fledged conclusions. U-Nets, except for merging nearby symbols of the same class, do not seem to have a problem with the recall. Because they process symbol classes independently and do not reduce the output features resolution, they cannot run into the same (hypothesized) problems as Faster R-CNN, which has a limited number of region proposals for any single region of the image that the symbols in effect compete for. The number of available proposals depends on a hyperparameter setting that might be difficult to set appropriately for areas densely populated of ground truth objects. Furthermore, the proposal merging step (such as non-maximum suppression) may also lead to false negatives in cluttered environments. None of these disadvantages concern the U-Nets. On the other hand, while these properties are ideal for very cluttered data where symbol classes are set to notation primitives, the design drawbacks of U-Nets do appear when the symbol vocabulary consists of composite symbols; conversely, this is where the cluttering that presumably hinders the bounding box-based models ceases to be an important factor, and the relative strength of these models—the ability to consider a particular region as a whole—becomes more relevant because composite symbols share visual elements that correspond to the primitives. The choice of a musical symbol detection model, therefore, seems to be based on the way the detection ground truth is defined. Now that a deep learning baseline for music object detection has been established, where can subsequent research be heading? First, one can use the first insights gained from comparing the models over various datasets to improve the music object detectors themselves. The weak point of U-Nets seems to be settings with composite objects; experiments with composites built from MUSCIMA++ primitives by leveraging their syntactic relationships would be a logical step to investigate this. In order for U-Nets to improve on datasets with composite symbols (which are cheaper to annotate, as they generally contain fewer symbol instances, and therefore more likely to be encountered during various music digitization D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Appl. Sci. 2018, 8, 1488 17 of 21 efforts), a combination of the pixel-wise approach, which deals very well with highly cluttered areas or occlusion, and combined properties of the resulting pixel groups can be a viable avenue, while also perhaps alleviating the problem of parallel beams. In [28], a YOLOv3-like approach has been used to detect noteheads with joint pixel classification and bounding box regression. A post-filtering step then significantly improved precision, which is a much bigger problem for U-Nets than recall. The Deep Watershed Detector used by [27] exhibits a similar combination. For improving the Faster R-CNN results on music notation data, we would need a better understanding of the relationship between anchor hyperparameters and expected symbol density. The inability of the RetinaNet to detect small symbols is disappointing and merits further investigation, as it persisted regardless of various anchor hyperparameter settings. An idea to test the hypothesis of some minimum detectable absolute symbol size would be to upscale the image until the objects of interest reach sufficient size, and run detection on windows of the upscaled image that fit into GPU memory. The speed of this model both in training and inference would make it an attractive choice for interactive OMR, which is now probably the most viable approach towards building OMR systems that can best support creating digital editions of music, such as the Ceres system [53] or the Pixel.js editor [54]. More can also be done in terms of evaluation to make the baseline more informative regarding the outputs expected from OMR downstream. While music object detection is a critical step in OMR pipelines, it is not the final step; for evaluating a detector as part of an OMR system, one should be able to attribute downstream errors, e.g., in pitch or duration inference, to detection errors or uncertainties. For instance, Ref. [25] uses several ways of evaluating MIDI inferred on top of the object detection results, using a baseline reconstruction model. Furthermore, the graph model of MUSCIMA++ offers hope that the edges can serve as “conduits” from higher-level errors to their lower-level causes, but, so far, we are not aware of any method that would allow combining such structured gradient flows with the object detection architectures. Then, there are exciting challenges of transfer learning. Modern notation follows the same underlying rules, regardless of whether it is printed or handwritten: can one leverage a printed music dataset to train for handwritten object detection? At least between DeepScores and MUSCIMA++, many symbol classes can be directly mapped onto each other—experiments in this direction should be possible. In this context, the effect of image deformations and other, perhaps more realistic data augmentation can be explored. Finally, while it is obvious that merely detecting the musical elements in score images does not represent a complete OMR system, we believe that addressing music object detection in a generic machine learning manner brings a series of changes that are quite interesting for the development of the OMR field. Except for the few attempts at end-to-end OMR that are so far limited to monophonic output [7,8,55], all OMR systems are explicitly detecting music objects at some point in their recognition pipeline. Generic deep learning approaches may have the potential to decouple object detection from actual knowledge of music notation itself—nevertheless, users now need to be aware of how these systems learn and design them accordingly. The proposed general machine learning approach can then be used by all of them, regardless of the musical notation system (except for hyperparameter tuning and cookbook-style model choice recommendations), as opposed to approaches that exploit specific characteristics of how the music notation system works to build segmentation heuristics. Then, as the music object detection stage is done, image processing can in principle be forgotten: the only remaining link to the original image is the bounding box and potentially pixel mask features associated with the detected objects. The remaining stages—notation reconstruction and exporting an output representation—then, in turn, do not require computer vision knowledge (while now requiring, of course, some understanding of how music notation stores content). On the other hand, one can utilize the syntactic regularities of music notation to improve the object detection stage (and perhaps perform detection and relational understanding jointly). Incorporating the graph structure, and further prior knowledge about the properties of music notation (such as expected voice leading), into a D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Appl. Sci. 2018, 8, 1488 18 of 21 differentiable loss function that can be optimized by the neural network learning process, represents an interesting avenue for future research. Both approaches, therefore, open up the possibility for experts from different areas to establish a synergy that pushes the development of the OMR field from both perspectives. Author Contributions: A.P., J.H. and J.C.-Z. all contributed equally. Funding: The authors wish to thank the TU Wien Bibliothek for the financial support through its Open Access Funding Program. The second author additionally acknowledges support by the Czech Science Foundation Grant No. P103/12/G084, Charles University Grant Agency grants 1444217 and 170217, and by SVV project 260 453. The third author additionally acknowledges the support from the Spanish Ministerio de Ciencia, Innovación y Universidades through a Juan de la Cierva Formación grant (Ref. FJCI-2016-27873). Conflicts of Interest: The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results. Abbreviations The following abbreviations and acronyms are used in this manuscript: OMR Optical Music Recognition IMSLP International Music Score Library Project SIMSSA Single Interface for Music Score Searching and Analysis MIR Music Information Retrieval MuNG Music Notational Graph MIDI Musical Instrument Digital Interface MEI Music Encoding Initiative MUSCIMA Music Score Images COCO Common Objects in Context PASCAL Pattern Analysis, Statistical Modelling and Computational Learning VOC Visual Object Classes R-CNN Region-based Convolutional Neural Network API Application Programming Interface SSD Single Shot Detector YOLO You Only Look Once CWMN Common Western Modern Notation IoU Intersection over Union mAP Mean Average Precision GPU Graphics Processing Unit ELU Exponential Linear Unit References 1. Craig-McFeely, J. Digital Image Archive of Medieval Music: The evolution of a digital resource. Digit. Med. 2008, 3. [CrossRef] 2. The International Music Score Library Project. Available online: http://imslp.org/ (accessed on 28 August 2018). 3. Fujinaga, I.; Hankinson, A.; Cumming, J.E. Introduction to SIMSSA (Single Interface for Music Score Searching and Analysis). In Proceedings of the 1st International Workshop on Digital Libraries for Musicology, London, UK, 12 September 2014; pp. 1–3. 4. Fujinaga, I. Optical Music Recognition Using Projections. Master’s Thesis, McGill University, Montreal, QC, Canada, 1988. 5. Blostein, D.; Baird, H.S. A Critical Survey of Music Image Analysis. In Structured Document Image Analysis; Springer: Berlin/Heidelberg, Germany, 1992; pp. 405–434. 6. Pacha, A.; Eidenberger, H. Towards Self-Learning Optical Music Recognition. In Proceedings of the 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Cancun, Mexico, 18–21 December 2017; pp. 795–800. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Appl. Sci. 2018, 8, 1488 19 of 21 7. Shi, B.; Bai, X.; Yao, C. An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 11, 2298–2304. [CrossRef] [PubMed] 8. Van der Wel, E.; Ullrich, K. Optical Music Recognition with Convolutional Sequence-to-Sequence Models. In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 23–27 October 2017. 9. Choi, K.Y.; Coüasnon, B.; Ricquebourg, Y.; Zanibbi, R. Bootstrapping Samples of Accidentals in Dense Piano Scores for CNN-Based Detection. In Proceedings of the 12th IAPR International Workshop on Graphics Recognition, Kyoto, Japan, 9–10 November 2017. 10. Calvo-Zaragoza, J.; Rizo, D. End-to-End Neural Optical Music Recognition of Monophonic Scores. Appl. Sci. 2018, 8, 606. [CrossRef] 11. Byrd, D.; Simonsen, J.G. Towards a Standard Testbed for Optical Music Recognition: Definitions, Metrics, and Page Images. J. New Music Res. 2015, 44, 169–195. [CrossRef] 12. Hajič jr., J.; Novotný, J.; Pecina, P.; Pokorný, J. Further Steps towards a Standard Testbed for Optical Music Recognition. In Proceedings of the 17th International Society for Music Information Retrieval Conference, New York, NY, USA, 7–11 August 2016; Mandel, M., Devaney, J., Turnbull, D., Tzanetakis, G., Eds.; New York University: New York, NY, USA, 2016; pp. 157–163. 13. Rebelo, A.; Fujinaga, I.; Paszkiewicz, F.; Marcal, A.R.; Guedes, C.; Cardoso, J.S. Optical music recognition: state-of-the-art and open issues. Int. J. Multimed. Inf. Retr. 2012, 1, 173–190. [CrossRef] 14. Hajič, J.J.; Pecina, P. The MUSCIMA++ Dataset for Handwritten Optical Music Recognition. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition, Kyoto, Japan, 10–15 November 2017. 15. Calvo-Zaragoza, J.; Castellanos, F.J.; Vigliensoni, G.; Fujinaga, I. Deep Neural Networks for Document Processing of Music Score Images. Appl. Sci. 2018, 8, 654. [CrossRef] 16. Bainbridge, D.; Bell, T. A music notation construction engine for optical music recognition. Software 2003, 33, 173–200. [CrossRef] 17. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Computer Vision—ECCV 2014; Fleet, D.,Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. 18. Music Object Detection Repository on Github. Available online: http://github.com/apacha/ MusicObjectDetection (accessed on 28 August 2018). 19. Bellini, P.; Bruno, I.; Nesi, P. Assessing Optical Music Recognition Tools. Comput. Music J. 2007, 31, 68–93. [CrossRef] 20. Dalitz, C.; Droettboom, M.; Pranzas, B.; Fujinaga, I. A Comparative Study of Staff Removal Algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 753–766. [CrossRef] [PubMed] 21. Fornés, A.; Dutta, A.; Gordo, A.; Lladós, J. The 2012 Music Scores Competitions: Staff Removal and Writer Identification. In Graphics Recognition, Proceedings of the 9th International Workshop, Seoul, Korea, 15–16 September 2011; Kwon, Y.B., Ogier, J.M., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 173–186. 22. Gallego, A.J.; Calvo-Zaragoza, J. Staff-line removal with selectional auto-encoders. Expert Syst. Appl. 2017, 89, 138–148. [CrossRef] 23. Pacha, A.; Choi, K.Y.; Coüasnon, B.; Ricquebourg, Y.; Zanibbi, R.; Eidenberger, H. Handwritten Music Object Detection: Open Issues and Baseline Results. In Proceedings of the 2018 13th IAPR Workshop on Document Analysis Systems (DAS), Vienna, Austria, 24–27 April 2018. 24. Pacha, A.; Calvo-Zaragoza, J. Optical Music Recognition in Mensural Notation with Region-Based Convolutional Neural Networks. In Proceedings of the 19th International Society for Music Information Retrieval Conference, Paris, France, 23–27 September 2018. 25. Hajič jr., J.; Dorfer, M.; Widmer, G.; Pecina, P. Towards Full-Pipeline Handwritten OMR with Musical Symbol Detection by U-Nets. In Proceedings of the 19th International Society for Music Information Retrieval Conference, Paris, France, 23–27 September 2018. 26. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention, Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Appl. Sci. 2018, 8, 1488 20 of 21 27. Tuggener, L.; Elezi, I.; Schmidhuber, J.; Stadelmann, T. Deep Watershed Detector for Music Object Recognition. In Proceedings of the 19th International Society for Music Information Retrieval Conference, Paris, France, 23–27 September 2018. 28. Hajič, J.j.; Pecina, P. Detecting Noteheads in Handwritten Scores with ConvNets and Bounding Box Regression. arXiv 2017, arXiv:1708.01806. 29. Coüasnon, B.; Brisset, P.; Stéphan, I. Using Logic Programming Languages For Optical Music Recognition. In Proceedings of the Third International Conference on the Practical Application of Prolog, Paris, France, 3–6 April 1995. 30. Villegas, M.; Sánchez, J.A.; Vidal, E. Optical modelling and language modelling trade-off for Handwritten Text Recognition. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Nancy, France, 23–26 August 2015; pp. 831–835. 31. Chen, L.; Hermans, A.; Papandreou, G.; Schroff, F.; Wang, P.; Adam, H. MaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features. arXiv 2017, arXiv:1712.04837. 32. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28; Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA 2015; pp. 91–99. 33. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2014; pp. 580–587. 34. Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. 35. Uijlings, J.R.R.; van de Sande, K.E.A.; Gevers, T.; Smeulders, A.W.M. Selective Search for Object Recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [CrossRef] 36. Zitnick, L.; Dollar, P. Edge Boxes: Locating Object Proposals from Edges. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014. 37. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. 38. Huang, J.; Rathod, V.; Sun, C.; Zhu, M.; Korattikara, A.; Fathi, A.; Fischer, I.; Wojna, Z.; Song, Y.; Guadarrama, S.; et al. Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. 39. Lin, T.; Goyal, P.; Girshick, R.B.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. arXiv 2017, arXiv:1708.02002. 40. Sermanet, P.; Eigen, D.; Zhang, X.; Mathieu, M.; Fergus, R.; LeCun, Y. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv 2013, arXiv:1312.6229. 41. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. 42. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. 43. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, CA, USA, 26 June–1 July 2016; pp. 770–778. 44. Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. 45. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. 46. Huang, G.; Liu, Z.; Weinberger, K.Q. Densely Connected Convolutional Networks. arXiv 2017, arXiv:1608.06993. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Appl. Sci. 2018, 8, 1488 21 of 21 47. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. 48. The OMR datasets project on Github. Available online: http://apacha.github.io/OMR-Datasets/ (accessed on 28 August 2018). 49. Tuggener, L.; Elezi, I.; Schmidhuber, J.; Pelillo, M.; Thilo, S. DeepScores—A Dataset for Segmentation, Detection and Classification of Tiny Objects. In Proceedings of the 24th International Conference on Pattern Recognition, Beijing, China, 20–28 August 2018. 50. MuseScore. The free and open-source score writer. Available online: http://musescore.org (accessed on 28 August 2018). 51. Fornés, A.; Dutta, A.; Gordo, A.; Lladós, J. CVC-MUSCIMA: A ground truth of handwritten music score images for writer identification and staff removal. Int. J. Doc. Anal. Recognit. 2012, 15, 243–251. [CrossRef] 52. Everingham, M.; Eslami, S.M.A.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes Challenge: A Retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [CrossRef] 53. Chen, L.; Jin, R.; Raphael, C. Human-Guided Recognition of Music Score Images. In Proceedings of the 4th International Workshop on Digital Libraries for Musicology, Shanghai, China, 28 October 2017. 54. Saleh, Z.; Zhang, K.; Calvo-Zaragoza, J.; Vigliensoni, G.; Fujinaga, I. Pixel.js: Web-Based Pixel Classification Correction Platform for Ground Truth Creation. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 10–15 November 2017; pp. 39–40. 55. Calvo-Zaragoza, J.; Rizo, D. Camera-PrIMuS: Neural End-to-End Optical Music Recognition on Realistic Monophonic Scores. In Proceedings of the 19th International Society for Music Information Retrieval Conference, Paris, France, 23–27 September 2018. © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek CHAPTER 6 Measure Detection and Structure Analysis Knowing the structure of music scores can have significant benefits when performing music object detection, as could be seen in the two previously mentioned papers. When the images were cut into smaller images, containing only one stave, the results were extraordinary, whereas they were disappointing, when feeding a shrunken version of the whole image into the Faster R-CNN network. But knowing the structure as preprocessing step for music object detection is only one reason, why it can make sense to analyze the layout and structure of music scores. In the paper “Identification and Cross-Document Alignment of Measures in Music Score Images,” Simon Waloschek, Aristotelis Hadjakos, and I worked on the structural analysis for a completely different reason [WHP19]. When creating critical editions of musical works, musicologists regularly compare multiple sources of the same musical piece. For allowing them to navigate between them efficiently, cross-source navigation is required which is aware of the musical content. Traditionally, measures were annotated by hand and then related to each other. In this paper, we trained a deep convolutional neural network, similar to the ones used for music object detection, to detect musical measures on a large, diverse body of over 8000 music scores, containing both handwritten and typeset scores. The interesting challenge is that musical measure can span across multiple staves and requires a certain amount of understanding to know how individual measures are joined into a system. Luckily, the trained object detectors were capable of learning these things very well and the results look very promising. After having detected the individual measures, they need to be aligned across multiple scores for navigating between them. To this end, a second convolutional neural network was trained to compute the similarity between two measures to determine if they contain 103 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 6. Measure Detection and Structure Analysis the same music and should, therefore, be linked. Sequences are matched using Dynamic Time Warping. My contribution to this work was limited to the first challenge: detecting measures. Simon Waloschek provided me with the body of manually annotated music scores and I was in charge of training and optimizing a convolutional neural network that is capable of solving this task without human intervention. This part of the work is publicly available on Github [Pac19b]. This paper has been accepted for the 20th International Society for Music Information Retrieval Conference 2019 in Delft, The Netherlands. 104 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek IDENTIFICATION AND CROSS-DOCUMENT ALIGNMENT OF MEASURES IN MUSIC SCORE IMAGES Simon Waloschek, Aristotelis Hadjakos Center of Music and Film Informatics Detmold University of Music, Germany {s.waloschek, a.hadjakos}@cemfi.de Alexander Pacha Institute of Information Systems Engineering TU Wien, Austria alexander.pacha@tuwien.ac.at ABSTRACT In the course of editing musical works, musicologists regu- larly compare multiple sources of the same musical piece, such as composers’ autographs, handwritten copies, and various prints. For efficient comparison, cross-source navi- gation is essential, enabling to quickly jump back and forth between multiple sources without losing the current musi- cal position. In practice, measures are first annotated by hand in the individual source images and then related to each other. Our approach automates this time-consuming and error-prone process with the help of deep learning. For this purpose, we train a neural network that automatically finds bounding boxes of all measures in images. A sec- ond network is trained to compute the similarity between two measures to determine if they have the same musical content and should, therefore, be linked for navigation. Se- quences of outputs from the second network are matched using Dynamic Time Warping to provide the final proposal of measure relationships, so-called concordances. In addi- tion to cross-source navigation, the results can be used to spot structural differences across the sources which are es- sential for editorial work, so that musicologists can focus more on analytical tasks. 1. INTRODUCTION Modern musical editions are the result of a long musico- logical process. From the composer’s manuscript to the printed music book, a musical work usually undergoes a large number of iterations and minor corrections, occa- sionally even substantial changes, such as striking or re- working complete parts [1]. Many of these changes are ei- ther unintentional—e.g., errors in handwritten copies, ty- pographical errors by publishers—or generally not docu- mented in a transparent manner. Musicologists, therefore, work on this genesis when editing a work and try to record the chronological order and causalities in their edition cre- ation process. c© Simon Waloschek, Aristotelis Hadjakos, Alexander Pacha. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Simon Waloschek, Aristotelis Had- jakos, Alexander Pacha. “Identification and Cross-Document Alignment of Measures in Music Score Images”, 20th International Society for Mu- sic Information Retrieval Conference, Delft, The Netherlands, 2019. The first step in this process is, therefore, the screening of the source material to identify differences between the various sources of a work. To facilitate this process, links are created between the sources so that editors can quickly switch back and forth between them. Adequate granular- ity of these links are usually musical measures, a feasible compromise between annotation effort and accuracy [29]. Currently, the measures of all sources are manually anno- tated with bounding boxes and related to each other in a very time-consuming and error-prone way. We have automated this multi-stage process by first rec- ognizing and sorting measures in score images (both hand- written and typeset) and then linking them according to their musical content. For this purpose, deep learning was used to develop a distance metric in an end-to-end fash- ion without an intermediate representation. The results can be further processed using classic alignment algorithms from the MIR community such as Dynamic Time Warping (DTW). While DTW-based approaches have achieved suf- ficient quality for practical use, audio-to-score alignment is still an active field of research [31]. Promising approaches for the synchronization of scans and sound recordings [5,6] are currently limited to monophonic and piano music and have not yet achieved sufficient accuracy for most real- world scenarios. With the contribution of this paper, we decrease a potential gap in the "audio – symbolic score – image" triangle and offer a new way for measure-accurate alignment across modal boundaries. 2. RELATED WORK Detecting measures can be seen as a preprocessing step in Optical Music Recognition (OMR). Therefore, it was rarely singled out as a dedicated task. While Pedersoli and Tzanetakis perform document segmentation, they only dis- tinguish between music scores and text blocks [22]. The only research we know of, that specifically addresses the automatic extraction of measures is by Vigliensoni et al. [30]. In their work, they attempt to extract measures with a traditional computer vision approach by heuristically find- ing all bar lines and then joining them into measures. Their approach requires human intervention for each page and straight bar lines to work well. For retrieval of sixteenth-century musical texts, Craw- ford et al. [4] have recently proposed a two-step proce- dure. They run an OMR algorithm to obtain an intermedi- D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek ate format, followed by a second step that uses n-grams and minimal absent words (MAWs) to find duplicates, related texts, or parts that have the same musical material. Neural networks make such intermediate formats partly obsolete and allow for learning bimodal embeddings end-to-end as shown by Dorfer et al. [5, 6], who correlate the scanned music score with a sound recording. For this purpose, syn- chronization was considered either a reinforcement learn- ing problem [6] or a metric learning problem [5]. In the metric learning approach, Dorfer et al. use the pairwise ranking loss—also known as triplet loss [26]—that draws triplets from a dataset consisting of an anchor, a positive example (picture fits the audio) and a negative example (picture does not fit the audio). This loss function creates an embedding, where images and audio with the same con- tent are appear close together, while non-matching images and audio are placed relatively far apart. Their approach has successfully been used before in other application do- mains, such as facial recognition [26]. We resort to a simi- lar cost function for metric learning (see section 4.2). As the basis for our detection, we use a convolutional neural network (CNN). While CNNs are currently an ac- tive field of research for OMR, the most influential ap- proaches come from the research area of computer vision. They are used for many tasks, including image recognition, semantic segmentation, object detection, and instance seg- mentation. R-CNN [9] performs object detection by an- alyzing a large number of heuristically generated region proposals that are classified into background or one of the classes of interest. Additionally, the bounding box is re- fined with regression. R-CNN uses a CNN that extracts features for object detection. These features are used in a downstream SVM for classification and regression. Faster R-CNN [23] improves the process by incorporating both the region proposal step as well as the classification and regression into the architecture of the neural network. CNN-based computer vision approaches are largely transferable to OMR and actively used for Music Infor- mation Retrieval: Gallego and Calvo-Zaragoza are using auto-encoders to remove staff lines [8]. Pacha et al. com- pare various CNN-based approaches for detecting music symbols in scores [21]. CNNs can also be used for seman- tic segmentation for staff-line removal, music and text sep- aration as well as for layout analysis as shown by Calvo- Zaragoza et al. [3]. Using U-Nets [25], Hajic et al. do se- mantical segmentation of handwritten music [10]. Pacha and Calvo-Zaragoza recognize musical objects in mensural notation using region-based CNNs [20]. By learning en- ergy levels that are used as inputs to a watershed algorithm, Tuggener et al. recognize music symbols [28]. In addi- tion to the energy levels, the network also predicts class la- bels and bounding boxes. And finally, Calvo-Zaragoza and Rizo use convolutional recurrent neural networks trained with a Connectionist Temporal Classification (CTC) loss to recognize musical symbols in monophonic music scores [2]. To simulate non-ideal image conditions, they artifi- cially distort the images. 3. DATA & ANNOTATIONS The success of Deep Learning approaches largely depends on the amount and diversity of data used during training. Since no dataset of sufficient size was available for mea- sure recognition or the concordance task, we created a large dataset ourselves in cooperation with musicologists and professional musicians. Our dataset contains measure annotations that were cre- ated manually by musicologists for digital music editions. In most cases, the image sources are high-resolution scans of facsimiles, occasionally supplemented by early music prints and PDFs exported directly from music engrav- ing software. Due to an imbalance between handwritten and typeset scores, we additionally obtained scores from the IMSLP/Petrucci Music Library while paying attention to varying image quality, the used engraving mechanism as well as diverse musical content. We complemented our collection with 140 pages from the MUSCIMA++ dataset 1 [7, 11]. Our data collection has a total of 8 251 pages with 81 124 annotated measures. The distribution according to engraving type and the number of systems per page is given in Table 1. One category is particularly over- represented: handwritten music scores with just one sys- tem per page because of a large quantity of full orchestral scores from operas by Carl Maria von Weber. Pages with zero systems include book covers, text pages, and prefaces. Systems per page Pages per engraving type Handwritten Typeset 0 413 113 1 5627 932 2 175 553 3 122 175 4 or more 102 39 Total pages 6439 1812 Table 1. Overall distribution of the dataset used. The accuracy of the measure annotations varies. Since the exact boundaries are not relevant for musicologists they were recorded only roughly. That is why many bounding boxes contain small overlaps with adjacent measures, as shown in Figure 1. To annotate the measures in the individual pictures, the Android app Vertaktoid 2 [18] was used. It allows to con- veniently draw bounding boxes for all measures with a pen directly on the tablet screen. The results can then be ex- ported to the MEI format [24] and used as ground truth training data. Data coming from digital music editions are partly pro- vided with concordance annotations between the measures. 1 The measure annotations are published as separate dataset at https://apacha.github.io/OMR-Datasets/#muscima 2 https://github.com/cemfi/vertaktoid D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Figure 1. Examples of cropped measures originating from different sources of the same work. All measures represent the same musical position, i.e. the same measure, within the work, but are in part extremely diverse in terms of instrumentation, graphic representation and also image resolution. 4. ALIGNING MEASURE SEQUENCES Our proposed solution for the given task can be split into three individual parts. First, we have to find the bounding boxes of all measure in the score images. Then we need a metric in order to compute the similarity between two given measure in terms of musical content. And finally, we have to compute actual concordances for multiple sources of the same music. 4.1 Optical Measure Recognition For automatically detecting measures in complete music scores, we propose a machine-learning approach with deep convolutional neural networks and a Faster R-CNN detec- tor [23]. Faster R-CNN has been shown to work well in a range of situations, including detecting music objects [21]. In this case, there is just one class of objects that needs to be detected, and the objects typically cover large portions of the entire image with little overlap. Our implementation is based on the TensorFlow Object Detection API frame- work [14] and freely available online 3 . We split the dataset randomly into 80% for training, 10% for validation, and 10% for testing. To avoid a bias toward scores with just one system, we categorize the sam- ples into the ten categories depicted in table 1. From the training set we only use about 2000 images and draw them equally distributed from these ten categories, which results in some examples being used more than once. The only exemption are images without systems which are sampled only half as often as the other categories. For the validation and test sets we use all images from that split. We tested the three different backbones, ResNet50, ResNet101 [13], and Inception-ResNet-V2 [27] and re- stricted ourselves to these to enable transfer-learning by 3 https://github.com/OMR-Research/ MeasureDetector initializing the networks with weights trained on ImageNet which generally improves the learning process, especially at the beginning. Input images are resized to be no longer than 1024 pixel on the longest edge. The Intersection over Union (IoU) measures how well two bounding boxes over- lap. If two predictions are very close, non-maximum sup- pression filters the box with the lower score. The IoU threshold is set to 0.6 and a maximum of 600 objects are detected per image. These parameters are derived from sta- tistical analysis of the entire data set and cover > 99.99% of the dataset. We evaluated the optical measure detection with the commonly used average precision (AP) metric, as defined for the COCO detection challenge [15]. It produces a single number that measures how well objects were de- tected. A detection is considered a match with the under- lying ground truth if the IoU is above a certain threshold. The trained models achieve very good results with 78.7% AP (IoU=0.5:0.95) on the test set for the top-performing model with Inception-ResNet-V2 [27] backbone. A few samples of the detection output are depicted in Figure 2. Given that the measure recognition step does not neces- sarily return the measures of a page in the musically correct order, we sort them according to the measure numbering rules outlined by Mexin et al. in [18]. 4.2 Metric Learning Now that the scans of all scores are divided into individ- ual measures, they have to be compared with each other to identify equivalent measures. Again, we decided to take a deep learning approach to learn such a musical similar- ity metric between two measures directly from the images. The neural network is trained to compute an embedding for measure images so that similar measures are placed in the proximity of one another in the embedding space. This al- lows for convenient comparison of two measures by com- D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Figure 2. Three samples of the detection results. The neu- ral network is capable of detecting measures robustly in typeset and handwritten scores, regardless of whether they contain piano scores or full orchestral scores. It does make occasional errors, but the majority of measures is being recognized correctly. puting their distance, e.g., using the L2 norm. The idea is based on triplet loss [26]: A pair of equiv- alent measure images from two different sources is drawn from the list of concordances. We will call them the an- chor image and the positive image. Additionally, a nega- tive measure image is drawn from the same source as the positive image, serving as a counterexample, i.e. having no musical relation to the anchor or the positive measure image. Each of these three images is fed separately into the same neural network, resulting in three k-dimensional vectors. The loss function is defined as L = max(d(fa, fp)− d(fa, fn) + α, 0) (1) with fa, fp, and fn being the resulting vectors from the network f for the three images and a distance measure d. Training with this loss function minimizes the distance from the anchor to the positive image while maximizing the distance between the anchor and the negative image. The additional margin α defines how far away the least dissimilarity should be. Finally, the surrounding max(...) function ensures that the loss never gets negative. We chose ResNet50 as the base network and replaced the usual final average pooling and classification layers by a fully connected layer with k-dimensional output. (Other CNN-based networks used for computer vision would most likely work comparably well.) All measure images are resized to 512 × 512 pixels but the original width and height information is also passed to the network as addi- tional input. The success of the used loss function depends heavily on the sampling strategy for the image triplets as discussed by Wojke and Bewley in [32]. In our context, there are three specific problems in the dataset: 1. A randomly sampled negative image might acciden- tally have the same musical content as the two other images. Those cases are not covered in the concor- dance dataset since not all measures with equal con- tent have to be linked together. 2. Intuitively, it seems beneficial to take the previous or subsequent measure of the positive sample as the negative measure with the goal of enhancing the contrast between them in terms of increased distance in the embedding space. This would make adja- cent measures more distinguishable. But again, the chance of these measures having the same content is higher compared to random sampling. 3. Especially handwritten sources sometimes exhibit excessive use of measure repeats and other abbrevia- tions as can be seen in the left part of Figure 1. Such symbols are meaningless if their immediate context is not given. The first two problems could be solved by manually adding all measures with the same content to the list of concor- dances. Given the amount of images, we decided against doing so and rely on rare collisions thanks to the large number of data. We also discarded the (perfectly valid) idea of looking at adjacent measures to form the triplets. The third problem—presence of measure repeats and abbreviations—has a direct impact on the appropriate choice of the distance metric d in our loss function; When using triplet loss, it is common practice to normalize the embedding vectors. This constraint puts all embeddings on a k-dimensional hypersphere, leading to some advan- tages for further processing (see [26]). Furthermore, co- sine distance is often used to calculate the distances. Both decisions make it impossible to get an embedding vector that is equally distant to all other possible vectors. This very property, however, characterizes the meaning of mea- sure repeats if no context is given. We, therefore, opted for no vector normalization and chose the L2 norm as our distance metric, resulting in L = N ∑ i=1 [ ‖fa i − f p i ‖2 − ‖fa i − fn i ‖2 + α ] + (2) for a training batch with size N . To speed up training and ensure fast convergence we select triplets that violate the following constraint: ‖fa i − f p i ‖2 + α < ‖fa i − fn i ‖2 . (3) This filter step is performed for each batch during training D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek and makes sure that only those triplets are used that signif- icantly contribute to the learning process. It also prevents the network from overfitting. 4.3 Concordance Computation & Manual Adjustments Given the embedding vectors for all measures of each source of a musical work, we can compare two sources by computing the distances between all measures from one source to the other. The resulting similarity matrices can then be used for dynamic time warping (DTW) as de- scribed by Müller in [19] to get an alignment path between the sources as shown in Figure 3. We implemented the canonical DTW algorithm without any noteworthy modifications to the core. Allowed step sizes inside the similarity matrix during path computation are (0, 1), (1, 0), and (1, 1). It rarely happens that a mea- sure gets divided into two parts at system or page breaks, so we penalized steps along a single axis by a factor of 2 to slightly enforce one-to-one mappings of the measures. The quality of the alignment was evaluated using a dataset with two sources and given ground truth concor- dances as outlined in Table 2. We have decided in favor of this particular dataset because it offers several challenges that occur only rarely in other works: Split measures: Some measures are split into two parts at page breaks. Therefore, one measure of source A maps to two other measures of source B. Completely different sections: An entire part of the piece was replaced in source B. Finding the "cor- rect" concordance is impossible. Additional parts: Source B contains a 16-measure Aria that is not present in the other source. Missing measure annotations: We also intentionally re- moved measures from source A to simulate annota- tion errors. Pages Measures Source A (typeset) 250 3098 Source B (handwritten) 532 3176 Total 782 6274 Table 2. Structure of the evaluation dataset. In the MIR community, DTW is often used to syn- chronize audio and/or symbolic score sources with each other [12]. The time resolution of the features in such sce- narios is usually in the range of several dozen milliseconds. Deviations in the alignment path are therefore undesirable, but can often be neglected as long as they do not exceed certain limits. In our context, however, any deviation from the ground truth marks a significant error. We took this into account and defined a very simple score for the over- Figure 3. Interface for inspecting the computed measure concordances. The alignment (white) and ground truth (blue, only available in evaluation dataset) are plotted over the currently visible part of the similarity matrix. Mea- sures of both sources (right) can be compared by moving a cursor within the matrix (green crosshair). A plot at the bottom indicates potentially interesting positions. all performance: score = 1− Number of (x, y) pairs from alignment not in ground truth Total number of concordances in ground truth (4) Our evaluation showed 14 errors in relation to 3079 con- cordance pairs, resulting in a score of 99.545%. As pointed out, the remaining 0.455% error rate still present a non-negligible problem. Therefore, we devel- oped an interface for manual adjustments to the alignment. Apart from being able to quickly compare the measures from two sources as shown in Figure 3, users can define points in the similarity matrix that have to be part of the alignment path. Each of these points splits the matrix into two parts and computes the warping path for each part in- dividually, ensuring that either the beginning or end of the path matches the desired point. An event plot at the bottom of the matrix helps to identify regions with potential errors by showing where the alignment path is not diagonal, i.e. taking a step in (0, 1) or (1, 0) direction. The mentioned obstacles for correct alignment have been handled successfully by either resulting in a cor- rect alignment or—in case of substantial structural differences—indicating a problem that cannot be solved without human intervention by marking these parts in the plot below the similarity matrix. This alignment and adjustment step has to be repeated for each source in regard to a master source of choice. The corrected alignment data can then finally be imported into the tools used by musicologists for their editorial work. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 5. CONCLUSIONS AND FUTURE WORK In this paper we proposed an approach to automate the te- dious task of annotating and linking measures in hetero- geneous score images, thereby allowing for cross-source navigation between measures without losing the current musical position. We used deep learning to find bound- ing boxes of measures in score images, learned a distance metric for measures, and used that to align measures from various sources, effectively linking equivalent musical po- sitions across sources. The evaluation showed that our ap- proach is feasible and solves a real-world problem while still retaining complete flexibility in case editors need to make manual adjustments, thanks to an interactive correc- tion tool. The presented solution still does not cover all possible situations that might occur in the editorial process. If the measure sequences to be compared have a different order, the alignment fails for these parts if not completely. We will address this specific problem in the future by identify- ing such passages and proposing reasonable re-ordering. Having a musically meaningful distance metric for measures also allows closing the gap between score images and symbolic scores. The latter can be rendered with suit- able engraving software and divided into individual mea- sures, followed by the steps of our alignment pipeline. Since audio can also be rendered from symbolic scores, alignments between all three modalities are possible. Another interesting application of our distance metric is the ability to visualize datasets in image fields as shown in Figure 4. Using dimensionality reduction algorithms such as T-SNE [16] or UMAP [17], the measures are positioned such that musically similar measures appear proximate to one another, giving new insight into a musical piece but also into the inner workings of the distance metric. For example, the visualization shows that measure repeats are placed almost in the center, indicating that their learned embedding retains the musical property of being close to basically every other measure in the embedding space. 6. REFERENCES [1] Benjamin W. Bohl, Axel Berndt, Simon Waloschek, and Aristotelis Hadjakos. Dem Igel Sitte lehren... Musikedition: von der digitalen Verfügbarkeit zur ak- tiven Nutzung. In Kristina Richts and Peter Stadler, ed- itors, „Ei, dem alten Herrn zoll’ ich Achtung gern’“ – Festschrift für Joachim Veit zum 60. Geburtstag, chap- ter 12, pages 141–163. Allitera Verlag, Munich, Ger- many, 2016. [2] Jorge Calvo-Zaragoza and David Rizo. Camera- primus: Neural end-to-end optical music recognition on realistic monophonic scores. In 19th International Society for Music Information Retrieval Conference, pages 248–255, Paris, France, 2018. [3] Jorge Calvo-Zaragoza, Gabriel Vigliensoni, and Ichiro Fujinaga. A machine learning framework for the cat- egorization of elements in images of musical docu- Figure 4. 46 344 measure images from 15 different sources of the same piece are projected into a two- dimensional manifold with the UMAP algorithm. The map is interactively zoomable. ments. In 3rd International Conference on Technolo- gies for Music Notation and Representation, A Coruña, Spain, 2017. University of A Coruña. [4] Tim Crawford, Golnaz Badkobeh, and David Lewis. Searching page-images of early music scanned with OMR: A scalable solution using minimal absent words. In 19th International Society for Music Information Retrieval Conference, pages 233–239, Paris, France, 2018. [5] Matthias Dorfer, Jan Hajič jr., Andreas Arzt, Harald Frostel, and Gerhard Widmer. Learning audio–sheet music correspondences for cross-modal retrieval and piece identification. Transactions of the International Society for Music Information Retrieval, 1(1):22–33, 2018. [6] Matthias Dorfer, Florian Henkel, and Gerhard Widmer. Learning to listen, read and follow: Score following as a reinforcement learning game. In 19th International Society for Music Information Retrieval Conference, pages 784–791, Paris, France, 2018. [7] Alicia Fornés, Anjan Dutta, Albert Gordo, and Josep Lladós. CVC-MUSCIMA: A ground-truth of hand- written music score images for writer identification and staff removal. International Journal on Document Analysis and Recognition, 15(3):243–251, 2012. [8] Antonio-Javier Gallego and Jorge Calvo-Zaragoza. Staff-line removal with selectional auto-encoders. Ex- pert Systems with Applications, 89:138–148, 2017. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek [9] Ross Girshick, Jeff Donahue, Trevor Darrell, and Ji- tendra Malik. Region-based convolutional networks for accurate object detection and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 38(1):142–158, 2016. [10] Jan Hajič jr., Matthias Dorfer, Gerhard Widmer, and Pavel Pecina. Towards full-pipeline handwritten OMR with musical symbol detection by u-nets. In 19th Inter- national Society for Music Information Retrieval Con- ference, pages 225–232, Paris, France, 2018. [11] Jan Hajič jr. and Pavel Pecina. The MUSCIMA++ dataset for handwritten optical music recognition. In 14th International Conference on Document Analysis and Recognition, pages 39–46, Kyoto, Japan, 2017. [12] Yun Hao. Real-time audio to score align- ment (a.k.a score following). https:// www.music-ir.org/mirex/wiki/2019: Real-time_Audio_to_Score_Alignment_ (a.k.a_Score_Following), 2019. [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recogntiion (CVPR), pages 770–778, 2016. [14] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, and Kevin Murphy. Speed/accuracy trade-offs for modern convolutional object detectors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [15] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common ob- jects in context. In Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. [16] Laurens van der Maaten and Geoffrey Hinton. Visual- izing data using t-SNE. Journal of machine learning research, 9(Nov):2579–2605, 2008. [17] Leland McInnes, John Healy, and James Melville. UMAP: Uniform manifold approximation and pro- jection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018. [18] Yevgen Mexin, Aristotelis Hadjakos, Axel Berndt, Simon Waloschek, Anastasia. Wawilow, and Gerd Szwillus. Tools for annotating musical measures in digital music editions. In 14th Sound and Music Com- puting Conf. (SMC-17), Espoo, Finland, 2017. Aalto University. [19] Meinard Müller. Information Retrieval for Music and Motion. Springer-Verlag, Berlin, Heidelberg, 2007. [20] Alexander Pacha and Jorge Calvo-Zaragoza. Optical music recognition in mensural notation with region- based convolutional neural networks. In 19th Interna- tional Society for Music Information Retrieval Confer- ence, pages 240–247, Paris, France, 2018. [21] Alexander Pacha, Jan Hajič jr., and Jorge Calvo- Zaragoza. A baseline for general music object detec- tion with deep learning. Applied Sciences, 8(9):1488– 1508, 2018. [22] Fabrizio Pedersoli and George Tzanetakis. Document segmentation and classification into musical scores and text. International Journal on Document Analysis and Recognition, 19(4):289–304, 2016. [23] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detec- tion with region proposal networks. In Advances in Neural Information Processing Systems 28, pages 91– 99. 2015. [24] Perry Roland. The music encoding initiative (MEI). In 1st International Conference on Musical Applications Using XML, pages 55–59, 2002. [25] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-assisted Intervention, pages 234–241. Springer, 2015. [26] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recog- nition and clustering. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 815–823, 2015. [27] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. Inception-v4, Inception- ResNet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), 2017. [28] Lukas Tuggener, Ismail Elezi, Jürgen Schmidhuber, and Thilo Stadelmann. Deep watershed detector for music object recognition. In 19th International Soci- ety for Music Information Retrieval Conference, pages 271–278, Paris, France, 2018. [29] Joachim Veit and Kristina Richts. Current status and perspectives of MEI usage in musicology and in li- braries. Bibliothek Forschung und Praxis, 42(2):292– 301, 2018. [30] Gabriel Vigliensoni, Gregory Burlet, and Ichiro Fuji- naga. Optical measure recognition in common music notation. In 14th International Society for Music Infor- mation Retrieval Conference, Curitiba, Brazil, 2013. [31] S. Waloschek and A. Hadjakos. Driftin’ down the scale: Dynamic time warping in the presence of pitch D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek drift and transpositions. In 19th International Soci- ety for Music Information Retrieval Conference, Paris, France, 2018. [32] Nicolai Wojke and Alex Bewley. Deep cosine met- ric learning for person re-identification. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 748–756. IEEE, 2018. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek CHAPTER 7 Music Notation Graph Construction To complete the proposed OMR pipeline, two steps are remaining after music objects were successfully detected in the image: constructing the notation graph and exporting it into the desired output format. In the article “Learning Notation Graph Construction for Full-Pipeline Optical Music Recognition,” Jorge Calvo-Zaragoza, Jan Hajič jr., and I investigated how the construction of a notation graph can be formulated as a machine- learning problem and thus be solved robustly and efficiently [PCZHj19]. The foundation for this work is the (music notation) graph representation inside the OMR pipeline, which consists of three things: the vertices, which represent the primitives, appearing in the image, the syntactic edges that relate these primitives with each other, and the precedence edges that specify the order of events, which is crucial when recognizing polyphonic music with simultaneous events. The vertices are created as a result of the music object detection stage. The edges on the other side are still missing. The initial attempt to build a binary classifier that decides whether a pair of nodes have an edge or not, showed room for significant improvement [HjDWP18]. We extended the initial work by adding a grammar which eliminates the proposal of illegal pairs, such as between two rests. The input of the neural network has changed to three channels: one for the image-patch that contains the two objects in question, one for the binary mask of the first object, and one for the binary mask of the second object. The results improved significantly, and the best model achieves an F1-score of over 95%. This paper has been accepted for the 20th International Society for Music Information Retrieval Conference 2019 in Delft, The Netherlands. 113 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek LEARNING NOTATION GRAPH CONSTRUCTION FOR FULL-PIPELINE OPTICAL MUSIC RECOGNITION Alexander Pacha Institute of Information Systems Engineering, TU Wien, Austria alexander.pacha@tuwien.ac.at Jorge Calvo-Zaragoza Pattern Recognition and Artificial Intelligence Group University of Alicante, Spain jcalvo@dlsi.ua.es Jan Hajič jr. Institute of Formal and Applied Linguistics, Charles University, Prague hajicj@ufal.mff.cuni.cz ABSTRACT Optical Music Recognition (OMR) promises great bene- fits to Music Information Retrieval by reducing the costs of making sheet music available in a symbolic format. Recent advances in deep learning, have turned typical OMR obsta- cles into clearly solvable problems, especially the stages that visually process the input image, such as staff line re- moval or detection of music-notation objects. However, merely detecting objects is not enough for retrieving the actual content, as music notation is a configurational writ- ing system, where the semantic of a primitive is defined by its relationship to other primitives. Thus, OMR systems must employ a notation assembly stage to infer such re- lationships among the detected objects. So far, this stage has been addressed by devising a set of predefined rules or grammars, which hardly generalize well. In this work, we formulate the notation assembly stage from a set of de- tected primitives as a machine learning problem. Our no- tation assembly is modeled as a graph that stores syntactic relationships among primitives, which allows us to cap- ture the configuration of symbols in a music-notation docu- ment. Our results over the handwritten sheet music corpus MUSCIMA++ show 95.2% precision, 96.0% recall, and an F-score of 95.6% in establishing the correct syntactic relationships. When inferring relationships on data from a music object detector, the model achieves 93.2% precision, 91.5% recall and an F-score of 92.3%. 1. INTRODUCTION Optical Music Recognition is the field of research that in- vestigates how to read music notation in documents com- putationally. This technology enables many computational tasks that, otherwise, could not be performed directly on the music sources themselves [17]. One interesting appli- cation of OMR is concerned with reconstructing the notes encoded in the music-notation document, also referred to c© Alexander Pacha, Jorge Calvo-Zaragoza, Jan Hajič jr.. Licensed under a Creative Commons Attribution 4.0 International Li- cense (CC BY 4.0). Attribution: Alexander Pacha, Jorge Calvo- Zaragoza, Jan Hajič jr.. “Learning Notation Graph Construction for Full- Pipeline Optical Music Recognition”, 20th International Society for Mu- sic Information Retrieval Conference, Delft, The Netherlands, 2019. as replayability [22]. In particular, the objective of the re- playability application is to recover the pitches, onsets, du- rations, and velocities of notes from a document and ex- port them into a symbolic representation. This symbolic representation—e.g., a MIDI file—is already a very useful abstraction of the music itself and allows for plugging in a wide range of music information retrieval tools. However, despite prolonged efforts, the replayability application is still under research [4, 7, 16, 36]. Given the wealth of information that is contained in a music score, the task of decoding its content is usually ad- dressed by dividing the process into smaller stages that rep- resent limited challenges. The general pipeline, proposed first by Bainbridge and Bell [3] and later refined by Re- belo et al. [29], is considered a de-facto standard, which organizes the process into four main blocks: i) preprocess- ing, which works over the input image to ease further steps and make the system more robust; ii) music object detec- tion, which is in charge of retrieving and classifying all objects and glyphs of the image; iii) notation assembly, which must infer the relationships among the detected ob- jects to reconstruct the music notation itself; and iv) encod- ing, which exports the symbolic reconstruction into the de- sired format, typically MIDI for replayability or an XML- based encoding such as MusicXML [15] or MEI [19] for further computational processing. As our starting point towards completing the OMR pipeline, we assume that the music object detection stage can be solved reliably, which allows us to investigate how to deal with the later stages. In this paper, we want to focus in particular on the third stage, which is responsible for the notation assembly. Although previous work exists, most approaches are based on predefined rules that hardly gen- eralize, and that only work for a limited set of scenarios. In contrast, we propose a well-principled machine learning approach, which addresses the problem in a generalizable way, provided there is convenient training data. 2. RELATED WORK Most literature on OMR focuses on the first stages of the pipeline. This comes as no surprise because if one strug- gles with detecting music objects in an image reliably, it is understandable that subsequent stages that build on top of that are often neglected. With the appearance of deep D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek learning in OMR, however, many steps that traditionally produced suboptimal results, such as the staff-line removal or symbol classification, have seen drastic improvements [14, 26] and are no longer considered obstacles for OMR development. Deep learning also caused some steps to become obso- lete or collapse into a single (bigger) stage. For instance, the music object detection stage, which was traditionally separated into segmentation plus classification stages, is currently addressed in a single step. Convolutional neu- ral networks have been shown to be able to deal with the music object detection stage holistically, without having to remove staff lines at all [25]. A compelling advantage is the capability of these models to be trained in a single step by merely providing pairs of images and positions of the music objects to be found, eliminating the preprocessing step altogether [24, 35]. This issue has been the subject of intense recent research. A comparison of existing ap- proaches to holistic music object detection is presented in the work of Pacha et al. [27]. Since the beginning of the OMR research, there have been attempts to complete the full pipeline, including the notation assembly stage. Below, we introduce some par- ticular proposals to perform this stage that can be found in the existing literature. They can be broadly divided into grammar-based approaches, and approaches that rely on heuristics and pre-defined rules. 2.1 Grammar-based approaches Formal grammars represent the most widely used descrip- tion of music notation. This feels natural, given that music notation has syntactic rules and hierarchical structures that invite such a formalization. These grammars are manually built to describe the expected relationships among music- notation objects and then used to reconstruct the music no- tation from the detected primitives [1–3, 5, 6, 30, 33]. The 2D nature of music notation also inspired graph grammars, as in the work of Fahmy and Blostein [12]. A prominent example of this approach is the DMOS system, proposed by Coüasnon et al. [8,9], which uses a definite clause gram- mar for describing the relations between graphical objects on two levels: a graphical one that assists the recognition of symbols and a syntactic one, which introduces the mu- sical semantics into the process. 2.2 Heuristical approaches The other set of approaches relies on ad hoc rules for the music notation at hand. This includes assumptions about the configuration and position of the occurring primitives to reconstruct composite symbols and the notation graph [10, 23, 28, 34]. Rossant et al. [31] additionally consid- ered fuzzy modeling, which allows for self-correction dur- ing the recognition [32]. Their system evaluated different hypotheses of recognized symbols to verify the compati- bility between them. 3. NOTATION ASSEMBLY The related works clearly show a lack of machine learning approaches. This work aims to fill that gap, by propos- ing a formulation of the notation assembly stage based on machine learning models. The advantage of such models is that they provide greater flexibility since they can vary their behavior by just changing the provided training set. This is especially interesting for OMR, where there is a great diversity of scenarios depending on the epoch or type of composition of the music scores. The conventional OMR pipeline foresees that the nota- tion assembly stage infers the relationships among previ- ously detected music objects to retrieve the necessary in- formation to infer the sequence of notes and rests. Our approach understands that music notation can be modeled as a directed graph G = (V, T ), hereafter referred to as Music Notation Graph (MuNG). V represents the set of vertices, where ζ(v), v ∈ V is the label associated with a vertex. T represents the set of directed edges, such that ti = (v1, v2), ti ∈ T, v1, v2 ∈ V denotes an edge from vertex v1 to vertex v2. The primitives that make up the music notation, such as noteheads or stems, are modeled as vertices of this graph, while the relationships between these symbols are modeled by the edges. In our MuNG, the edges are not labeled, but there are two types of rela- tionships: • Syntactic edges that relate elements syntactically. This includes relationships between primitives that make up a composite symbol, such as an eighth note, which consist of a notehead, a stem, and a flag or beam as well as general relationships, e.g., between an accidental and the notehead that is affected by it. • Precedence edges that specify the temporal order be- tween notes. In most cases, the position on the hori- zontal axis is sufficient to infer this kind of relation- ship; however, for polyphonic music, a more sophis- ticated mechanism is needed to handle ambiguous situations. We can, therefore, define the set of edges as T = S∪P , where S is the set of edges that define the syntactic rela- tionships and P is the set of edges that define the prece- dence relationships. A graphical representation of MuNG is shown in Fig. 1. The primary goal of our work is to train a machine learning model to construct such a MuNG G from a music score image. 4. LEARNING MUSIC NOTATION GRAPH ASSEMBLY There are existing algorithms that are capable of dealing with the input image and retrieving a set of detected music- notation primitives. In other words, these algorithms pro- cess the input and provide the set of vertices V , along with its associated labels and bounding-boxes. In order to com- plete the OMR pipeline for replayability, we also need to recover the set of edges T . D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Figure 1: Graphical representation of a Music Notation Graph in a selected excerpt of music notation: vertices are highlighted with transparent yellow bounding boxes around the music-notation primitives, syntactic edges are shown as transparent cyan lines, and precedence edges are shown as transparent purple lines connecting the note- heads. We propose a principled way of inferring T without re- sorting to a set of fixed rules but using machine learning. Our system learns to establish these relationships from a conveniently annotated training set so that the rules are im- plicitly modeled by the machine learning model. The edges that relate vertices of the set T have an un- labeled binary nature; i.e. for each pair of vertices, a rela- tionship either exists or not. Formally speaking, the infer- ence of these relationships can be formulated as a function f : V × V → {0, 1}. However, given their different na- ture, the set of edges S and P are inferred by independent models. To learn the functions fS and fP , for the edges of S and P , respectively, we propose to train binary classi- fiers that receive two vertices and predict whether such re- lationship must be established or not. To do so, one would have to estimate the potential relationship between each pair of symbols, which entails high computational costs. However, it is obvious that most of these relationships are unfeasible. Since the music object detection stage also re- trieves some associated information, such as the label ζ(v) associated to each vertex and the bounding box of that ob- ject in the input score image, we can use this information to filter edges by two criteria: 1. An edge is only feasible if the distance between the bounding boxes of their vertices falls below a certain threshold t. In other words, two vertices that are too far apart cannot be related. 2. An edge is only feasible if the labels of its associ- ated vertices are “compatible”, e.g., a notehead with a stem. This eliminates a large number of incom- patible combinations, such as an edge between an accidental and a rest. The compatibility map is a fixed list of vertex pairs that, according to the syntax of modern music notation, can hold a relationship to each other. Then, given two vertices v1 and v2, for which their edge is declared feasible, we train a deep convolutional neural network to predict whether there must be an edge from v1 to v2 or not. We generate a multi-channel image with a fixed size that serves as input features for the neural net- work, which consists of: • Channel 1: the patch of the input score image that is centered at the objects represented by v1 and v2. • Channel 2: the binary mask of the object v1 • Channel 3: the binary mask of the object v2 The required information to generate these multi- channel images can be obtained from the bounding boxes of v1 and v2, which are expected to be generated during the preceding music object detection stage. Note, that the masks for channel 2 and 3 are obtained from the bound- ing boxes and the underlying image, which means that the masks can (partially) include other objects as well un- less the exact masks are provided via pixelwise segmenta- tion [16, 35]. The network is then fed with this 3-channel image and trained to predict 1 if there should be a relationship be- tween the vertices, and 0 otherwise. Visualizations of the input images are given in Fig. 2. 4.1 Dataset To carry out our experiments we need a corpus, which has annotations for both the individual symbols as well as their relationships. Currently, the only publicly avail- able dataset which fulfills this requirement is the MUS- CIMA++ dataset [18] of handwritten music notation. It provides symbol-level annotations as well as relationship annotations for 140 out of 1 000 images from the CVC- MUSCIMA dataset [13]. The MUSCIMA++ dataset con- tains 91 254 annotated symbols, consisting of both nota- tion primitives and higher-level notation objects, such as key signatures or time signatures as well as 82 247 explic- itly marked relationships between symbol pairs. Unfortunately, the precedence relationships between notes are not included in the MUSCIMA++ dataset, so our experiments consider only the syntactic edges. However, the formulation and the proposed approach are very simi- lar and should work for both kinds of edges. 4.2 Relationship Reconstruction For learning the relationships, we train a convolutional neural network in PyTorch with five consecutive blocks, each consisting of a convolution, batch normalization, a non-linearity (ReLU), and max-pooling, before going into a fully connected layer with a single output neuron fol- lowed by a sigmoid activation function that produces the final estimation. The network has 28 865 parameters in to- tal. We use the Binary Cross-Entropy loss and train with the Adam optimizer [20] until the validation performance has not improved for ten epochs, upon which we stop. The data-loading routine presents the biggest challenge because it has to construct the multi-channel images as de- scribed in Sect. 4. To efficiently generate the set of vertex- pairs, we compute the pairwise distance between all ob- jects in an image but filter them considerably afterward by the distance and compatibility criteria (see Sec. 4). The distance threshold was set to t = 200 pixels for including most valid edges from the MUSCIMA++ dataset. Valid re- lationships between objects that are further apart than 200 pixels are extremely rare and were neglected in favor of D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek (a) A positive example of two objects that are related. (b) A negative example of two objects that are unrelated. (c) A hard negative example of a dot that could be related to the notehead, but is not. Figure 2: Three samples of images that are used during training. The mask given in channel 2 is shown as bright green overlay and the mask from channel 3 as cyan overlay. computational efficiency. Our compatibility map contains 225 valid combinations of primitives. To improve the per- formance even further and simplify the classification task, the input image for the neural network is cropped to a sub-image of 512 × 256 pixels (width × height), contain- ing the two objects of interest at its center. Both the dis- tance threshold and the sub-image dimensions are hyper- parameters that are dataset-dependent but can be obtained by running a statistical analysis on the used dataset. We split the 140 images of the dataset into 60 % train- ing data, 20 % validation data, and 20% test data. In each epoch, the network is trained on approximately 250 000 images of candidate pairs. Approximately 25 percent of the candidates contain positive examples. The best re- sults were obtained after just 12 epochs before the network started to overfit and the validation performance declined. The source-code is publicly available on Github. 1 4.3 Music Object Detection Since the notation assembly stage begins after the music objects have been detected in the score image, we also wanted to evaluate, how well our approach works on ac- tual detection results. For obtaining such results, we resort to a state-of-the-art music object detector as proposed by Pacha et al. [25] with a minor modification: While we do divide the full page into sub-images containing one stave each, we do not see the need for cutting the images any fur- ther. The model selection and training procedure remains unchanged. We split the dataset into 100 images for train- ing, 20 images for validation and 20 images for testing, as proposed by the authors of the MUSCIMA++ dataset. The improved implementation is publicly available. 2 We evaluate the trained model on the test set for the stave-wise individual images and report the Mean Average Precision (mAP) as defined for the COCO challenge [21] which is a unified metric, commonly used for object detec- tion tasks. The trained model achieves 69.5 % mAP. For comparison, we also report a mAP of 93.3 % when using the mAP as defined for the PASCAL VOC challenge [11], which was used in the original paper. Finally, the im- ages are merged into the full-page results upon we achieve: 1 https://github.com/OMR-Research/MungLinker 2 https://github.com/apacha/ MusicObjectDetector-TF 34.5 % mAP / 45.2 % w-mAP 3 (COCO) and 53.8 % mAP / 80.9 % w-mAP (PASCAL). As our main focus is on learn- ing relationships and not music object detection, we do not go into further details on these numbers. However, we want to point out that the COCO metric is very strict and probably underestimating the performance of the mu- sic object detector (see Fig. 3 for an example output). 4.4 Evaluation Protocol Once the music objects have been detected, and their rela- tionships established, the system can produce a complete MuNG that can be compared with the reference MuNG, provided as ground truth. However, it is necessary to first establish the correspondences between vertices from the prediction and the ground-truth. To do so, we assume that a detected object v1 corresponds to a ground-truth object v2 if they depict the same class ζ(v1) = ζ(v2) and their Intersection over Union exceeds 50 %. Once the vertices of the ground-truth are matched with the detected objects, it is possible to compute the statistics. If an established relationship is correct, it is considered a true positive (TP); if an established relationship is incor- rect, it is considered a false positive (FP); and, if an ex- pected relationship is not predicted, it is considered a false negative (FN). Then, we can compute precision (P ), recall (R), and F-score (F1) metrics: P = TP TP + FP , R = TP TP + FN , F1 = 2 P ×R P +R P measures how reliable the established relationships are, whereas R measures the ability of the model to re- trieve as many relationships as possible. F1 summarizes both metrics with a single value. Note that, although our evaluation is primarily focused on the relationships between objects, the used metrics are affected by the performance of the music object detector. Errors from earlier stages of the OMR process propagate to later stages. So if musical objects were missed, their relationships are counted as false negatives. To account for this, we evaluate our model in two ways: 3 Weighted Mean Average Precision is the Mean Average Precision, weighted by the frequency of the occurring classes, which is higher be- cause frequent classes yielded better results than rare ones. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Figure 3: Sample output of the improved music object detector. Each detected object v has a box around it, with the color representing the class ζ(v) of the particular object, e.g., light green for full-noteheads. • over a hypothetical set of perfect detections, which we can extract from the ground-truth of the corpus, and • over the result of an actual music object detec- tor, specifically using the state-of-the-art model, de- scribed in Sect. 4.3. These settings allow us to answer the two questions: Does the proposed approach for reconstructing the MuNG with a machine learning model work at all? If yes, how well does the system perform in a real-world scenario, when confronted with (imperfect) object detector results instead of the perfect ground-truth bounding boxes? 4.5 Results The main objective of our work is to demonstrate that the notation assembly stage can be formulated as a machine learning task. The main results of our experiments are given in Table 1. It can be observed that the proposed ap- proach is highly effective: in all cases, values above 90 % are reported. When starting from ground-truth music object detec- tion, our model yields P = 95.2%, R = 96.0%, and F1 = 95.2%, which indicates a successful approach to completing the OMR pipeline. In case of starting from actual results of a state-of-the-art detector, performance decreases slightly to P = 93.2%, R = 91.5%, and F1 = 92.3%. We think this is because the location of the objects is not always exact (leading to a lower P ) and miss- ing symbols cause relationships to be irrecoverable (lead- ing to a lower R). Graph Edges / Relationships Precision Recall F-Score Perfect Detection 95.2% 96.0% 95.6% Real Detector 93.2% 91.5% 92.3% Table 1: Overall performance of the proposed machine learning model to reconstruct syntactic edges of the Mu- sic Notation Graph (MuNG), given hypothetically perfect detection results (top row), and given results from a state- of-the-art detector (bottom row). In order to provide more experimental insights, Table 2 reports 10 out of the 225 compatible combinations of relationships that are most common in the MUSCIMA++ dataset. As might be expected, the notehead primitives are involved in all of these frequent combinations. In this regard, our model obtains nearly optimal results for these over-represented cases. Note that these relationships are of particular relevance to be able to decode the notes that appear in the score. When comparing the individual results to the overall results in Table 1, the discrepancy can be explained by looking at the remaining 215 combinations that are not shown. Many of these have a much lower F1, probably because they are under-represented in the dataset. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek From To Number of candidate pairs in the dataset F-Score on the test set notehead-full stem 158064 99.5% notehead-full beam 61253 98.7% notehead-full leger_line 47503 98.1% notehead-full slur 24738 96.4% notehead-full 8th_flag 12877 97.7% notehead-full sharp 12563 97.5% notehead-full duration-dot 12305 96.7% notehead-empty stem 9488 100.0% notehead-full staccato-dot 8628 96.8% notehead-full natural 7160 98.7% Table 2: Overview of the ten most common combinations of object-pairs, along with the number of generated candidate pairs in the dataset, as seen by the network. The last column contains the F-scores that were reported for the individual combinations when evaluating the trained model on the test set, containing the ground truth of music primitives v. 5. CONCLUSION AND OUTLOOK In this work, we study how to complete the OMR pipeline from the previous efforts to detect the music objects within the input image. Our approach seeks the construction of a music notation graph that stores the information of the no- tation primitives as well as their syntactic and precedence relationships. We propose a machine learning model that can predict whether two primitives are related to each other or not. Results over the set of syntactic relationships from the handwritten sheet music dataset MUSCIMA++ show that our approach is very effective. We obtain success rates close to the optimum when establishing the correct rela- tionships from the ground-truth primitives (F1 = 95.6%). When re-evaluating the results starting from the primi- tives detected by a state-of-the-art music object detector, a slightly lower performance can be observed (F1 = 92.3%). These figures indicate that the notation assembly stage of the OMR pipeline can be solved reliably with a machine learning model, given a curated set of annotated scores. Comparing our approach to existing methods is extremely difficult, if not impossible, because: • most existing solutions are black boxes with closed source-code, or there is no available implementation at all, • only a few systems are capable of handling hand- written modern notation, and • it is unclear how to compare the music notation as- sembly stage between two different systems, espe- cially given that the notation graph is only an inter- mediate representation. So, although the results are promising, we still see many interesting avenues for further research. For instance, by adding data augmentation during training to make the no- tation assembly model more robust against variations in the bounding box retrieval of the first stage. Also, we plan to look into providing other meaningful features to the net- work, such as the class labels ζ(v) of the involved prim- itives v ∈ V . Furthermore, we observed that the fixed- sized input patch given to the network is often covering a much larger area than required to contain the objects of in- terest, especially when they are very close (see Fig. 2c). This could be handled by using size-independent neural network layers such as Global Pooling, instead of flatten- ing the features and feeding them into a fully-connected layer, allowing us to adjust the input patch for each sam- ple. We also believe that the notation assembly stage could benefit from having a broader set of hypotheses about the objects detected in the previous stage, instead of a fixed set of proposals. State-of-the-art music object detectors are based on statistical neural models that are able to provide a probability distribution over the whole set of possible de- tection hypotheses. When it comes to recognizing, we are typically interested in the most likely hypothesis—the one that is proposed as an answer—forgetting the other ones. However, it is certainly interesting to exploit this statistical modeling: the notation assembly algorithm could establish relationships that are more logical a priori, although the objects involved have a lower probability according to the object detector. These types of approaches have yet to be explored in the field of OMR. And finally, for completing the OMR pipeline, the en- coding stage is still missing. However, we see two benefits of the notation graph representation: the encoding can be implemented by experts in music encoding that are pro- ficient in a particular format and given a complete graph representation, there is no restriction on the actual output format because the graph contains all the information that is present in the image. 6. REFERENCES [1] Alfio Andronico and Alberto Ciampa. On automatic pattern recognition and acquisition of printed music. In International Computer Music Conference, Venice, Italy, 1982. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek [2] David Bainbridge. Extensible optical music recogni- tion. PhD thesis, University of Canterbury, 1997. [3] David Bainbridge and Tim Bell. The challenge of opti- cal music recognition. Computers and the Humanities, 35(2):95–121, 2001. [4] Arnau Baró, Pau Riba, Jorge Calvo-Zaragoza, and Ali- cia Fornés. From optical music recognition to hand- written music recognition: A baseline. Pattern Recog- nition Letters, 123:1–8, 2019. [5] Pierfrancesco Bellini, Ivan Bruno, and Paolo Nesi. Op- tical music recognition: Architecture and algorithms. In Interactive Multimedia Music Technologies, pages 80–110. IGI Global, Hershey, PA, USA, 2008. [6] Dorothea Blostein and Henry S. Baird. A critical sur- vey of music image analysis. In Structured Document Image Analysis, pages 405–434. Springer Berlin Hei- delberg, 1992. [7] Jorge Calvo-Zaragoza and David Rizo. End-to-end neural optical music recognition of monophonic scores. Applied Sciences, 8(4), 2018. [8] Bertrand Coüasnon, Pascal Brisset, and Igor Stéphan. Using logic programming languages for optical mu- sic recognition. In 3rd International Conference on the Practical Application of Prolog, 1995. [9] Bertrand Coüasnon and Jean Camillerapp. A way to separate knowledge from program in structured doc- ument analysis: Application to optical music recog- nition. In 3rd International Conference on Document Analysis and Recognition, pages 1092–1097, 1995. [10] Michael Droettboom, Ichiro Fujinaga, and Karl MacMillan. Optical music interpretation. In Structural, Syntactic, and Statistical Pattern Recognition, pages 378–387, Berlin, Heidelberg, 2002. [11] Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vi- sion, 111(1):98–136, 2015. [12] Hoda M. Fahmy and Dorothea Blostein. A graph gram- mar programming style for recognition of music no- tation. Machine Vision and Applications, 6(2):83–99, 1993. [13] Alicia Fornés, Anjan Dutta, Albert Gordo, and Josep Lladós. CVC-MUSCIMA: A ground-truth of hand- written music score images for writer identification and staff removal. International Journal on Document Analysis and Recognition, 15(3):243–251, 2012. [14] Antonio-Javier Gallego and Jorge Calvo-Zaragoza. Staff-line removal with selectional auto-encoders. Ex- pert Systems with Applications, 89:138–148, 2017. [15] Michael Good. MusicXML: An internet-friendly for- mat for sheet music. Technical report, Recordare LLC, 2001. [16] Jan Hajič jr., Matthias Dorfer, Gerhard Widmer, and Pavel Pecina. Towards full-pipeline handwritten OMR with musical symbol detection by u-nets. In 19th Inter- national Society for Music Information Retrieval Con- ference, pages 225–232, Paris, France, 2018. [17] Jan Hajič jr., Marta Kolárová, Alexander Pacha, and Jorge Calvo-Zaragoza. How current optical music recognition systems are becoming useful for digital li- braries. In 5th International Conference on Digital Li- braries for Musicology, pages 57–61, Paris, France, 2018. [18] Jan Hajič jr. and Pavel Pecina. The MUSCIMA++ dataset for handwritten optical music recognition. In 14th International Conference on Document Analysis and Recognition, pages 39–46, Kyoto, Japan, 2017. [19] Andrew Hankinson, Perry Roland, and Ichiro Fujinaga. The music encoding initiative as a document-encoding framework. In 12th International Society for Music In- formation Retrieval Conference, pages 293–298, 2011. [20] Diederik P. Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. Computing Re- search Repository, abs/1412.6980, 2014. [21] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common ob- jects in context. In Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. [22] Hidetoshi Miyao and Robert Martin Haralick. Format of ground truth data used in the evaluation of the results of an optical music recognition system. In 4th Interna- tional Workshop on Document Analysis Systems, pages 497–506, Brasil, 2000. [23] Kia Ng. Music manuscript tracing. Lecture Notes in Computer Science, 2390:322–334, 2002. [24] Alexander Pacha and Jorge Calvo-Zaragoza. Optical music recognition in mensural notation with region- based convolutional neural networks. In 19th Interna- tional Society for Music Information Retrieval Confer- ence, pages 240–247, Paris, France, 2018. [25] Alexander Pacha, Kwon-Young Choi, Bertrand Coüas- non, Yann Ricquebourg, Richard Zanibbi, and Horst Eidenberger. Handwritten music object detection: Open issues and baseline results. In 13th International Workshop on Document Analysis Systems, pages 163– 168, 2018. [26] Alexander Pacha and Horst Eidenberger. Towards a universal music symbol classifier. In 14th International Conference on Document Analysis and Recognition, pages 35–36, Kyoto, Japan, 2017. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek [27] Alexander Pacha, Jan Hajič jr., and Jorge Calvo- Zaragoza. A baseline for general music object detec- tion with deep learning. Applied Sciences, 8(9):1488– 1508, 2018. [28] Christopher Raphael and Jingya Wang. New ap- proaches to optical music recognition. In 12th Inter- national Society for Music Information Retrieval Con- ference, pages 305–310, Miami, Florida, 2011. [29] Ana Rebelo, Ichiro Fujinaga, Filipe Paszkiewicz, An- dre R.S. Marcal, Carlos Guedes, and Jamie dos Santos Cardoso. Optical music recognition: State-of-the-art and open issues. International Journal of Multimedia Information Retrieval, 1(3):173–190, 2012. [30] K. Todd Reed and J. R. Parker. Automatic computer recognition of printed music. In 13th International Conference on Pattern Recognition, pages 803–807, 1996. [31] Florence Rossant and Isabelle Bloch. A fuzzy model for optical recognition of musical scores. Fuzzy Sets and Systems, 141(2):165–201, 2004. [32] Florence Rossant and Isabelle Bloch. Robust and adaptive OMR system including fuzzy modeling, fu- sion of musical rules, and possible error detection. EURASIP Journal on Advances in Signal Processing, 2007(1):081541, 2006. [33] Mariusz Szwoch. Guido: A musical score recognition system. In 9th International Conference on Document Analysis and Recognition, pages 809–813, 2007. [34] Lorenzo J. Tardón, Simone Sammartino, Isabel Bar- bancho, Verónica Gómez, and Antonio Oliver. Optical music recognition for scores written in white mensural notation. EURASIP Journal on Image and Video Pro- cessing, 2009(1):843401, 2009. [35] Lukas Tuggener, Ismail Elezi, Jürgen Schmidhuber, and Thilo Stadelmann. Deep watershed detector for music object recognition. In 19th International Soci- ety for Music Information Retrieval Conference, pages 271–278, Paris, France, 2018. [36] Eelco van der Wel and Karen Ullrich. Optical music recognition with convolutional sequence-to-sequence models. In 18th International Society for Music Infor- mation Retrieval Conference, Suzhou, China, 2017. D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek CHAPTER 8 OMR for Mensural Notation Before modern stave notation was established, a number of preceding notations have evolved. One of them is mensural notation. It was used, for example, to write down sacred chants that were sung during the mass in the Catholic church (see Fig. 8.1). Figure 8.1: Chant from the Capitan collection, written in mensural notation during the 17th century. In contrast to modern notation, this early notation system had a smaller vocabulary and was more limited with regard to what could be expressed with it. This motivated Jorge Calvo-Zaragoza and me to work on a complete OMR system for these scores, which requires fewer building blocks than OMR systems that attempt to recognizing modern stave notation. In our work “Optical Music Recognition in Mensural Notation with Region-Based Convolutional Neural Networks,” published at the 19th International Society for Music 123 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 8. OMR for Mensural Notation Information Retrieval Conference 2018 [PCZ18], we devised a simplified pipeline that only consists of three stages: music object detection, position classification, and semantics recognition. The first stage is similar to the one described in the papers above. The position classification on the other hand represents a new building block which can be reused in other scenarios to improve the robustness of OMR systems. The idea is to obtain the vertical position, which corresponds to the pitch,1 by a neural network classifier. The benefit is that no stave recognition and removal stage is needed, while symbols can be classified robustly, relying on local information only. The position classification network worked exceptionally well, making virtually no errors and even spotting errors that were done by human annotators. The last step dealt with reconstructing the semantics, which can be done with a set of simple heuristics. For example, there are no simultaneous events, so notes can simply be read left to right to determine their order. The encoding step, however, is non- trivial because the interpretation of the recognized symbols requires specialized domain knowledge. This part of the research was conducted by David Rizo and published along with a description of the MuRET project [RCZIn18]. 1Note that the actual pitch still depends on other symbols, such as the clef and accidentals. 124 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek OPTICAL MUSIC RECOGNITION IN MENSURAL NOTATION WITH REGION-BASED CONVOLUTIONAL NEURAL NETWORKS Alexander Pacha Institute of Visual Computing and Human- Centered Technology, TU Wien, Austria alexander.pacha@tuwien.ac.at Jorge Calvo-Zaragoza PRHLT Research Center Universitat Politècnica de València, Spain jcalvo@upv.es ABSTRACT In this work, we present an approach for the task of opti- cal music recognition (OMR) using deep neural networks. Our intention is to simultaneously detect and categorize musical symbols in handwritten scores, written in mensu- ral notation. We propose the use of region-based convo- lutional neural networks, which are trained in an end-to- end fashion for that purpose. Additionally, we make use of a convolutional neural network that predicts the rela- tive position of a detected symbol within the staff, so that we cover the entire image-processing part of the OMR pipeline. This strategy is evaluated over a set of 60 ancient scores in mensural notation, with more than 15000 anno- tated symbols belonging to 32 different classes. The results reflect the feasibility and capability of this approach, with a weighted mean average precision of around 76% for sym- bol detection, and over 98% accuracy for predicting the position. 1. INTRODUCTION The preservation of the musical heritage over the cen- turies makes it possible to study a certain artistic or cul- tural paradigm. Most of this heritage exists in written form and is stored in cathedrals or music libraries [10]. In addi- tion to the possible issues related to the ownership of the sources, this storage protects the physical preservation of the sources over time, but also limits their accessibility. That is why efforts are being made to improve this situa- tion through initiatives to digitize musical archives [17,21]. These digital copies can easily be distributed and studied without compromising their integrity. Nevertheless, this digitalization, which indeed repre- sents a progress with respect to the aforementioned situ- ation, is not enough to exploit the actual potential of this heritage. To make the most out of it, the musical content itself must be transcribed into a structured format that can be processed by a computer [6]. In addition to indexing c© Alexander Pacha, Jorge Calvo-Zaragoza. Licensed un- der a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Alexander Pacha, Jorge Calvo-Zaragoza. “Optical Music Recognition in Mensural Notation with Region-based Convolu- tional Neural Networks”, 19th International Society for Music Informa- tion Retrieval Conference, Paris, France, 2018. the content and thereby enabling tasks such as content- based search, this could also facilitate large-scale data- driven musicological analysis in general [39]. Given that the transcription of sources is extremely time-consuming, it is desirable to resort to automatic sys- tems. Optical music recognition (OMR) is a field of re- search that investigates how to build systems that decode music notation from images. Regardless of the approach used to achieve such objective, OMR systems vary signif- icantly due to the differences amongst musical notations, document layouts, or printing mechanisms. The work presented here deals with manuscripts writ- ten in mensural notation, specifically with sources from the 17th century, attributed to the Pan-Hispanic framework. Although this type of mensural notation is generally con- sidered as an extension of the European mensural notation, the Pan-Hispanic situation of that time underwent a par- ticular development that fostered the massive use of hand- written copies. Due to this circumstance, the need for de- veloping successful OMR systems for handwritten nota- tion becomes evident. Figure 1. A sample page of ancient music, written in men- sural notation. We address the optical music recognition of scores writ- ten in mensural notation (see Figure 1) as an object detec- tion and classification task. In this notation, the symbols are atomic units, 1 which can be detected and categorized independently. Although there are polyphonic composi- 1 Except for beamed notes, in which the beam can be considered an atomic symbol itself. 240 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek tions from that era, each voice was placed on its own page, so we can consider the notation as monophonic on the graphical level. Assuming the aforementioned simplifica- tions allows us to formulate OMR as an object detection task in music score images, followed by a classification stage that determines the vertical position of each detected object within a staff. If the clef and other alterations are known, the vertical position of a note encodes its pitch. We propose using region-based convolutional neural networks, which represent the state of the art in computer vision for object detection, and demonstrate their capabili- ties of detecting and categorizing the musical symbols that appear in the image of a music score with a high precision. We believe that this work provides a solid foundation for the automatic encoding of scores into a machine-readable music format like Music Encoding Initiative (MEI) [38] or MusicXML [15]. At present, there are thousands of manuscripts of this type that remain to be digitized and transcribed. Although each manuscript may have its own particularities (such as the handwriting style or the lay- out organization), the approach developed in this work presents a common and extensible formulation to all of them. 2. RELATED WORK Most of the proposed solutions to OMR have focused on a multi-stage approach [34]. This traditional workflow in- volves steps that have been addressed isolatedly, such as image binarization [4,47], staff and text segmentation [44], staff-line detection and removal [5, 11, 46], and symbol classification [3, 30, 33]. In other works, a full pipeline is proposed for a particular type of music score [31, 32, 43]. Recent works have shown that the image-processing pipeline can largely be replaced with machine-learning ap- proaches, making use of deep learning techniques such as convolutional neural networks (CNNs) [1, 16, 29, 45]. CNNs denote a breakthrough in machine learning, espe- cially when dealing with images. They have been applied with great success to many computer vision tasks, often reaching or even surpassing human performance [18, 22]. These neural networks are composed of a series of filters that operate locally (i.e. convolutions, pooling) and com- pute various representations of the input image. These fil- ters form a hierarchy of layers, each of which represents a different level of abstraction [20]. The key is that these filters are not fixed but learnt from the raw data through a gradient descent optimization process [23], meaning that the network can learn to extract data-specific, high-level features. Here, we formulate OMR for mensural notation as an object detection task in music score images. Object detec- tion in images is one of the fundamental problems in com- puter vision, for which deep learning can provide excel- lent solutions. Traditionally, the task has been addressed by means of heuristic strategies based on the extraction of low-level, general-purpose features such as SIFT [28] or HOG [7]. Szegedy and colleagues [8, 42] redefined the use of CNNs for object detection for the first time. Instead of classifying the image, the neural network predicted the bounding box of the object within the image. Around the same time, the ground-breaking work of Girshick et al. [14] definitely changed the traditional paradigm. In their work, a CNN was in charge of predicting whether each object of the vocabulary appeared in selected bottom- up regions of the image. This scheme has been referred to as region-based convolutional neural network (R-CNN). Afterwards, several extensions and variations have been proposed with the aim of improving both the quality of the detection and the efficiency of the process. Well-known examples include Fast R-CNN [13], Faster R-CNN [37], R-FCN [24], SSD [27] or YOLO [35, 36]. In this work, we use these region-based convolutional neural networks for OMR, which are trained for the direct detection and categorization of music symbols in a given music document. Thereby allowing for an elegant formula- tion of the task, since the training process only needs score images along with their corresponding set of symbols and the regions (bounding boxes) in which they appear. 3. AN OMR-PIPELINE FOR MENSURAL SCORES Music scores written in mensural notation share many properties with scores written in modern notation: the se- quence of tones and pauses is captured as notes and rests within a reference frame of five parallel lines, temporally ordered along the x-axis with the y-axis representing the pitch of notes. But unlike modern notation, mensural scores are notated monophonically with a smaller vocabu- lary of only around 30 different glyphs, reducing the over- all complexity significantly and thus allowing for a simpli- fied pipeline that consists of only three stages. A represen- tative subset of the symbols that appear in the considered notation is depicted in Table 1. Group Symbol Note Semibrevis Minima Col. Minima Semiminima Rest Longa Brevis Semibrevis Semiminima Clef C Clef G Clef F Clef (I) F Clef (II) Time Major Minor Common Cut Others Flat Sharp Dot Custos Table 1. Subset of classes from mensural notation. The symbols are depicted without considering their pitch or vertical position on the staff. 3.1 Music Object Detection The first stage takes as input an entire high-quality image that contains music symbols. The entire image is fed into Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018 241 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek a deep convolutional neural network for object detection and yields the bounding boxes of all detected objects along with their most likely class (e.g., g-clef, minima, flat). 3.2 Position classification After detecting the symbols and classifying them, the sec- ond stage performs position classification of each detected object to obtain the relative position with respect to the reference frame (staff) which is required to recover a notes pitch. For this process, we extract a local patch from the full image with the object of interest in the center and feed the image into another CNN, which outputs the vertical position, encoded as shown in Figure 2. L1 L2 L0 L4 L5 L3 L6 S1 S2 S0 S4 S5 S3 S6 Figure 2. Encoding of the vertical staff line position into discrete categories. The five continuous lines in the middle form the regular staff and the dashed lines represent ledger lines, that are inserted locally as needed. A note between the second and third line from the bottom would be classi- fied as S2 (orange). 3.3 Semantics Reconstruction and Encoding Given the detected objects and their relative position to the staff line, the final step is to reconstruct the musical se- mantics and encode the output into the desired format (e.g., into modern notation [48]). This step has to translate the detected objects into an ordered sequence for further pro- cessing. Depending on the application and desired output, semantic rules need to be taken care of, such as grouping beams with their associated notes to infer the right duration or altering the pitch of notes when accidentals are encoun- tered. 4. EXPERIMENTS To evaluate the proposed approach, we conducted exper- iments 2 for the first two steps of the pipeline. While a full system would also require the third step, we refrain from implementing it, to not restrict this approach to a par- ticular applications. It is also noteworthy, that translating mensural notation into modern notation can be seen as its own field of research that requires a deep understanding of 2 Source code is available at https://github.com/apacha/ Mensural-Detector both notational languages, which exceeds the scope of this work. 4.1 Dataset Our corpus consists of 60 fully-annotated pages in mensu- ral notation from the 16th-18th century. The manuscript represents sacred music, composed for vocal interpreta- tion. 3 The compositions were written in music books by copyists of that time. To ensure the integrity of the phys- ical sources, the images were taken with a camera instead of scanning the books in a flatbed scanner, leading to sub- optimal conditions in some cases. An overview of the con- sidered corpus is given in Table 2. Pages 60 Total number of symbols 15258 Different classes 32 Different positions within a staff 14 Average size of a symbol (w × h) 44× 84 pixels Number of symbols per image 42–447 (∅ 250) Image resolution (w × h) ∼ 3000× 2000 pixels Dots per inch (DPI) 300 Table 2. Statistics of the considered corpus. The ground-truth data is collected using a framework, in which an electronic pen is used to trace the music symbols, similar to that of [2]. The bounding boxes of the symbols are then obtained by computing the rectangular extent of the users’ strokes. 4.2 Setup Our experiments are based on previous research by [29], where a sliding-window-approach is used to detect hand- written music symbols in sub-regions of a music score. In contrast to their work, we are able to detect hundreds of tiny objects in the full page within a single pass. To train a network in a reasonable amount of time within the con- straints of modern hardware, it is currently necessary to shrink the input image to be no longer than 1000px on the longest edge, which corresponds to a downscaling opera- tion by a factor of three on our dataset. For detecting music objects, the Faster R-CNN ap- proach [37] with the Inception-ResNet-v2 [41] feature ex- tractor has been shown to yield very good results for de- tecting handwritten symbols [29]. It works by having a region-proposal stage for generating suggestions, where an 3 The dataset is subject to ongoing musicological research and can not be made public at this point in time, so it is only available upon request. 242 Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek object might be, followed by a classification stage, which confirms or discards these proposals. Both stages are im- plemented as CNNs and trained jointly on the provided dataset. The first stage scans the image linearly along a regular grid with user-defined box proposals in each cell of that grid. To be able to generate meaningful proposals, the shape of these boxes has to be similar to the actual shape of the objects that should be found. Since the image contains a large number of very tiny objects (sometimes only a few pixels), a very fine grid is required. After a statistical anal- ysis of the objects appearing in the given dataset, including dimension clustering [35], several experiments were con- ducted to study the effects of size, scale, and aspect ratios of the above-mentioned boxes, concluding that sensibly chosen priors for these boxes work similarly good as the boxes obtained from the statistical analysis. For the down- scaled image, boxes of 16x16 pixels, iterating with a stride of 8 pixels and using the scales 0.25, 0.5, 1.0, and 2.0, with aspect ratios of 0.5, 1.0, and 2.0 represent a meaningful default configuration. Accounting for the high density of objects, the maximum number of box proposals is set to 1200 with a maximum of 600 final detections per image. For the second step of our proposed pipeline, another CNN is trained to infer the relative position of an object to its staff line upon which it is notated (see Figure 2). Different off-the-shelf network architectures are evaluated (VGG [40], ResNet [19], Inception-ResNet-v2 [41]) with the more complex models slightly outperforming the sim- pler ones. Using pre-trained weights instead of random initialization accelerates the training, improves the over- all result, and is therefore used throughout all experiments. The input to the classification network is a 224×448 pixels patch of the original image with the target object in the cen- ter (see Figure 3). The exact dimensions of the patch are not important, as long as the image contains enough verti- cal and horizontal context to classify even symbols notated above or below the staff. When objects appear too close to the border, the image is padded with the reflection along the extended edge to simulate the continuation of the page as shown in Figures 3(d) and 3(e). (a) (b) (c) (d) (e) Figure 3. Sample inputs for the position classification net- work depicting a g-clef (a), semiminima (b), brevis rest (c), custos (d) and semibrevis (e), with vertical (d) and horizon- tal (e) reflections of the image to enforce the target object to be in the center, while preserving meaningful context. It is important to notice that the vertical position de- fines the semantical meaning only for some symbols (e.g., the pitch of a note or the upcoming pitch with a custos). Classes for which the position is either undefined or not of importance include barlines, fermatas, different time- signatures, beams and in particular for mensural notation: the augmentation dot. Symbols from these classes can be excluded from the second step. 4.3 Evaluation metrics Concerning the music object detection stage, the model provides a set of bounding box proposals, as well as the recognized class of the objects therein. The model also yields a score of its confidence for each proposal. A bound- ing box proposal Bp is considered positive if it overlaps with the ground-truth bounding box Bg exceeding 60%, according to the Intersection over Union (IoU) criterion: 4 area(Bp ∩Bg) area(Bp ∪Bg) If the recognized class matches the actual category of the object, it is considered a true positive, being otherwise a false positive. Additional detections of the same object are computed as false positives as well. Those objects for which the model makes no proposal are considered false negatives. Given that the prediction is associated with a score, different values of precision and recall can be ob- tained for each possible threshold. To obtain a single met- ric, Average Precision (AP) can be computed, which is de- fined as the area under this precision-recall curve. An AP value can be computed independently for each class, and then we provide the mean AP (mAP) as the mean across all classes. Since our problem is highly unbalanced with re- spect to the number of objects of each class, we also com- pute the weighted mAP (w-mAP), in which the mean value is weighted according to the frequency of each class. For the second part of the pipeline (position classification), we evaluate the performance with the accuracy rate (ratio of correctly classified samples). 5. RESULTS Both experiments yielded very promising results while leaving some room for improvement. The detection of objects in the full image (see Figure 4) was evaluated by training on 48 randomly selected images and testing on the remaining 12 images with a 5-fold cross-validation. This task can be performed very well and yielded 66% mAP and 76% w-mAP. When considering practical applications, the weighted mean average precision indicates the effort needed to correct the detection results, because it reflects the fact that symbols from classes that appear frequently are generally detected better than rare symbols. When reviewing the error cases, a few things can be observed: Very tiny objects such as the dot, semibrevis rest and minima rest pose a significant challenge to the network, due to their small size and extremely similar ap- pearance (see Figure 5). This problem might be mitigated, 4 as defined for the PASCAL VOC challenge [9] Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018 243 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Figure 4. Detected objects in the full image with the detected class being encoded as the color of the box. This example achieves a mAP of approximately 68% and a w-mAP of 85%. (a) (b) (c) Figure 5. The smallest objects from the dataset that are hard to detect and often confused (from left to right): dot, semibrevis rest, and minima rest. by allowing the network to access the full resolution im- age, which potentially has more discriminative information than the downsized image. Unsurprisingly, classes that are underrepresented such as dots, barlines, or all types of rests are also frequently missed or incorrectly classified, leading to average precision rates of only 10–40% for these classes. Another interesting observation can be made, that in many cases, objects were detected but the IoU with the underlying ground-truth was too low for considering them a true positive detection (see Figure 6 with a red box being very close to a white box). For the second experiment, a total of 13246 sym- bols were split randomly into a training (80%), valida- tion (10%) and test set (10%). The pre-trained Inception- ResNet-v2 model is then fine-tuned on this dataset and achieves over 98% accuracy on the test set of 1318 sam- ples. Analyzing the few remaining errors reveals that the model makes virtually no errors and that the misclassified samples are mostly human annotation errors or data incon- sistencies. For inference, both networks can be connected in series. Running both detection and classification takes about 30 seconds per image when running on a GPU (GeForce 1080 Ti) and 210 seconds on a CPU. 6. CONCLUSION In this work, we have shown that the optical music recogni- tion of handwritten music scores in mensural notation, can be performed accurately and extendible by formulating it as an object detection problem, followed by a classification stage to recover the position of the notes within the staff. By using a machine learning approach with region-based convolutional neural networks, this problem can be solved by simply providing annotated data and training a suitable model on that dataset. However, we are aware that our pro- posal still has room for improvement. In future work we would like to: 244 Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek (a) (b) (c) (d) Figure 6. Visualization of the performance of the object detection stage with selected patches of the music documents: green boxes indicate true positive detections; white boxes are false negatives, that the network missed during detection; red boxes are false positive detections, where the model reported an object, although there is no ground-truth; yellow boxes are also false positives, where the bounding-box is valid, but the assigned class was incorrect. • evaluate the use of different network architectures, such as feature pyramid networks [25,26], that might improve the detection of small objects, which we have identified as the biggest source of error at the moment. These networks allow the use of high- resolution images directly, without the inherent in- formation loss, that is caused by the downscaling operation. • merge the staff position classification with the object detection network, by adding another output to the neural network, so the model simultaneously pre- dicts the staff position, the bounding box and the class label. • apply and evaluate the same techniques for other no- tations, including modern notation • study models or strategies that reduce (or remove) the need for specific ground-truth data of each type of manuscript. For example, unsupervised training schemes such as the one proposed in [12], which al- lows the network to adapt to a new domain by simply providing new, unannotated images. We believe that this research avenue represents a ground-breaking work in the field of OMR, as the pre- sented approach would potentially deal with any type of music scores by just providing undemanding ground-truth data to train the neural models. 7. ACKNOWLEDGEMENT Jorge Calvo-Zaragoza thanks the support from the Eu- ropean Union’s H2020 grant READ (Ref. 674943), the Spanish Ministerio de Economı́a, Industria y Competitivi- dad through Juan de la Cierva - Formación grant (Ref. FJCI-2016-27873), and the Social Sciences and Humani- ties Research Council of Canada. Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018 245 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 8. REFERENCES [1] J. Calvo-Zaragoza and D. Rizo. End-to-End Neural Optical Music Recognition of Monophonic Scores. Ap- plied Sciences, 8(4):606–629, 2018. [2] J. Calvo-Zaragoza, D. Rizo, and J. M. Iñesta. Two (note) heads are better than one: pen-based multimodal interaction with music scores. In 17th International Society for Music Information Retrieval Conference, pages 509–514, 2016. [3] J. Calvo-Zaragoza, A. J. G. Sánchez, and A. Pertusa. Recognition of Handwritten Music Symbols with Con- volutional Neural Codes. In 14th IAPR International Conference on Document Analysis and Recognition, pages 691–696, 2017. [4] J. Calvo-Zaragoza, G. Vigliensoni, and I. Fujinaga. Pixel-wise binarization of musical documents with convolutional neural networks. In 15th IAPR Inter- national Conference on Machine Vision Applications, pages 362–365, 2017. [5] J. S. Cardoso, A. Capela, A. Rebelo, C. Guedes, and J. P. da Costa. Staff detection with stable paths. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 31(6):1134–1139, 2009. [6] G. S. Choudhury, M. Droetboom, T. DiLauro, I. Fu- jinaga, and B. Harrington. Optical music recognition system within a large-scale digitization project. In 1st International Symposium on Music Information Re- trieval, pages 1–6, 2000. [7] N. Dalal and B. Triggs. Histograms of oriented gradi- ents for human detection. In IEEE Conference on Com- puter Vision and Pattern Recognition, pages 886–893, 2005. [8] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2147–2154, 2014. [9] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge: A Retrospective. In- ternational Journal of Computer Vision, 111(1):98– 136, 2015. [10] I. Fujinaga, A. Hankinson, and J. E. Cumming. Intro- duction to SIMSSA (Single Interface for Music Score Searching and Analysis). In 1st International Work- shop on Digital Libraries for Musicology, pages 1–3, 2014. [11] A.-J. Gallego and J. Calvo-Zaragoza. Staff-line re- moval with selectional auto-encoders. Expert Systems with Applications, 89:138–148, 2017. [12] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In International Con- ference on Machine Learning, pages 1180–1189, 2015. [13] R. Girshick. Fast R-CNN. In IEEE International Con- ference on Computer Vision, pages 1440–1448, 2015. [14] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and se- mantic segmentation. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 580–587, 2014. [15] M. Good and G. Actor. Using MusicXML for file in- terchange. In Third International Conference on WEB Delivering of Music, page 153, 2003. [16] J. Hajič Jr. and P. Pecina. Detecting Noteheads in Handwritten Scores with ConvNets and Bound- ing Box Regression. Computing Research Repository, abs/1708.01806, 2017. [17] A. Hankinson, J. A. Burgoyne, G. Vigliensoni, A. Porter, J. Thompson, W. Liu, R. Chiu, and I. Fujinaga. Digital Document Image Retrieval Using Optical Mu- sic Recognition. In Proceedings of the 13th Interna- tional Society for Music Information Retrieval Confer- ence, pages 577–582, 2012. [18] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In IEEE International Confer- ence on Computer Vision, pages 1026–1034, 2015. [19] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE International Conference on Computer Vision and Pattern Recogni- tion, pages 770–778, 2016. [20] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet Classification with Deep Convolutional Neural Net- works. In Advances in Neural Information Processing Systems, pages 1106–1114, 2012. [21] A. Laplante and I. Fujinaga. Digitizing musical scores: Challenges and opportunities for libraries. In 3rd In- ternational workshop on Digital Libraries for Musicol- ogy, pages 45–48. ACM, 2016. [22] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015. [23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recog- nition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [24] Y. Li, K. He, J. Sun, et al. R-FCN: Object detec- tion via region-based fully convolutional networks. In Advances in Neural Information Processing Systems, pages 379–387, 2016. [25] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature Pyramid Networks for Ob- ject Detection. In IEEE Conference on Computer Vi- sion and Pattern Recognition, 2017. 246 Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek [26] T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár. Focal Loss for Dense Object Detection. Computing Re- search Repository, abs/1708.02002, 2017. [27] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single shot multibox detector. In European Conference on Computer Vision, pages 21–37, 2016. [28] D. G. Lowe. Distinctive image features from scale- invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004. [29] A. Pacha, K.-Y. Choi, B. Coüasnon, Y. Ricquebourg, R. Zanibbi, and H. Eidenberger. Handwritten music ob- ject detection: Open issues and baseline results. In 13th IAPR Workshop on Document Analysis Systems, pages 163–168, 2018. [30] A. Pacha and H. Eidenberger. Towards a Univer- sal Music Symbol Classifier. In 12th IAPR Interna- tional Workshop on Graphics Recognition, pages 35– 36, 2017. [31] L. Pugin. Optical music recognitoin of early typo- graphic prints using hidden markov models. In 7th International Conference on Music Information Re- trieval, pages 53–56, 2006. [32] C. Ramirez and J. Ohya. Automatic recognition of square notation symbols in western plainchant manuscripts. Journal of New Music Research, 43(4):390–399, 2014. [33] A. Rebelo, A. Capela, and J. S. Cardoso. Optical recognition of music symbols. International Journal on Document Analysis and Recognition, 13(1):19–31, 2010. [34] A. Rebelo, I. Fujinaga, F. Paszkiewicz, A. R. Mar- cal, C. Guedes, and J. S. Cardoso. Optical music recognition: state-of-the-art and open issues. Interna- tional Journal of Multimedia Information Retrieval, 1(3):173–190, 2012. [35] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You Only Look Once: Unified, Real-Time Object De- tection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 779–788, 2016. [36] J. Redmon and A. Farhadi. YOLO9000: Better, Faster, Stronger. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6517–6525, 2017. [37] S. Ren, K. He, R. Girshick, and J. Sun. Faster R- CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015. [38] P. Roland. The music encoding initiative (MEI). In Proceedings of the First International Conference on Musical Applications Using XML, pages 55–59, 2002. [39] X. Serra. The computational study of a musical culture through its digital traces. Acta Musicologica. 2017; 89 (1): 24-44., 2017. [40] K. Simonyan and A. Zisserman. Very Deep Convo- lutional Networks for Large-Scale Image Recogni- tion. Computing Research Repository, abs/1409.1556, 2014. [41] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In 31st AAAI Con- ference on Artificial Intelligence, pages 4278–4284, 2017. [42] C. Szegedy, A. Toshev, and D. Erhan. Deep Neural Networks for Object Detection. In Advances in Neural Information Processing Systems 26, pages 2553–2561. 2013. [43] L. J. Tardón, S. Sammartino, I. Barbancho, V. Gómez, and A. Oliver. Optical Music Recognition for Scores Written in White Mensural Notation. EURASIP Jour- nal on Image and Video Processing, 2009(1):1–23, 2009. [44] R. Timofte and L. Van Gool. Automatic stave discov- ery for musical facsimiles. In Asian Conference on Computer Vision, pages 510–523, 2012. [45] E. van der Wel and K. Ullrich. Optical music recog- nition with convolutional sequence-to-sequence mod- els. In 18th International Society for Music Informa- tion Retrieval Conference, pages 731–737, 2017. [46] M. Visaniy, V. C. Kieu, A. Fornés, and N. Journet. The ICDAR 2013 music scores competition: Staff removal. In International Conference on Document Analysis and Recognition, pages 1407–1411, 2013. [47] Q. N. Vo, S. H. Kim, H. J. Yang, and G. Lee. An MRF model for binarization of music scores with com- plex background. Pattern Recognition Letters, 69:88– 95, 2016. [48] Yu-Hui Huang, Xuanli Chen, Serafina Beck, David Burn, and Luc J. Van Gool. Automatic Handwritten Mensural Notation Interpreter: From Manuscript to MIDI Performance. In 16th International Society for Music Information Retrieval Conference, pages 79–85, 2015. Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018 247 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek CHAPTER 9 Other contributions Apart from the scientific publications mentioned above and two more scientific papers ([Pac18c] and [Pac18a]), some side-projects evolved over the last few years. They gained significant attention both from the community but also from prospective researchers who are making use of these projects and the resources that I have shared publicly. 9.1 Optical Music Recognition Datasets project One of the most pressing issues among OMR researchers has been the lack of datasets. While music in principle was available on a large scale, annotated datasets were not. So most researchers resorted to creating their own small datasets while researching the subject. This situation changed in recent years. Therefore, I collected the datasets that have been published so far and made that list available online [Pac17b]. It is a curated list with more than 20 datasets that were developed explicitly for OMR. Each entry contains a summary, a link to the official website, optionally the scientific publication where it was published as well as a small example from the dataset, cf. Fig. 9.1: Apart from the links and the summaries, the OMR datasets project also provides a Python software package omrdatasettools [Pac18b] that facilitates working with the datasets, including downloader scripts, converters and image generators for datasets that only have a textual description of the underlying data. The Github repository also mirrors most of the referenced datasets to prevent them from suddenly disappearing in case the original websites are taken down. This contribution has gained significant attention in the community and is referenced from various scientific articles. 133 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 9. Other contributions Figure 9.1: Screenshot of the website for the OMR Datasets project. 9.2 ISMIR Tutorial “Optical Music Recognition for Dummies” At the International Society of Music Information Retrieval (ISMIR) Conference 2018 in Paris, France, Jorge Calvo-Zaragoza, Jan Hajič jr., Ichiro Fujinaga, and I gave a 3-hour tutorial on Optical Music Recognition, called “Optical Music Recognition for Dummies.” It spanned the entire spectrum of OMR: from the history of the field to modern approaches which were presented a few days later at the conference. The entire session was recorded by us and published on YouTube [CZHjPF18]. So far, the videos have been viewed more than 400 times (April 2019). 134 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 9.3. Workshop on Reading Music Systems (WoRMS) 9.3 Workshop on Reading Music Systems (WoRMS) In 2018 Jorge Calvo-Zaragoza, Jan Hajič jr., and I organized the first Workshop on Reading Music Systems (WoRMS) [CZHjP18b] which was a satellite event to ISMIR 2018 in Paris, France with about 30 attendees. It was the first time that the majority of active researchers in OMR sat in the same place. WoRMS was organized similar to the GREC workshop [FL17], where the idea for a dedicated OMR workshop was born. The workshop featured 12 talks from researchers who work on OMR as well as users of OMR systems, such as librarians. Each session was followed by an interactive discussion on the presented papers. Another WoRMS is planned for 2019 in Delft, The Netherlands, again as a satellite event to ISMIR [CZPR19]. 9.4 Workshop at MEC 2019: Let’s Formalize Music Notation The Music Encoding Conference [DKKG19] is an annual conference on music encodings, digital musicology, digital editions, and symbolic music information retrieval. In 2019, the workshop “Let’s Formalize Music Notation for OMR” was held by Jorge Calvo-Zaragoza, Heinz Roggenkemper, and me. The goal of this workshop was to work towards a standard representation for OMR, something that does not yet exist. Given that the Music Encoding Initiative has a large body of knowledge in the field and significant interest in the results of OMR systems, it was an ideal place to jointly work on this subject. 9.5 Discussion Group Summary: Optical Music Recognition In 2017, I attended the 12th IAPR International Workshop on Graphics Recognition [FL17] in Kyoto, Japan. During the workshop multiple discussion groups were formed, including one on Optical Music Recognition. The discussion was summarized by Jorge Calvo-Zaragoza, Jan Hajič jr., and me, and published as part of Springer Lecture Notes in Computer Science [CZHjP18a]. 9.6 Community Engagement and Website for OMR-Research As a result of the Workshop on Reading Music Systems, the community decided that it wanted a website for future OMR research. A few months later, we launched https: //omr-research.net, which is an expanding collection of resources on OMR, links to upcoming events as well as blog-entries with ideas that are still in rough shape. We have also established a Slack channel [Pac17c] that is actively being used by researchers as well as a Github Organization [Pac19a] to channel the development of various OMR projects. 135 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 9. Other contributions 9.7 OMR Bibliography A side-effect of writing a thesis is that one has to study the literature of the subject thoroughly. For writing the paper “Understanding Optical Music Recognition,” we tried to gather all papers that were written on the subject of OMR. We collected the BibTex citations for hundreds of articles and manually verified them. Similar efforts were made before by Ichiro Fujinaga [Fuj00], Kia Ng, and Andrew Hankinson [Han12]. They published extensive bibliographies on the internet as static websites. We wanted to go one step further and have published a curated list of BibTeX entries on OMR research along with a static website that is generated from those entries online. The idea is to make the life of future OMR researchers easier by providing a verified bibliography of nearly all past research. The Github repository is open for submissions from the community, the rendered website can easily be updated, and our ultimate goal is to provide a valuable asset for all OMR researchers to correctly quote previous research. 136 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek CHAPTER 10 Conclusions and Outlook The most important conclusion from this thesis is that large parts of the OMR can be formulated as a machine learning problem, which in turn can be solved efficiently with deep learning. While the overall pipeline has been slightly reformulated, its general structure remains unchanged: preprocessing, symbol detection, semantic reconstruction, encoding. Especially when trying to recover all of the information for structured encoding, I believe that there will not be an alternative to this anytime soon. While there have been other attempts to solve OMR in a complete end-to-end fashion—feeding in an image, and getting encoded music out—they leave much to be desired. Single stave, monophonic music can be processed this way, but as soon as polyphony is involved or interactions between multiple staves appear, these approaches face their limits because they rely on the serializability of the score for encoding music as an ordered sequence. Certainly, it would be desirable if a system could learn everything from reading an image to producing a MIDI file, without any intervention being necessary at all. Unfortunately, I see no evidence that such a system is feasible. An alternative approach could be to learn the construction of the notation graph directly. But even that exceeds the boundaries of what I think is possible today. Coming back to what is actually possible: Music Object Detection, one of the subjects I spent the most effort on, is now clearly solvable. Object Detectors that use deep convolutional neural network are powerful enough to provide very good results. I have shown that the approach generalizes well across datasets, meaning that it performs well on the dataset it was trained on. However, it is still unclear whether the trained networks generalize well across datasets. Can they transfer easily from one dataset to another? While the music object detectors operate well, there is also a catch, which is the need for large, annotated datasets required for the training. Building such a dataset is a costly endeavor; so, if anyone decides to put an effort into it, it will be of immense benefit to publicly share it. To this end, I see the OMR datasets project as a milestone that will help 137 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 10. Conclusions and Outlook future researchers getting started much faster. I also contributed two smaller datasets, but more importantly the facilities for finding and working with existing datasets. With the newly introduced definition of OMR and its taxonomy, it became much clearer why there is no answer to the question “Does OMR work?”: Because it is an ill-defined question. Certain applications for OMR already work well, whereas others do not. Recovering the structured encoding remains a big challenge, and I believe it will still take some time before we will see reliable OMR systems in commercial products. One important distinction can help future systems to become more flexible and robust: decoupling the internal representation used during the recognition from the final repre- sentation into which the final results are encoded. The reason is that music encodings such as MusicXML, MEI, MuseScore XML, or MIDI were not designed for the specific needs of OMR. For example, they struggle to represent syntactically incorrect scores, which can quickly happen if the recognition fails. By representing music notation in a graph with vertices and edges, this barrier is removed, which allows the system to store the information as faithfully as possible. The last part of the OMR pipeline, the export, can then be handled by someone who is an expert in music encoding and not necessarily in machine learning or computer vision. While some encodings only require a fraction of the information from the notation graph, it can be useful in other situations to have all the information, e.g., when deciding how to resolve ambiguities, or where to ask for user-intervention if the system fails. Unfortunately, it is unclear, whether the idea of the notation graph will be picked up and developed further by other researchers or not. The past has shown that existing tools will only be adopted and used if they provide useful features, are well-designed, documented, and ready-to-use. Therefore, one key ingredient that could make the notation graph successful would be if the export into at least one widely used format was already available. A benefit of pushing music object detection into the area of clearly solvable problems was that later process stages now started to receive more attention. For many years, OMR research was struggling with early stages such as the detection and removal of staff lines. This has changed for good. I believe that music object detection still has plenty of room for improvement. For instance, the trained models can probably be much smaller than 200 MB with hundreds of layers, while still producing excellent detection results. As we started several community activities, we saw more collaboration between the research groups as well as new application scenarios popping up during discussions. I think that the research conducted for this thesis pushed the state of the art forward significantly. Maybe in five years, OMR as a whole will be considered a solved problem, although I doubt it will. But at least we will be a big step closer to solving it. 138 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek List of Figures 1.1 Excerpt from the waltz “An der schönen blauen Donau” by Johann Strauss, Jr. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 First measures from the guitar riff of the song “Enter Sandman” by Metallica. 3 1.3 The initial three measures of Lisa’s first composition for the piano. . . . . 3 1.4 A born-digital version of music scores, typeset by a music score editor and without artifacts or degradations. . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 The same musical snippet as in Fig. 1.4, but degraded, as it can happen in real-world scenarios: The stave is slightly slanted, the image is blurred and noisy due to a poor image capturing process, and some straight lines are bent, which frequently happens when making photos of scores that are bound in a book. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.6 The same musical snippet as in Fig. 1.4, but handwritten on a tablet with a stylus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.7 The word ‘Research’ written three times with vertically shifted letters, which always remains the word research, whereas the values of the three notes that are also slightly shifted vertically represent three different notes with the pitches A, B, and G. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.8 Three quarter-notes appear in the second space from the top within the reference system. The reference system’s origin is given by the G-Clef at the beginning, which specifies the G to be on the second line from the bottom. So the first note corresponds to a C, but with the given key-signature at the beginning which depicts two sharps with one of them placed on the second space from the top, it makes the note a C#. The second note has a local modifier that undoes this alteration from the key signature, which makes the note a C. The third note has no local modifier, but the effect of the local modifier from the second note is propagated to consecutive notes within the measure, making it also a C. So even if the first and third note visually look exactly the same, their semantics (pitch) is different. . . . . . . . . . . . . 6 4.1 A small sample of music symbols that are part of the collected music symbols dataset. It depicts ten different classes of handwritten and typeset symbols in modern notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 139 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek 5.1 Illustration of the sliding window approach, used to crop music scores into sub-images (red boxes). Boxes overlap both vertically with the boxes above and below as well as with adjacent crops (orange). . . . . . . . . . . . . . 74 8.1 Chant from the Capitan collection, written in mensural notation during the 17th century. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 9.1 Screenshot of the website for the OMR Datasets project. . . . . . . . . . . 134 140 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek Bibliography [CZHjP18a] Jorge Calvo-Zaragoza, Jan Hajič jr., and Alexander Pacha. Discussion Group Summary: Optical Music Recognition. In Graphics Recognition, Current Trends and Evolutions, Lecture Notes in Computer Science, pages 152–157. Springer International Publishing, 2018. [CZHjP18b] Jorge Calvo-Zaragoza, Jan Hajič jr., and Alexander Pacha. Website of the Workshop on Reading Music Systems 2018. https://sites.google. com/view/worms2018/home (Last visited 18.06.2019), 2018. [CZHjP19] Jorge Calvo-Zaragoza, Jan Hajič jr., and Alexander Pacha. Understanding Optical Music Recognition. ACM Computing Surveys (under review), 2019. [CZHjPF18] Jorge Calvo-Zaragoza, Jan Hajič jr., Alexander Pacha, and Ichiro Fujinaga. The recording of the ISMIR Tutorial "OMR for Dum- mies" on YouTube. https://www.youtube.com/playlist?list= PL1jvwDVNwQke-04UxzlzY4FM33bo1CGS0 (Last visited 18.06.2019), 2018. [CZO14] Jorge Calvo-Zaragoza and Jose Oncina. Recognition of Pen-Based Music Notation: The HOMUS Dataset. In 22nd International Conference on Pattern Recognition, pages 3038–3043. Institute of Electrical & Electronics Engineers (IEEE), 2014. [CZPR19] Jorge Calvo-Zaragoza, Alexander Pacha, and Heinz Roggenkemper. Web- site of the Workshop on Reading Music Systems 2019. https://sites. google.com/view/worms2019/home (Last visited 18.06.2019), 2019. [DKKG19] Norbert Dubowy, Robert Klugseder, Franz Kelnreiter, and Paul Gulewycz. Website of the Music Encoding Conference 2019. https:// music-encoding.org/conference/2019/ (Last visited 18.06.2019), 2019. [ETPS18] Ismail Elezi, Lukas Tuggener, Marcello Pelillo, and Thilo Stadelmann. DeepScores and Deep Watershed Detection: Current State and Open Issues. In 1st International Workshop on Reading Music Systems, pages 13–14, Paris, France, 2018. 141 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek [FL17] Alicia Fornés and Bart Lamiroy. Website of the 12th IAPR International Workshop on Graphics Recognition. https://grec2017.loria.fr/ (Last visited 18.06.2019), 2017. [Fuj00] Ichiro Fujinaga. Optical Music Recognition Bibliography. http:// www.music.mcgill.ca/~ich/research/omr/omrbib.html (Last visited 18.06.2019), 2000. [GJB+18] Mark Gotham, Peter Jonas, Bruno Bower, William Bosworth, Daniel Rootham, and Leigh VanHandel. Scores of Scores: An Openscore Project to Encode and Share Sheet Music. In 5th International Conference on Digital Libraries for Musicology, pages 87–95, Paris, France, 2018. ACM. [Han12] Andrew Hankinson. Optical Music Recognition Bibliography. http: //ddmal.music.mcgill.ca/research/omr/omr_bibliography (Last visited 18.06.2019), 2012. [HjDWP18] Jan Hajič jr., Matthias Dorfer, Gerhard Widmer, and Pavel Pecina. Towards Full-Pipeline Handwritten OMR with Musical Symbol Detection by U-Nets. In 19th International Society for Music Information Retrieval Conference, pages 225–232, Paris, France, 2018. [HjP17] Jan Hajič jr. and Pavel Pecina. The MUSCIMA++ Dataset for Handwritten Optical Music Recognition. In 14th International Conference on Document Analysis and Recognition, pages 39–46, Kyoto, Japan, 2017. [HZRS16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In The IEEE Conference on Computer Vision and Pattern Recogntiion (CVPR), pages 770–778, 2016. [LF16] Audrey Laplante and Ichiro Fujinaga. Digitizing Musical Scores: Challenges and Opportunities for Libraries. In 3rd International Workshop on Digital Libraries for Musicology, pages 45–48, 2016. [LMB+14] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common Objects in Context. In Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing. [LPR+] Tsung-Yi Lin, Genevieve Patterson, Matteo R. Ronchi, Yin Cui, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, Larry Zitnick, and Piotr Dol- lár. COCO Detection Leaderboard. http://cocodataset.org/ #detection-leaderboard (Last visited 18.06.2019). [MSH+85] T. Matsushima, I. Sonomoto, T. Harada, K. Kanamori, and S. Ohteru. Automated High Speed Recognition of Printed Music (WABOT-2 Vision 142 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek System). In International Conference on Advanced Robotics, pages 477–482, 1985. [Pac17a] Alexander Pacha. Github Repository of the Music Score Classifier. https://github.com/apacha/MusicScoreClassifier (Last vis- ited 18.06.2019), 2017. [Pac17b] Alexander Pacha. The OMR Datasets Project. https://apacha. github.io/OMR-Datasets (Last visited 18.06.2019), 2017. [Pac17c] Alexander Pacha. Slack Channel for Research on Optical Music Recognition. http://omr-research.slack.com (Last visited 18.06.2019), 2017. [Pac18a] Alexander Pacha. Advancing OMR as a Community: Best Practices for Reproducible Research. In 1st International Workshop on Reading Music Systems, pages 19–20, Paris, France, 2018. [Pac18b] Alexander Pacha. Documentation of the OMR Dataset Tools Python pack- age. https://omr-datasets.readthedocs.io/en/latest (Last visited 18.06.2019), 2018. [Pac18c] Alexander Pacha. Self-learning Optical Music Recognition. In Vienna Young Scientists Symposium, pages 34–35. Book-of-Abstracts.com, Heinz A. Krebs, 2018. ISBN: 978-3-9504017-8-3. [Pac19a] Alexander Pacha. Github Organisation for Research on Optical Music Recog- nition. https://github.com/omr-research (Last visited 18.06.2019), 2019. [Pac19b] Alexander Pacha. Github Repository for the Deep Learning Based Detector for Measures in Musical Scores. https://github.com/ OMR-Research/MeasureDetector/ (Last visited 18.06.2019), 2019. [PCC+18] Alexander Pacha, Kwon-Young Choi, Bertrand Coüasnon, Yann Ricque- bourg, Richard Zanibbi, and Horst Eidenberger. Handwritten Music Object Detection: Open Issues and Baseline Results. In 13th International Work- shop on Document Analysis Systems, pages 163–168, 2018. [PCZ18] Alexander Pacha and Jorge Calvo-Zaragoza. Optical Music Recognition in Mensural Notation with Region-Based Convolutional Neural Networks. In 19th International Society for Music Information Retrieval Conference, pages 240–247, Paris, France, 2018. [PCZHj19] Alexander Pacha, Jorge Calvo-Zaragoza, and Jan Hajič jr. Learning Notation Graph Construction for Full-Pipeline Optical Music Recognition. In 20th International Society for Music Information Retrieval Conference (in press), 2019. 143 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek [PE17a] Alexander Pacha and Horst Eidenberger. Towards a Universal Music Symbol Classifier. In 14th International Conference on Document Analysis and Recognition, pages 35–36, Kyoto, Japan, 2017. IEEE Computer Society. [PE17b] Alexander Pacha and Horst Eidenberger. Towards Self-Learning Optical Music Recognition. In 16th International Conference on Machine Learning and Applications, pages 795–800, 2017. [PHjCZ18] Alexander Pacha, Jan Hajič jr., and Jorge Calvo-Zaragoza. A Baseline for General Music Object Detection with Deep Learning. Applied Sciences, 8(9):1488–1508, 2018. [RCZIn18] David Rizo, Jorge Calvo-Zaragoza, and José M. Iñesta. MuRET: A Mu- sic Recognition, Encoding, and Transcription Tool. In 5th International Conference on Digital Libraries for Musicology, pages 52–56, Paris, France, 2018. ACM. [Sul36] John W. N. Sullivan. Beethoven: His Spiritual Development. Knopf, Alfred A., 1936. [SZ14] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Net- works for Large-Scale Image Recognition. Computing Research Repository, abs/1409.1556, 2014. [TES+18] Lukas Tuggener, Isamil Elezi, Jürgen Schmidhuber, Marcello Pelillo, and Stadelmann Thilo. DeepScores - A Dataset for Segmentation, Detection and Classification of Tiny Objects. In 24th International Conference on Pattern Recognition, Beijing, China, 2018. [TESS18] Lukas Tuggener, Ismail Elezi, Jürgen Schmidhuber, and Thilo Stadelmann. Deep Watershed Detector for Music Object Recognition. In 19th Interna- tional Society for Music Information Retrieval Conference, pages 271–278, Paris, France, 2018. [WHP19] Simon Waloschek, Aristotelis Hadjakos, and Alexander Pacha. Identification and Cross-Document Alignment of Measures in Music Score Images. In 20th International Society for Music Information Retrieval Conference (in press), 2019. 144 D ie a pp ro bi er te O rig in al ve rs io n di es er D is se rt at io n is t i n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e at th e T U W ie n B ib lio th ek . tu w ie n. at /b ib lio th ek