dc.description.abstract
This thesis has made contributions toward building a structured design and implementation flow for DNN, addressing key challenges in estimation, quantization, and profiling. Through the development of new methodologies and frameworks such as ANNETTE, this work has introduced solutions that enhance the efficiency of DNN deployment on a range of hardware platforms, also highlighting the areas that require further development to achieve the overall goal of a generalizable design and implementation flow.In the area of estimation, this thesis successfully implemented latency estimation [Paper I] [1] to create a more holistic understanding of DNN behavior on embedded hardware. The combination of analytical and stochastic models proved to be crucial in this field. The introduction of a confidence framework [Paper II] [2] further improves the usability of the prediction results, allowing for more informed performance tuning across a wide range of architectures.The next logical step is to connect latency estimation with resource utilization and power consumption models, bringing them together into a unified framework. Such a model would not only offer more comprehensive performance predictions but also make use of benchmarking across diverse hardware settings. This could also enable accurate predictions for upcoming hardware generations, leveraging architectural insights. Moreover, refining the confidence methods to include optimizations that occur during graph compilation, will further increase the trustworthiness of the predictions.An additional focus will need to be on an automated analysis and estimation whether a network can be inferred on a target device. This way developers can ensure that the selected or designed network contains only supported layers and fits within the memory constraints of the target hardware.Quantization plays a crucial role in optimizing DNNs for deployment on resource-constrained devices. The development of WQR provides a method for dynamically adjusting layer precision based on its criticality. However, quantization remains a challenging and evolving area. There is no single method that is perfect for all networks and hardware platforms, and given the rapid pace at which DNN architectures evolve, this is unlikely to change. The field is moving toward smaller integer datatypes, such as INT4 and INT2 [78], but striking the right balance between compression and accuracy will continue to be a key focus.As the landscape of DNN architectures shifts, future research should aim to improve the flexibility of quantization techniques, ensuring that they can adapt to the demands of new architectures. Although post-training quantization will likely remain the standard due to its simplicity, methods for QAT will be essential in cases where precision must be maintained despite aggressive quantization. Standards and support for different quantization types will also need to evolve, providing developers with more tools to handle the increasing complexity of DNNs.Profiling is another area where this thesis has made progress. The use of power side-channel analysis provided valuable insights into the power and latency performance of DNNs at a granular level [Paper IV] [4]. This allows for a detailed understanding of how individual operations impact overall performance, particularly when deployed on embedded platforms. However, despite these advancements, the complexity of the post-processing required for power side-channel analysis prevented the full automation of these measurements. This presents a challenge for scaling the methodology to broader use cases. Additionally, the introduction of the smart padding technique enabled accurate layer isolation for latency profiling, even in the absence of detailed hardware transparency.Future efforts should aim to combine both approaches to enable power consumption and resource estimation. The smart padding could potentially increase the reliability of the power measurements. Automation will also be key tomaking profiling more scalable, enabling real-time assessments across different hardware platforms, and streamlining the benchmarking process. As hardware architectures continue to diversify, automated profiling tools will be crucial for ensuring that performance predictions remain accurate and consistent across platforms.As the focus of AI development increasingly shifts toward Large Language Models (LLMs) like GPT and BERT,the methodologies developed in this thesis take on new relevance. While this work primarily Convolutional Neural Networks (CNNs), the growing prominence of LLMs presents a set of unique challenges that must be addressed in future research. LLMs are typically much larger than traditional CNNs for vision applications, and their resource requirements,particularly in terms of memory and power consumption, are substantially higher. The resource and power estimation techniques developed here provide a strong foundation, but they will need to be adapted to the specific needs of LLMs.The primary difficulty lies in the fact that many LLM implementations are still highly customized, relying on hand-optimized code to achieve maximum performance. Unlike CNNs, where inference frameworks have become widely adopted, LLMs lack such standardization, making their optimization more complex.Looking ahead, it will be essential to develop more sophisticated resource and power estimation models that fit specifically to LLMs. Hardware support for LLMs is also still evolving, and future research should explore how to optimizethese models for next-generation hardware platforms, ensuring that both efficiency and performance are maximized.Standards for LLM optimization will need to mature, and tools that can automate the benchmarking and profiling of these large-scale models will be crucial for their successful deployment.As AI models and hardware continue to evolve, the need for adaptable, flexible, and efficient design methodologies will only grow. The contributions made in this thesis provide a strong foundation for this evolution, offering valuable insights and tools that will help drive further advancements in DNN optimization. By continuing to refine these methods and adapting them to new technologies, the field will move closer to realizing a fully automated, generalizable DNN design and implementation flow.
en