Strobl, S. (2026). Pinging the Internet : a study on efficient long-term storage, analysis and visualisation [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2026.140042
E194 - Institut für Information Systems Engineering
-
Date (published):
2026
-
Number of Pages:
89
-
Keywords:
Security
en
Abstract:
Internet measurements generate vast amounts of data; for example, a single scan over the entire IPv4 address space produces 2^32 or roughly 4.3 billion data points. Consequently, storage and analysis solutions must be carefully designed to handle such amounts of data efficiently. In this thesis, we conduct a comprehensive study on efficient storage and querying of Internet measurement data. In addition, we present a visualization tool that enables interactive analysis of these measurements.We begin by comparing file-based storage solutions for baseline measurement data. Specifically, we benchmark the CSV file format against the Parquet file format. Our results show that using Parquet reduces disk space requirements by a factor of 20.7 compared to CSV. Next, we investigate database-based storage solutions. We compare a traditional row-based relational database, PostgreSQL, with a column-oriented time-series database, ClickHouse. For each database, we evaluate multiple schema designs and benchmark their storage requirements. In terms of storage efficiency, ClickHouse significantly outperforms PostgreSQL. To further reduce disk space usage, we design aggregated schemas tailored for ClickHouse and analyze their query performance. We evaluate a diverse set of queries, differing in type and informational output, and observe that an aggregated schema using a single table outperforms the aggregated relational alternative. This schema is therefore selected for use in our analysis tool.Finally, we deploy the selected database and schema using real-world measurement data. We collect data by performing ICMP echo request scans over the entire IPv4 address space using ZMap. An interactive analysis platform is implemented using the Python framework Streamlit, which uses our queries to provide a web-based interface for exploring and analyzing the collected data.