Enforcing guardedness for acyclic aggregate queries to minimize materialization

Ortner, Andrea

doi:10.34726/hss.2025.127522

DC Field

Value

Language

dc.contributor.advisor

Pichler, Reinhard

dc.contributor.author

Ortner, Andrea

dc.date.accessioned

2025-07-14T11:34:38Z

dc.date.issued

2025

dc.date.submitted

2025-04

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Ortner, A. (2025). <i>Enforcing guardedness for acyclic aggregate queries to minimize materialization</i> [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2025.127522</div> </div>

dc.identifier.uri

https://doi.org/10.34726/hss.2025.127522

dc.identifier.uri

http://hdl.handle.net/20.500.12708/217007

dc.description.abstract

Petabytes an Daten - das ist der Preis unserer globalisierten, digitalen Welt. Nur Hardware alleine kann diesen Berg an Daten nicht bewältigen, zumal ihre Mächtigkeit auch physisch limitiert ist (Moore’s Law). Will man bestimmte Informationen aus diesen Daten abfragen, ist dies mit hohen Kosten verbunden. Um diese zu senken, ist es notwendig, intelligente Software-Lösungen zu entwickeln. Doch warum ist diese Beantwortung bzw. Evaluierung so kostspielig? Ein Grund für die hohen anfallenden Kosten ist das Entstehen sehr großer Zwischenergebnisse bei relativ kleinen Endergebnissen. Dieses Problem konnte Yannakakis bereits 1981 etwas abschwächen, als er einen Algorithmus entwarf, der im Fall von Join-Abfragen nur jene Zwischenergebnisse materialisierte, die auch Teil des Endergebnisses waren. Im Fall von Aggregat-Abfragen bedeutet dies natürlich immer noch, dass große Zwischenergebnisse materialisiert werden, obwohl das Endergebnis nur aus den Aggregat(en) besteht. Die neueste Entwicklung bzgl. Yannakakis-artigen Optimierungen, genannt AggJoin-Operator [LPS24 ], vermeidet Materialisierung im Fall von azyklischen Aggregat-Abfragen komplett. AggJoin setzt dafür aber voraus, dass die zu beantwortenden Abfragen ”piecewise-guarded” sind. Das bedeutet, dass alle Attribute in der GROUP BY Klausel bzw. innerhalb eines Aggregat-Ausdrucks gemeinsam in einer der an den Joins beteiligten Relation vorkommen müssen. Dies stellt eine große Einschränkung dar, schließlich können so insbesondere Abfragen, deren GROUP BY Klauseln Attribute aus mehreren Relationen enthalten (ein häufiger Use-Case in Business Analytics), nicht mit AggJoin evaluiert werden. Aus diesem Grund ist es das Ziel dieser Arbeit, eine Brücke zwischen Yannakakis-artigen Optimierungen (AggJoin), und non-guarded Abfragen zu bauen (d.h. Abfragen, deren gruppierende Attribute aus verschiedenen Relationen stammen). Dazu haben wir PartAggJoin implementiert; eine Erweiterung von AggJoin, die nicht an dieselben Restriktionen gebunden ist: PartAggJoin zerlegt Abfragen-Bäume in Sub-Bäume, welche mit unterschiedlichen Methoden evaluiert werden. Sind gruppierende Attribute vorhanden, wird minimale Materialisierung erlaubt, um keine Informationen zu verlieren; im anderen Fall wird AggJoin angewandt. Dadurch können nun auch Abfragen mit mehreren Attributen im GROUP BY Teil von Yannakakis-artigen Optimierungsstrategien profitieren. Die Effizienz und Performanz unserer Optimierung wurde durch die Implementierung in SparkSQL bewiesen und empirisch anhand der Benchmarks TPC-H und Syn-TPC-Hgezeigt. Syn-TPC-H wurde als Datenset im Rahmen dieser Arbeit geschaffen, um einen gezielten Fokus auf non-guarded Abfragen zu erlauben.

dc.description.abstract

Petabytes of data - such is the cost of our modern, global, digital interconnectedness. The approach of relying solely on hardware to process this amount of data is slowly reaching its limit (Moore’s Law). This is why we need intelligent software solutions that reduce the energy, time and overall cost of data processing. But why is it even so costly?One reason for the high cost of processing is the explosion of intermediate results, even if the final result is rather small. To mitigate this problem to some extent, Yannakakis designed an algorithm in 1981 that, when multiple joins are necessary, only materializes tuples that are actually part of the final result. In case of aggregate queries, this still means that intermediate results are quite big in comparison to the final result (the aggregated values). The state-of-the-art optimization (the AggJoin operator [ LPS24 ]), provides powerful speed-gains and succeeded in avoiding materialization altogether for ”piecewise-guarded” acyclic aggregate queries. This means, that all attributes in the GROUP BY statement need to stem from the same relation, and so do attributes within an aggregate expression. This is very limiting, as queries that have grouping attributes frommultiple relations are a common use case in business analytics. For this reason, this work sought to build a bridge between Yannakakis-style query optimization and non-guarded queries by allowing minimal necessary materialization. To this end, we implemented PartAggJoin, an extension to AggJoin that lifts the restrictions in terms of guardedness:PartAggJoin splits a query tree into subtrees. During the subsequent analysis of these subtrees, PartAggJoin detects whether a subtree contains applicable grouping attributes. If this is the case, it performs the minimal materialization necessary to preserve the grouping attributes. If no grouping attributes are present, an AggJoin is performed. Thus, PartAggJoin removes the limitations of guardedness for queries that contain two or more attributes in GROUP BY clauses that stem from different relations, widening the field of application for Yannakakis-style optimization. The efficiency and performance power of our optimization is validated through its implementation in SparkSQL and empirical evaluation on the TPC-H benchmark as well as on the newly created Syn-TPC-H datasetthat specializes on the evaluation of non-guarded queries.

dc.language

English

dc.language.iso

dc.rights.uri

http://rightsstatements.org/vocab/InC/1.0/

dc.subject

Anfragebeantwortung

dc.subject

azyklische Anfragen

dc.subject

Aggregate

dc.subject

Optimierung

dc.subject

query evaluation

dc.subject

acyclic queries

dc.subject

aggregates

dc.subject

optimization

dc.title

Enforcing guardedness for acyclic aggregate queries to minimize materialization

dc.title.alternative

Erzeugen von Guardedness für azyklische Aggregatabfragen zur Minimierung der Materialisierung von Zwischenergebnissen

dc.type

Thesis

dc.type

Hochschulschrift

dc.rights.license

In Copyright

dc.rights.license

Urheberrechtsschutz

dc.identifier.doi

10.34726/hss.2025.127522

dc.contributor.affiliation

TU Wien, Österreich

dc.rights.holder

Andrea Ortner

dc.publisher.place

Wien

tuw.version

vor

tuw.thesisinformation

Technische Universität Wien

dc.contributor.assistant

Selzer, Alexander

tuw.publication.orgunit

E192 - Institut für Logic and Computation

dc.type.qualificationlevel

Diploma

dc.identifier.libraryid

AC17582882

dc.description.numberOfPages

100

dc.thesistype

Diplomarbeit

dc.thesistype

Diploma Thesis

dc.rights.identifier

In Copyright

dc.rights.identifier

Urheberrechtsschutz

tuw.advisor.staffStatus

staff

tuw.assistant.staffStatus

staff

tuw.advisor.orcid

0000-0002-1760-122X

tuw.assistant.orcid

0000-0002-6867-5448

item.cerifentitytype

Publications

item.openairecristype

http://purl.org/coar/resource_type/c_bdcc

item.openaccessfulltext

Open Access

item.grantfulltext

open

item.openairetype

master thesis

item.fulltext

with Fulltext

item.languageiso639-1

item.mimetype

application/pdf

crisitem.author.dept

E192-02 - Forschungsbereich Databases and Artificial Intelligence

crisitem.author.parentorg

E192 - Institut für Logic and Computation

Appears in Collections:

Thesis

Fulltext (Version of Record (published version))

Adobe PDF

(3.97 MB)

In Copyright

Show simple item record

Page view(s)

checked on Jul 14, 2025

Download(s)

checked on Jul 14, 2025

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM