Identifying related smart contracts by their bytecode

Yücel, Tan

doi:10.34726/hss.2022.96144

Record link:

https://doi.org/10.34726/hss.2022.96144
http://hdl.handle.net/20.500.12708/20180

Title:

Citation:

Yücel, T. (2022). Identifying related smart contracts by their bytecode [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2022.96144

reposiTUm DOI:

10.34726/hss.2022.96144

CatalogPlus:

AC16527210

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Yücel, Tan

Advisor:

Salzer, Gernot

Co-advisor:

di Angelo, Monika

Organisational Unit:

E192 - Institut für Logic and Computation

Date (published):

2022

Number of Pages:

108

Keywords:

Smart contracts; Similarity; Bytecode; Taint analysis; Control-flow graph; graph embedding; graph2vec; ethereum; simulated execution

Abstract:

Das wachsende Interesse an Smart Contracts (Blockchain Programmen), die stetig steigende Zahl von Teilnehmern im Crypto-Ökosystem, sowie neuartige Geschäftsideen tragen dazu bei, dass das Gesamtsystem immer komplexer wird. Da nur wenige Entwickler den Quellcode ihrer Smart Contracts offenlegen, kann die Identifizierung von funktionalen Ähnlichkeiten zwischen Smart Contracts nützlich sein, insbesondere der Vergleich von quelloffenen Contracts mit solchen, die es nicht sind. Dabei stellt die automatisierte Analyse von Programmen, die nur als Maschinenprogramm vorliegen, ein immer noch aktives Forschungsgebiet dar.Die vorliegende Arbeit untersucht ein Verfahren zum Erkennen von Ähnlichkeiten zwischen Smart Contracts, die als Maschinenprogramm vorliegen. Sie baut auf einer Publikation von Huang et al. (2021) auf, die Methoden des Machine Learning, Taint-Analyse und simulierte Bytecode-Ausführung kombiniert. Ausgehend von Codesegmenten, die eineSicherheitslücke enthalten, extrahieren die Autoren sogenannte Slices und repräsentierensie als numerische Vektoren. Durch den Vergleich dieser mit analog kodierten SmartContracts gelingt es, ähnliche Sicherheitslücken in anderen Programmen zu identifizieren.Unser primäres Ziel ist es, die Arbeit von Huang et al. nachzuvollziehen und mit eigenen Datensätzen zu überprüfen. Eine Schwierigkeit besteht dabei darin, dass Huang et al. ihr Verfahren nur lückenhaft beschreiben und ihre Daten nicht öffentlich verfügbar sind. Unsere Ergebnisse sind daher nicht direkt vergleichbar, unsere Experimente liefern aber Anhaltspunkte dafür, dass unsere Rekonstruktion weitgehend dem ursprünglichen Verfahren entspricht. Weiters schlagen wir eine Reihe von Verbesserungen vor.Ein weiteres Ziel unserer Arbeit ist die Erweiterung des Verfahrens von Huang et al., um die Ähnlichkeiten zwischen Smart Contracts als Ganzes zu bestimmen. Die zu diesem Zweck entwickelte heuristische Matching-Methode vergleichen wir mit etablierten Metriken wie dem Jaccard Index der Funktionssignaturen der Contracts. Versuche, die mittels Datensätzen bestehend aus Wallet-Contracts durchgeführt wurden, zeigen auf, dass eine mittelgroße Korrelation zwischen diesen Ähnlichkeitsmaßen besteht.

Smart contracts (blockchain programs) have now attracted significant interest, and the growing number of participants in the ecosystem as well as refined business cases add layers of complexities to the system. Automated tools quickly reach their limits when trying to make sense of closed source smart contracts. As only a small proportion of live smart contracts provide their source openly, it can be helpful to automatically determine the functional similarity between smart contracts, especially with open and closed source.In this thesis, we attempt to identify related smart contracts by extending the work of Huang et al. (2021), who tried to detect vulnerabilities in smart contracts using a combination of machine learning techniques, taint analysis and simulated bytecode execution. By extracting segments of bytecodes into so-called slices and calculating their numerical vector representations, the authors were able to compare parts of smart contracts with each other. Applying this on a set of vulnerable contracts, allowed for the detection of priorly unknown vulnerabilities in other smart contracts.The primary goal of this work is to create suitable datasets and verify the methods deployed by Huang et al. While our results do not reach the levels of Huang et al., we believe that they show that our implementation of the method works as intended. Based on our findings, we compile a list of suggestions for substantial improvements in future iterations.The second goal of this thesis is to extend the above-mentioned method so that it can also detect similarities between contracts as a whole. We deploy a scalable heuristic method of matching several slices of different contracts with each other. By comparing our computed similarity score with existing metrics, e.g. the Jaccard index over function signatures between two contracts, we try to argue in favor of our extension of the original method. Experiments conducted over a dataset consisting of wallet contracts show that a moderately high correlation between our method, and the Jaccard index can be achieved.

License:

In Copyright

Appears in Collections:

Thesis