cover image: Exploration of Graph Databases : Triangle Enumeration in Large Graphs

Exploration of Graph Databases : Triangle Enumeration in Large Graphs

2 Dec 2023

This thesis falls within the field of large-scale graph analysis, with a particular focus on triangle enumeration. The search for triangles is a fundamental problem that allows us to detect all structures composed of three adjacent vertices, all interconnected by edges. These triangles play a crucial role in many graph-related applications, such as determining the clustering coefficient, assessing the age of a community in a social network, and detecting cliques.To address this study, we conducted an in-depth literature review on triangle detection in graphs. We identified two main categories of approaches: (a) those that assume the graph can fit in main memory, and (b) those that consider that the graph cannot be entirely stored in main memory. Analyzing these works, we found that the first approaches (which rely on main memory management) are effective when the graph can be fully loaded into memory, but this does not always apply to large-scale graphs that do not entirely fit into main memory. Therefore, it is essential to develop efficient algorithms for these types of graphs, which poses a major challenge for the scientific community.In the context of this thesis, we proposed a randomized algorithm (also known as adaptive algorithm) for triangle enumeration in large-scale graphs using distributed Relational Database Management System (RDBMS) technology. We chose Vertica, an RDBMS that is based on a column-oriented storage model, unlike traditional RDBMSs that use a row-oriented model. Our solution also leveraged the distributed $k-machines$ computing model. In this approach, the graph is represented as tables and processed through SQL queries executed across a set of machines while ensuring workload balancing among these nodes. To evaluate this workload balancing, we developed a tool called "PandaSQL," which allows us to analyze the graph's structure and compare our solution to existing algorithms dedicated to triangle enumeration. Furthermore, we made another contribution by studying the behavior of our randomized algorithm in a distributed environment using Python, while managing data communication between machines via the Message Passing Interface (MPI). A theoretical analysis of our algorithm, implemented in the two aforementioned environments, was also conducted.Finally, we conducted a series of experiments using large-scale graphs with different structures while varying the number of machines in the $k-machines$ model (4, 9, and 16 machines). These experiments focused on the performance of the randomized algorithm and its comparison against competing systems in different environments. Specifically, the classic solution on RDBMS, Networkx in Python, Spark Graphx, and G-thinker in a Big Data environment. The results obtained during this thesis highlight the benefits of combining RDBMS technology and parallel algorithms for efficient triangle enumeration in large-scale graphs.

Authors

Abir Farouzi

Bibliographic Reference
Abir Farouzi. Exploration des bases de données orientées graphe : Énumération des triangles dans les graphes à grande échelle. Autre [cs.OH]. ISAE-ENSMA Ecole Nationale Supérieure de Mécanique et d'Aérotechique - Poitiers; Ecole Nationale Supérieure d'Informatique (ESI) - Alger, 2023. Français. ⟨NNT : 2023ESMA0012⟩. ⟨tel-04326161⟩
HAL Collection
['Université de Poitiers', "Ecole Nationale Supérieure de Mécanique et d'Aérotechnique", 'STAR - Dépôt national des thèses électroniques', "Laboratoire d'Informatique et d'Automatique pour les Systèmes"]
HAL Identifier
4326161
Institution
['Université de Poitiers', 'École Nationale Supérieure de Mécanique et d’Aérotechnique [Poitiers]']
Laboratory
Laboratoire d'Informatique et d'Automatique pour les Systèmes
Published in
France

Table of Contents