Relational databases (RDBs) store vast amounts of structured data across multiple interconnected tables. This rich relational information has immense potential for predictive machine learning. However, the progress of predictive models on RDBs currently lags behind advancements in other domains like computer vision or natural-language processing. One key reason is the lack of established, publicly available RDB benchmarks for model training and evaluation.
Existing predictive models for RDBs often resort to using single-table datasets or graph datasets derived from preprocessed relational data. However, these approaches do not fully capture the native multi-table structure and characteristics of RDBs, potentially limiting model performance.
To address this gap, Amazon’s Shanghai Lablet has developed 4DBInfer, a comprehensive open-source benchmarking tool for graph-centric predictive modeling on RDBs.
4DBinfer enables systematic comparison of diverse baseline models across four key dimensions: (1) RDB datasets, (2) predictive tasks, (3) RDB-to-graph extraction methods, and (4) graph-based predictive architectures. This 4-D design facilitates a thorough exploration of the model design space for RDB predictive analytics.
![4DBInfer.16x9.png](https://assets.amazon.science/dims4/default/b502fff/2147483647/strip/true/crop/1680x945+0+0/resize/1200x675!/quality/90/?url=http%3A%2F%2Famazon-topics-brightspot.s3.amazonaws.com%2Fscience%2Fed%2F7a%2Ff606830c4fff8a769e5fb24d3514%2F4dbinfer-16x9.png)
Let's dive deeper into 4DBInfer's core components:
RDB datasets and tasks: We curate a suite of RDB benchmarks spanning real-world application domains, including e-commerce, advertising, and social networks. These datasets exhibit diverse characteristics in terms of scale (up to billions of rows), schema complexity, and temporal evolution. For each dataset, we define practically relevant predictive tasks, such as estimating missing cell values.
RDB-to-graph extraction: 4DBInfer supports multiple strategies for converting RDBs into graph representations while preserving rich tabular information. The Row2Node approach treats each table row as a graph node, with foreign-key relationships forming the edges. The Row2N/E method selectively converts some rows into edges to capture more nuanced relational structures. 4DBInfer also introduces "dummy tables" to enrich the graph connectivity.
Graph-based predictive models: We implement a range of strong baseline architectures for graph-based learning, covering both early- and late-feature-fusion paradigms. These include graph neural networks (GNNs) that learn node embeddings based on relational message passing, as well as models that first extract tabular features from the graph using techniques like deep feature synthesis (DFS) before applying classical machine learning predictors.
Extensive experiments using 4DBInfer yield several key insights:
- Using graph-based models to leverage the full multi-table RDB structure generally yields better results than using single-table or simple table-joining models, highlighting the value of relational information.
- The choice of RDB-to-graph extraction strategy significantly influences model performance, underscoring the importance of flexibly exploring this design space.
- Graph models with early feature fusion (e.g., GNNs) tend to outperform late-fusion approaches overall, but the latter can still be competitive in some scenarios, particularly under computational constraints.
- Model performance exhibits dataset- and task-specific variations, emphasizing the need for diverse benchmarks to ensure reliable conclusions.
Through 4DBInfer, we aim to accelerate research on graph-centric predictive modeling for RDBs by providing a unified, fully open-sourced framework. We believe this work will enable the community to develop novel approaches that effectively harness the power of relational data for prediction tasks. Excitingly, our experiments suggest that the most successful solutions may emerge at the intersection of tabular and graph machine learning paradigms — an area ripe for further exploration.