Group Information
Haoyu Zhang: haoyuzha
Yuchen Liu: yuchenl6
Project Description
We would like to realize the techniques discussed in Paper: Aggify: Lifting the Curse of Cursor Loops using Custom Aggregates in DuckDB. Aggify is an optimization pass that can rewrite cursor loops in Procedural User Defined Functions as custom aggregation functions. Then we can apply techniques like UDF inlining [2, 3] or UDF compilation [4]. We would like to target PL/PgSQL UDFs and the database platform DuckDB in this project. We choose PL/pgSQL because it is static and the language syntax is relatively conservative, making the compilation process easier. We choose DuckDB because we already have abundant experience with it, and the open-source community of it has been extremely helpful.
This technique requires constructing CFG for the entire procedure code, doing some data flow analysis including reaching definition analysis and live variable analysis as well as generating custom aggregation function in C++ and rewriting the original UDF that contains cursor loop to call the aggregation function.
75% Goal
Pass all six TPCH test cases.
Construct CFG of PL/pgSQL.
Do correct Data Flow Analysis.
Code generation.
Using the vectorized API (need to check if there is a scalar version)
Rewrite original UDF.
100% Goal
Perform optimizations related to loops, such as DCE, LICM.
125% Goal
Perform “Query Motion”. Take Query 2 as an example, if there is a filter condition on the top of the loop body, we can hoist it out of the loop and move it to the “where” clause in cursor select statement. By doing this, the loop iteration count is reduced and less data is materialized.
declare c1 cursor for (select PS_SUPPLYCOST, S_NAME
from dbo.partsupp, dbo.supplier where ...);
...
fetch next from c1 into @fetchedCost, @fetchedName;
if(@fetchedCost<@minCost) //<-- filter on top of loop body
begin
...
Logistics
Getting Started
Since this project is a subcomponent of an ongoing project, the CFG and compilation framework has already been under development, and we can reuse most of them. By now, the CFG is not fully functional. Neither is the code generation backend from CFG to C++. Therefore, we will take the progress of these prerequisites into account in the following schedule.
Implementing Aggify on the current framework will not only help us improve the framework to be more robust and generic, it will also reduce the time for cursor loops by 1000x (in SQL Server) if the UDF has one [1].
Schedule
Start from the week on Oct 30th to Dec 7th.
Week No. | Task |
---|---|
1 | Haoyu: Fully read the existing codebase. Yuchen: Generate the CFG for PL/pgSQL. |
2 | Haoyu: Do Data Flow Analysis on the CFG. Yuchen: Generate code from CFG to C++. |
3 | Haoyu: Correctly manipulate the CFG based on Aggify. Yuchen: Generate correct code using vectorized custom aggregation API. |
4 | Haoyu & Yuchen: Benchmark on TPC-H workloads. |
5 | Haoyu: Work on DCE, LICM. Yuchen: Work on Query Motion. |
6 | Haoyu & Yuchen: Prepare presentation, poster and report. |
Milestone
The milestone on Nov 20th will include all the tasks by week 3. More specifically, this means the whole pipeline should work by checking manually (without the need to pass all TPC-H test cases).
Resources Needed
Hard work.
References
[1] S. Gupta, S. Purandare, and K. Ramachandra, “Aggify: Lifting the Curse of Cursor Loops using Custom Aggregates,” in Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, Portland OR USA: ACM, Jun. 2020, pp. 559–573. doi: 10.1145/3318464.3389736.
[2] K. Ramachandra, K. Park, K. V. Emani, A. Halverson, C. Galindo-Legaria, and C. Cunningham, “Froid: optimization of imperative programs in a relational database,” Proc. VLDB Endow., vol. 11, no. 4, pp. 432–444, Dec. 2017, doi: 10.1145/3186728.3164140.
[3] D. Hirn and T. Grust, “One WITH RECURSIVE is Worth Many GOTOs,” in Proceedings of the 2021 International Conference on Management of Data, Virtual Event China: ACM, Jun. 2021, pp. 723–735. doi: 10.1145/3448016.3457272.
[4] M. Sichert and T. Neumann, “User-defined operators: efficiently integrating custom algorithms into modern databases,” Proc. VLDB Endow., vol. 15, no. 5, pp. 1119–1131, Jan. 2022, doi: 10.14778/3510397.3510408.