DataFrame Interchange: An Example
Recently I have been trying to explore the data analysis crates available in Rust, with the thesis that “Rust is data analysis ready”. Turns out that while there are tons of really great data analysis crates in Rust, mostly based on Polars, they don’t work well as an ecosystem. Because of Rust’s strong type system, you can’t take the Polars 0.45 output from one crate and give it to another crate that assumes Polars 0.43. It just won’t work - you will get a error[E0308]: mismatched types. On top of that, many crates only output arrow-rs data, and the support for arrow-rs was removed in Polars 0.44. All of this was one serious blocker to Rust being data analysis ready! The interoperability for these crates were assumed to take place on the Python side of things, not within Rust.
Since Polars uses Apache Arrow’s memory model, and the Arrow memory model implements a C data interchange format, this makes it so that zero-copy data interchange can be implemented between any version of Polars and any version of Arrow. This is what my df-interchange crate does! With the correct version of Arrow or Polars enabled as a feature flag (e.g. polars_0_41, polars_0_46, arrow_53) you can move data between any version of Polars (>=0.40) and any version of Arrow (>=50) directly within a data pipeline.
Data pipeline example
Here is a working data pipeline examples that takes data from a .parquet with Polars 0.46, a .duckdb database with DuckDB (returning an Arrow 53 RecordBatch vector) and a PostgreSQL database with ConnectorX (returning a Polars 0.45 DataFrame). These files are created using the Palmer Penguins data and the three sources can be seeded with this Rust script.
Now that we have the seeded files, we can read them in, concatenate them and pass them to Plotlars for data visualization and to Hypors for hypothesis testing. The full script can be found here.
Lastly, here are the crates, versions and features for this example:
[]
= { = "0.46", = ["parquet", "pivot", "lazy"] }
= { = "0.4.1", = ["src_postgres", "dst_arrow", "dst_polars"] }
= "1.1"
= "0.2.5"
= "0.8.1"
= { = "0.1", = ["polars_0_43", "polars_0_45", "polars_0_46", "arrow_53"] }
Reading the data
To start, we can read the three part Penguin data from the .parquet file, the .duckdb database and a PostgreSQL database.
Read ~1/3rd of the Penguin data from .parquet using Polars. Returns a Polars 0.46 DataFrame.
let mut file = open.unwrap;
let polars = new.finish.unwrap;
Read ~1/3rd of the Penguin data from DuckDB. Returns an Arrow 53 Vec<RecordBatch>.
let conn = open.unwrap;
let mut stmt = conn.prepare.unwrap;
let duckdb: = stmt.query_arrow.unwrap.collect;
Read ~1/3rd of the Penguin data from PostgreSQL with ConnectorX. Returns a Polars 0.45 DataFrame.
let source_conn =
try_from.unwrap;
let connectorx = get_arrow
.unwrap
.polars
.unwrap;
Interchange
So now we have Polars 0.46, Arrow 53 and Polars 0.45 data in memory. If you try to concatenate the two Polars DataFrame you will get a [E0308]: mismatched types error. The df-interchange crate can be used to convert two of the data objects to Polars 0.46.
Lets first convert the Arrow 53 Vec<RecordBatch> we got from DuckDB to Polars 0.46 using Interchange::from_arrow_53() and .to_polars_0_46():
let duckdb = from_arrow_53
.unwrap
.to_polars_0_46
.unwrap
.lazy
.with_column
.with_column;
Next we can convert the Polars 0.45 DataFrame we got from ConnectorX to Polars 0.46 using Interchange::from_polars_0_45() and .to_polars_0_46():
let connectorx = from_polars_0_45
.unwrap
.to_polars_0_46
.unwrap
.lazy;
Now that we have three in-memory data object using the Polars 0.46 crate, we can concatenate it (using Polars’ LazyFrame):
let polars = concat
.unwrap;
Plotlars
Now that we have one concatenated LazyFrame in memory called polars, we can pass a copy of it to Plotlars to create a graphic! Plotlars takes Polars 0.45, so lets convert it to that with Interchange::from_polars_0_46() and .to_polars_0_45():
let polars_0_45 = from_polars_0_46
.unwrap
.to_polars_0_45
.unwrap;
And now we can render the graph as html:
let html = builder
.data
.x
.y
.group
.opacity
.size
.colors
.plot_title
.x_title
.y_title
.legend_title
.build
.to_html;
let mut file = create.unwrap;
write_all.unwrap;
See output here.
Hypors
Using the same concatenated LazyFrame called polars, we can modify and pivot it, for it to be accepted by Hypors in order to do an Analysis of Variance (ANOVA) Tests.
let polars = polars
.select
.with_row_index;
let polars_pivot = pivot_stable
.unwrap
.drop
.unwrap;
Once properly configured, we can convert it to Polars 0.43:
let polars_pivot = from_polars_0_46
.unwrap
.to_polars_0_43
.unwrap;
And now we can pass the columns to the anova() function and print the results!
let cols = polars_pivot.get_columns;
let result = anova
.unwrap;
println!;
F-statistic: 594.8016274385171
p-value: 0
Conclusion
Prior to df-interchange, attempting to do this in Rust would have been extremely hard. You would likely have had to read each sources, convert them to Parquet, then re-read them with the correct version of Polars for each of the crates (Plotlars and Hypors). This would require a lot more reading and writing of data, trivial for small tables like this, but in a real world example can make the pipeline incredibly slow. Now it’s as simple as adding a few lines of code and passing the correct version of the object to the analysis crates.