Parquet
You can read and write from Parquet using Polars.
Reading
You can connect to a Parquet file, like the large ./data/large/census.parquet, without bringing it in memory, with the LazyCsvReader. Run this code using cargo run -r --example 2_3_1_read_parquet.
use polars::prelude::*;
// Connect to LazyFrame (no data is brought into memory)
let args = ScanArgsParquet::default();
let lf =
LazyFrame::scan_parquet(PlPath::from_str("./data/large/census.parquet"), args).unwrap();
You can also connect to a partitioned parquet folder (./data/large/partititoned) in the same exact way:
// Connect to LazyFrame (no data is brought into memory)
let args = ScanArgsParquet::default();
let lf = LazyFrame::scan_parquet(PlPath::from_str("./data/large/partitioned"), args).unwrap();
In both cases, in the same way as with LazyFrame with CSV, the data is not brought into memory. You can convert a few rows to a DataFrame (bring them into memory) to visualize it.
println!("{}", lf.limit(5).collect().unwrap());
shape: (5, 21)
┌─────────────────┬────────┬───────┬──────┬───┬─────┬───────────┬────────┬───────┐
│ id ┆ social ┆ birth ┆ econ ┆ … ┆ sex ┆ keep_type ┆ income ┆ chunk │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════════════╪════════╪═══════╪══════╪═══╪═════╪═══════════╪════════╪═══════╡
│ PTS000000348231 ┆ 2 ┆ 1 ┆ -8 ┆ … ┆ 1 ┆ 1 ┆ 59292 ┆ 47 │
│ PTS000000059235 ┆ 1 ┆ 1 ┆ -8 ┆ … ┆ 1 ┆ 1 ┆ 25731 ┆ 47 │
│ PTS000000060206 ┆ 1 ┆ 1 ┆ -8 ┆ … ┆ 2 ┆ 1 ┆ 88277 ┆ 47 │
│ PTS000000468982 ┆ 3 ┆ 1 ┆ -8 ┆ … ┆ 2 ┆ 1 ┆ 82954 ┆ 47 │
│ PTS000000224308 ┆ 2 ┆ 1 ┆ -8 ┆ … ┆ 2 ┆ 1 ┆ 82315 ┆ 47 │
└─────────────────┴────────┴───────┴──────┴───┴─────┴───────────┴────────┴───────┘
Writing
You can write to Parquet any DataFrame you have in memory. For this example, we will bring one percent of the UK Census into memory. Run this code using cargo run -r --example 2_3_2_write_parquet.
use polars::prelude::*;
// Read `census_0.csv` as LazyFrame
let lf = LazyCsvReader::new(PlPath::from_str("./data/csv/census_0.csv"))
.with_has_header(true)
.finish()
.unwrap();
// Bring it into memory (by converting it to DataFrame)
let mut df = lf.collect().unwrap();
In order to save it, you have to create a file and write to it:
// Write `pub0124.parquet`
let mut file = std::fs::File::create("./data/temp_data/census_0.parquet").unwrap();
ParquetWriter::new(&mut file).finish(&mut df).unwrap();
This saves the data into one .parquet file. The write_partitioned_dataset function can be used to write a partitioned Parquet files, based on the values in one or more columns.
Warning
The write_partitioned_dataset function is unstable and undocumented.
For example, you can write one percent of the UK Census data by region and age_group using write_partitioned_dataset. Run this code using cargo run -r --example 2_3_3_write_partitioned_parquet.
Note
The value of
4294967296bytes (4 GB) was selected for thechunk_sizeas it is the default for the partitioned parquet files in Polars for Python. This will be the approximate maximum size of each.parquetfile created (if it was large enough).
// This functionality is unstable according to the docs
write_partitioned_dataset(
&mut df,
PlPath::from_str("./data/temp_data/partitioned/").as_ref(),
vec!["region".into(), "age_group".into()],
&ParquetWriteOptions::default(),
None,
4294967296,
)
.unwrap();
This will create a hive partitioned Parquet file based on region and age_group:
folder/
├─ region=E12000001/
├─ region=E12000002/
├─ region=E12000003/
│ ├─ age_group=1/
│ │ ├─ 00000000.parquet
│ ├─ age_group=2/
│ │ ├─ 00000000.parquet
│ ├─ age_group=3/
│ │ ├─ 00000000.parquet
│ ├─ age_group=4/
│ │ ├─ 00000000.parquet
│ ├─ age_group=5/
│ │ ├─ 00000000.parquet
│ ├─ age_group=6/
│ │ ├─ 00000000.parquet
│ ├─ age_group=7/
│ │ ├─ 00000000.parquet
├─ region=E12000004/
├─ ...
├─ region=W92000004/
The filter chapter will go into more detail about the advantages of doing this.