Joins
This section explore how to join two datasets, either by stacking them one on top of the other (same columns) or by stacking them side by side (same rows). Run this code using cargo run -r --example 3_5_1_joins.
Concatenate
First, we create a vector containing five times the 1% Census sample. In ./data/parquet we have 1% sample files of the UK Census. Each 1% sample is identical.
use polars::prelude::*;
// Connect and process multiple census chunks
let mut census_chunk = vec![];
for c in 1..6 {
let args = ScanArgsParquet::default();
let lf = LazyFrame::scan_parquet(
PlPath::from_string(format!("./data/parquet/census_{c}.parquet")),
args,
)
.unwrap();
census_chunk.push(lf);
}
To concatenate data of the same row-shape on top of each other, we can use the concat function by listing the LazyFrames we want to stack tegether. Here, we can concatenate all four 1% sample of the UK Census, to create a 5% sample of the Census from the census_chunk vector.
// Concatenate vertically two (or more) datasets
let five_percent_sample = concat(
[
census_chunk[0].clone(), // Cloned, since we need it later
census_chunk[1].clone(),
census_chunk[2].clone(),
census_chunk[3].clone(),
census_chunk[4].clone(),
],
UnionArgs::default(),
)
.unwrap();
If we print the result, it shows that we have a DataFrame of over 3 million rows (5 x ~600,000 rows per 1% sample). It is also possible to see that chunk has multiple values.
// See `chunk` going from 1 to 5
println!(
"{}",
five_percent_sample
.filter(col("chunk").eq(1))
.collect()
.unwrap()
);
shape: (3_021_755, 22)
┌─────────────────┬────────┬───────┬──────┬───┬───────────┬────────┬────────┬───────┐
│ id ┆ social ┆ birth ┆ econ ┆ … ┆ keep_type ┆ income ┆ weight ┆ chunk │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════════════╪════════╪═══════╪══════╪═══╪═══════════╪════════╪════════╪═══════╡
│ PTS000000588097 ┆ 4 ┆ 1 ┆ 1 ┆ … ┆ 1 ┆ 46223 ┆ 109 ┆ 1 │
│ PTS000000000320 ┆ -8 ┆ 1 ┆ 5 ┆ … ┆ 1 ┆ null ┆ 97 ┆ 1 │
│ PTS000000397448 ┆ -8 ┆ 2 ┆ 5 ┆ … ┆ 1 ┆ null ┆ 90 ┆ 1 │
│ PTS000000082442 ┆ -8 ┆ 1 ┆ 5 ┆ … ┆ 1 ┆ null ┆ 108 ┆ 1 │
│ PTS000000016066 ┆ 4 ┆ 1 ┆ 8 ┆ … ┆ 1 ┆ null ┆ 75 ┆ 1 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ PTS000000166524 ┆ -8 ┆ 2 ┆ 6 ┆ … ┆ 1 ┆ null ┆ 110 ┆ 5 │
│ PTS000000246489 ┆ 2 ┆ 1 ┆ 1 ┆ … ┆ 1 ┆ 26757 ┆ 107 ┆ 5 │
│ PTS000000177162 ┆ -8 ┆ 1 ┆ 9 ┆ … ┆ 1 ┆ null ┆ 75 ┆ 5 │
│ PTS000000377217 ┆ -8 ┆ 1 ┆ 6 ┆ … ┆ 1 ┆ null ┆ 113 ┆ 5 │
│ PTS000000377192 ┆ 1 ┆ 2 ┆ 2 ┆ … ┆ 1 ┆ 36811 ┆ 121 ┆ 5 │
└─────────────────┴────────┴───────┴──────┴───┴───────────┴────────┴────────┴───────┘
Joins
Polars has multiple options for joining data by row. To make the data visualization simpler, the below code processes the data found in the census_chunk vector, keeps only a few variables, removes those without any income and renames the income to reflect the chunk number of the data. Note that we will use id as a linkage key, but each chunk is identical (the same 1% sample of the UK Census), so we will drop a random 50% of the rows from each to better show how joins work. This can be done with DataFrame’s sample_n_literal
// Update the census_chunk vector to remove variables and update values
for c in 1..6 {
census_chunk[c - 1] = census_chunk[c - 1]
.clone()
.filter(col("income").is_not_null())
.select([col("id"), col("income").alias(format!("inc_{c}"))])
.collect() // Collect necessary for sampling, but small database
.unwrap()
.sample_n_literal(150_000, false, false, Some(c as u64))
.unwrap()
.lazy();
}
Here is what each chunk of data approximately looks like at this point:
println!("{}", census_chunk[0].clone().collect().unwrap());
shape: (150_000, 2)
┌─────────────────┬───────┐
│ id ┆ inc_1 │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════════════╪═══════╡
│ PTS000000407151 ┆ 38619 │
│ PTS000000127436 ┆ 62296 │
│ PTS000000499339 ┆ 13414 │
│ PTS000000397534 ┆ 19402 │
│ PTS000000490519 ┆ 63475 │
│ … ┆ … │
│ PTS000000403316 ┆ 67194 │
│ PTS000000150748 ┆ 85943 │
│ PTS000000022816 ┆ 88017 │
│ PTS000000140902 ┆ 66287 │
│ PTS000000078467 ┆ 36567 │
└─────────────────┴───────┘
Now that we have simpler data, we can join these. In this example, we are doing multiple left joins in a row with the left_join function, always keeping the original population (akin to creating a cohort). To do these joins, all we need is the data and the left and right key (id in this example).
// Left join (creating a cohort)
let cohort = census_chunk[0]
.clone()
.left_join(census_chunk[1].clone(), col("id"), col("id"))
.left_join(census_chunk[2].clone(), col("id"), col("id"))
.left_join(census_chunk[3].clone(), col("id"), col("id"))
.left_join(census_chunk[4].clone(), col("id"), col("id"));
println!("{}", cohort.collect().unwrap());
This gives us a longitudinal cohort, keeping the population from the first dataset:
shape: (150_000, 6)
┌─────────────────┬───────┬───────┬───────┬───────┬───────┐
│ id ┆ inc_1 ┆ inc_2 ┆ inc_3 ┆ inc_4 ┆ inc_5 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════════════╪═══════╪═══════╪═══════╪═══════╪═══════╡
│ PTS000000407151 ┆ 38619 ┆ 38619 ┆ null ┆ null ┆ null │
│ PTS000000127436 ┆ 62296 ┆ 62296 ┆ null ┆ null ┆ 62296 │
│ PTS000000499339 ┆ 13414 ┆ null ┆ 13414 ┆ 13414 ┆ 13414 │
│ PTS000000397534 ┆ 19402 ┆ null ┆ 19402 ┆ null ┆ 19402 │
│ PTS000000490519 ┆ 63475 ┆ null ┆ 63475 ┆ null ┆ 63475 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ PTS000000403316 ┆ 67194 ┆ null ┆ null ┆ 67194 ┆ 67194 │
│ PTS000000150748 ┆ 85943 ┆ 85943 ┆ null ┆ null ┆ 85943 │
│ PTS000000022816 ┆ 88017 ┆ null ┆ null ┆ 88017 ┆ null │
│ PTS000000140902 ┆ 66287 ┆ null ┆ 66287 ┆ null ┆ null │
│ PTS000000078467 ┆ 36567 ┆ 36567 ┆ 36567 ┆ null ┆ null │
└─────────────────┴───────┴───────┴───────┴───────┴───────┘
In the same way, we can also use other types of joins, like the inner_join:
// Inner join (creating a "always earning" cohort)
let longitudinal_all = census_chunk[0]
.clone()
.inner_join(census_chunk[1].clone(), col("id"), col("id"))
.inner_join(census_chunk[2].clone(), col("id"), col("id"))
.inner_join(census_chunk[3].clone(), col("id"), col("id"))
.inner_join(census_chunk[4].clone(), col("id"), col("id"));
println!("{}", longitudinal_all.collect().unwrap());
This creates a cohort of those who are in every dataset.
shape: (9_981, 6)
┌─────────────────┬───────┬───────┬───────┬───────┬───────┐
│ id ┆ inc_1 ┆ inc_2 ┆ inc_3 ┆ inc_4 ┆ inc_5 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════════════╪═══════╪═══════╪═══════╪═══════╪═══════╡
│ PTS000000282273 ┆ 78586 ┆ 78586 ┆ 78586 ┆ 78586 ┆ 78586 │
│ PTS000000221209 ┆ 19179 ┆ 19179 ┆ 19179 ┆ 19179 ┆ 19179 │
│ PTS000000098103 ┆ 41441 ┆ 41441 ┆ 41441 ┆ 41441 ┆ 41441 │
│ PTS000000538070 ┆ 74324 ┆ 74324 ┆ 74324 ┆ 74324 ┆ 74324 │
│ PTS000000364383 ┆ 72375 ┆ 72375 ┆ 72375 ┆ 72375 ┆ 72375 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ PTS000000182678 ┆ 45183 ┆ 45183 ┆ 45183 ┆ 45183 ┆ 45183 │
│ PTS000000122168 ┆ 30325 ┆ 30325 ┆ 30325 ┆ 30325 ┆ 30325 │
│ PTS000000079312 ┆ 74982 ┆ 74982 ┆ 74982 ┆ 74982 ┆ 74982 │
│ PTS000000461362 ┆ 96417 ┆ 96417 ┆ 96417 ┆ 96417 ┆ 96417 │
│ PTS000000345005 ┆ 31422 ┆ 31422 ┆ 31422 ┆ 31422 ┆ 31422 │
└─────────────────┴───────┴───────┴───────┴───────┴───────┘
Polars has multiple of these “simple” joins, including left_join, semi_join, full_join, inner_join, anti_join and cross_join. But you can create significantly more complex joins by building the join yourself with the join function and all of it’s options. For example, here is a full join on multiple variables:
Note
For some reason, Polars does not reconcile the values of the keys in a
full join, both in thejoinandfull_joinfunctions. This means that any keys not found in the left createsnullsin the original key name and any key not found in the right createsnullsin the key with an_rightsuffix (e.g. id_right). This can be fixed with an expression that applies to all five joins (calledfix_full_join_varsbelow).
// More complex types of joins
let fix_full_join_vars = [when(col("id").is_not_null())
.then(col("id"))
.otherwise(col("id_right"))
.alias("id")];
let longitudinal_any = census_chunk[0]
.clone()
.join(
census_chunk[1].clone(),
[col("id")],
[col("id")],
JoinArgs::new(JoinType::Full),
)
.with_columns(fix_full_join_vars.clone())
.select([all().exclude_cols(["id_right"]).as_expr()])
.join(
census_chunk[2].clone(),
[col("id")],
[col("id")],
JoinArgs::new(JoinType::Full),
)
.with_columns(fix_full_join_vars.clone())
.select([all().exclude_cols(["id_right"]).as_expr()])
.join(
census_chunk[3].clone(),
[col("id")],
[col("id")],
JoinArgs::new(JoinType::Full),
)
.with_columns(fix_full_join_vars.clone())
.select([all().exclude_cols(["id_right"]).as_expr()])
.join(
census_chunk[4].clone(),
[col("id")],
[col("id")],
JoinArgs::new(JoinType::Full),
)
.with_columns(fix_full_join_vars.clone())
.select([all().exclude_cols(["id_right"]).as_expr()])
.sort(["id"], Default::default());
println!("{}", longitudinal_any.collect().unwrap());
This full join keeps a superpopulation of all the five datasets:
shape: (286_293, 6)
┌─────────────────┬───────┬───────┬───────┬───────┬───────┐
│ id ┆ inc_1 ┆ inc_2 ┆ inc_3 ┆ inc_4 ┆ inc_5 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════════════╪═══════╪═══════╪═══════╪═══════╪═══════╡
│ PTS000000000002 ┆ 19163 ┆ null ┆ null ┆ 19163 ┆ null │
│ PTS000000000005 ┆ 43481 ┆ 43481 ┆ 43481 ┆ null ┆ 43481 │
│ PTS000000000008 ┆ 70614 ┆ 70614 ┆ 70614 ┆ null ┆ null │
│ PTS000000000010 ┆ 62499 ┆ 62499 ┆ 62499 ┆ null ┆ null │
│ PTS000000000011 ┆ null ┆ 30589 ┆ 30589 ┆ null ┆ null │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ PTS000000604345 ┆ null ┆ 20666 ┆ 20666 ┆ null ┆ null │
│ PTS000000604348 ┆ 54153 ┆ 54153 ┆ null ┆ 54153 ┆ null │
│ PTS000000604349 ┆ 74670 ┆ null ┆ 74670 ┆ 74670 ┆ null │
│ PTS000000604350 ┆ 22652 ┆ 22652 ┆ 22652 ┆ 22652 ┆ null │
│ PTS000000604351 ┆ null ┆ 73487 ┆ 73487 ┆ null ┆ 73487 │
└─────────────────┴───────┴───────┴───────┴───────┴───────┘