Select
This chapter will explore how to keep or drop columns from your data. You can run the examples with cargo run --example 3_2_1_select.
Select
To have access to data, lets connect to the parquet Census data:
use polars::prelude::*;
// Connect to LazyFrame
let args = ScanArgsParquet::default();
let mut lf =
LazyFrame::scan_parquet(PlPath::from_str("./data/large/partitioned"), args).unwrap();
Using collect_schema, we can collect the names of the columns in the LazyFrame. Here is code to collect a vector of variable names:
// Get names of columns
let cols: Vec<String> = lf
.collect_schema()
.unwrap()
.iter_names()
.map(|c| c.to_owned().to_string())
.collect();
println!(
"Vector of the {} variables in the LazyFrame: {:?}",
cols.len(),
cols
);
Vector of the 21 variables in the LazyFrame: ["id", "social", "birth", "econ", "ethnic", "health", "fam_type", "hours_worked", "education", "industry", "london", "mar_stat", "occupation", "region", "religion", "residence_type", "age_group", "sex", "keep_type", "income", "chunk"]
Now, using select() you can select (i.e. keep) various columns using the col() function. With the regex Polars crate feature, you can also use regular expressions to identify columns following a pattern. This pattern must start with ^ and end with $. In this example, we are keeping age_group, region and income. With alias we are renaming income to yearly_income.
// Select some columns by name & with regex & with rename
let lf = lf.select([
col("^age.*$"), // survyear, survmnth
col("region"),
col("income").alias("yearly_income"),
]);
// Print selected column (top 5 values)
println!("{}", lf.clone().limit(5).collect().unwrap());
shape: (5, 3)
┌───────────┬───────────┬───────────────┐
│ age_group ┆ region ┆ yearly_income │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞═══════════╪═══════════╪═══════════════╡
│ 1 ┆ E12000001 ┆ null │
│ 1 ┆ E12000001 ┆ null │
│ 1 ┆ E12000001 ┆ null │
│ 1 ┆ E12000001 ┆ null │
│ 1 ┆ E12000001 ┆ null │
└───────────┴───────────┴───────────────┘
Remove
You can also drop variables by selecting all() variables and providing a vector of variables to drop to exclude_cols().
// Drop variables (better to simply select the columns needed)
let lf = lf.select([all().exclude_cols(["region", "yearly_income"]).as_expr()]);
// Print selected column (top 5 values)
println!("{}", lf.clone().limit(5).collect().unwrap());
shape: (5, 1)
┌───────────┐
│ age_group │
│ --- │
│ i64 │
╞═══════════╡
│ 1 │
│ 1 │
│ 1 │
│ 1 │
│ 1 │
└───────────┘
The exclude_cols() should be used sparingly by letting your query optimization (e.g. summary of data on requested variables only) do the work for you. In other words, an analytical pipeline will naturally ignore some columns and Polars will automatically drop them when no longer relevant.