Variables
This section will explore how to create new variables, derived from other variables and literals. You can run the examples in this chapter with cargo run --example 3_3_1_variables.
Literals
To have access to data, lets connect to the parquet Census data:
use polars::prelude::*;
// Connect to LazyFrame
let args = ScanArgsParquet::default();
let lf = LazyFrame::scan_parquet(PlPath::from_str("./data/large/partitioned"), args).unwrap();
In the same way as selecting existing columns with select(), you can create variables from literals using lit() by giving them a name with alias(). As you can see in this example, you can create a new column from a literal with lit(5).alias("five") or a new variable from a formula of a mix of literals (e.g. (lit(5) + lit(7) - lit(2)).alias("ten")). You can use any of the arithmetic expressions (-, +, *, /, %).
// Add new variables from literals
let lf = lf.select([
col("age_group"),
col("region"),
col("income"),
lit(5).alias("five"), // add single value literal
(lit(5) + lit(7) - lit(2)).alias("ten"), // add single value from two or more literals
]);
println!("{}", lf.clone().limit(5).collect().unwrap());
shape: (5, 5)
┌───────────┬───────────┬────────┬──────┬─────┐
│ age_group ┆ region ┆ income ┆ five ┆ ten │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ i32 ┆ i32 │
╞═══════════╪═══════════╪════════╪══════╪═════╡
│ 1 ┆ E12000001 ┆ null ┆ 5 ┆ 10 │
│ 1 ┆ E12000001 ┆ null ┆ 5 ┆ 10 │
│ 1 ┆ E12000001 ┆ null ┆ 5 ┆ 10 │
│ 1 ┆ E12000001 ┆ null ┆ 5 ┆ 10 │
│ 1 ┆ E12000001 ┆ null ┆ 5 ┆ 10 │
└───────────┴───────────┴────────┴──────┴─────┘
Deriving from existing columns
Now that we have created the five and ten columns, we can derive the fifteen column by adding the two columns together with col(). In this example, we use with_column() instead of select() as it keeps the existing columns. For adding multiple columns with_columns() is available. A select() with an all() selection would do the same.
Note
You can not use a newly created variable (or a renamed variable) in the same
select(). For example, you would not be able to run(col("five") + col("ten")).alias("fifteen")in the sameselect()as you created thefiveandtencolumns.
// Can't use created columns in the same `select()`, but you can add new column(s) with `with_column()`
let lf = lf.with_column(
(col("five") + col("ten")).alias("fifteen"), // add two columns
);
println!("{}", lf.clone().limit(5).collect().unwrap());
shape: (5, 6)
┌───────────┬───────────┬────────┬──────┬─────┬─────────┐
│ age_group ┆ region ┆ income ┆ five ┆ ten ┆ fifteen │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ i32 ┆ i32 ┆ i32 │
╞═══════════╪═══════════╪════════╪══════╪═════╪═════════╡
│ 1 ┆ E12000001 ┆ null ┆ 5 ┆ 10 ┆ 15 │
│ 1 ┆ E12000001 ┆ null ┆ 5 ┆ 10 ┆ 15 │
│ 1 ┆ E12000001 ┆ null ┆ 5 ┆ 10 ┆ 15 │
│ 1 ┆ E12000001 ┆ null ┆ 5 ┆ 10 ┆ 15 │
│ 1 ┆ E12000001 ┆ null ┆ 5 ┆ 10 ┆ 15 │
└───────────┴───────────┴────────┴──────┴─────┴─────────┘
Now lets look at real data and how to modify it. As we can see, income is an i64. A 2% inflation increase can be added to this value, once it has been converted to an f64 using .cast(DataType::Float64), by multiplying by 1.02:
// Cast the value from an `i64` to a `f64` and modify it (add a inflation modifier)
let lf = lf
.select([all().exclude_cols(["five", "ten", "fifteen"]).as_expr()])
.filter(col("income").is_not_null())
.with_column((col("income").cast(DataType::Float64) * lit(1.02)).alias("income_infl"));
println!("{}", lf.clone().limit(5).collect().unwrap());
shape: (5, 4)
┌───────────┬───────────┬────────┬─────────────┐
│ age_group ┆ region ┆ income ┆ income_infl │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ f64 │
╞═══════════╪═══════════╪════════╪═════════════╡
│ 2 ┆ E12000001 ┆ 50811 ┆ 51827.22 │
│ 2 ┆ E12000001 ┆ 70224 ┆ 71628.48 │
│ 2 ┆ E12000001 ┆ 89534 ┆ 91324.68 │
│ 2 ┆ E12000001 ┆ 93123 ┆ 94985.46 │
│ 2 ┆ E12000001 ┆ 82122 ┆ 83764.44 │
└───────────┴───────────┴────────┴─────────────┘
Conditionally create values
You can also create conditional values for a column, using the when(), then() and otherwise() chain. This example below creates the income_cat column with <= 30_000 = Low, > 30_000 & <= 70_000 = Medium and > 70_000 = High.
// Create categorical variables
let lf = lf.with_column(
when(col("income_infl").lt_eq(lit(30_000)))
.then(lit("Low"))
.when(
col("income_infl")
.gt(lit(30000))
.and(col("income_infl").lt_eq(lit(70_000))),
)
.then(lit("Medium"))
.otherwise(lit("High"))
.alias("income_cat"),
);
println!("{}", lf.clone().limit(5).collect().unwrap());
shape: (5, 5)
┌───────────┬───────────┬────────┬─────────────┬────────────┐
│ age_group ┆ region ┆ income ┆ income_infl ┆ income_cat │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ f64 ┆ str │
╞═══════════╪═══════════╪════════╪═════════════╪════════════╡
│ 2 ┆ E12000001 ┆ 50811 ┆ 51827.22 ┆ Medium │
│ 2 ┆ E12000001 ┆ 70224 ┆ 71628.48 ┆ High │
│ 2 ┆ E12000001 ┆ 89534 ┆ 91324.68 ┆ High │
│ 2 ┆ E12000001 ┆ 93123 ┆ 94985.46 ┆ High │
│ 2 ┆ E12000001 ┆ 82122 ┆ 83764.44 ┆ High │
└───────────┴───────────┴────────┴─────────────┴────────────┘
Replace data
You can replace the values of a column using the replace_strict() function. For example, to change the GCC Geographic Code for the regions in the dataset, you can use a “from” Series and a “to” Series and the region code variable will be replaced with region names.
// Change alpha-numeric region code to the region name
let lf = lf.with_column(col("region").replace_strict(
lit(Series::from_iter(vec![
"E12000001",
"E12000002",
"E12000003",
"E12000004",
"E12000005",
"E12000006",
"E12000007",
"E12000008",
"E12000009",
"W92000004",
])),
lit(Series::from_iter(vec![
"North East",
"North West",
"Yorkshire and The Humber",
"East Midlands",
"West Midlands",
"East of England",
"London",
"South East",
"South West",
"Wales",
])),
None,
Some(DataType::String),
));
println!("{}", lf.clone().limit(5).collect().unwrap());
shape: (5, 5)
┌───────────┬────────────┬────────┬─────────────┬────────────┐
│ age_group ┆ region ┆ income ┆ income_infl ┆ income_cat │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ f64 ┆ str │
╞═══════════╪════════════╪════════╪═════════════╪════════════╡
│ 2 ┆ North East ┆ 50811 ┆ 51827.22 ┆ Medium │
│ 2 ┆ North East ┆ 70224 ┆ 71628.48 ┆ High │
│ 2 ┆ North East ┆ 89534 ┆ 91324.68 ┆ High │
│ 2 ┆ North East ┆ 93123 ┆ 94985.46 ┆ High │
│ 2 ┆ North East ┆ 82122 ┆ 83764.44 ┆ High │
└───────────┴────────────┴────────┴─────────────┴────────────┘