By taylor In Recipes
FoldRows
Combine multiple rows that contain a duplicate value into a single row, you choose the row or features that pass through
Similar to [[DropRows]], but the instruction block lets the chef differentiate among all the matching rows and select one or reconstruct a composite row.
Instruction file arguments can contain multiple columns to fold on!
Helpful for disambiguating multiple events that happen on a single day.
Advanced Instruction
Signature
( ns: String, foldColumns: Seq[Iccn], foldValues: Seq[String], rows: Seq[IccnNamespaceRow] ) : Seq[IccnNamespaceRow]
Remember that FoldRows Insturctions expect a Sequence of rows to be returned, enabling you to build Many to Many mappings.
Combinations
Advanced Technique: AddColumn, then FoldRows! Adding a new identifier column, then folding it is a powerful combination you can use to synthesize or reduce a key space before deduplicating or sorting. A classic example is Social Security Number scrubbing: removing dashes, zero padding, the works.
Examples
FoldRows BenefitsDecisionDate.scala
//(ns: String, foldColumns: Seq[Iccn], foldValues: Seq[String], rows: Seq[IccnNamespaceRow]) : Seq[IccnNamespaceRow]
val sortBy = "BenefitsDecisionDate"
// Sort rows by time and use the last one
Seq(rows.sortBy( _.date(ns, sortBy) ).reverse.head)
FoldRows personnelNumber.scala
One of my data ingredients contained multiple rows per employee ID, and I needed to find if ANY of them contained a true flag.
// Do any of the rows have 'reviewed' flag true? If not, return the first one.
val bestRow = rows.find( _.stringEmpty("reviewed").size > 0 ).getOrElse( rows.head )
/*
// Println debugging example!
if ( rows.size > 1 ) {
println(s"FoldRows personnelNumber ${foldValues} -> ${rows.map{_.stringEmpty("reviewed")}.mkString(", ")} to ${bestRow}")
}
*/
// Fold Many to One
Seq(bestRow)