Filter data with like
Filtering is made to select the people whose surname contains “Garc” and which age is under 30.
val df = sc.parallelize(Seq( ("Paco","Garcia",24,24000,"2018-08-06 00:00:00"), ("Juan","Garcia",26,27000,"2018-08-07 00:00:00"), ("Ana", "Martin",28,28000,"2018-08-14 00:00:00"), ("Lola","Martin",29,31000,"2018-08-18 00:00:00"), ("Sara","Garcia",35,34000,"2018-08-20 00:00:00") )).toDF("name","surname","age","salary","reg_date") val type_df = df.select($"name",$"surname",$"age",$"salary", unix_timestamp($"reg_date", "yyyy-MM-dd HH:mm:ss").cast(TimestampType).as("timestamp")) type_df.show() val filter_df = type_df.filter("surname like 'Garc%' AND age < 30") filter_df.show()
+------+--------+----+-------+-------------------+ |name |surname |age |salary | timestamp| +------+--------+----+-------+-------------------+ | Paco| Garcia| 24| 24000|2018-08-06 00:00:00| | Juan| Garcia| 26| 27000|2018-08-07 00:00:00| | Ana| Martin| 28| 28000|2018-08-14 00:00:00| | Lola| Martin| 29| 31000|2018-08-18 00:00:00| | Sara| Garcia| 35| 34000|2018-08-20 00:00:00| +------+--------+----+-------+-------------------+ +------+--------+----+-------+-------------------+ |name |surname |age |salary | timestamp| +------+--------+----+-------+-------------------+ | Paco| Garcia| 24| 24000|2018-08-06 00:00:00| | Juan| Garcia| 26| 27000|2018-08-07 00:00:00| +------+--------+----+-------+-------------------+
Filtering data by matching item
Filtering is made to select people who with the surname “Garcia”
df.filter("surname== 'Garcia'").show()
+------+--------+----+-------+ | name| surname| age| salary| +------+--------+----+-------+ | Paco| Garcia| 24| 24000| | Juan| Garcia| 26| 27000| | Sara| Garcia| 35| 34000| +------+--------+----+-------+
Filtering data from the result of a pool
Filtering is done to select the surnames to be repeated more than twice
df.groupBy("surname").count().filter("count > 2").show()
+--------+-----+ | surname|count| +--------+-----+ | Garcia| 3| +--------+-----+
0 Comments