Generate data to use for reading and writing in parquet format
Example of random data to use in the following sections
data = [] for x in range(5): data.append((random.randint(0,9), random.randint(0,9))) df = spark.createDataFrame(data, ("label", "data")) df.show()
+-----+----+ |label|data| +-----+----+ | 4| 0| | 7| 0| | 1| 1| | 3| 8| | 3| 5| +-----+----+
Write data in parquet format
path_parquet = "/prueba.parquet"# Read from HDFS path_parquet = "/prueba.parquet" # Read from local file
df.write \ .mode("overwrite") \ .format("parquet") \ .save(path_parquet)
Read data in parquet format
df2 = spark\ .read\ .option("multiline", "true") \ .parquet(path_parquet) df2.show()
+-----+----+ |label|data| +-----+----+ | 4| 0| | 7| 0| | 1| 1| | 3| 8| | 3| 5| +-----+----+
Write data in gzip compressed data in parquet format
path_parquet_gzip = "/prueba_gzip.parquet"# Read from HDFS path_parquet_gzip = "D:/prueba_gzip.parquet" # Read from local file
df.write\ .mode("overwrite")\ .format("parquet")\ .option("compression", "gzip")\ .save(path_parquet_gzip)
Read gzip compressed data in parquet format
df2 = spark\ .read\ .option("multiline", "true") \ .parquet(path_parquet_gzip) df2.show()
+-----+----+ |label|data| +-----+----+ | 4| 0| | 7| 0| | 1| 1| | 3| 8| | 3| 5| +-----+----+
Write data snappy compressed in parquet format
path_parquet_snappy = "/prueba_snappy.parquet"# Read from HDFS path_parquet_snappy = "D:/prueba_snappy.parquet" # Read from local file df.write\ .mode("overwrite")\ .format("parquet")\ .option("compression", "snappy")\ .save(path_parquet_snappy)
Read data snappy compressed in parquet format
df2 = spark\ .read\ .option("multiline", "true") \ .parquet(path_parquet_snappy) df2.show()
+-----+----+ |label|data| +-----+----+ | 4| 0| | 7| 0| | 1| 1| | 3| 8| | 3| 5| +-----+----+
0 Comments