Generate data to use for reading and writing in parquet format
Example of random data to use in the following sections
data = []
for x in range(5):
data.append((random.randint(0,9), random.randint(0,9)))
df = spark.createDataFrame(data, ("label", "data"))
df.show()+-----+----+ |label|data| +-----+----+ | 4| 0| | 7| 0| | 1| 1| | 3| 8| | 3| 5| +-----+----+
Write data in parquet format
path_parquet = "/prueba.parquet"# Read from HDFS path_parquet = "/prueba.parquet" # Read from local file
df.write \
.mode("overwrite") \
.format("parquet") \
.save(path_parquet)Read data in parquet format
df2 = spark\
.read\
.option("multiline", "true") \
.parquet(path_parquet)
df2.show()+-----+----+ |label|data| +-----+----+ | 4| 0| | 7| 0| | 1| 1| | 3| 8| | 3| 5| +-----+----+
Write data in gzip compressed data in parquet format
path_parquet_gzip = "/prueba_gzip.parquet"# Read from HDFS path_parquet_gzip = "D:/prueba_gzip.parquet" # Read from local file
df.write\
.mode("overwrite")\
.format("parquet")\
.option("compression", "gzip")\
.save(path_parquet_gzip)Read gzip compressed data in parquet format
df2 = spark\
.read\
.option("multiline", "true") \
.parquet(path_parquet_gzip)
df2.show()+-----+----+ |label|data| +-----+----+ | 4| 0| | 7| 0| | 1| 1| | 3| 8| | 3| 5| +-----+----+
Write data snappy compressed in parquet format
path_parquet_snappy = "/prueba_snappy.parquet"# Read from HDFS
path_parquet_snappy = "D:/prueba_snappy.parquet" # Read from local file
df.write\
.mode("overwrite")\
.format("parquet")\
.option("compression", "snappy")\
.save(path_parquet_snappy)Read data snappy compressed in parquet format
df2 = spark\
.read\
.option("multiline", "true") \
.parquet(path_parquet_snappy)
df2.show()+-----+----+ |label|data| +-----+----+ | 4| 0| | 7| 0| | 1| 1| | 3| 8| | 3| 5| +-----+----+




0 Comments