Generate data to use to read & write JSON
Example of random data to use in the following sections
data = [] for x in range(5): data.append((random.randint(0,9), random.randint(0,9))) df = spark.createDataFrame(data, ("label", "data")) df.show()
+-----+----+ |label|data| +-----+----+ | 4| 0| | 7| 0| | 1| 1| | 3| 8| | 3| 5| +-----+----+
Write data in JSON format
path_json = "/prueba.json" # Leer desde HDFS path_json = "D:/prueba.json" # Leer desde fichero local df.write \ .mode("overwrite") \ .format("json") \ .save(path_json)
Read data in JSON format
df2 = spark\ .read\ .option("multiline", "true") \ .json(path_json) df2.show()
+-----+----+ |label|data| +-----+----+ | 4| 0| | 7| 0| | 1| 1| | 3| 8| | 3| 5| +-----+----+
Write gzip compressed data in JSON format
path_json_gzip = "/prueba_gzip.json" # Leer desde HDFS path_json_gzip = "D:/prueba_gzip.json" # Leer desde fichero local df.write\ .mode("overwrite")\ .format("json")\ .option("compression", "gzip")\ .save(path_json_gzip)
Read gzip compressed data in JSON format
df2 = spark\ .read\ .option("multiline", "true") \ .json(path_json_gzip) df2.show()
+-----+----+ |label|data| +-----+----+ | 4| 0| | 7| 0| | 1| 1| | 3| 8| | 3| 5| +-----+----+
Write deflate compressed data in JSON format
path_json_deflate = "/prueba_deflate.json" # Leer desde HDFS path_json_deflate = "D:/prueba_deflate.json" # Leer desde fichero local df.write\ .mode("overwrite")\ .format("json")\ .option("compression", "deflate")\ .save(path_json_deflate)
Read deflate compressed data in JSON format
df2 = spark\ .read\ .option("multiline", "true") \ .json(path_json_deflate) df2.show()
+-----+----+ |label|data| +-----+----+ | 4| 0| | 7| 0| | 1| 1| | 3| 8| | 3| 5| +-----+----+
Write bzip2 compressed data in JSON format
path_json_bzip2 = "/prueba_bzip2.json" # Leer desde HDFS path_json_bzip2 = "D:/prueba_bzip2.json" # Leer desde fichero local df.write\ .mode("overwrite")\ .format("json")\ .option("compression", "bzip2")\ .save(path_json_bzip2)
Read bzip2 compressed data in JSON format
df2 = spark\ .read\ .option("multiline", "true") \ .json(path_json_bzip2) df2.show()
+-----+----+ |label|data| +-----+----+ | 4| 0| | 7| 0| | 1| 1| | 3| 8| | 3| 5| +-----+----+
0 Comments