by Diego Calvo | Jun 20, 2018 | Big Data
HDFS definition HDFS (Hadoop Distributed File System) is Hadoop’s primary file storage System. Works well with large volumes of data, reduces I/O, high scalability, and availability and fault tolerance due to data replication. The Hadoop file system is typically... by Diego Calvo | Jun 20, 2018 | Apache Spark, Big Data
Components Spark Core Spark core is the core where all the architecture is supported, provides: Distributing tasks Programming Input/output operations Using Java, Python, Scala and R programming interfaces focused on RDDs’s abstraction. It establishes a... by Diego Calvo | May 30, 2018 | Apache Spark, Big Data
Download Hortonworks Data Platform (HDP) Sandbox Virtualbox Installation First install virtual box and once installed go to the virtual machine of Hortonworks and run it, this will appear an installation of this machine in virtual box. Configure the features of the... by Diego Calvo | May 24, 2018 | Python-example
Define virtual Environment from command line > python -m venv develop_virtual_enviroment Activate in Environment > ..\develop_virtual_enviroment\Scripts\activate.bat (for Windows) > ..\develop_virtual_enviroment\bin\activate.bat (for Linux) Disable the... by Diego Calvo | Apr 26, 2018 | Apache Spark, Big Data, Python-example
Load CSV in Databricks Databricks Community Edition provides a graphical interface for file loading. This interface is accessed in the DataBase > Create New Table. Once inside, the fields must be indicated: Upload to DBF: name of the file to Load. Select a cluster...