Big Data Archivos - Page 7 of 8

HDFS – Hadoop Distributed File System

by Diego Calvo | Jun 20, 2018 | Big Data

HDFS definition HDFS (Hadoop Distributed File System) is Hadoop’s primary file storage System. Works well with large volumes of data, reduces I/O, high scalability, and availability and fault tolerance due to data replication. The Hadoop file system is typically...

Apache Spark Components

by Diego Calvo | Jun 20, 2018 | Apache Spark, Big Data

Components Spark Core Spark core is the core where all the architecture is supported, provides: Distributing tasks Programming Input/output operations Using Java, Python, Scala and R programming interfaces focused on RDDs’s abstraction. It establishes a...

Install Hortonworks in Virtual Box for Spark

by Diego Calvo | May 30, 2018 | Apache Spark, Big Data

Download Hortonworks Data Platform (HDP) Sandbox Virtualbox Installation First install virtual box and once installed go to the virtual machine of Hortonworks and run it, this will appear an installation of this machine in virtual box. Configure the features of the...

Read CSV in Databricks in Spark

by Diego Calvo | Apr 26, 2018 | Apache Spark, Big Data, Python-example

Load CSV in Databricks Databricks Community Edition provides a graphical interface for file loading. This interface is accessed in the DataBase > Create New Table. Once inside, the fields must be indicated: Upload to DBF: name of the file to Load. Select a cluster...

Big Data definition

by Diego Calvo | Nov 21, 2017 | Big Data

Big Data definition The term big Data refers to a volume of data that exceeded the capabilities of the software commonly used to view capturing, administering, and processing data. As the computing capacity is getting higher and the number from which is considered a...

Lambda Architecture (batch and stream processing combination)

by Diego Calvo | Nov 15, 2017 | Big Data

Before we focus on the Lambda architecture it is advisable to specify the two types of data processing that compose it: The processing of data in batch mode, is one that allows us to process data volumes in spaced times, for example every 10 minutes, 1 hour or daily....