Prerequisites
- Familiarity with terminals, shell commands and fundamental knowledge in operating Systems (Linux)
- Basic knowledge in virtualization and managing virtual machines
- Basic knowledge in containerization and dealing with Docker images and containers
Objectives
- Install a hypervisor or VM monitor such as VirtualBox
- Install Hortonworks Data Platform
- Access Hortonworks Data Platform via ssh
- Access HDFS and learn how to transfer files
Introduction
Every business is now a data business. Data is the organizations' future and most valuable asset. The Hortonworks Data Platform (HDP) is a security-rich, enterprise-ready open-source Hadoop distribution based on a centralized architecture (YARN). Hortonworks Sandbox is a single-node cluster and can be run as a Docker container installed on a virtual machine. HDP is a complete system to handle the processing and storage of big data. It is an open architecture used to store and process complex and large-scale data. It is composed of numerous Apache Software Foundation (ASF) projects including Apache Hadoop and is built specifically to meet enterprise demands. Hortonworks was a standalone company untill 2019 when it is merged to Cloudera and now Hortonworks is a subsidiary for Cloudera, Inc.
Hortonworks is merged to Cloudera in 2019 |
At the beginning of this course, we are aiming to remind you with relational databases before dealing with big data. In practice, we will use the relational database management system PostgreSQL. Fortunately, HDP comes with a pre-installed PostgreSQL database server.
Hardware requirements
- Memory dedicated to the cluster (Minimum: 4 GiB, Recommended: 8+ GiB). More is better.
- CPU (Minimum: 4 Cores, Recommended: 6+ Cores)
- Virtualization should be enabled
- (Check Virtualization on Windows, On Linux: lscpu). Sometimes it is disabled in BIOS.
- Storage
- 25-35 GiB
- for HDP 2.5.0
- 65-75 GiB
- for HDP 2.6.5
- 80-100 GiB
- for HDP 3.0.1
Installing Hortonworks Sandbox
There are two common ways to install HDP Sandbox on your PC, either by using a hypervisor such as VirtualBox which will pull the Docker image and run a container for your cluster or by directly using Docker where you need to manage your resources via docker command line options in Linux or you can use WSL backend in Windows where you can configure resources via .wslconfig file.
A. Using a Hypervisor
If you are not familiar with Docker, you can follow this approach where configuring the cluster resources can be done via the hypervisor GUI. If you have less resources, then we recommend using Docker, so you do not need to consume resources for running the guest virtual machine.In this approach, you will run a virtual machine which in turn will run your cluster container, so the operating system of the virtual machine will be different from the operating system of the container (HDP Sandbox cluster). You can notice that by checking the content of /home directory or the version of the operating system cat /etc/redhat-release.
1. Installing a Hypervisor
We recommend VirtualBox as a hypervisor since it is supported for most common operating systems (Linux, Windows, and macOS). Please, follow the attached link in the following list to download your preferred hypervisor.
- Oracle VM VirtualBox (Recommended)
- VMware Workstation Player (Only for Linux and Windows)
- VMware Fusion for mac (Only for macOS)
For installation instructions, you can use Google but I share here a tutorial to install VirtualBox on Windows 11. If you have an old version of the software, we recommend to update it in order to avoid any issues in installing the virtual machines. In my PC, I installed VirtualBox 7.0.4 in January 2023.
2. Downloading the Sandbox
Hortonworks Data Platform (HDP) is a straightforward, pre-configured, learning environment that contains the latest developments from Apache Hadoop. The Sandbox comes packaged in a virtual environment that can run in the cloud or on your personal machine. The Sandbox allows you to learn and explore HDP on your own.
You can find the download links of the Sandbox in .ova format with respect to the chosen hypervisor. If you are using VirtualBox, then download from here. For VMware users, the download link is here. You can also download them from the official website but it needs an account on Cloudera website. I share below the download links for all available versions of HDP Sandbox.
- The download links of HDP Sandbox on VMware: