Total Pageviews

Friday, 1 July 2016

在谷歌云上,部署hdp

Tools for creating Hadoop and Spark clusters on Google Compute Engine. See http://cloud.google.com/hadoop for more information.

Hortonworks Data Platform + Google Cloud Platform

Hortonworks Data Platform (HDP) on Google Cloud Platform

This extension, to Google's bdutil, provides support for deploying the Hortonworks Data Platform with a single command.
The extension utilizes Apache Ambari's Blueprint Recommendations to fully configure the cluster without the need for manual configuration.

Resources

Video Tutorial

Before you start

Create a Google Cloud Platform account

Create a Google Cloud Project

  • Open https://console.developers.google.com/
  • Open 'Create Project' and fill in the details.
    • As an example, this document uses 'hdp-00'
  • Within the project, open 'APIs & auth -> APIs'. Then enable:
    • Google Compute Engine
    • Google Cloud Storage
    • Google Cloud Storage JSON API

Configure Google Cloud SDK & Google Cloud Storage

  • Install Google Cloud SDK locally
  • Configure the SDK:
    gcloud auth login                   ## authenticate to Google cloud
    gcloud config set project hdp-00    ## set the default project
    gsutil mb -p hdp-00 gs://hdp-00     ## create a cloud storage bucket
    

Download bdutil

Quick start

  1. Set your project & bucket from above in bdutil_env.sh
  2. Deploy or Delete the cluster: see './bdutil --help' for more details
  • Deploy: ./bdutil -e ambari deploy
  • Delete: ./bdutil -e ambari delete
    • when deleting, ensure to use the same switches/configuration as the deploy

Configuration

  • You can deploy without setting any configuration, but you should have a look at platforms/hdp/ambari.conf
Here are some of the defaults to consider:
  GCE_ZONE='us-central1-a'           ## the zone/region to deploy in
  NUM_WORKERS=4                      ## the number of worker nodes. Total
                                     ##     is NUM_WORKERS + 1 master
  GCE_MACHINE_TYPE='n1-standard-4'   ## the machine type
  WORKER_ATTACHED_PDS_SIZE_GB=1500   ## 1500GB attached to each worker
  MASTER_ATTACHED_PD_SIZE_GB=1500    ## 1500GB attached to master

  ## The Hortonworks Data Platform services which will be installed.
  ##   This is nearly the entire stack
  AMBARI_SERVICES="ACCUMULO AMBARI_METRICS ATLAS FALCON FLUME GANGLIA HBASE HDFS
      HIVE KAFKA MAHOUT MAPREDUCE2 OOZIE PIG SLIDER SPARK SQOOP STORM TEZ YARN
      ZOOKEEPER"

  AMBARI_PUBLIC=false                ## Services listed on internal
                                     ##   hostname not public IP. Need
                                     ##   a socks proxy or tunnel to access

Use the cluster

SSH

  • You'll have immediate SSH access with: ./bdutil shell
  • Or update your SSH config with: gcloud compute config-ssh

Access Ambari & other services

a. With a local socks proxy:
  ./bdutil socksproxy             # opens a socks proxy to the cluster at localhost:1080

  # I use the Chrome extension 'Proxy SwitchySharp' to automatically detect when connecting to Google Compute
  open http://hadoop-m:8080/      # My Google Chrome has an extension which automatically uses the proxy
b. Or a local SSH tunnel
  gcloud compute config-ssh                  # updates our SSH config for direct SSH access to all nodes
  ssh -L 8080:127.0.0.1:8080 hadoop-m  <TAB> # quick tunnel to Apache Ambari
  open http://localhost:8080/                # open Ambari in your browser
c. Or open a firewall rule from the Google Cloud Platform control panel

Use the cluster

You now have a full HDP cluster. If you are new to Hadoop check the tutorials at http://hortonworks.com/.
For command-line based jobs, 'bdutil' gives methods for passing through commands:https://cloud.google.com/hadoop/running-a-mapreduce-job
For example: ./bdutil shell < ./extensions/google/gcs-validate-setup.sh

Questions

Can I set/override Hadoop configurations during deployment?

For adding/overriding Hadoop configurations, update configuration.json and then use the extension as documented. And contribute back if you think the defaults should be changed.

Can I deploy HDP manually using Ambari and/or use my own Ambari Blueprints?

Yes. Set ambari_manual_env.sh as your environment (with the -e switch) instead of ambari_env.sh. That will configure Ambari across the cluster & handle all HDP prerequisites, but not trigger the Ambari Blueprints which install HDP.
After manually deploying your cluster, you can use ./bdutil <YOUR_FLAGS> -e platforms/hdp/ambari_manual_post_deploy_env.sh run_command_steps to configure HDFS directories and install the GCS connector. Note it uses run_command_steps instead of deploy.

Can I re-use the attached persistent disk(s) across deployments?

bdutil supports keeping persistent disks (aka ATTACHED_PDS) online when deleting machines. It can then deploy a new cluster using the same disks without lose of data, assuming the number of workers is the same.
The basic commands are below. Find more detail in TEST.md.
## deploy the cluster & create disks
./bdutil -e ambari deploy

## delete the cluster but don't delete the disks
export DELETE_ATTACHED_PDS_ON_DELETE=false
./bdutil -e ambari delete

## create with existing disks
export CREATE_ATTACHED_PDS_ON_DEPLOY=false
./bdutil -e ambari deploy
Another would be to use gs:// (Google Cloud Storage) instead of hdfs:// in your Hadoop jobs, even setting it as the default. Or backup HDFS to Google Cloud Storage before cluster deletion.
Note: Hortonworks can't guarantee the safety of data throughout this process. You should always take care when manipulating disks and have backups where necessary.

What are the built-in storage options?

By default, HDFS is on attached disks ('pd-standard' or 'pd-ssd').
  • the size and type can be set in ambari.conf
The rest of the system resides on the local boot disk, unless configured otherwise.
Google Cloud Storage is also available with gs://. It can be used anywhere that hdfs:// is available, such as but not limited to mapreduce & hadoop fs operations.
  • Note: Adding an additional slash (gs:///) will allow you to use the default bucket (defined at cluster build) without needing to specific it.

Can I deploy in the Google Cloud Platform Free Trial ?

You may use bdutil with HDP by lowering the machine type & count below the recommended specifications. To use the default configuration, upgrade the account from a free trial.
  • In 'platforms/hdp/ambari.conf':
    • GCE_MACHINE_TYPE='n1-standard-2'
    • WORKERS=3 # or less
  • Or at the command-line provide these switches to the 'deploy' & 'delete':
    • Deploy cluster: -n 3 -m n1-standard-2

Known Issues

Feedback & Issues

from https://github.com/GoogleCloudPlatform/bdutil/blob/master/platforms/hdp/README.md