For installing and configuring RHadoop framework we would require Hadoop (2.6.0 and above installed on every machine if in cluster) and RStudio. You can refer Michael G Noll's Blog or Chalpritam's Blog for both hadoop single and multi-node setup. This configuration and installation steps have been tested on Ubuntu 14.04 LTS 32-bit OS please feel free to contact me via comments if there is some error in steps
1. Getting into root access to install all the RHadoop libraries globally
sudo su
2. Start R Terminal using the command below
R
3. Install RHadoop framework Libraries by using the following commands
install.packages(c("codetools", "R", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2", "rJava"))
install.packages(c("dplyr","R.methodsS3"))
install.packages(c("Hmisc"))
install.packages(c("caTools")
4. Set up the system environment variables
- Sys.setenv(HADOOP_HOME="/usr/local/hadoop")
- Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")
- Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop/share/hadoop/tools/lib/hadoopversiomentionhere.jar")
5. Download rmr2 and rdfs packages from here and install them using the following commands
install.packages(path_to_rmr2package, repos = NULL, type="source")
install.packages(path_to_rdfspackage, repos = NULL , type="source")
6. After installing these packages switch to RStudio and run a test code given below. Make sure you have executed the start-all.sh srcipt of hadoop in a separate terminal before executing this program
Sys.setenv(HADOOP_HOME="/usr/local/hadoop")
Sys.setenv(HADOOP_CMD='/usr/local/hadoop/bin/hadoop')
Sys.setenv(JAVA_HOME='/usr/lib/jvm/java-7-openjdk-amd64')
Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar")
library("rmr2")
library("rJava")
library("rhdfs")
hdfs.init()
ints = to.dfs(1:100)
calc = mapreduce(input = ints,map = function(k, v) cbind(v, 2*v))
from.dfs(calc)
Comments
Post a Comment