Today, I was working on IBM Big Data University course Spark Fundamentals and found that there are some issues with Data Scientist Workbench (DSWB) site. DSWB’s Jupyter Notebook link was not working. I try to overcome this situation by creating Apache Spark Standalone Mode Setup on my home Windows 10 PC. This blog post summarizes steps that I have performed for the purpose.
- Winutils Exe Download
- Winutils.exe Download Google
- Winutils.exe Download Spark
- Hadoop Winutils.exe Download
In order to avoid error, download winutils.exe binary and add the same to the classpath. Error: java.io.IOException: Could not locate executable null bin winutils.exe in the Hadoop binaries. Winutils free download - WinUtilities Free Edition, WinUtilities Professional Edition, WinUtilities Free Undelete, and many more programs. Winutils free download - WinUtilities Free Edition. Winutils free download - WinUtilities Free Edition, WinUtilities Professional Edition, WinUtilities Free Undelete, and many more programs. Winutils free download - WinUtilities Free Edition.
Please refer Wikipedia Apache Spark page ( https://en.wikipedia.org/wiki/Apache_Spark ) to start learning about the same.
Software Version details:
- OS: Microsoft Windows 10 [Version 10.0.14393] 64bit
- Java JDK Version 1.8.0_101
- Apache Spark version 2.0.2
- Scala Version 2.12.0
Install Java
Java is required for Apache Spark. Spark overview page clearly mentions “It’s easy to run locally on one machine — all you need is to have java installed on your system PATH, or the JAVA_HOME environment variable pointing to a Java installation”. I’ve used Java JDK Version 1.8.0_101 for my setup.
Scala
Apache Spark is written in Scala programming language and needs it installed on local PC. I’ve downloaded Scala 2.12.0 binaries MSI installer from ( http://www.scala-lang.org/download/ ). Followed standard installation prompts and installed Scala on default path ( C:Program Files (x86)scala ).
winutils
I referred various sources and found that Spark can run locally, but needs winutils.exe which is a component of Hadoop. So why exactly is winutils and why it is required? On further investigation, I found that among other things, it seems like Spark uses Hadoop which calls UNIX commands such as chmod to create files and directories. Also, winutils calls are made to read and write files on Windows. In summary, it is required for running shell commands on Windows OS. I’m running 64bit Windows 10 and downloaded winutils.exe from this Git Hub URL: https://github.com/steveloughran/winutils/tree/master/hadoop-2.6.0/bin . Placed the winutils.exe file to a folder.
Spark
I’ve downloaded latest Spark release 2.0.2 (Nov 14, 2016) from the official download site.( https://spark.apache.org/downloads.html ). The downloaded file is 7zip compressed. I’ve extracted the files to a folder in D drive as my C drive have limited space.
Environment Variables
Following environment variables set to specify the path where various required components are installed.
HADOOP_HOME: I set this variable to the folder containing the winutils.exe file
JAVA_HOME: set the folder path for my JDK
SCALA_HOME: the bin folder of the Scala location.
SPARK_HOME: the bin folder path of uncompressed Spark
Following are the values for my Desktop:
Also added “D:spark-2.0.2-bin-hadoop2.7bin” folder to PATH environment variable. This give me flexibility to run spark from anywhere from command prompt.
Start Spark
Opened a command prompt and run “spark-shell.cmd”. Immediately got some errors on console regarding hive directory write permission. Googled and found that Spark is looking for “tmphive” folder also this folder is expected to have 777 permissions. Further results provide hints that this permission should be granted using following winutils command.
Testing Spark
Followed Scala Quick Start Guide for tests
Apache Spark
Spark is an in-memory cluster computing framework for processing and analyzing large amounts of data.
Steps to Install Spark in Windows.
Step 1: Install Java
You must have java installed in your system.
If you dont have java installed in your system, download the appropriate java version from this link.
Step 2: Downloading Winutils.exe and setting up Hadoop path
- To run spark in windows without any complications, we have to download winutils . Download winutils from this link.
- Now, create a folder in C drive with name winutils. Inside the winutils folder create another folder named bin and place the winutils.exe .
- Now, open environment variables and click on new. And, add HADOOP_HOME as shown in the pic.
Winutils Exe Download
- Now, select path environment variable which is already present in the list and click on edit. In the path environment variable, click on new and add %HADOOP_HOME%bin in it, as shown in the picture below.
Step 3: Install scala.
- If you dont have scala installed in your system, Download and install scala from the official site.
Step 4: Downloading Apache Spark and Setting the path.
- Download the latest version of Apache Spark from official site. While writing this blog, Spark 2.1.1 is the latest version.
- The downloaded file will be in the form of .7.tgz. So, use software like 7zip to extract it.
- After extracting the files, create a new folder in C drive with name Spark ( or name u like) and copy the content into that folder.
- Now, we have to set SPARK_HOME environment variable. It is same as setting the HADOOP_HOME path.
Now, edit the path variable to add %SPARK_HOME%bin path.
Thats it… :). Spark Installation is done.
Running Spark
To run spark, Open a Command Prompt(CMD) and type spark-shell. hit enter.
If everything is correct, you will get a screen like this without any errors.
Winutils.exe Download Google
To create a Spark project using SBT to work with eclipse, check this link.
Winutils.exe Download Spark
If have any errors installing the Spark, please post the problem in the comments. I will try to solve.