In this problem, you are required to use spark.ml API. As in Problem 2, consider 3
objects:
(1) The first object, denoted by OA, is a ball centered at (0; 0; 0) of radius 1. As a
set of points, we write OA = f(x; y; z) j x2 + y2 + z2 1g.
(2) The second object, denoted by OB, is a cylinder defined by OB = f(x; y; z) j
(3) The third object, denoted by OC, is an ellipsoid
OC = f(x; y; z) j (x 2)2
1:2
+ y2 +
z2
4
1g
Note that OA overlaps with OC a little bit.
Create a dataset in the following way:
(1) Each record in the dataset corresponds to a point contained in the union of OA,
OB and OC, which has a “features” part which is made of the xyz coordinates
of that point and a “label” part which tells which of OA, OB or OC this point
is contained in. Note that since OA \ OC is nonempty, if the point happens to
locate in OA \ OC, you still can only label it as OA or OC, but not both.
(2) The dataset you create should contain at least 500000 records. You should generate
the records randomly in the following way:
i. Each time, choose OA, OB or OC randomly. Suppose we choose OX (X is A,
B or C).
ii. Randomly create a point P contained in OX (think of how to do it). Now
the features of the newly created record is the coordinates of P and the
corresponding label is “OX”.
iii. After creating all the records, you should load and transform the dataset to
a spark Dataframe.
You are required to do the following work.
(1) Do classifications using both logistic regression and decision tree classifier. You
should try several different training/test split ratio on your dataset and for each
trained model, evaluate your model and show the accuracy of the test.
(2) Use K-means clustering to make cluster analysis on your data. Now only the
“feature” part of your data matters. Set the number K of clusters to 2, 3 and 4
respectively and make a comparison. Show the location of the centroids for each
case.
(3) Provide a visualization of the results of your classifications and cluster analysis.
In your report, you should provide both your codes and your demonstration of the
results. Take screenshots whenever necessary
Apache Spark
Apache Spark is an open source computing cluster framework Originally developed at UC AMPLab at Berkeley, the Spark codebase was later transferred to the Apache Software Foundation, which has maintained it ever since. Spark uses a program to program entire clusters with implicit data parallelism and fault tolerance.
Spark ML
Apache SparkML is a machine learning library of common learning algorithms and utilities including classification, regression, clustering, collaborative filtering, dimensionality reduction, and basic optimization primitives.
Why Spark ML?
The transition to the era of big data requires heavy iterative computation for very large datasets. Standard implementations of machine learning algorithms require very powerful machines to operate. Depending on the expensive cars, this is not profitable because of their high price and unjustified expansion costs. The idea behind using distributed computing engines is to distribute computations across multiple junior machines (conventional hardware) instead of one senior. This definitely speeds up the learning phase and allows us to create better models.
Software requirements
To continue with this tutorial, you need to install the following:
python
Apache Spark
findspark library
Numpy
Jupiter
Apache Spark
Installing Apache Spark is very easy. You just need to download the package from the official website.
To test your work:
unzip the file
go to bunker catalog
the corresponding following command
% ./pyspark --version
The output should look like this:
Testing the Apache Spark version
Finds library
To make Apache Spark easier to access, we will use findspark.It is a very simple library that automatically configures the development environment to import the Apache Spark library.
To install findspark, run the following in your shell:
% pip install findspark
Numpy
Numpy is a well-known library for numeric computing in Python. Spark ML uses it for its calculations.
Install it with the following command:
% pip install numpy
Jupiter
Jupiter Notepad is an open source web application that allows you to create and share documents that contain live code, equations, visualizations, and textual storytelling. Uses include: data cleansing and transformation, numerical modeling, statistical modeling, data visualization, machine learning, and more.
To install Jupyter:
% pip install jupyter
Defining the problem
The first problem in this series is regression, We are going to train a model to predict the famous Boston Gillen dataset (download from here).
This dataset contains information collected by the United States Census Service regarding housing in the Boston Mass area. It was obtained from the StatLib archive and has been widely used in the literature to compare algorithms.
The dataset is by size, 506 cases in total. It contains 14 functions, described below:
CRIME: per capita crime rate by city
ZN: Share of residential land zoned for plots over 25,000 sq. Feet.
INDUS: Share of non-retail acres per city.
CHAS: Charles River dummy (1 if the path limits the river; 0 in case of death)
NOX: nitrogen oxide content (parts per 10 million)
RM: average number of rooms per dwelling
AGE: Proportion of homeowners built before 1940
DIS: Weighted Distances to Five Boston Job Centers
RAD: Radial Highway Accessibility Index
TAX: full property tax value for USD 10,000
PTRATIO: student to teacher ratio by city
B: 1000 (Bk - 0 63) ² where Bk is the proportion of blacks by city
LSTAT:% below population status
MEDV: Average cost of owner-occupied homes at $ 1,000
The goal is to use 13 functions to predict the MEDV value (which represents home price).
It's time to get your hands dirty. Let's jump into Spark and Spark.
Implementation
Apache Spark setup
To prepare your development environment for lunch, Jupyter, and create a new notebook.
% jupyter laptop
Let's start by importing the findspark library and initializing the path to the Apache Spark folder.
findspark import
findspark.init ('/ opt / spark')
Every Spark application requires a SparkSession.
To create a SparkSession we write:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate ()
Loading data
data = spark.read.csv('./boston_housing.csv', header=True, inferSchema=True)
header = True signals that the first line contains a header
inferSchema = True enables automatic detection of the underlying data schema
To display data:
data.show()
Comments
Leave a comment