Answer to Question #160025 in Python for TQTQ

Question #160025

In this problem, you are required to use spark.ml API. As in Problem 2, consider 3

objects:

(1) The first object, denoted by OA, is a ball centered at (0; 0; 0) of radius 1. As a

set of points, we write OA = f(x; y; z) j x2 + y2 + z2  1g.

(2) The second object, denoted by OB, is a cylinder defined by OB = f(x; y; z) j

  • x^2 + y^2  4; 2  z  4g.

(3) The third object, denoted by OC, is an ellipsoid

OC = f(x; y; z) j (x 􀀀 2)2

1:2

+ y2 +

z2

4

 1g

Note that OA overlaps with OC a little bit.

Create a dataset in the following way:

(1) Each record in the dataset corresponds to a point contained in the union of OA,

OB and OC, which has a “features” part which is made of the xyz coordinates

of that point and a “label” part which tells which of OA, OB or OC this point

is contained in. Note that since OA \ OC is nonempty, if the point happens to

locate in OA \ OC, you still can only label it as OA or OC, but not both.

(2) The dataset you create should contain at least 500000 records. You should generate

the records randomly in the following way:

i. Each time, choose OA, OB or OC randomly. Suppose we choose OX (X is A,

B or C).

ii. Randomly create a point P contained in OX (think of how to do it). Now

the features of the newly created record is the coordinates of P and the

corresponding label is “OX”.

iii. After creating all the records, you should load and transform the dataset to

a spark Dataframe.

You are required to do the following work.

(1) Do classifications using both logistic regression and decision tree classifier. You

should try several different training/test split ratio on your dataset and for each

trained model, evaluate your model and show the accuracy of the test.

(2) Use K-means clustering to make cluster analysis on your data. Now only the

“feature” part of your data matters. Set the number K of clusters to 2, 3 and 4

respectively and make a comparison. Show the location of the centroids for each

case.

(3) Provide a visualization of the results of your classifications and cluster analysis.

In your report, you should provide both your codes and your demonstration of the

results. Take screenshots whenever necessary


1
Expert's answer
2021-01-30T22:37:58-0500

Apache Spark

Apache Spark is an open source computing cluster framework Originally developed at UC AMPLab at Berkeley, the Spark codebase was later transferred to the Apache Software Foundation, which has maintained it ever since. Spark uses a program to program entire clusters with implicit data parallelism and fault tolerance.

Spark ML

Apache SparkML is a machine learning library of common learning algorithms and utilities including classification, regression, clustering, collaborative filtering, dimensionality reduction, and basic optimization primitives.

Why Spark ML?

The transition to the era of big data requires heavy iterative computation for very large datasets. Standard implementations of machine learning algorithms require very powerful machines to operate. Depending on the expensive cars, this is not profitable because of their high price and unjustified expansion costs. The idea behind using distributed computing engines is to distribute computations across multiple junior machines (conventional hardware) instead of one senior. This definitely speeds up the learning phase and allows us to create better models.


Software requirements

To continue with this tutorial, you need to install the following:

python

Apache Spark

findspark library

Numpy

Jupiter

Apache Spark

Installing Apache Spark is very easy. You just need to download the package from the official website.

To test your work:

unzip the file

go to bunker catalog

the corresponding following command

% ./pyspark --version

The output should look like this:




Testing the Apache Spark version

Finds library

To make Apache Spark easier to access, we will use findspark.It is a very simple library that automatically configures the development environment to import the Apache Spark library.

To install findspark, run the following in your shell:

% pip install findspark

Numpy

Numpy is a well-known library for numeric computing in Python. Spark ML uses it for its calculations.

Install it with the following command:


% pip install numpy

Jupiter

Jupiter Notepad is an open source web application that allows you to create and share documents that contain live code, equations, visualizations, and textual storytelling. Uses include: data cleansing and transformation, numerical modeling, statistical modeling, data visualization, machine learning, and more.

To install Jupyter:

% pip install jupyter

Defining the problem

The first problem in this series is regression, We are going to train a model to predict the famous Boston Gillen dataset (download from here).

This dataset contains information collected by the United States Census Service regarding housing in the Boston Mass area. It was obtained from the StatLib archive and has been widely used in the literature to compare algorithms.

The dataset is by size, 506 cases in total. It contains 14 functions, described below:

CRIME: per capita crime rate by city

ZN: Share of residential land zoned for plots over 25,000 sq. Feet.

INDUS: Share of non-retail acres per city.

CHAS: Charles River dummy (1 if the path limits the river; 0 in case of death)

NOX: nitrogen oxide content (parts per 10 million)

RM: average number of rooms per dwelling

AGE: Proportion of homeowners built before 1940

DIS: Weighted Distances to Five Boston Job Centers

RAD: Radial Highway Accessibility Index

TAX: full property tax value for USD 10,000

PTRATIO: student to teacher ratio by city

B: 1000 (Bk - 0 63) ² where Bk is the proportion of blacks by city

LSTAT:% below population status

MEDV: Average cost of owner-occupied homes at $ 1,000

The goal is to use 13 functions to predict the MEDV value (which represents home price).

It's time to get your hands dirty. Let's jump into Spark and Spark.

Implementation

Apache Spark setup

To prepare your development environment for lunch, Jupyter, and create a new notebook.

% jupyter laptop

Let's start by importing the findspark library and initializing the path to the Apache Spark folder.

findspark import
findspark.init ('/ opt / spark')

Every Spark application requires a SparkSession.

To create a SparkSession we write:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate ()

Loading data

data = spark.read.csv('./boston_housing.csv', header=True, inferSchema=True)

header = True signals that the first line contains a header

inferSchema = True enables automatic detection of the underlying data schema

To display data:

data.show()




Need a fast expert's response?

Submit order

and get a quick answer at the best price

for any assignment or question with DETAILED EXPLANATIONS!

Comments

No comments. Be the first!

Leave a comment

LATEST TUTORIALS
New on Blog
APPROVED BY CLIENTS