D a t a s c i e n c e m e g a b y t e s , x P r o c e s s e d r e q u e s t s , y 6 40 7 55 7 50 8 41 10 17 10 26 15 16 \begin{matrix}
Data\ science\ megabytes, x & Processed\ requests, y \\
6 & 40 \\
7 & 55 \\
7 & 50\\
8 & 41\\
10 & 17\\
10 & 26\\
15 & 16
\end{matrix} D a t a sc i e n ce m e g ab y t es , x 6 7 7 8 10 10 15 P rocesse d re q u es t s , y 40 55 50 41 17 26 16 The response variable here is the number of processed requests ( y ) , (y), ( y ) , and we attempt to predict it from the size of a data set ( x ) . (x). ( x ) .
x y x 2 x y y 2 6 40 36 240 1600 7 55 49 385 3025 7 50 49 350 2500 8 41 64 328 1681 10 17 100 170 289 10 26 100 260 676 15 16 225 240 256 S u m = 63 245 623 1973 10027 \def\arraystretch{1.5}
\begin{array}{c:c:c}
& x & y & x^2 & xy & y^2 \\ \hline
& 6 & 40 & 36 & 240 & 1600 \\
& 7 & 55 & 49 & 385 & 3025 \\
& 7 & 50 & 49 & 350 & 2500 \\
& 8 & 41 & 64 & 328 & 1681 \\
& 10 & 17 & 100 & 170 & 289 \\
& 10 & 26 & 100 & 260 & 676 \\
& 15 & 16 & 225 & 240 & 256 \\
Sum =& 63 & 245 & 623 & 1973 & 10027
\end{array} S u m = x 6 7 7 8 10 10 15 63 y 40 55 50 41 17 26 16 245 x 2 36 49 49 64 100 100 225 623 x y 240 385 350 328 170 260 240 1973 y 2 1600 3025 2500 1681 289 676 256 10027
x ˉ = ∑ i x i n = 63 7 = 9 , y ˉ = ∑ i y i n = 245 7 = 35 \bar{x}={\sum_i x_i\over n}={63\over 7}=9,\ \bar{y}={\sum_i y_i\over n}={245\over 7}=35 x ˉ = n ∑ i x i = 7 63 = 9 , y ˉ = n ∑ i y i = 7 245 = 35
S x x = ∑ i x i 2 − n ⋅ x ˉ 2 = 623 − 7 ⋅ ( 9 ) 2 = 56 S_{xx}=\sum_i x_i^2-n\cdot\bar{x}^2=623-7\cdot(9)^2=56 S xx = i ∑ x i 2 − n ⋅ x ˉ 2 = 623 − 7 ⋅ ( 9 ) 2 = 56
S x y = ∑ i x i y i − n ⋅ x ˉ y ˉ = 1973 − 7 ⋅ ( 9 ) ( 35 ) = − 232 S_{xy}=\sum_i x_iy_i-n\cdot\bar{x}\bar{y}=1973-7\cdot(9)(35)=-232 S x y = i ∑ x i y i − n ⋅ x ˉ y ˉ = 1973 − 7 ⋅ ( 9 ) ( 35 ) = − 232
S y y = ∑ i y i 2 − n ⋅ y ˉ 2 = 10027 − 7 ⋅ ( 35 ) 2 = 1452 S_{yy}=\sum_i y_i^2-n\cdot\bar{y}^2=10027-7\cdot(35)^2=1452 S yy = i ∑ y i 2 − n ⋅ y ˉ 2 = 10027 − 7 ⋅ ( 35 ) 2 = 1452 Therefore, based on the above calculations, the regression coefficients (the slope m , m, m , and the y − y- y − intercept n n n ) are obtained as follows:
m = S x y S x x = − 232 56 = − 29 7 ≈ − 4.142857 m={S_{xy}\over S_{xx}}={-232\over 56}=-{29\over 7}\approx-4.142857 m = S xx S x y = 56 − 232 = − 7 29 ≈ − 4.142857
n = y ˉ − m x ˉ = 35 − ( − 29 7 ) ( 9 ) = 506 7 ≈ 72.285714 n=\bar{y}-m\bar{x}=35-(-{29\over 7})(9)={506\over 7}\approx72.285714 n = y ˉ − m x ˉ = 35 − ( − 7 29 ) ( 9 ) = 7 506 ≈ 72.285714 Therefore, we find that the regression equation is:
Y = 72.285714 − 4.142857 X Y=72.285714-4.142857X Y = 72.285714 − 4.142857 X
ii. Is there any correlation between the processing request and the size of incoming data?
What is the correlation coefficient?
Correlation cofficient
r = S x y S x x S y y = − 232 56 1452 ≈ − 0.8136 r={S_{xy}\over \sqrt{S_{xx}}\sqrt{S_{yy}}}={-232\over \sqrt{56}\sqrt{1452}}\approx-0.8136 r = S xx S yy S x y = 56 1452 − 232 ≈ − 0.8136 Strong correlation
iii. By what percentage is the processing time dependent on the size of incoming data?
The coefficient of determination
r 2 = ( − 0.8136 ) 2 = 0.6619 r^2=(-0.8136)^2=0.6619 r 2 = ( − 0.8136 ) 2 = 0.6619 66.19 % 66.19\ \% 66.19 %
The proportion of Y variance explained by the linear relationship between X and Y is 66.19 % . 66.19\ \%. 66.19 %.
Comments