Contents

1 Introduction

This package complements the book, `Introduction to Probability, Statistics and R for Data-Based Sciences’ by Sahu (2023). The package distributes the data sets used in the book and provides code illustrating the statistical modeling of the data sets. In addition, the package provides code for illustrating various results in probability and statistics. For example, it provides code to simulate the Monty python game illustrating conditional probability, and gives simulation based examples to illustrate the central limit theorem and the weak law of large numbers. Thus the package helps a beginner reader in enhancing understanding of a few elementary concepts in probability and statistics, and introduces them to perform linear statistical modelling, i.e., regression and ANOVA which are among the key foundational concepts of data science and machine learning, more generally data-based sciences.

ipsRdbs book cover

Figure 1: ipsRdbs book cover

1.1 Installing the required software packages

The reader is first instructed to install the R software package by searching for CRAN in the internet. The reader shoud then go onto the web-page https://cran.r-project.org/ and install the correct and latest version of the package on their own computer. Please note that R cannot be installed on a mobile phone. Once R has been installed, the next task is to install the frontend software package Rstudio, which provides an easier interface to work with R.

After installing R and Rstudio, the reader should launch the Rstudio programme in their computer. This will open up a four pane window with one named ‘Console’ or ‘Terminal’. This window accepts commands (or our instructions) and prints results. For example, the reader may type 2+2 and then hit the Enter button to examine the result. The reader is asked to search the internet for gentler introductions and videos.

In order to getting started here, thereader is aked to install the add-on R package ipsRdbs simply by issuing the R command

install.packages("ipsRdbs", dependencies=TRUE)

without committing any typing mistakes. If this installation is successful, the reader can issue the following two commands to list all R objects (data sets and programmes) included in the package.

library(ipsRdbs)
ls("package:ipsRdbs")

Note that this command will only produce the intended results if only the package has been successfully installed in the first place.

1.2 How to learn more about the objects included in the package

All the listed objects, as the output of the ls command in the previous section, have associated help files. The reader can gain information for each of those object by asking for help by typing the question mark immediately followed by the object name, e.g. ?butterfly or by issuing the command help(butterfly).

The help files provide details about the objects and the user is able to run all the code included as illustrations at the end of the help file. This cam be done either by clicking the Run Examples link or simply by copy-pasting all the commands onto the command console in Rstudio. This is a great advantage of R as it allows the users to reproduce the results without having to learn all the commands and syntax correctly. After gaining this confidence, a beginner user can examine and experiment with the commands further. More details regarding the objects are provided in the book Sahu (2023).

The remainder of this vignette simply elaborates the help files for all the main objects and programmes included in the package. The main intention here is to enable the reader to reproduce all the results by actually running the commands and the code included already included in the help files.

Section 2 discusses all the data sets. All the R functions are discussed in 3.
Some summary remarks are provided in Section 4.

2 Data sets

2.1 beanie: Age and value of beanie baby toys.

This data set contains the age and the value of 50 beanie baby toys. Source: Beanie world magazine. This data set has been used as an example of simple linear regression modellinhg where the exercise is to predict the value of a beanie baby toy by knowing it’s age.

head(beanie)
#>      name age value
#> 1    Ally  52    55
#> 2   Batty  12    12
#> 3   Bongo  28    40
#> 4 Blackie  52    10
#> 5   Bucky  40    45
#> 6  Bumble  28   600
summary(beanie)
#>      name                age            value       
#>  Length:50          Min.   : 5.00   Min.   :  10.0  
#>  Class :character   1st Qu.:12.00   1st Qu.:  15.0  
#>  Mode  :character   Median :28.00   Median :  26.5  
#>                     Mean   :26.52   Mean   : 128.9  
#>                     3rd Qu.:40.00   3rd Qu.:  62.5  
#>                     Max.   :64.00   Max.   :1900.0
plot(beanie$age, beanie$value, xlab="Age", ylab="Value", pch="*", col="red")

2.2 bill: Wealth and age of world billionaires

This data set contains wealth, age and region of 225 billionaires in 1992 as reported in the Fortune magazine. This data set can be used to illustrate exploratory data analysis by producing side-by-side box plots of wealth for billionaires from different continents of the world. It can also be used for multiple linear regression models, although such tasks have not been undertaken here.

head(bill)
#>   wealth age region
#> 1   37.0  50      M
#> 2   24.0  88      U
#> 3   14.0  64      A
#> 4   13.0  63      U
#> 5   13.0  66      U
#> 6   11.7  72      E
summary(bill)
#>      wealth            age         region
#>  Min.   : 1.000   Min.   :  7.00   A:37  
#>  1st Qu.: 1.300   1st Qu.: 56.00   E:76  
#>  Median : 1.800   Median : 65.00   M:22  
#>  Mean   : 2.726   Mean   : 64.03   O:28  
#>  3rd Qu.: 3.000   3rd Qu.: 72.00   U:62  
#>  Max.   :37.000   Max.   :102.00
library(ggplot2)
gg <- ggplot2::ggplot(data=bill, aes(x=age, y=wealth)) +
geom_point(aes(col=region, size=wealth)) +
geom_smooth(method="loess", se=FALSE) +
xlim(c(7, 102)) +
ylim(c(1, 37)) +
labs(subtitle="Wealth vs Age of Billionaires",
y="Wealth (Billion US $)", x="Age",
title="Scatterplot", caption = "Source: Fortune Magazine, 1992.")
plot(gg)
#> `geom_smooth()` using formula = 'y ~ x'

2.3 bodyfat : Body fat percentage and skinfold thickness of athletes

This data set contains body fat percentage data for 102 elite male athletes training at the Australian Institute of Sport. This data set has been used to illustrate simple linear regression in Chapter 17 of the book by Sahu (2023).

summary(bodyfat)
#>     Skinfold         Bodyfat      
#>  Min.   : 28.00   Min.   : 5.630  
#>  1st Qu.: 37.52   1st Qu.: 6.968  
#>  Median : 47.70   Median : 8.625  
#>  Mean   : 51.42   Mean   : 9.251  
#>  3rd Qu.: 58.15   3rd Qu.:10.010  
#>  Max.   :113.50   Max.   :19.940
plot(bodyfat$Skinfold,  bodyfat$Bodyfat, xlab="Skin", ylab="Fat")

plot(bodyfat$Skinfold,  log(bodyfat$Bodyfat), xlab="Skin", ylab="log Fat")

plot(log(bodyfat$Skinfold),  log(bodyfat$Bodyfat), xlab="log Skin", ylab="log Fat")

# Keep the transformed variables in the data set 
bodyfat$logskin <- log(bodyfat$Skinfold)
bodyfat$logbfat <- log(bodyfat$Bodyfat)
bodyfat$logskin <- log(bodyfat$Skinfold)
 # Create a grouped variable 
bodyfat$cutskin <- cut(log(bodyfat$Skinfold), breaks=6) 
boxplot(data=bodyfat, Bodyfat~cutskin, col=2:7)

require(ggplot2)
p2 <- ggplot(data=bodyfat, aes(x=cutskin, y=logbfat)) + 
geom_boxplot(col=2:7) + 
stat_summary(fun=mean, geom="line", aes(group=1), col="blue", linewidth=1) +
labs(x="Skinfold", y="Percentage of log bodyfat", 
title="Boxplot of log-bodyfat percentage vs grouped log-skinfold")  
plot(p2)