This is a vignette for the R
package ipsRdbs. This package contains data sets, programmes and illustrations discussed in the book, “Introduction to Probability, Statistics and R: Foundations for Data-Based Sciences” by Sahu (2023).
ipsRdbs 1.0.0
beanie
: Age and value of beanie baby toys.bill
: Wealth and age of world billionairesbodyfat
: Body fat percentage and skinfold thickness of athletesbombhits
: Number and frequency of bombhits in Londoncement
: Breaking strength of cementcfail
: Number of weekly computer failurescheese
: Taste of cheeseemissions
: Exhaust emissions of carserr_age
: Error in guessing ages from photographsffood
: Service times in a fast food restaurantgasmileage
: Gas mileage of carspossum
: Body weight and length of possums in Australian regionspuffin
: Nesting habits of puffins in Newfoundlandrice
: data set on rice yieldwgain
: Weight gain of students starting collegeThis package complements the book, `Introduction to Probability, Statistics and R for Data-Based Sciences’ by Sahu (2023). The package distributes the data sets used in the book and provides code illustrating the statistical modeling of the data sets. In addition, the package provides code for illustrating various results in probability and statistics. For example, it provides code to simulate the Monty python game illustrating conditional probability, and gives simulation based examples to illustrate the central limit theorem and the weak law of large numbers. Thus the package helps a beginner reader in enhancing understanding of a few elementary concepts in probability and statistics, and introduces them to perform linear statistical modelling, i.e., regression and ANOVA which are among the key foundational concepts of data science and machine learning, more generally data-based sciences.
The reader is first instructed to install the R
software package by searching for CRAN
in the internet. The reader shoud then go onto the web-page https://cran.r-project.org/ and install the correct and latest version of the package on their own computer. Please note that R
cannot be installed on a mobile phone.
Once R
has been installed, the next task is to install the frontend software package Rstudio,
which provides an easier interface to work with R
.
After installing R
and Rstudio
, the reader should launch the Rstudio
programme in their computer. This will open up a four pane window with one named ‘Console’ or ‘Terminal’. This window accepts commands (or our instructions) and prints results. For example, the reader may type 2+2 and then hit the Enter button to examine the result. The reader is asked to search the internet for gentler introductions and videos.
In order to getting started here, thereader is aked to install the add-on R
package ipsRdbs
simply by issuing the R
command
install.packages("ipsRdbs", dependencies=TRUE)
without committing any typing mistakes. If this installation is successful, the reader can issue the following two commands to list all R
objects (data sets and programmes) included in the package.
library(ipsRdbs)
ls("package:ipsRdbs")
Note that this command will only produce the intended results if only the package has been successfully installed in the first place.
All the listed objects, as the output of the ls
command in the previous section, have associated help files. The reader can gain information for each of those object by asking for help by typing the question mark immediately followed by the object name, e.g. ?butterfly
or by issuing the command help(butterfly)
.
The help files provide details about the objects and the user is able to run all the code included as illustrations at the end of the help file. This cam be done either by clicking the Run Examples
link or simply by copy-pasting all the commands onto the command console in Rstudio. This is a great advantage of R
as it allows the users to reproduce the results without having to learn all the commands and syntax correctly. After gaining this confidence, a beginner user can examine and experiment with the commands further. More details regarding the objects are provided in the book Sahu (2023).
The remainder of this vignette simply elaborates the help files for all the main objects and programmes included in the package. The main intention here is to enable the reader to reproduce all the results by actually running the commands and the code included already included in the help files.
Section 2 discusses all the data sets.
All the R
functions are discussed in 3.
Some summary remarks are provided in Section 4.
beanie
: Age and value of beanie baby toys.This data set contains the age and the value of 50 beanie baby toys. Source: Beanie world magazine. This data set has been used as an example of simple linear regression modellinhg where the exercise is to predict the value of a beanie baby toy by knowing it’s age.
head(beanie)
#> name age value
#> 1 Ally 52 55
#> 2 Batty 12 12
#> 3 Bongo 28 40
#> 4 Blackie 52 10
#> 5 Bucky 40 45
#> 6 Bumble 28 600
summary(beanie)
#> name age value
#> Length:50 Min. : 5.00 Min. : 10.0
#> Class :character 1st Qu.:12.00 1st Qu.: 15.0
#> Mode :character Median :28.00 Median : 26.5
#> Mean :26.52 Mean : 128.9
#> 3rd Qu.:40.00 3rd Qu.: 62.5
#> Max. :64.00 Max. :1900.0
plot(beanie$age, beanie$value, xlab="Age", ylab="Value", pch="*", col="red")
bill
: Wealth and age of world billionairesThis data set contains wealth, age and region of 225 billionaires in 1992 as reported in the Fortune magazine. This data set can be used to illustrate exploratory data analysis by producing side-by-side box plots of wealth for billionaires from different continents of the world. It can also be used for multiple linear regression models, although such tasks have not been undertaken here.
head(bill)
#> wealth age region
#> 1 37.0 50 M
#> 2 24.0 88 U
#> 3 14.0 64 A
#> 4 13.0 63 U
#> 5 13.0 66 U
#> 6 11.7 72 E
summary(bill)
#> wealth age region
#> Min. : 1.000 Min. : 7.00 A:37
#> 1st Qu.: 1.300 1st Qu.: 56.00 E:76
#> Median : 1.800 Median : 65.00 M:22
#> Mean : 2.726 Mean : 64.03 O:28
#> 3rd Qu.: 3.000 3rd Qu.: 72.00 U:62
#> Max. :37.000 Max. :102.00
library(ggplot2)
gg <- ggplot2::ggplot(data=bill, aes(x=age, y=wealth)) +
geom_point(aes(col=region, size=wealth)) +
geom_smooth(method="loess", se=FALSE) +
xlim(c(7, 102)) +
ylim(c(1, 37)) +
labs(subtitle="Wealth vs Age of Billionaires",
y="Wealth (Billion US $)", x="Age",
title="Scatterplot", caption = "Source: Fortune Magazine, 1992.")
plot(gg)
#> `geom_smooth()` using formula = 'y ~ x'
bodyfat
: Body fat percentage and skinfold thickness of athletesThis data set contains body fat percentage data for 102 elite male athletes training at the Australian Institute of Sport. This data set has been used to illustrate simple linear regression in Chapter 17 of the book by Sahu (2023).
summary(bodyfat)
#> Skinfold Bodyfat
#> Min. : 28.00 Min. : 5.630
#> 1st Qu.: 37.52 1st Qu.: 6.968
#> Median : 47.70 Median : 8.625
#> Mean : 51.42 Mean : 9.251
#> 3rd Qu.: 58.15 3rd Qu.:10.010
#> Max. :113.50 Max. :19.940
plot(bodyfat$Skinfold, bodyfat$Bodyfat, xlab="Skin", ylab="Fat")
plot(bodyfat$Skinfold, log(bodyfat$Bodyfat), xlab="Skin", ylab="log Fat")
plot(log(bodyfat$Skinfold), log(bodyfat$Bodyfat), xlab="log Skin", ylab="log Fat")
# Keep the transformed variables in the data set
bodyfat$logskin <- log(bodyfat$Skinfold)
bodyfat$logbfat <- log(bodyfat$Bodyfat)
bodyfat$logskin <- log(bodyfat$Skinfold)
# Create a grouped variable
bodyfat$cutskin <- cut(log(bodyfat$Skinfold), breaks=6)
boxplot(data=bodyfat, Bodyfat~cutskin, col=2:7)
require(ggplot2)
p2 <- ggplot(data=bodyfat, aes(x=cutskin, y=logbfat)) +
geom_boxplot(col=2:7) +
stat_summary(fun=mean, geom="line", aes(group=1), col="blue", linewidth=1) +
labs(x="Skinfold", y="Percentage of log bodyfat",
title="Boxplot of log-bodyfat percentage vs grouped log-skinfold")
plot(p2)