*This is a personal introductory R notebook created for the purpose of enhancing my knowledge and skills in R, and more generally, statistical programming.*

In this notebook, I will be investigating the relationship between the battery capacity of modern mobile phones, and the other specifications of the same phoneâ€”such as:

- Front and rear camera resolution (megapixels)
- Screen height and width
- Weight
- Number of processor cores
- Internal memory and RAM, etc.

The battery capacity of a mobile phone is intuitively dependent upon some of these variables, such as the number of processor cores (since more cores will consume more power). However, for other variables such as the dimensions and weight of a mobile phone, this relationship may not be as clear.

As such, this notebook aims to provide exploratory analysis of the specifications of the phone, how they correlate with each other, and the battery capacityâ€”as well as modeling the relationship.

As the variable being modeled (battery capacity) is a real-valued number, the linear model seemed most applicable to this scenario. This is ideal since linear models are often used as a simple introduction to programming in R, using functions or packages such as `lm`

or `glmnet`

.

There are many resources dedicated to the teaching of linear models and linear regression, so I will not cover this to its full extent.

Through linear regression, we will attempt to learn a model that expresses the battery capacity of a mobile phone as a linear combination of its specificationsâ€”that is:

\[\text{Capacity}=\theta_0+\theta_1\text{FrontCam}+\theta_2\text{BackCam}+\cdots+\theta_{n-1}\text{Weight}+\theta_n\text{RAM}\] Where the specifications are the **explanatory variables** for our linear model, and the battery capacity is the **output variable**.

As expected, a mobile phone may have many specifications that its battery capacity might be dependent upon (some examples given above).

The particular dataset we will be dealing with is the *Mobile Price Classification* dataset from Kaggle, by Abhishek Sharma. This dataset is designed for modeling mobile phone prices, but since the price variable is not real-valued, I opted for battery capacity as the output variable.

The dataset consists of instances of **22** variables (including the battery capacity and ID):

Name | Type | Description |
---|---|---|

`id` |
Numeric | The auto-incrementing ID for each phone in the dataset |

`battery_power` |
Numeric | Total energy a battery can store in one time (measured in mAh) |

`blue` |
Boolean | Whether the phone has Bluetooth support or not |

`clock_speed` |
Numeric | Speed at which microprocessor executes instructions |

`dual_sim` |
Boolean | Whether the phone has dual-SIM support or not |

`fc` |
Numeric | Front camera megapixels |

`four_g` |
Boolean | Whether the phone has 4G support or not |

`int_memory` |
Numeric | Internal memory in gigabytes |

`m_dep` |
Numeric | Mobile depth in centimetres |

`mobile_wt` |
Numeric | Weight of mobile phone in grams |

`n_cores` |
Numeric | Number of processor cores |

`pc` |
Numeric | Primary camera megapixels |

`px_height` |
Numeric | Pixel resolution (height) |

`px_width` |
Numeric | Pixel resolution (width) |

`ram` |
Numeric | Random Access Memory in megabytes |

`sc_h` |
Numeric | Screen height in centimetres |

`sc_w` |
Numeric | Screen width in centimetres |

`talk_time` |
Numeric | Longest time that a single battery charge will last when you are on a call |

`three_g` |
Boolean | Whether the phone has 3G support or not |

`touch_screen` |
Boolean | Whether the phone has a touch screen or not |

`wifi` |
Boolean | Whether the phone has a WiFi adapter or not |

`price_range` |
Numeric | Target variable with value of 0 (low cost), 1 (medium cost), 2 (high cost) and 3 (very high cost) |

For our purposes, the `id`

and `price_range`

variables are not necessaryâ€”reducing the dataset to the battery capacity output variable and **19** explanatory specification variables.

Installing and requiring dependencies and additional files:

```
# Set working directory
setwd('~/Downloads/R/reg-lm')
# Install dependencies
#install.packages("glmnet")
#install.packages("glmnetUtils")
#install.packages("ggplot2")
#install.packages("GGally")
# Require dependencies
require(readr)
require(glmnet)
require(glmnetUtils)
require(ggplot2)
require(GGally)
# Set seed for reproducibility
set.seed(9292)
```

Importing the training and test sets, ignoring the unnecessary features:

```
# Import train + test sets
train <- read_csv('data/train.csv')[,1:20] # Ignore price_range
test <- read_csv('data/test.csv')[,2:21] # Ignore ID
```

To understand the extent to which the variables of the dataset correlate with each other, our first step is to visualize the **correlation matrix** of the training data.

```
# Correlation plot of all variables (including output)
ggcorr(cor(train), label=TRUE, label_size=3, label_round=2, label_alpha=TRUE, hjust=0.85)
```