This is a personal introductory R notebook created for the purpose of enhancing my knowledge and skills in R, and more generally, statistical programming.

Introduction

In this notebook, I will be investigating the relationship between the battery capacity of modern mobile phones, and the other specifications of the same phone—such as:

The battery capacity of a mobile phone is intuitively dependent upon some of these variables, such as the number of processor cores (since more cores will consume more power). However, for other variables such as the dimensions and weight of a mobile phone, this relationship may not be as clear.

As such, this notebook aims to provide exploratory analysis of the specifications of the phone, how they correlate with each other, and the battery capacity—as well as modeling the relationship.

Linear modeling

As the variable being modeled (battery capacity) is a real-valued number, the linear model seemed most applicable to this scenario. This is ideal since linear models are often used as a simple introduction to programming in R, using functions or packages such as lm or glmnet.

There are many resources dedicated to the teaching of linear models and linear regression, so I will not cover this to its full extent.


Through linear regression, we will attempt to learn a model that expresses the battery capacity of a mobile phone as a linear combination of its specifications—that is:

\[\text{Capacity}=\theta_0+\theta_1\text{FrontCam}+\theta_2\text{BackCam}+\cdots+\theta_{n-1}\text{Weight}+\theta_n\text{RAM}\] Where the specifications are the explanatory variables for our linear model, and the battery capacity is the output variable.

Dataset and specifications

As expected, a mobile phone may have many specifications that its battery capacity might be dependent upon (some examples given above).

The particular dataset we will be dealing with is the Mobile Price Classification dataset from Kaggle, by Abhishek Sharma. This dataset is designed for modeling mobile phone prices, but since the price variable is not real-valued, I opted for battery capacity as the output variable.


The dataset consists of instances of 22 variables (including the battery capacity and ID):

Name Type Description
id Numeric The auto-incrementing ID for each phone in the dataset
battery_power Numeric Total energy a battery can store in one time (measured in mAh)
blue Boolean Whether the phone has Bluetooth support or not
clock_speed Numeric Speed at which microprocessor executes instructions
dual_sim Boolean Whether the phone has dual-SIM support or not
fc Numeric Front camera megapixels
four_g Boolean Whether the phone has 4G support or not
int_memory Numeric Internal memory in gigabytes
m_dep Numeric Mobile depth in centimetres
mobile_wt Numeric Weight of mobile phone in grams
n_cores Numeric Number of processor cores
pc Numeric Primary camera megapixels
px_height Numeric Pixel resolution (height)
px_width Numeric Pixel resolution (width)
ram Numeric Random Access Memory in megabytes
sc_h Numeric Screen height in centimetres
sc_w Numeric Screen width in centimetres
talk_time Numeric Longest time that a single battery charge will last when you are on a call
three_g Boolean Whether the phone has 3G support or not
touch_screen Boolean Whether the phone has a touch screen or not
wifi Boolean Whether the phone has a WiFi adapter or not
price_range Numeric Target variable with value of 0 (low cost), 1 (medium cost), 2 (high cost) and 3 (very high cost)

For our purposes, the id and price_range variables are not necessary—reducing the dataset to the battery capacity output variable and 19 explanatory specification variables.

Setup

Installing and requiring dependencies and additional files:

# Set working directory
setwd('~/Downloads/R/reg-lm')

# Install dependencies
#install.packages("glmnet")
#install.packages("glmnetUtils")
#install.packages("ggplot2")
#install.packages("GGally")

# Require dependencies
require(readr)
require(glmnet)
require(glmnetUtils)
require(ggplot2)
require(GGally)

# Set seed for reproducibility
set.seed(9292)

Importing the training and test sets, ignoring the unnecessary features:

# Import train + test sets
train <- read_csv('data/train.csv')[,1:20] # Ignore price_range
test <- read_csv('data/test.csv')[,2:21] # Ignore ID

Exploratory analysis of the training set

To understand the extent to which the variables of the dataset correlate with each other, our first step is to visualize the correlation matrix of the training data.

# Correlation plot of all variables (including output)
ggcorr(cor(train), label=TRUE, label_size=3, label_round=2, label_alpha=TRUE, hjust=0.85)