An advertising company sells a service of buying keywords in search engines on behalf of their customers. They’re trying to optimise their keyword and funds allocation. The first towards the optimal solution is to predict performance by keyword and fund. In this case, the goal is to predicting the feature “clicks” for new keywords with no historical data regarding them.
Data is from train.csv file with following features:
Date : The date the data was collected yyyymmdd
Market : The market (US/UK)
Keyword : The Keyword
Average.Position : The average position the keyword had in a search engine
PPC : The amount of money agreed to be paid per click on a keyword
Impression : The number of user who saw the ad
Clicks : The number of clicks a keyword had
Since it is numerical value predictive modelling problem, my original approach was to create model using linear regression algorithm.
0.1 Load data
0.2 Categorize features based on types
0.3 Define target variable
1.1 Identify missing values
1.2 Clean text data
1.3 Encode categorical feature
1.4 Convert integer type on Date feature to Date object
1.5 Expand Date feature into 4 new features
2.1 Descriptive statistic
2.2 Identify correlation between features
2.3 Identify correlation between features and target variable
2.4 Distribution of target variable
2.5 Distribution of feature variables
3.1 Normalize distribution
3.2 Word Embedding for Keyword feature
4.1 Linear Regression
4.2 Lasso Regression
4.3 Ridge Regression
5.1 Prepare unseen data
5.2 Predict unseen data
6.1 Recap and insight
6.2 Future work
6.3 Answer test questions
This is the basic step where we import all libraries we use and load the data.
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
data = pd.read_csv('train.csv')
data.shape
There are 264037 observations and 7 features. Let's see the first 10 rows of the data.
data.head(10)
data.dtypes
We have 5 numerical features and 2 object features which are text data. We know that date should not be integer type. We will explore further in the next step.
The goal of this task is to predict number of click based on new keyword. So, it's clear that 'Clicks' feature will be the target variable.
data.Clicks.value_counts()
Note that we see imbalanced data here. There are large numbers of 0.00 and large value range in the data. We need to normalize it and explore more in the next step.
# identify missing values in data
data.info()
From the result above, there are missing values in feature 'Keyword'.
# select data where Keyword is empty
data[data['Keyword'].isna()]
There are 383 empty values of Keyword feature. But interestingly, this empty value has 1 Average.Position and gain high impression. Thus, we will keep it. Another thing that we must consider is that in evaluate.csv file, there is no NaN value for Keyword but an empty string. Therefore, we need to convert NaN value in Keyword as empty string or blank space.
# Convert NaN value in Keyword to ''
data['Keyword'] = data.Keyword.fillna('')
data['Keyword']
# select data where Keyword is empty
data[data['Keyword'].isna()]
When working with text data, we need to check if the text is clean from any noisy characters. This can be achieved by checking non-ascii character in text. We can use the regex pattern '[^\x00-\x7F]+' which looks for hex values in the ascii range up to 128 (7f) so it looks for characters in range 0-128 (not including 128), and we negate this using ^ so that it's looking for the presence of non-ascii anywhere in the text.
# identify non ascii characters
data_non_ascii = data[data['Keyword'].str.contains(r'[^\x00-\x7F]+')]
data_non_ascii
It is interesting because all the non ASCII characters that was considered as noise are words from other languages. However, to process the text further we only have english language dictionary. So, we need to remove all rows which contain non english words, in this case is specified as non ASCII characters. Total rows should be 264037 - 4820 now.
# total rows after removing non ascii characters
total_rows = 264037 - 4820
total_rows
# data without non ascii characters in Keyword column
# We invert this mask using ~ and use this to mask the data
data = data[~data['Keyword'].str.contains(r'[^\x00-\x7F]+')]
data
We have feature 'Market' which is categorical. We will encode US-Market as 0 and UK-Market as 1.
data = data.copy()
data['Market'] = data['Market'].apply(lambda item: 0 if item == 'US-Market' else item)
data['Market'] = data['Market'].apply(lambda item: 1 if item == 'UK-Market' else item)
data
Date feature represents the exact time when a user clicked on the keyword. We will convert this integer type to Date object.
from datetime import date
def int2date(argdate: int) -> date:
"""
If you have date as an integer, use this method to obtain a datetime.date object.
Parameters
----------
argdate : int
Date as a regular integer value (example: 20160618)
Returns
-------
dateandtime.date
A date object which corresponds to the given value `argdate`.
"""
year = int(argdate / 10000)
month = int((argdate % 10000) / 100)
day = int(argdate % 100)
return date(year, month, day)
# convert integer to date object in Date column
data['New Date'] = data['Date'].apply(int2date)
data
We will expand this feature to 3 new features: year, month, day of the month, day of the week. In this way, we will get new variables that an ML model will be able to process and find possible dependencies and correlations. Since we have created new variables, we will exclude the original variable "Date" from the table. The "Day of the week" variable contains values from 0 to 6, where each number represents a specific day of the week (from Monday to Sunday).
# create temporary column
data['Date Time'] = pd.to_datetime(data['New Date'])
data['Year'] = data['Date Time'].dt.year
data['Month'] = data['Date Time'].dt.month
data['Day of the month'] = data['Date Time'].dt.day
data["Day of the week"] = data['Date Time'].dt.dayofweek
# drop temporary columns
data = data.drop(['New Date'], axis=1)
data = data.drop(['Date Time'], axis=1)
#data = data.drop(['Date'], axis=1)
data
# getting summary stats on Data after preprocessing
data.describe()
A few observations we see from the statistic description above:
# Visualize trends on Average.Position, PPC, Impressions and Clicks
customPalette = sns.set_palette(sns.color_palette(['#7bc2e0']))
sns.pairplot(data[['Average.Position', 'PPC','Impressions', 'Clicks']], palette=customPalette)
A few observations we see from the scatter plot above:
To see more accurate, we can take a look the exact correlation score between features using heatmap.
# create a correlation heatmap of PPC, Impressions, and Average.Position
fig, ax = plt.subplots(figsize = (2, 15))
cmap = sns.diverging_palette(130, 275, as_cmap=True)
sns.heatmap(data[data.columns[3:6]].corr()[['Average.Position']], annot=True, linewidths=.4, fmt=".1f", cmap=cmap, ax=ax);
We can see that PPC, Impression and Average.Position are correlated, though not strong.
# create a correlation heatmap of all features in relation to Clicks
fig, ax = plt.subplots(figsize = (2, 15))
cmap = sns.diverging_palette(130, 275, as_cmap=True)
sns.heatmap(data[data.columns[1:]].corr()[['Clicks']], annot=True, linewidths=.4, fmt=".1f", cmap=cmap, ax=ax);
#sns.set(font_scale=4)
#plt.savefig("heatmap_fe_4.png")
A few observations we see from the correlation map above:
Feature Date seems like have no correlation with target variable. But we will not drop here. By training our model, we will see if the feature is worth or not.
We will visualize the data of target variable to better see and understand the distribution.
def plot_distribution(df, col):
print('Unique value of {}:'.format(col))
display(df[col].unique())
print('Number of unique value of {}: {}\n'.format(col, len(df[col].unique())))
fig, ax = plt.subplots(figsize=(15,5))
sns.set_style(style='whitegrid')
feature = df[col]
_ = feature.plot.hist(ax=ax, bins=15, color=['#d5ede4'])
mean_line = ax.axvline(x=feature.mean(), linewidth=2, color='blue', label='mean {:.2f}'.format(
feature.mean()))
median_line = ax.axvline(x=feature.median(), linewidth=2, color='red', label='median {:.2f}'.format(
feature.median()))
_ = ax.legend(handles=[mean_line, median_line])
_ = ax.set_title('Histogram of {} for Bins = 15'.format(col), fontsize='x-large')
plt.show()
# Visualize distribution of target variable
plot_distribution(data, 'Clicks')
# Density plot of number of Clicks
sns.distplot(data['Clicks'], hist=False, kde=True,
bins=15, color = 'lightblue',
hist_kws={'edgecolor':'black'},
kde_kws={'linewidth': 3})
It's right-skewed distribution. The mean will be larger than the median in this distribution. Right skewness can be by reduced applying roots and logarithms transformation.
# Visualize distribution on features
customPalette = sns.set_palette(sns.color_palette(['#7bc2e0']))
sns.pairplot(data[['Average.Position', 'PPC','Impressions', 'Year', 'Month', 'Day of the month', 'Day of the week']], palette=customPalette)
# Visualize distribution of Market
sns.countplot(x ='Market', data = data)
We see that US-Market and UK-Market have balance number.
# Log transformation target variable
helpful_log = np.log(data.Clicks)
helpful_log.describe()
Many data points are 0 because many keywords have 0 Average.Position. For a quick fix, we can add 1 to each data point. This works well since the log of 1 is 0. Furthermore, the same spread is retained since all points are increased by 1
click_log = np.log(data.Clicks + 1)
data['Clicks'] = np.sqrt(click_log)
data['Clicks'].describe()
# Density Plot of Clicks
sns.distplot(data['Clicks'], hist=False, kde=True,
bins=15, color = 'darkblue',
hist_kws={'edgecolor':'black'},
kde_kws={'linewidth': 3})
# Log transformation of PPC
ppc_log = np.log(data.PPC + 1)
data['PPC'] = np.sqrt(ppc_log)
data['PPC'].describe()
# Density Plot of PPC
sns.distplot(data['PPC'], hist=False, kde=True,
bins=15, color = 'darkblue',
hist_kws={'edgecolor':'black'},
kde_kws={'linewidth': 3})
# Log transformation of Impressions
impression_log = np.log(data.Impressions + 1)
data['Impressions'] = np.sqrt(impression_log)
data['Impressions'].describe()
# Density Plot of Impressions
sns.distplot(data['Impressions'], hist=False, kde=True,
bins=15, color = 'darkblue',
hist_kws={'edgecolor':'black'},
kde_kws={'linewidth': 3})
We will create word embedding using Glove pretrained model for English https://nlp.stanford.edu/projects/glove/. GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
# After downloading:
glove_path = "glove.6B.50d.txt"
def load_word_embeddings(file=glove_path):
embeddings={}
with open(file,'r', encoding="utf-8") as infile:
for line in infile:
values=line.split()
embeddings[values[0]]=np.asarray(values[1:], dtype='float64')
return embeddings
embeddings = load_word_embeddings()
len(embeddings.keys())
# create average feature embedding for each sentence
def average_embedding(keyword, embeddings=embeddings,emb_size=50):
token_keywords = keyword.lower().split()
token_keywords=[w for w in token_keywords if w.isalpha() and w in embeddings]
if len(token_keywords)==0:
return 0
keywords_embedding = np.array([embeddings[w] for w in token_keywords])
sentences_embedding_avg = keywords_embedding.mean(axis=0)
return sentences_embedding_avg.mean(axis=0)
# apply average embedding funtion to Keyword column
data['Keyword'] = data['Keyword'].apply(average_embedding)
data
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Define features and label
# x = data[['Keyword', 'Market', 'Average.Position', 'PPC', 'Impressions', 'Year', 'Month', 'Day of the month', 'Day of the week']]
x = data[['Keyword', 'Market', 'PPC', 'Year', 'Month', 'Day of the month', 'Day of the week']]
y = data ['Clicks']
# Model initialization
lr_model = LinearRegression()
# Fit the data(train the model)
lr_model.fit(x, y)
# Predict
y_predicted = lr_model.predict(x)
# model evaluation
rmse = mean_squared_error(y, y_predicted)
r2 = r2_score(y, y_predicted)
# printing values
print('Slope:' ,lr_model.coef_)
print('Intercept:', lr_model.intercept_)
print('Root mean squared error: ', rmse)
print('R2 score: ', r2)
# plotting values
# data points
#plt.scatter(x, y)
plt.xlabel('x')
plt.ylabel('y')
# predicted values
plt.plot(x, y_predicted, color='b')
plt.show()
# show distribution of target variable versus predictions
fig, ax = plt.subplots(figsize = (11, 8))
sns.distplot(np.exp(data.Clicks), kde=False, color='#A15EDB', bins=50, label='click_actual')
sns.distplot(np.exp(y_predicted), kde=False, color='#69547C', bins=50, label='click_pred')
plt.xlabel('Clicks', fontsize=19, labelpad=11)
plt.xticks(fontsize=14)
plt.ylabel('Count', fontsize=19, labelpad=11)
plt.yticks(fontsize=14)
plt.legend(loc='upper right');
#plt.savefig("y_vs_yhat.png");
from sklearn.linear_model import LassoCV
# Model initialization
lasso_model = LassoCV()
# Fit the data(train the model)
lasso_model.fit(x, y)
# Predict
y_predicted = lasso_model.predict(x)
# model evaluation
rmse = mean_squared_error(y, y_predicted)
r2 = r2_score(y, y_predicted)
# printing values
print('Slope:' ,lasso_model.coef_)
print('Intercept:', lasso_model.intercept_)
print('Root mean squared error: ', rmse)
print('R2 score: ', r2)
# plotting values
# data points
#plt.scatter(x, y)
plt.xlabel('x')
plt.ylabel('y')
# predicted values
plt.plot(x, y_predicted, color='b')
plt.show()
# show distribution of target variable versus predictions
fig, ax = plt.subplots(figsize = (11, 8))
sns.distplot(np.exp(data.Clicks), kde=False, color='#A15EDB', bins=50, label='click_actual')
sns.distplot(np.exp(y_predicted), kde=False, color='#69547C', bins=50, label='click_pred')
plt.xlabel('Clicks', fontsize=19, labelpad=11)
plt.xticks(fontsize=14)
plt.ylabel('Count', fontsize=19, labelpad=11)
plt.yticks(fontsize=14)
plt.legend(loc='upper right');
#plt.savefig("y_vs_yhat.png");
from sklearn.linear_model import RidgeCV
# Model initialization
ridge_model = RidgeCV()
# Fit the data(train the model)
ridge_model.fit(x, y)
# Predict
y_predicted = ridge_model.predict(x)
# model evaluation
rmse = mean_squared_error(y, y_predicted)
r2 = r2_score(y, y_predicted)
# printing values
print('Slope:' ,ridge_model.coef_)
print('Intercept:', ridge_model.intercept_)
print('Root mean squared error: ', rmse)
print('R2 score: ', r2)
# plotting values
# data points
#plt.scatter(x, y)
plt.xlabel('x')
plt.ylabel('y')
# predicted values
plt.plot(x, y_predicted, color='b')
plt.show()