PREDICTING CLICKS FOR NEW KEYWORDS

Poetri Heriningtyas (poetri.h89@gmail.com)

Problem Statement

An advertising company sells a service of buying keywords in search engines on behalf of their customers. They’re trying to optimise their keyword and funds allocation. The first towards the optimal solution is to predict performance by keyword and fund. In this case, the goal is to predicting the feature “clicks” for new keywords with no historical data regarding them.

Data Description

Data is from train.csv file with following features:

Date : The date the data was collected yyyymmdd
Market : The market (US/UK)
Keyword : The Keyword
Average.Position : The average position the keyword had in a search engine
PPC : The amount of money agreed to be paid per click on a keyword
Impression : The number of user who saw the ad
Clicks : The number of clicks a keyword had

Technique Overview

Since it is numerical value predictive modelling problem, my original approach was to create model using linear regression algorithm.

Technique Summary

0. Import Libraries and Load Data

0.1 Load data 
0.2 Categorize features based on types
0.3 Define target variable

1. Data Prepocessing

1.1 Identify missing values
1.2 Clean text data
1.3 Encode categorical feature
1.4 Convert integer type on Date feature to Date object
1.5 Expand Date feature into 4 new features

2. Explorative Visualization

2.1 Descriptive statistic
2.2 Identify correlation between features
2.3 Identify correlation between features and target variable
2.4 Distribution of target variable
2.5 Distribution of feature variables

3. Feature Engineering

3.1 Normalize distribution
3.2 Word Embedding for Keyword feature

4. Create Prediction Model

4.1 Linear Regression
4.2 Lasso Regression
4.3 Ridge Regression

5. Evaluate Model On Unseen Data

5.1 Prepare unseen data
5.2 Predict unseen data

6. Summary

6.1 Recap and insight
6.2 Future work
6.3 Answer test questions

Step 0 - Import Libraries and Load Data

This is the basic step where we import all libraries we use and load the data.

In [1]:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

0.1. Load data

In [2]:
data = pd.read_csv('train.csv')
data.shape
Out[2]:
(264037, 7)

There are 264037 observations and 7 features. Let's see the first 10 rows of the data.

In [3]:
data.head(10)
Out[3]:
Date Keyword Market Average.Position PPC Impressions Clicks
0 20120524 secure online back up US-Market 0.0 0.00 0 0.00
1 20120524 agile management software US-Market 1.0 1.20 260 21.22
2 20120524 crm for financial US-Market 0.0 0.00 0 0.00
3 20120524 disaster recovery planning for it US-Market 0.0 0.00 0 0.00
4 20120524 tracking a vehicle US-Market 0.0 0.00 0 0.00
5 20120524 applications in the cloud US-Market 0.0 0.00 0 0.00
6 20120524 project management with scrum US-Market 0.0 0.00 0 0.00
7 20120524 servers for businesses US-Market 0.0 0.00 0 0.00
8 20120524 android application developers US-Market 0.0 0.00 0 0.00
9 20120524 android app developers US-Market 1.0 0.66 270 23.31

0.2. Categorize features based on types

In [4]:
data.dtypes
Out[4]:
Date                  int64
Keyword              object
Market               object
Average.Position    float64
PPC                 float64
Impressions           int64
Clicks              float64
dtype: object

We have 5 numerical features and 2 object features which are text data. We know that date should not be integer type. We will explore further in the next step.

0.3. Define Target Variable

The goal of this task is to predict number of click based on new keyword. So, it's clear that 'Clicks' feature will be the target variable.

In [5]:
data.Clicks.value_counts()
Out[5]:
0.00        162987
0.01           241
0.18           168
0.20           164
0.27           140
0.26           138
0.02           138
0.21           134
0.35           134
0.23           131
0.15           131
0.89           131
0.08           129
0.16           128
0.46           128
0.13           125
0.17           124
0.11           115
0.19           113
0.82           113
0.14           111
0.04           111
0.70           111
0.87           109
0.88           109
1.06           109
0.79           109
0.09           108
0.07           108
0.10           104
             ...  
37212.54         1
3427.18          1
4546.58          1
1701.78          1
117.16           1
581.14           1
814.02           1
342.32           1
653.43           1
3033.74          1
242.48           1
102.75           1
1539.33          1
1718.66          1
189.39           1
190.61           1
688.88           1
326.66           1
1162.47          1
7871.24          1
583.38           1
3904.13          1
1309.61          1
72.67            1
4897.10          1
429.54           1
107.29           1
101.69           1
200.62           1
105.80           1
Name: Clicks, Length: 36142, dtype: int64

Note that we see imbalanced data here. There are large numbers of 0.00 and large value range in the data. We need to normalize it and explore more in the next step.

Step 1 - Data Preprocessing

In this step we will clean the data from missing values and manipulate some features.

1.1. Identify missing values

In [6]:
# identify missing values in data
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264037 entries, 0 to 264036
Data columns (total 7 columns):
Date                264037 non-null int64
Keyword             263654 non-null object
Market              264037 non-null object
Average.Position    264037 non-null float64
PPC                 264037 non-null float64
Impressions         264037 non-null int64
Clicks              264037 non-null float64
dtypes: float64(3), int64(2), object(2)
memory usage: 14.1+ MB

From the result above, there are missing values in feature 'Keyword'.

In [7]:
# select data where Keyword is empty
data[data['Keyword'].isna()]
Out[7]:
Date Keyword Market Average.Position PPC Impressions Clicks
1094 20120526 NaN US-Market 1.0 2.32 5499999 289365.25
1642 20120527 NaN US-Market 1.0 2.34 5499999 285877.47
2190 20120529 NaN US-Market 1.0 2.32 5499999 287599.59
2738 20120530 NaN US-Market 1.0 2.63 6714282 330769.03
3286 20120531 NaN US-Market 1.0 2.64 6714282 330107.41
3834 20120601 NaN US-Market 1.0 2.59 6714281 330516.59
4382 20120602 NaN US-Market 1.0 2.59 9571432 355690.41
4930 20120603 NaN US-Market 1.0 2.50 9571423 401689.91
5478 20120604 NaN US-Market 1.0 2.54 11166672 433008.50
6026 20120605 NaN US-Market 1.0 2.31 11750007 427226.03
6574 20120610 NaN US-Market 1.0 2.63 13400001 441963.50
7122 20120612 NaN US-Market 1.0 2.58 11166669 428694.63
7670 20120613 NaN US-Market 1.0 2.53 13428574 422703.03
8218 20120614 NaN US-Market 1.0 2.52 13428574 427788.47
8766 20120615 NaN US-Market 1.0 2.53 13428574 424935.56
9314 20120616 NaN US-Market 1.0 2.51 9571428 392467.78
9862 20120617 NaN US-Market 1.0 2.55 9571428 379890.72
10410 20120618 NaN US-Market 1.0 2.53 13428580 382663.88
10958 20120619 NaN US-Market 1.0 2.54 9571430 362908.22
11506 20120620 NaN US-Market 1.0 2.55 9571426 373766.03
12054 20120621 NaN US-Market 1.0 2.61 11166665 373583.75
12602 20120622 NaN US-Market 1.0 2.72 9571431 359941.81
13150 20120623 NaN US-Market 1.0 2.73 9571426 354189.47
13698 20120624 NaN US-Market 1.0 2.89 9571426 316026.56
14246 20120625 NaN US-Market 1.0 2.89 9571426 315998.50
14794 20120626 NaN US-Market 1.0 2.85 9571429 309798.00
15342 20120627 NaN US-Market 1.0 2.85 11166663 308133.22
15890 20120628 NaN US-Market 1.0 2.78 9571429 331790.28
16438 20120629 NaN US-Market 1.0 3.04 9571431 322409.72
16986 20120701 NaN US-Market 1.0 3.06 9571428 324125.81
... ... ... ... ... ... ... ...
247597 20130108 NaN UK-Market 1.0 5.42 2285713 44691.21
248145 20130109 NaN UK-Market 1.0 5.32 2285715 43786.67
248693 20130110 NaN UK-Market 1.0 5.37 2285715 43388.24
249241 20130111 NaN UK-Market 1.0 5.50 2285715 41995.05
249789 20130112 NaN UK-Market 1.0 5.54 2285715 41719.83
250337 20130113 NaN UK-Market 1.0 5.52 2285715 41901.30
250885 20130114 NaN UK-Market 1.0 5.53 1833333 42149.83
251433 20130115 NaN UK-Market 1.0 5.76 1571428 39829.72
251981 20130117 NaN UK-Market 1.0 6.30 1833332 37040.57
252529 20130118 NaN UK-Market 1.0 6.32 1571428 36058.94
253077 20130119 NaN UK-Market 1.0 6.67 1571428 35389.41
253625 20130120 NaN UK-Market 1.0 6.64 1571428 34820.20
254173 20130121 NaN UK-Market 1.0 6.52 1571428 35794.18
254721 20130122 NaN UK-Market 1.0 6.35 1571428 37103.77
255269 20130123 NaN UK-Market 1.0 6.18 1571428 38916.38
255817 20130124 NaN UK-Market 1.0 6.17 1833333 38951.21
256365 20130125 NaN UK-Market 1.0 6.33 1833332 38993.13
256913 20130126 NaN UK-Market 1.0 6.37 1571428 38653.07
257461 20130127 NaN UK-Market 1.0 6.27 1571428 39082.41
258009 20130128 NaN UK-Market 1.0 6.18 1571429 39547.60
258557 20130129 NaN UK-Market 1.0 6.22 1833332 40135.79
259105 20130130 NaN UK-Market 1.0 6.00 2285715 40734.42
259653 20130131 NaN UK-Market 1.0 6.33 2285714 40296.99
260201 20130201 NaN UK-Market 1.0 6.59 1571429 38847.38
260749 20130203 NaN UK-Market 1.0 6.89 1571429 37094.16
261297 20130204 NaN UK-Market 1.0 6.91 1571429 37020.55
261845 20130205 NaN UK-Market 1.0 6.76 1833333 36694.93
262393 20130206 NaN UK-Market 1.0 6.80 1833333 36694.93
262941 20130207 NaN UK-Market 1.0 6.93 1571428 37054.14
263489 20130212 NaN UK-Market 1.0 7.56 1571428 35155.06

383 rows × 7 columns

There are 383 empty values of Keyword feature. But interestingly, this empty value has 1 Average.Position and gain high impression. Thus, we will keep it. Another thing that we must consider is that in evaluate.csv file, there is no NaN value for Keyword but an empty string. Therefore, we need to convert NaN value in Keyword as empty string or blank space.

In [8]:
# Convert NaN value in Keyword to ''
data['Keyword'] = data.Keyword.fillna('')
data['Keyword']
Out[8]:
0                                     secure online back up
1                                 agile management software
2                                         crm for financial
3                         disaster recovery planning for it
4                                        tracking a vehicle
5                                 applications in the cloud
6                             project management with scrum
7                                    servers for businesses
8                            android application developers
9                                    android app developers
10                                    training for indesign
11                           teach yourself web development
12                              web applications developers
13                            android application developer
14                                    android app developer
15                                    app developer android
16                                    android developer app
17                           mobile applications developers
18                                   design for programmers
19                              cloud computing for dummies
20                       android application development uk
21        web design and marketing solutions for busines...
22                           developers mobile applications
23                                        training indesign
24                                         apps development
25                                    computing for dummies
26                           bpp learning management system
27                                   mobile apps developers
28                                         joomla designers
29                              best pay as you go sim card
                                ...                        
264007                               21st century insurance
264008                              cheap insurance medical
264009                          angeles criminal lawyer los
264010                         low cost term life insurance
264011                                 home refinance loans
264012                                                eloan
264013                                 life insurance quote
264014                                 cheap life assurance
264015                                   anti spam exchange
264016                              credit equity home line
264017                                         att go phone
264018                                   dui florida lawyer
264019                                hair removal new york
264020                       merchant account international
264021                                   vioxx class action
264022                          life insurance quote online
264023                                   consolidation loan
264024                                equity line of credit
264025                             carolina law lemon north
264026                                         equity rates
264027                           direct home insurance line
264028                             angeles attorney los tax
264029                                     credit home line
264030                                       term insurance
264031                                      auto insurances
264032                                       debtfreedirect
264033                                 refinance comparison
264034                                acid reflux treatment
264035                        bad credit mortgage refinance
264036                               college loan refinance
Name: Keyword, Length: 264037, dtype: object
In [ ]:

In [9]:
# select data where Keyword is empty
data[data['Keyword'].isna()]
Out[9]:
Date Keyword Market Average.Position PPC Impressions Clicks

1.2. Clean text data

When working with text data, we need to check if the text is clean from any noisy characters. This can be achieved by checking non-ascii character in text. We can use the regex pattern '[^\x00-\x7F]+' which looks for hex values in the ascii range up to 128 (7f) so it looks for characters in range 0-128 (not including 128), and we negate this using ^ so that it's looking for the presence of non-ascii anywhere in the text.

In [10]:
# identify non ascii characters
data_non_ascii = data[data['Keyword'].str.contains(r'[^\x00-\x7F]+')]
data_non_ascii
Out[10]:
Date Keyword Market Average.Position PPC Impressions Clicks
247 20120524 트위터 US-Market 0.0 0.00 0 0.00
251 20120524 食べログ US-Market 0.0 0.00 0 0.00
254 20120524 как жрать суши US-Market 0.0 0.00 0 0.00
269 20120524 nürburgring US-Market 0.0 0.00 0 0.00
281 20120524 jörg kachelmann US-Market 0.0 0.00 0 0.00
282 20120524 menowin fröhlich US-Market 0.0 0.00 0 0.00
287 20120524 ресторан нью-йорк US-Market 0.0 0.00 0 0.00
288 20120524 машинное обучение US-Market 0.0 0.00 0 0.00
290 20120524 comercio electrónico US-Market 0.0 0.00 0 0.00
292 20120524 منتزه يلوستون الوطني US-Market 0.0 0.00 0 0.00
794 20120525 트위터 US-Market 0.0 0.00 0 0.00
798 20120525 食べログ US-Market 0.0 0.00 0 0.00
801 20120525 как жрать суши US-Market 0.0 0.00 0 0.00
816 20120525 nürburgring US-Market 0.0 0.00 0 0.00
828 20120525 jörg kachelmann US-Market 0.0 0.00 0 0.00
829 20120525 menowin fröhlich US-Market 0.0 0.00 0 0.00
834 20120525 ресторан нью-йорк US-Market 0.0 0.00 0 0.00
835 20120525 машинное обучение US-Market 0.0 0.00 0 0.00
837 20120525 comercio electrónico US-Market 0.0 0.00 0 0.00
839 20120525 منتزه يلوستون الوطني US-Market 0.0 0.00 0 0.00
1342 20120526 트위터 US-Market 0.0 0.00 0 0.00
1346 20120526 食べログ US-Market 0.0 0.00 0 0.00
1349 20120526 как жрать суши US-Market 0.0 0.00 0 0.00
1364 20120526 nürburgring US-Market 0.0 0.00 0 0.00
1376 20120526 jörg kachelmann US-Market 0.0 0.00 0 0.00
1377 20120526 menowin fröhlich US-Market 0.0 0.00 0 0.00
1382 20120526 ресторан нью-йорк US-Market 0.0 0.00 0 0.00
1383 20120526 машинное обучение US-Market 0.0 0.00 0 0.00
1385 20120526 comercio electrónico US-Market 0.0 0.00 0 0.00
1387 20120526 منتزه يلوستون الوطني US-Market 0.0 0.00 0 0.00
... ... ... ... ... ... ... ...
262641 20130206 트위터 UK-Market 0.0 0.00 0 0.00
262645 20130206 食べログ UK-Market 0.0 0.00 0 0.00
262648 20130206 как жрать суши UK-Market 0.0 0.00 0 0.00
262663 20130206 nürburgring UK-Market 1.0 2.15 64 1.21
262675 20130206 jörg kachelmann UK-Market 0.0 0.00 0 0.00
262676 20130206 menowin fröhlich UK-Market 0.0 0.00 0 0.00
262681 20130206 ресторан нью-йорк UK-Market 0.0 0.00 0 0.00
262682 20130206 машинное обучение UK-Market 0.0 0.00 0 0.00
262684 20130206 comercio electrónico UK-Market 0.0 0.00 0 0.00
262686 20130206 منتزه يلوستون الوطني UK-Market 0.0 0.00 0 0.00
263189 20130207 트위터 UK-Market 0.0 0.00 0 0.00
263193 20130207 食べログ UK-Market 0.0 0.00 0 0.00
263196 20130207 как жрать суши UK-Market 0.0 0.00 0 0.00
263211 20130207 nürburgring UK-Market 1.0 2.15 47 1.04
263223 20130207 jörg kachelmann UK-Market 0.0 0.00 0 0.00
263224 20130207 menowin fröhlich UK-Market 0.0 0.00 0 0.00
263229 20130207 ресторан нью-йорк UK-Market 0.0 0.00 0 0.00
263230 20130207 машинное обучение UK-Market 0.0 0.00 0 0.00
263232 20130207 comercio electrónico UK-Market 0.0 0.00 0 0.00
263234 20130207 منتزه يلوستون الوطني UK-Market 0.0 0.00 0 0.00
263737 20130212 트위터 UK-Market 0.0 0.00 0 0.00
263741 20130212 食べログ UK-Market 0.0 0.00 0 0.00
263744 20130212 как жрать суши UK-Market 0.0 0.00 0 0.00
263759 20130212 nürburgring UK-Market 1.0 1.86 44 0.78
263771 20130212 jörg kachelmann UK-Market 0.0 0.00 0 0.00
263772 20130212 menowin fröhlich UK-Market 0.0 0.00 0 0.00
263777 20130212 ресторан нью-йорк UK-Market 0.0 0.00 0 0.00
263778 20130212 машинное обучение UK-Market 0.0 0.00 0 0.00
263780 20130212 comercio electrónico UK-Market 0.0 0.00 0 0.00
263782 20130212 منتزه يلوستون الوطني UK-Market 0.0 0.00 0 0.00

4820 rows × 7 columns

It is interesting because all the non ASCII characters that was considered as noise are words from other languages. However, to process the text further we only have english language dictionary. So, we need to remove all rows which contain non english words, in this case is specified as non ASCII characters. Total rows should be 264037 - 4820 now.

In [11]:
# total rows after removing non ascii characters
total_rows = 264037 - 4820
total_rows
Out[11]:
259217
In [12]:
# data without non ascii characters in Keyword column
# We invert this mask using ~ and use this to mask the data
data = data[~data['Keyword'].str.contains(r'[^\x00-\x7F]+')]
data
Out[12]:
Date Keyword Market Average.Position PPC Impressions Clicks
0 20120524 secure online back up US-Market 0.0 0.00 0 0.00
1 20120524 agile management software US-Market 1.0 1.20 260 21.22
2 20120524 crm for financial US-Market 0.0 0.00 0 0.00
3 20120524 disaster recovery planning for it US-Market 0.0 0.00 0 0.00
4 20120524 tracking a vehicle US-Market 0.0 0.00 0 0.00
5 20120524 applications in the cloud US-Market 0.0 0.00 0 0.00
6 20120524 project management with scrum US-Market 0.0 0.00 0 0.00
7 20120524 servers for businesses US-Market 0.0 0.00 0 0.00
8 20120524 android application developers US-Market 0.0 0.00 0 0.00
9 20120524 android app developers US-Market 1.0 0.66 270 23.31
10 20120524 training for indesign US-Market 0.0 0.00 0 0.00
11 20120524 teach yourself web development US-Market 0.0 0.00 0 0.00
12 20120524 web applications developers US-Market 0.0 0.00 0 0.00
13 20120524 android application developer US-Market 1.0 1.02 546 46.92
14 20120524 android app developer US-Market 1.0 1.30 2469 147.46
15 20120524 app developer android US-Market 0.0 0.00 0 0.00
16 20120524 android developer app US-Market 0.0 0.00 0 0.00
17 20120524 mobile applications developers US-Market 1.0 0.57 260 21.69
18 20120524 design for programmers US-Market 0.0 0.00 0 0.00
19 20120524 cloud computing for dummies US-Market 1.0 0.50 330 37.89
20 20120524 android application development uk US-Market 0.0 0.00 0 0.00
21 20120524 web design and marketing solutions for busines... US-Market 0.0 0.00 0 0.00
22 20120524 developers mobile applications US-Market 0.0 0.00 0 0.00
23 20120524 training indesign US-Market 0.0 0.00 0 0.00
24 20120524 apps development US-Market 1.0 0.69 433 23.69
25 20120524 computing for dummies US-Market 0.0 0.00 0 0.00
26 20120524 bpp learning management system US-Market 0.0 0.00 0 0.00
27 20120524 mobile apps developers US-Market 1.0 2.53 529 22.94
28 20120524 joomla designers US-Market 1.0 1.37 268 17.84
29 20120524 best pay as you go sim card US-Market 0.0 0.00 0 0.00
... ... ... ... ... ... ... ...
264007 20130212 21st century insurance UK-Market 0.0 0.00 0 0.00
264008 20130212 cheap insurance medical UK-Market 0.0 0.00 0 0.00
264009 20130212 angeles criminal lawyer los UK-Market 0.0 0.00 0 0.00
264010 20130212 low cost term life insurance UK-Market 1.0 20.22 22 0.79
264011 20130212 home refinance loans UK-Market 1.0 3.69 178 7.33
264012 20130212 eloan UK-Market 0.0 0.00 0 0.00
264013 20130212 life insurance quote UK-Market 1.0 33.07 682 25.19
264014 20130212 cheap life assurance UK-Market 0.0 0.00 0 0.00
264015 20130212 anti spam exchange UK-Market 1.0 15.41 46 1.53
264016 20130212 credit equity home line UK-Market 0.0 0.00 0 0.00
264017 20130212 att go phone UK-Market 0.0 0.00 0 0.00
264018 20130212 dui florida lawyer UK-Market 0.0 0.00 0 0.00
264019 20130212 hair removal new york UK-Market 0.0 0.00 0 0.00
264020 20130212 merchant account international UK-Market 0.0 0.00 0 0.00
264021 20130212 vioxx class action UK-Market 0.0 0.00 0 0.00
264022 20130212 life insurance quote online UK-Market 0.0 0.00 0 0.00
264023 20130212 consolidation loan UK-Market 1.0 33.66 1675 52.48
264024 20130212 equity line of credit UK-Market 0.0 0.00 0 0.00
264025 20130212 carolina law lemon north UK-Market 0.0 0.00 0 0.00
264026 20130212 equity rates UK-Market 1.0 2.32 22 0.43
264027 20130212 direct home insurance line UK-Market 0.0 0.00 0 0.00
264028 20130212 angeles attorney los tax UK-Market 0.0 0.00 0 0.00
264029 20130212 credit home line UK-Market 0.0 0.00 0 0.00
264030 20130212 term insurance UK-Market 1.0 20.25 1698 62.03
264031 20130212 auto insurances UK-Market 0.0 0.00 0 0.00
264032 20130212 debtfreedirect UK-Market 1.0 413.53 22 0.74
264033 20130212 refinance comparison UK-Market 0.0 0.00 0 0.00
264034 20130212 acid reflux treatment UK-Market 1.0 0.98 67 1.77
264035 20130212 bad credit mortgage refinance UK-Market 0.0 0.00 0 0.00
264036 20130212 college loan refinance UK-Market 0.0 0.00 0 0.00

259217 rows × 7 columns

1.4. Encode categorical feature

We have feature 'Market' which is categorical. We will encode US-Market as 0 and UK-Market as 1.

In [13]:
data = data.copy()
data['Market'] = data['Market'].apply(lambda item: 0 if item == 'US-Market' else item)
data['Market'] = data['Market'].apply(lambda item: 1 if item == 'UK-Market' else item)
        
data
Out[13]:
Date Keyword Market Average.Position PPC Impressions Clicks
0 20120524 secure online back up 0 0.0 0.00 0 0.00
1 20120524 agile management software 0 1.0 1.20 260 21.22
2 20120524 crm for financial 0 0.0 0.00 0 0.00
3 20120524 disaster recovery planning for it 0 0.0 0.00 0 0.00
4 20120524 tracking a vehicle 0 0.0 0.00 0 0.00
5 20120524 applications in the cloud 0 0.0 0.00 0 0.00
6 20120524 project management with scrum 0 0.0 0.00 0 0.00
7 20120524 servers for businesses 0 0.0 0.00 0 0.00
8 20120524 android application developers 0 0.0 0.00 0 0.00
9 20120524 android app developers 0 1.0 0.66 270 23.31
10 20120524 training for indesign 0 0.0 0.00 0 0.00
11 20120524 teach yourself web development 0 0.0 0.00 0 0.00
12 20120524 web applications developers 0 0.0 0.00 0 0.00
13 20120524 android application developer 0 1.0 1.02 546 46.92
14 20120524 android app developer 0 1.0 1.30 2469 147.46
15 20120524 app developer android 0 0.0 0.00 0 0.00
16 20120524 android developer app 0 0.0 0.00 0 0.00
17 20120524 mobile applications developers 0 1.0 0.57 260 21.69
18 20120524 design for programmers 0 0.0 0.00 0 0.00
19 20120524 cloud computing for dummies 0 1.0 0.50 330 37.89
20 20120524 android application development uk 0 0.0 0.00 0 0.00
21 20120524 web design and marketing solutions for busines... 0 0.0 0.00 0 0.00
22 20120524 developers mobile applications 0 0.0 0.00 0 0.00
23 20120524 training indesign 0 0.0 0.00 0 0.00
24 20120524 apps development 0 1.0 0.69 433 23.69
25 20120524 computing for dummies 0 0.0 0.00 0 0.00
26 20120524 bpp learning management system 0 0.0 0.00 0 0.00
27 20120524 mobile apps developers 0 1.0 2.53 529 22.94
28 20120524 joomla designers 0 1.0 1.37 268 17.84
29 20120524 best pay as you go sim card 0 0.0 0.00 0 0.00
... ... ... ... ... ... ... ...
264007 20130212 21st century insurance 1 0.0 0.00 0 0.00
264008 20130212 cheap insurance medical 1 0.0 0.00 0 0.00
264009 20130212 angeles criminal lawyer los 1 0.0 0.00 0 0.00
264010 20130212 low cost term life insurance 1 1.0 20.22 22 0.79
264011 20130212 home refinance loans 1 1.0 3.69 178 7.33
264012 20130212 eloan 1 0.0 0.00 0 0.00
264013 20130212 life insurance quote 1 1.0 33.07 682 25.19
264014 20130212 cheap life assurance 1 0.0 0.00 0 0.00
264015 20130212 anti spam exchange 1 1.0 15.41 46 1.53
264016 20130212 credit equity home line 1 0.0 0.00 0 0.00
264017 20130212 att go phone 1 0.0 0.00 0 0.00
264018 20130212 dui florida lawyer 1 0.0 0.00 0 0.00
264019 20130212 hair removal new york 1 0.0 0.00 0 0.00
264020 20130212 merchant account international 1 0.0 0.00 0 0.00
264021 20130212 vioxx class action 1 0.0 0.00 0 0.00
264022 20130212 life insurance quote online 1 0.0 0.00 0 0.00
264023 20130212 consolidation loan 1 1.0 33.66 1675 52.48
264024 20130212 equity line of credit 1 0.0 0.00 0 0.00
264025 20130212 carolina law lemon north 1 0.0 0.00 0 0.00
264026 20130212 equity rates 1 1.0 2.32 22 0.43
264027 20130212 direct home insurance line 1 0.0 0.00 0 0.00
264028 20130212 angeles attorney los tax 1 0.0 0.00 0 0.00
264029 20130212 credit home line 1 0.0 0.00 0 0.00
264030 20130212 term insurance 1 1.0 20.25 1698 62.03
264031 20130212 auto insurances 1 0.0 0.00 0 0.00
264032 20130212 debtfreedirect 1 1.0 413.53 22 0.74
264033 20130212 refinance comparison 1 0.0 0.00 0 0.00
264034 20130212 acid reflux treatment 1 1.0 0.98 67 1.77
264035 20130212 bad credit mortgage refinance 1 0.0 0.00 0 0.00
264036 20130212 college loan refinance 1 0.0 0.00 0 0.00

259217 rows × 7 columns

1.3. Convert integer type on Date feature to Date object

Date feature represents the exact time when a user clicked on the keyword. We will convert this integer type to Date object.

In [14]:
from datetime import date

def int2date(argdate: int) -> date:
    """
    If you have date as an integer, use this method to obtain a datetime.date object.

    Parameters
    ----------
    argdate : int
      Date as a regular integer value (example: 20160618)

    Returns
    -------
    dateandtime.date
      A date object which corresponds to the given value `argdate`.
    """
    year = int(argdate / 10000)
    month = int((argdate % 10000) / 100)
    day = int(argdate % 100)

    return date(year, month, day)
In [15]:
# convert integer to date object in Date column
data['New Date'] = data['Date'].apply(int2date)
data
Out[15]:
Date Keyword Market Average.Position PPC Impressions Clicks New Date
0 20120524 secure online back up 0 0.0 0.00 0 0.00 2012-05-24
1 20120524 agile management software 0 1.0 1.20 260 21.22 2012-05-24
2 20120524 crm for financial 0 0.0 0.00 0 0.00 2012-05-24
3 20120524 disaster recovery planning for it 0 0.0 0.00 0 0.00 2012-05-24
4 20120524 tracking a vehicle 0 0.0 0.00 0 0.00 2012-05-24
5 20120524 applications in the cloud 0 0.0 0.00 0 0.00 2012-05-24
6 20120524 project management with scrum 0 0.0 0.00 0 0.00 2012-05-24
7 20120524 servers for businesses 0 0.0 0.00 0 0.00 2012-05-24
8 20120524 android application developers 0 0.0 0.00 0 0.00 2012-05-24
9 20120524 android app developers 0 1.0 0.66 270 23.31 2012-05-24
10 20120524 training for indesign 0 0.0 0.00 0 0.00 2012-05-24
11 20120524 teach yourself web development 0 0.0 0.00 0 0.00 2012-05-24
12 20120524 web applications developers 0 0.0 0.00 0 0.00 2012-05-24
13 20120524 android application developer 0 1.0 1.02 546 46.92 2012-05-24
14 20120524 android app developer 0 1.0 1.30 2469 147.46 2012-05-24
15 20120524 app developer android 0 0.0 0.00 0 0.00 2012-05-24
16 20120524 android developer app 0 0.0 0.00 0 0.00 2012-05-24
17 20120524 mobile applications developers 0 1.0 0.57 260 21.69 2012-05-24
18 20120524 design for programmers 0 0.0 0.00 0 0.00 2012-05-24
19 20120524 cloud computing for dummies 0 1.0 0.50 330 37.89 2012-05-24
20 20120524 android application development uk 0 0.0 0.00 0 0.00 2012-05-24
21 20120524 web design and marketing solutions for busines... 0 0.0 0.00 0 0.00 2012-05-24
22 20120524 developers mobile applications 0 0.0 0.00 0 0.00 2012-05-24
23 20120524 training indesign 0 0.0 0.00 0 0.00 2012-05-24
24 20120524 apps development 0 1.0 0.69 433 23.69 2012-05-24
25 20120524 computing for dummies 0 0.0 0.00 0 0.00 2012-05-24
26 20120524 bpp learning management system 0 0.0 0.00 0 0.00 2012-05-24
27 20120524 mobile apps developers 0 1.0 2.53 529 22.94 2012-05-24
28 20120524 joomla designers 0 1.0 1.37 268 17.84 2012-05-24
29 20120524 best pay as you go sim card 0 0.0 0.00 0 0.00 2012-05-24
... ... ... ... ... ... ... ... ...
264007 20130212 21st century insurance 1 0.0 0.00 0 0.00 2013-02-12
264008 20130212 cheap insurance medical 1 0.0 0.00 0 0.00 2013-02-12
264009 20130212 angeles criminal lawyer los 1 0.0 0.00 0 0.00 2013-02-12
264010 20130212 low cost term life insurance 1 1.0 20.22 22 0.79 2013-02-12
264011 20130212 home refinance loans 1 1.0 3.69 178 7.33 2013-02-12
264012 20130212 eloan 1 0.0 0.00 0 0.00 2013-02-12
264013 20130212 life insurance quote 1 1.0 33.07 682 25.19 2013-02-12
264014 20130212 cheap life assurance 1 0.0 0.00 0 0.00 2013-02-12
264015 20130212 anti spam exchange 1 1.0 15.41 46 1.53 2013-02-12
264016 20130212 credit equity home line 1 0.0 0.00 0 0.00 2013-02-12
264017 20130212 att go phone 1 0.0 0.00 0 0.00 2013-02-12
264018 20130212 dui florida lawyer 1 0.0 0.00 0 0.00 2013-02-12
264019 20130212 hair removal new york 1 0.0 0.00 0 0.00 2013-02-12
264020 20130212 merchant account international 1 0.0 0.00 0 0.00 2013-02-12
264021 20130212 vioxx class action 1 0.0 0.00 0 0.00 2013-02-12
264022 20130212 life insurance quote online 1 0.0 0.00 0 0.00 2013-02-12
264023 20130212 consolidation loan 1 1.0 33.66 1675 52.48 2013-02-12
264024 20130212 equity line of credit 1 0.0 0.00 0 0.00 2013-02-12
264025 20130212 carolina law lemon north 1 0.0 0.00 0 0.00 2013-02-12
264026 20130212 equity rates 1 1.0 2.32 22 0.43 2013-02-12
264027 20130212 direct home insurance line 1 0.0 0.00 0 0.00 2013-02-12
264028 20130212 angeles attorney los tax 1 0.0 0.00 0 0.00 2013-02-12
264029 20130212 credit home line 1 0.0 0.00 0 0.00 2013-02-12
264030 20130212 term insurance 1 1.0 20.25 1698 62.03 2013-02-12
264031 20130212 auto insurances 1 0.0 0.00 0 0.00 2013-02-12
264032 20130212 debtfreedirect 1 1.0 413.53 22 0.74 2013-02-12
264033 20130212 refinance comparison 1 0.0 0.00 0 0.00 2013-02-12
264034 20130212 acid reflux treatment 1 1.0 0.98 67 1.77 2013-02-12
264035 20130212 bad credit mortgage refinance 1 0.0 0.00 0 0.00 2013-02-12
264036 20130212 college loan refinance 1 0.0 0.00 0 0.00 2013-02-12

259217 rows × 8 columns

1.4. Expand Date feature into 4 new features

We will expand this feature to 3 new features: year, month, day of the month, day of the week. In this way, we will get new variables that an ML model will be able to process and find possible dependencies and correlations. Since we have created new variables, we will exclude the original variable "Date" from the table. The "Day of the week" variable contains values from 0 to 6, where each number represents a specific day of the week (from Monday to Sunday).

In [16]:
# create temporary column
data['Date Time'] = pd.to_datetime(data['New Date'])

data['Year'] = data['Date Time'].dt.year
data['Month'] = data['Date Time'].dt.month
data['Day of the month'] = data['Date Time'].dt.day
data["Day of the week"] = data['Date Time'].dt.dayofweek

# drop temporary columns
data = data.drop(['New Date'], axis=1)
data = data.drop(['Date Time'], axis=1)
#data = data.drop(['Date'], axis=1)

data
Out[16]:
Date Keyword Market Average.Position PPC Impressions Clicks Year Month Day of the month Day of the week
0 20120524 secure online back up 0 0.0 0.00 0 0.00 2012 5 24 3
1 20120524 agile management software 0 1.0 1.20 260 21.22 2012 5 24 3
2 20120524 crm for financial 0 0.0 0.00 0 0.00 2012 5 24 3
3 20120524 disaster recovery planning for it 0 0.0 0.00 0 0.00 2012 5 24 3
4 20120524 tracking a vehicle 0 0.0 0.00 0 0.00 2012 5 24 3
5 20120524 applications in the cloud 0 0.0 0.00 0 0.00 2012 5 24 3
6 20120524 project management with scrum 0 0.0 0.00 0 0.00 2012 5 24 3
7 20120524 servers for businesses 0 0.0 0.00 0 0.00 2012 5 24 3
8 20120524 android application developers 0 0.0 0.00 0 0.00 2012 5 24 3
9 20120524 android app developers 0 1.0 0.66 270 23.31 2012 5 24 3
10 20120524 training for indesign 0 0.0 0.00 0 0.00 2012 5 24 3
11 20120524 teach yourself web development 0 0.0 0.00 0 0.00 2012 5 24 3
12 20120524 web applications developers 0 0.0 0.00 0 0.00 2012 5 24 3
13 20120524 android application developer 0 1.0 1.02 546 46.92 2012 5 24 3
14 20120524 android app developer 0 1.0 1.30 2469 147.46 2012 5 24 3
15 20120524 app developer android 0 0.0 0.00 0 0.00 2012 5 24 3
16 20120524 android developer app 0 0.0 0.00 0 0.00 2012 5 24 3
17 20120524 mobile applications developers 0 1.0 0.57 260 21.69 2012 5 24 3
18 20120524 design for programmers 0 0.0 0.00 0 0.00 2012 5 24 3
19 20120524 cloud computing for dummies 0 1.0 0.50 330 37.89 2012 5 24 3
20 20120524 android application development uk 0 0.0 0.00 0 0.00 2012 5 24 3
21 20120524 web design and marketing solutions for busines... 0 0.0 0.00 0 0.00 2012 5 24 3
22 20120524 developers mobile applications 0 0.0 0.00 0 0.00 2012 5 24 3
23 20120524 training indesign 0 0.0 0.00 0 0.00 2012 5 24 3
24 20120524 apps development 0 1.0 0.69 433 23.69 2012 5 24 3
25 20120524 computing for dummies 0 0.0 0.00 0 0.00 2012 5 24 3
26 20120524 bpp learning management system 0 0.0 0.00 0 0.00 2012 5 24 3
27 20120524 mobile apps developers 0 1.0 2.53 529 22.94 2012 5 24 3
28 20120524 joomla designers 0 1.0 1.37 268 17.84 2012 5 24 3
29 20120524 best pay as you go sim card 0 0.0 0.00 0 0.00 2012 5 24 3
... ... ... ... ... ... ... ... ... ... ... ...
264007 20130212 21st century insurance 1 0.0 0.00 0 0.00 2013 2 12 1
264008 20130212 cheap insurance medical 1 0.0 0.00 0 0.00 2013 2 12 1
264009 20130212 angeles criminal lawyer los 1 0.0 0.00 0 0.00 2013 2 12 1
264010 20130212 low cost term life insurance 1 1.0 20.22 22 0.79 2013 2 12 1
264011 20130212 home refinance loans 1 1.0 3.69 178 7.33 2013 2 12 1
264012 20130212 eloan 1 0.0 0.00 0 0.00 2013 2 12 1
264013 20130212 life insurance quote 1 1.0 33.07 682 25.19 2013 2 12 1
264014 20130212 cheap life assurance 1 0.0 0.00 0 0.00 2013 2 12 1
264015 20130212 anti spam exchange 1 1.0 15.41 46 1.53 2013 2 12 1
264016 20130212 credit equity home line 1 0.0 0.00 0 0.00 2013 2 12 1
264017 20130212 att go phone 1 0.0 0.00 0 0.00 2013 2 12 1
264018 20130212 dui florida lawyer 1 0.0 0.00 0 0.00 2013 2 12 1
264019 20130212 hair removal new york 1 0.0 0.00 0 0.00 2013 2 12 1
264020 20130212 merchant account international 1 0.0 0.00 0 0.00 2013 2 12 1
264021 20130212 vioxx class action 1 0.0 0.00 0 0.00 2013 2 12 1
264022 20130212 life insurance quote online 1 0.0 0.00 0 0.00 2013 2 12 1
264023 20130212 consolidation loan 1 1.0 33.66 1675 52.48 2013 2 12 1
264024 20130212 equity line of credit 1 0.0 0.00 0 0.00 2013 2 12 1
264025 20130212 carolina law lemon north 1 0.0 0.00 0 0.00 2013 2 12 1
264026 20130212 equity rates 1 1.0 2.32 22 0.43 2013 2 12 1
264027 20130212 direct home insurance line 1 0.0 0.00 0 0.00 2013 2 12 1
264028 20130212 angeles attorney los tax 1 0.0 0.00 0 0.00 2013 2 12 1
264029 20130212 credit home line 1 0.0 0.00 0 0.00 2013 2 12 1
264030 20130212 term insurance 1 1.0 20.25 1698 62.03 2013 2 12 1
264031 20130212 auto insurances 1 0.0 0.00 0 0.00 2013 2 12 1
264032 20130212 debtfreedirect 1 1.0 413.53 22 0.74 2013 2 12 1
264033 20130212 refinance comparison 1 0.0 0.00 0 0.00 2013 2 12 1
264034 20130212 acid reflux treatment 1 1.0 0.98 67 1.77 2013 2 12 1
264035 20130212 bad credit mortgage refinance 1 0.0 0.00 0 0.00 2013 2 12 1
264036 20130212 college loan refinance 1 0.0 0.00 0 0.00 2013 2 12 1

259217 rows × 11 columns

Step 2 - Explorative Visualization

2.1. Descriptive statistic

In [17]:
# getting summary stats on Data after preprocessing
data.describe()
Out[17]:
Date Market Average.Position PPC Impressions Clicks Year Month Day of the month Day of the week
count 2.592170e+05 259217.000000 259217.000000 259217.000000 2.592170e+05 259217.000000 259217.000000 259217.000000 259217.000000 259217.000000
mean 2.012233e+07 0.499998 0.403028 5.199379 2.150505e+04 471.990056 2012.153586 7.781338 15.960720 3.037390
std 3.329351e+03 0.500001 0.548785 27.054846 3.189934e+05 7848.330662 0.360552 3.399161 9.074103 1.983025
min 2.012052e+07 0.000000 0.000000 0.000000 0.000000e+00 0.000000 2012.000000 1.000000 1.000000 0.000000
25% 2.012080e+07 0.000000 0.000000 0.000000 0.000000e+00 0.000000 2012.000000 6.000000 8.000000 1.000000
50% 2.012100e+07 0.000000 0.000000 0.000000 0.000000e+00 0.000000 2012.000000 8.000000 16.000000 3.000000
75% 2.012121e+07 1.000000 1.000000 1.390000 4.450000e+02 9.710000 2012.000000 10.000000 24.000000 5.000000
max 2.013021e+07 1.000000 11.000000 632.730000 1.566667e+07 441963.500000 2013.000000 12.000000 31.000000 6.000000

A few observations we see from the statistic description above:

  • 50% data of Average.Position, PPC, Impressions and Clicks are 0 and large range values on those fetures.

2.2. Identify correlation between features

In [18]:
# Visualize trends on Average.Position, PPC, Impressions and Clicks
customPalette = sns.set_palette(sns.color_palette(['#7bc2e0']))
sns.pairplot(data[['Average.Position', 'PPC','Impressions', 'Clicks']], palette=customPalette)
Out[18]:
<seaborn.axisgrid.PairGrid at 0x22223949828>

A few observations we see from the scatter plot above:

  • There is correlation between PPC and Average.Position
  • No correlation between PPC and Impressions, Averarage.Position and Impression
  • Strong correlation between Clicks and Impressions
  • No correlation between Clicks and PPC, Clicks and Average.Position

To see more accurate, we can take a look the exact correlation score between features using heatmap.

In [19]:
# create a correlation heatmap of PPC, Impressions, and Average.Position
fig, ax = plt.subplots(figsize = (2, 15))
cmap = sns.diverging_palette(130, 275, as_cmap=True)
sns.heatmap(data[data.columns[3:6]].corr()[['Average.Position']], annot=True, linewidths=.4, fmt=".1f", cmap=cmap, ax=ax);

We can see that PPC, Impression and Average.Position are correlated, though not strong.

2.3. Identify correlation between features and target variable

In [20]:
# create a correlation heatmap of all features in relation to Clicks
fig, ax = plt.subplots(figsize = (2, 15))
cmap = sns.diverging_palette(130, 275, as_cmap=True)
sns.heatmap(data[data.columns[1:]].corr()[['Clicks']], annot=True, linewidths=.4, fmt=".1f", cmap=cmap, ax=ax);
#sns.set(font_scale=4)

#plt.savefig("heatmap_fe_4.png")

A few observations we see from the correlation map above:

  • Strong correlation between Impression and Clicks
  • Zero correlation between Day of the month and Clicks, Day of the week and Clicks
  • Negative correlation between Month and Clicks, Year and Clicks, PPC and Clicks
  • Weak correlation between Average.Position and Clicks

Feature Date seems like have no correlation with target variable. But we will not drop here. By training our model, we will see if the feature is worth or not.

2.4. Distribution of target variable

We will visualize the data of target variable to better see and understand the distribution.

In [21]:
def plot_distribution(df, col):
    print('Unique value of {}:'.format(col))
    display(df[col].unique())
    print('Number of unique value of {}: {}\n'.format(col, len(df[col].unique())))
    
    fig, ax = plt.subplots(figsize=(15,5))
    sns.set_style(style='whitegrid')
    feature = df[col]
    _ = feature.plot.hist(ax=ax, bins=15, color=['#d5ede4'])
    mean_line = ax.axvline(x=feature.mean(), linewidth=2, color='blue', label='mean {:.2f}'.format(
        feature.mean()))
    median_line = ax.axvline(x=feature.median(), linewidth=2, color='red', label='median {:.2f}'.format(
        feature.median()))
    _ = ax.legend(handles=[mean_line, median_line])
    _ = ax.set_title('Histogram of {} for Bins = 15'.format(col), fontsize='x-large')
    
    plt.show()
In [22]:
# Visualize distribution of target variable
plot_distribution(data, 'Clicks')
Unique value of Clicks:
array([  0.  ,  21.22,  23.31, ...,  98.5 ,  35.77, 163.68])
Number of unique value of Clicks: 36142

In [23]:
# Density plot of number of Clicks
sns.distplot(data['Clicks'], hist=False, kde=True, 
             bins=15, color = 'lightblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 3})
C:\Users\user\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x22225726cc0>

It's right-skewed distribution. The mean will be larger than the median in this distribution. Right skewness can be by reduced applying roots and logarithms transformation.

2.5. Distribution of feature variables

In [24]:
# Visualize distribution on features
customPalette = sns.set_palette(sns.color_palette(['#7bc2e0']))
sns.pairplot(data[['Average.Position', 'PPC','Impressions', 'Year', 'Month', 'Day of the month', 'Day of the week']], palette=customPalette)
Out[24]:
<seaborn.axisgrid.PairGrid at 0x222271e06a0>
In [25]:
# Visualize distribution of Market
sns.countplot(x ='Market', data = data)
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x2222dfb1c18>

We see that US-Market and UK-Market have balance number.

Step 3 - Feature Engineering

3.1. Normalize distribution

From visualization above, we can see that some features and target variable are right skewed distributed. We need to normalize them using log transformation.

In [26]:
# Log transformation target variable
helpful_log = np.log(data.Clicks)
helpful_log.describe()
C:\Users\user\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: RuntimeWarning: divide by zero encountered in log
  
C:\Users\user\Anaconda3\lib\site-packages\numpy\lib\function_base.py:3768: RuntimeWarning: invalid value encountered in multiply
  x2 = take(ap, indices_above, axis=axis) * weights_above
Out[26]:
count    2.592170e+05
mean             -inf
std               NaN
min              -inf
25%               NaN
50%               NaN
75%      2.273156e+00
max      1.299898e+01
Name: Clicks, dtype: float64

Many data points are 0 because many keywords have 0 Average.Position. For a quick fix, we can add 1 to each data point. This works well since the log of 1 is 0. Furthermore, the same spread is retained since all points are increased by 1

In [27]:
click_log = np.log(data.Clicks + 1)
data['Clicks'] = np.sqrt(click_log)
data['Clicks'].describe()
Out[27]:
count    259217.000000
mean          0.691737
std           0.961168
min           0.000000
25%           0.000000
50%           0.000000
75%           1.539863
max           3.605410
Name: Clicks, dtype: float64
In [28]:
# Density Plot of Clicks
sns.distplot(data['Clicks'], hist=False, kde=True, 
             bins=15, color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 3})
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x2222e3a3668>
In [29]:
# Log transformation of PPC
ppc_log = np.log(data.PPC + 1)
data['PPC'] = np.sqrt(ppc_log)
data['PPC'].describe()
Out[29]:
count    259217.000000
mean          0.443651
std           0.629027
min           0.000000
25%           0.000000
50%           0.000000
75%           0.933431
max           2.540005
Name: PPC, dtype: float64
In [30]:
# Density Plot of PPC
sns.distplot(data['PPC'], hist=False, kde=True, 
             bins=15, color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 3})
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x2222e3f56d8>
In [31]:
# Log transformation of Impressions
impression_log = np.log(data.Impressions + 1)
data['Impressions'] = np.sqrt(impression_log)
data['Impressions'].describe()
Out[31]:
count    259217.000000
mean          1.050147
std           1.341228
min           0.000000
25%           0.000000
50%           0.000000
75%           2.469882
max           4.070264
Name: Impressions, dtype: float64
In [32]:
# Density Plot of Impressions
sns.distplot(data['Impressions'], hist=False, kde=True, 
             bins=15, color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 3})
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x222306bdba8>

3.2. Word embedding for keyword feature

We will create word embedding using Glove pretrained model for English https://nlp.stanford.edu/projects/glove/. GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

In [33]:
# After downloading:
glove_path = "glove.6B.50d.txt"
In [34]:
def load_word_embeddings(file=glove_path):
    embeddings={}
    with open(file,'r', encoding="utf-8") as infile:
        for line in infile:
            values=line.split()
            embeddings[values[0]]=np.asarray(values[1:], dtype='float64')
    return embeddings
In [35]:
embeddings = load_word_embeddings()
In [36]:
len(embeddings.keys())
Out[36]:
400000
In [37]:
# create average feature embedding for each sentence
def average_embedding(keyword, embeddings=embeddings,emb_size=50):
    token_keywords = keyword.lower().split()
    token_keywords=[w for w in token_keywords if w.isalpha() and w in embeddings]
    if len(token_keywords)==0:
        return 0
    keywords_embedding = np.array([embeddings[w] for w in token_keywords])
    sentences_embedding_avg = keywords_embedding.mean(axis=0)
    return sentences_embedding_avg.mean(axis=0)
In [38]:
# apply average embedding funtion to Keyword column
data['Keyword'] = data['Keyword'].apply(average_embedding)
In [39]:
data
Out[39]:
Date Keyword Market Average.Position PPC Impressions Clicks Year Month Day of the month Day of the week
0 20120524 0.022198 0 0.0 0.000000 0.000000 0.000000 2012 5 24 3
1 20120524 0.048223 0 1.0 0.887951 2.358924 1.760964 2012 5 24 3
2 20120524 0.064273 0 0.0 0.000000 0.000000 0.000000 2012 5 24 3
3 20120524 0.046857 0 0.0 0.000000 0.000000 0.000000 2012 5 24 3
4 20120524 0.031747 0 0.0 0.000000 0.000000 0.000000 2012 5 24 3
5 20120524 -0.017084 0 0.0 0.000000 0.000000 0.000000 2012 5 24 3
6 20120524 0.034216 0 0.0 0.000000 0.000000 0.000000 2012 5 24 3
7 20120524 0.049713 0 0.0 0.000000 0.000000 0.000000 2012 5 24 3
8 20120524 0.025787 0 0.0 0.000000 0.000000 0.000000 2012 5 24 3
9 20120524 0.049577 0 1.0 0.711911 2.366880 1.786306 2012 5 24 3
10 20120524 0.018618 0 0.0 0.000000 0.000000 0.000000 2012 5 24 3
11 20120524 0.078420 0 0.0 0.000000 0.000000 0.000000 2012 5 24 3
12 20120524 0.023065 0 0.0 0.000000 0.000000 0.000000 2012 5 24 3
13 20120524 0.056338 0 1.0 0.838509 2.510866 1.967113 2012 5 24 3
14 20120524 0.080128 0 1.0 0.912639 2.794991 2.236139 2012 5 24 3
15 20120524 0.080128 0 0.0 0.000000 0.000000 0.000000 2012 5 24 3
16 20120524 0.080128 0 0.0 0.000000 0.000000 0.000000 2012 5 24 3
17 20120524 0.034538 0 1.0 0.671622 2.358924 1.766897 2012 5 24 3
18 20120524 0.028561 0 0.0 0.000000 0.000000 0.000000 2012 5 24 3
19 20120524 0.041374 0 1.0 0.636761 2.408759 1.913305 2012 5 24 3
20 20120524 0.057223 0 0.0 0.000000 0.000000 0.000000 2012 5 24 3
21 20120524 0.038984 0 0.0 0.000000 0.000000 0.000000 2012 5 24 3
22 20120524 0.034538 0 0.0 0.000000 0.000000 0.000000 2012 5 24 3
23 20120524 0.007457 0 0.0 0.000000 0.000000 0.000000 2012 5 24 3
24 20120524 0.083472 0 1.0 0.724381 2.464355 1.790642 2012 5 24 3
25 20120524 0.054911 0 0.0 0.000000 0.000000 0.000000 2012 5 24 3
26 20120524 0.046519 0 0.0 0.000000 0.000000 0.000000 2012 5 24 3
27 20120524 0.033455 0 1.0 1.123075 2.504571 1.782007 2012 5 24 3
28 20120524 -0.037979 0 1.0 0.928919 2.365314 1.713471 2012 5 24 3
29 20120524 0.079269 0 0.0 0.000000 0.000000 0.000000 2012 5 24 3
... ... ... ... ... ... ... ... ... ... ... ...
264007 20130212 -0.046905 1 0.0 0.000000 0.000000 0.000000 2013 2 12 1
264008 20130212 0.119748 1 0.0 0.000000 0.000000 0.000000 2013 2 12 1
264009 20130212 -0.091270 1 0.0 0.000000 0.000000 0.000000 2013 2 12 1
264010 20130212 0.073270 1 1.0 1.747840 1.770733 0.763031 2013 2 12 1
264011 20130212 0.119321 1 1.0 1.243154 2.277583 1.455975 2013 2 12 1
264012 20130212 0.000000 1 0.0 0.000000 0.000000 0.000000 2013 2 12 1
264013 20130212 0.045349 1 1.0 1.878408 2.554701 1.807036 2013 2 12 1
264014 20130212 0.121727 1 0.0 0.000000 0.000000 0.000000 2013 2 12 1
264015 20130212 -0.006922 1 1.0 1.672690 1.962179 0.963441 2013 2 12 1
264016 20130212 0.022574 1 0.0 0.000000 0.000000 0.000000 2013 2 12 1
264017 20130212 0.079855 1 0.0 0.000000 0.000000 0.000000 2013 2 12 1
264018 20130212 -0.098399 1 0.0 0.000000 0.000000 0.000000 2013 2 12 1
264019 20130212 -0.022386 1 0.0 0.000000 0.000000 0.000000 2013 2 12 1
264020 20130212 0.025320 1 0.0 0.000000 0.000000 0.000000 2013 2 12 1
264021 20130212 -0.031211 1 0.0 0.000000 0.000000 0.000000 2013 2 12 1
264022 20130212 0.040901 1 0.0 0.000000 0.000000 0.000000 2013 2 12 1
264023 20130212 0.022664 1 1.0 1.882973 2.724732 1.994820 2013 2 12 1
264024 20130212 0.003459 1 0.0 0.000000 0.000000 0.000000 2013 2 12 1
264025 20130212 -0.076353 1 0.0 0.000000 0.000000 0.000000 2013 2 12 1
264026 20130212 0.087444 1 1.0 1.095429 1.770733 0.598059 2013 2 12 1
264027 20130212 0.027520 1 0.0 0.000000 0.000000 0.000000 2013 2 12 1
264028 20130212 -0.109962 1 0.0 0.000000 0.000000 0.000000 2013 2 12 1
264029 20130212 0.018037 1 0.0 0.000000 0.000000 0.000000 2013 2 12 1
264030 20130212 0.013170 1 1.0 1.748244 2.727232 2.035586 2013 2 12 1
264031 20130212 0.057496 1 0.0 0.000000 0.000000 0.000000 2013 2 12 1
264032 20130212 0.000000 1 1.0 2.455025 1.770733 0.744235 2013 2 12 1
264033 20130212 0.105813 1 0.0 0.000000 0.000000 0.000000 2013 2 12 1
264034 20130212 0.275269 1 1.0 0.826497 2.054144 1.009380 2013 2 12 1
264035 20130212 0.102241 1 0.0 0.000000 0.000000 0.000000 2013 2 12 1
264036 20130212 0.046458 1 0.0 0.000000 0.000000 0.000000 2013 2 12 1

259217 rows × 11 columns

Step 4 - Create Prediction Model

4.1. Linear Regression

In [40]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Define features and label
# x = data[['Keyword', 'Market', 'Average.Position', 'PPC', 'Impressions', 'Year', 'Month', 'Day of the month', 'Day of the week']]
x = data[['Keyword', 'Market', 'PPC', 'Year', 'Month', 'Day of the month', 'Day of the week']]
y = data ['Clicks']

# Model initialization
lr_model = LinearRegression()
# Fit the data(train the model)
lr_model.fit(x, y)
# Predict
y_predicted = lr_model.predict(x)

# model evaluation
rmse = mean_squared_error(y, y_predicted)
r2 = r2_score(y, y_predicted)

# printing values
print('Slope:' ,lr_model.coef_)
print('Intercept:', lr_model.intercept_)
print('Root mean squared error: ', rmse)
print('R2 score: ', r2)

# plotting values

# data points
#plt.scatter(x, y)
plt.xlabel('x')
plt.ylabel('y')

# predicted values
plt.plot(x, y_predicted, color='b')
plt.show()
Slope: [-0.70056741 -0.13729769  1.17123246 -0.04475564 -0.01203647  0.0033549
  0.00259599]
Intercept: 90.34532927085633
Root mean squared error:  0.3670406430227459
R2 score:  0.6027011799394391
In [41]:
# show distribution of target variable versus predictions
fig, ax = plt.subplots(figsize = (11, 8))
sns.distplot(np.exp(data.Clicks), kde=False, color='#A15EDB', bins=50, label='click_actual')
sns.distplot(np.exp(y_predicted), kde=False, color='#69547C', bins=50, label='click_pred')

plt.xlabel('Clicks', fontsize=19, labelpad=11)
plt.xticks(fontsize=14)
plt.ylabel('Count', fontsize=19, labelpad=11)
plt.yticks(fontsize=14)
plt.legend(loc='upper right');

#plt.savefig("y_vs_yhat.png");

4.2. Lasso Regression

In [42]:
from sklearn.linear_model import LassoCV

# Model initialization
lasso_model = LassoCV()
# Fit the data(train the model)
lasso_model.fit(x, y)
# Predict
y_predicted = lasso_model.predict(x)

# model evaluation
rmse = mean_squared_error(y, y_predicted)
r2 = r2_score(y, y_predicted)

# printing values
print('Slope:' ,lasso_model.coef_)
print('Intercept:', lasso_model.intercept_)
print('Root mean squared error: ', rmse)
print('R2 score: ', r2)

# plotting values

# data points
#plt.scatter(x, y)
plt.xlabel('x')
plt.ylabel('y')

# predicted values
plt.plot(x, y_predicted, color='b')
plt.show()
C:\Users\user\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)
Slope: [-0.6120999  -0.13571678  1.16934205 -0.03216596 -0.01090186  0.00338334
  0.00251832]
Intercept: 65.0018013552894
Root mean squared error:  0.3670899665200862
R2 score:  0.6026477903008063
In [43]:
# show distribution of target variable versus predictions
fig, ax = plt.subplots(figsize = (11, 8))
sns.distplot(np.exp(data.Clicks), kde=False, color='#A15EDB', bins=50, label='click_actual')
sns.distplot(np.exp(y_predicted), kde=False, color='#69547C', bins=50, label='click_pred')

plt.xlabel('Clicks', fontsize=19, labelpad=11)
plt.xticks(fontsize=14)
plt.ylabel('Count', fontsize=19, labelpad=11)
plt.yticks(fontsize=14)
plt.legend(loc='upper right');

#plt.savefig("y_vs_yhat.png");

4.3. Ridge Regression

In [44]:
from sklearn.linear_model import RidgeCV

# Model initialization
ridge_model = RidgeCV()
# Fit the data(train the model)
ridge_model.fit(x, y)
# Predict
y_predicted = ridge_model.predict(x)

# model evaluation
rmse = mean_squared_error(y, y_predicted)
r2 = r2_score(y, y_predicted)

# printing values
print('Slope:' ,ridge_model.coef_)
print('Intercept:', ridge_model.intercept_)
print('Root mean squared error: ', rmse)
print('R2 score: ', r2)

# plotting values

# data points
#plt.scatter(x, y)
plt.xlabel('x')
plt.ylabel('y')

# predicted values
plt.plot(x, y_predicted, color='b')
plt.show()
Slope: [-0.70005009 -0.13729783  1.17121593 -0.04475152 -0.01203618  0.0033549
  0.00259601]
Intercept: 90.33703342985348
Root mean squared error:  0.36704064449516854
R2 score:  0.6027011783456326