PREDICTING CLICKS FOR NEW KEYWORDS ¶

Poetri Heriningtyas (poetri.h89@gmail.com)

Problem Statement¶

An advertising company sells a service of buying keywords in search engines on behalf of their customers. They’re trying to optimise their keyword and funds allocation. The first towards the optimal solution is to predict performance by keyword and fund. In this case, the goal is to predicting the feature “clicks” for new keywords with no historical data regarding them.

Data Description¶

Data is from train.csv file with following features:

Date : The date the data was collected yyyymmdd
Market : The market (US/UK)
Keyword : The Keyword
Average.Position : The average position the keyword had in a search engine
PPC : The amount of money agreed to be paid per click on a keyword
Impression : The number of user who saw the ad
Clicks : The number of clicks a keyword had

Technique Overview¶

Since it is numerical value predictive modelling problem, my original approach was to create model using linear regression algorithm.

Technique Summary¶

0. Import Libraries and Load Data¶

0.1 Load data 
0.2 Categorize features based on types
0.3 Define target variable

1. Data Prepocessing¶

1.1 Identify missing values
1.2 Clean text data
1.3 Encode categorical feature
1.4 Convert integer type on Date feature to Date object
1.5 Expand Date feature into 4 new features

2. Explorative Visualization¶

2.1 Descriptive statistic
2.2 Identify correlation between features
2.3 Identify correlation between features and target variable
2.4 Distribution of target variable
2.5 Distribution of feature variables

3. Feature Engineering¶

3.1 Normalize distribution
3.2 Word Embedding for Keyword feature

4. Create Prediction Model¶

4.1 Linear Regression
4.2 Lasso Regression
4.3 Ridge Regression

5. Evaluate Model On Unseen Data¶

5.1 Prepare unseen data
5.2 Predict unseen data

6. Summary¶

6.1 Recap and insight
6.2 Future work
6.3 Answer test questions

Step 0 - Import Libraries and Load Data¶

This is the basic step where we import all libraries we use and load the data.

# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

0.1. Load data¶

data = pd.read_csv('train.csv')
data.shape

(264037, 7)

There are 264037 observations and 7 features. Let's see the first 10 rows of the data.

data.head(10)

0.2. Categorize features based on types¶

data.dtypes

Date                  int64
Keyword              object
Market               object
Average.Position    float64
PPC                 float64
Impressions           int64
Clicks              float64
dtype: object

We have 5 numerical features and 2 object features which are text data. We know that date should not be integer type. We will explore further in the next step.

0.3. Define Target Variable¶

The goal of this task is to predict number of click based on new keyword. So, it's clear that 'Clicks' feature will be the target variable.

data.Clicks.value_counts()

0.00        162987
0.01           241
0.18           168
0.20           164
0.27           140
0.26           138
0.02           138
0.21           134
0.35           134
0.23           131
0.15           131
0.89           131
0.08           129
0.16           128
0.46           128
0.13           125
0.17           124
0.11           115
0.19           113
0.82           113
0.14           111
0.04           111
0.70           111
0.87           109
0.88           109
1.06           109
0.79           109
0.09           108
0.07           108
0.10           104
             ...  
37212.54         1
3427.18          1
4546.58          1
1701.78          1
117.16           1
581.14           1
814.02           1
342.32           1
653.43           1
3033.74          1
242.48           1
102.75           1
1539.33          1
1718.66          1
189.39           1
190.61           1
688.88           1
326.66           1
1162.47          1
7871.24          1
583.38           1
3904.13          1
1309.61          1
72.67            1
4897.10          1
429.54           1
107.29           1
101.69           1
200.62           1
105.80           1
Name: Clicks, Length: 36142, dtype: int64

Note that we see imbalanced data here. There are large numbers of 0.00 and large value range in the data. We need to normalize it and explore more in the next step.

Step 1 - Data Preprocessing¶

In this step we will clean the data from missing values and manipulate some features.

1.1. Identify missing values¶

# identify missing values in data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264037 entries, 0 to 264036
Data columns (total 7 columns):
Date                264037 non-null int64
Keyword             263654 non-null object
Market              264037 non-null object
Average.Position    264037 non-null float64
PPC                 264037 non-null float64
Impressions         264037 non-null int64
Clicks              264037 non-null float64
dtypes: float64(3), int64(2), object(2)
memory usage: 14.1+ MB

From the result above, there are missing values in feature 'Keyword'.

# select data where Keyword is empty
data[data['Keyword'].isna()]

There are 383 empty values of Keyword feature. But interestingly, this empty value has 1 Average.Position and gain high impression. Thus, we will keep it. Another thing that we must consider is that in evaluate.csv file, there is no NaN value for Keyword but an empty string. Therefore, we need to convert NaN value in Keyword as empty string or blank space.

# Convert NaN value in Keyword to ''
data['Keyword'] = data.Keyword.fillna('')
data['Keyword']

0                                     secure online back up
1                                 agile management software
2                                         crm for financial
3                         disaster recovery planning for it
4                                        tracking a vehicle
5                                 applications in the cloud
6                             project management with scrum
7                                    servers for businesses
8                            android application developers
9                                    android app developers
10                                    training for indesign
11                           teach yourself web development
12                              web applications developers
13                            android application developer
14                                    android app developer
15                                    app developer android
16                                    android developer app
17                           mobile applications developers
18                                   design for programmers
19                              cloud computing for dummies
20                       android application development uk
21        web design and marketing solutions for busines...
22                           developers mobile applications
23                                        training indesign
24                                         apps development
25                                    computing for dummies
26                           bpp learning management system
27                                   mobile apps developers
28                                         joomla designers
29                              best pay as you go sim card
                                ...                        
264007                               21st century insurance
264008                              cheap insurance medical
264009                          angeles criminal lawyer los
264010                         low cost term life insurance
264011                                 home refinance loans
264012                                                eloan
264013                                 life insurance quote
264014                                 cheap life assurance
264015                                   anti spam exchange
264016                              credit equity home line
264017                                         att go phone
264018                                   dui florida lawyer
264019                                hair removal new york
264020                       merchant account international
264021                                   vioxx class action
264022                          life insurance quote online
264023                                   consolidation loan
264024                                equity line of credit
264025                             carolina law lemon north
264026                                         equity rates
264027                           direct home insurance line
264028                             angeles attorney los tax
264029                                     credit home line
264030                                       term insurance
264031                                      auto insurances
264032                                       debtfreedirect
264033                                 refinance comparison
264034                                acid reflux treatment
264035                        bad credit mortgage refinance
264036                               college loan refinance
Name: Keyword, Length: 264037, dtype: object

# select data where Keyword is empty
data[data['Keyword'].isna()]

1.2. Clean text data¶

When working with text data, we need to check if the text is clean from any noisy characters. This can be achieved by checking non-ascii character in text. We can use the regex pattern '[^\x00-\x7F]+' which looks for hex values in the ascii range up to 128 (7f) so it looks for characters in range 0-128 (not including 128), and we negate this using ^ so that it's looking for the presence of non-ascii anywhere in the text.

# identify non ascii characters
data_non_ascii = data[data['Keyword'].str.contains(r'[^\x00-\x7F]+')]
data_non_ascii

It is interesting because all the non ASCII characters that was considered as noise are words from other languages. However, to process the text further we only have english language dictionary. So, we need to remove all rows which contain non english words, in this case is specified as non ASCII characters. Total rows should be 264037 - 4820 now.

# total rows after removing non ascii characters
total_rows = 264037 - 4820
total_rows

259217

# data without non ascii characters in Keyword column
# We invert this mask using ~ and use this to mask the data
data = data[~data['Keyword'].str.contains(r'[^\x00-\x7F]+')]
data

1.4. Encode categorical feature¶

We have feature 'Market' which is categorical. We will encode US-Market as 0 and UK-Market as 1.

data = data.copy()
data['Market'] = data['Market'].apply(lambda item: 0 if item == 'US-Market' else item)
data['Market'] = data['Market'].apply(lambda item: 1 if item == 'UK-Market' else item)
        
data

1.3. Convert integer type on Date feature to Date object¶

Date feature represents the exact time when a user clicked on the keyword. We will convert this integer type to Date object.

from datetime import date

def int2date(argdate: int) -> date:
    """
    If you have date as an integer, use this method to obtain a datetime.date object.

    Parameters
    ----------
    argdate : int
      Date as a regular integer value (example: 20160618)

    Returns
    -------
    dateandtime.date
      A date object which corresponds to the given value `argdate`.
    """
    year = int(argdate / 10000)
    month = int((argdate % 10000) / 100)
    day = int(argdate % 100)

    return date(year, month, day)

# convert integer to date object in Date column
data['New Date'] = data['Date'].apply(int2date)
data

1.4. Expand Date feature into 4 new features¶

We will expand this feature to 3 new features: year, month, day of the month, day of the week. In this way, we will get new variables that an ML model will be able to process and find possible dependencies and correlations. Since we have created new variables, we will exclude the original variable "Date" from the table. The "Day of the week" variable contains values from 0 to 6, where each number represents a specific day of the week (from Monday to Sunday).

# create temporary column
data['Date Time'] = pd.to_datetime(data['New Date'])

data['Year'] = data['Date Time'].dt.year
data['Month'] = data['Date Time'].dt.month
data['Day of the month'] = data['Date Time'].dt.day
data["Day of the week"] = data['Date Time'].dt.dayofweek

# drop temporary columns
data = data.drop(['New Date'], axis=1)
data = data.drop(['Date Time'], axis=1)
#data = data.drop(['Date'], axis=1)

data

Step 2 - Explorative Visualization¶

2.1. Descriptive statistic¶

# getting summary stats on Data after preprocessing
data.describe()

A few observations we see from the statistic description above:

50% data of Average.Position, PPC, Impressions and Clicks are 0 and large range values on those fetures.

2.2. Identify correlation between features¶

# Visualize trends on Average.Position, PPC, Impressions and Clicks
customPalette = sns.set_palette(sns.color_palette(['#7bc2e0']))
sns.pairplot(data[['Average.Position', 'PPC','Impressions', 'Clicks']], palette=customPalette)

<seaborn.axisgrid.PairGrid at 0x22223949828>

A few observations we see from the scatter plot above:

There is correlation between PPC and Average.Position
No correlation between PPC and Impressions, Averarage.Position and Impression
Strong correlation between Clicks and Impressions
No correlation between Clicks and PPC, Clicks and Average.Position

To see more accurate, we can take a look the exact correlation score between features using heatmap.

# create a correlation heatmap of PPC, Impressions, and Average.Position
fig, ax = plt.subplots(figsize = (2, 15))
cmap = sns.diverging_palette(130, 275, as_cmap=True)
sns.heatmap(data[data.columns[3:6]].corr()[['Average.Position']], annot=True, linewidths=.4, fmt=".1f", cmap=cmap, ax=ax);

We can see that PPC, Impression and Average.Position are correlated, though not strong.

2.3. Identify correlation between features and target variable¶

# create a correlation heatmap of all features in relation to Clicks
fig, ax = plt.subplots(figsize = (2, 15))
cmap = sns.diverging_palette(130, 275, as_cmap=True)
sns.heatmap(data[data.columns[1:]].corr()[['Clicks']], annot=True, linewidths=.4, fmt=".1f", cmap=cmap, ax=ax);
#sns.set(font_scale=4)

#plt.savefig("heatmap_fe_4.png")

A few observations we see from the correlation map above:

Strong correlation between Impression and Clicks
Zero correlation between Day of the month and Clicks, Day of the week and Clicks
Negative correlation between Month and Clicks, Year and Clicks, PPC and Clicks
Weak correlation between Average.Position and Clicks

Feature Date seems like have no correlation with target variable. But we will not drop here. By training our model, we will see if the feature is worth or not.

2.4. Distribution of target variable¶

We will visualize the data of target variable to better see and understand the distribution.

def plot_distribution(df, col):
    print('Unique value of {}:'.format(col))
    display(df[col].unique())
    print('Number of unique value of {}: {}\n'.format(col, len(df[col].unique())))
    
    fig, ax = plt.subplots(figsize=(15,5))
    sns.set_style(style='whitegrid')
    feature = df[col]
    _ = feature.plot.hist(ax=ax, bins=15, color=['#d5ede4'])
    mean_line = ax.axvline(x=feature.mean(), linewidth=2, color='blue', label='mean {:.2f}'.format(
        feature.mean()))
    median_line = ax.axvline(x=feature.median(), linewidth=2, color='red', label='median {:.2f}'.format(
        feature.median()))
    _ = ax.legend(handles=[mean_line, median_line])
    _ = ax.set_title('Histogram of {} for Bins = 15'.format(col), fontsize='x-large')
    
    plt.show()

# Visualize distribution of target variable
plot_distribution(data, 'Clicks')

Unique value of Clicks:

array([  0.  ,  21.22,  23.31, ...,  98.5 ,  35.77, 163.68])

Number of unique value of Clicks: 36142

# Density plot of number of Clicks
sns.distplot(data['Clicks'], hist=False, kde=True, 
             bins=15, color = 'lightblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 3})

C:\Users\user\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

<matplotlib.axes._subplots.AxesSubplot at 0x22225726cc0>

It's right-skewed distribution. The mean will be larger than the median in this distribution. Right skewness can be by reduced applying roots and logarithms transformation.

2.5. Distribution of feature variables¶

# Visualize distribution on features
customPalette = sns.set_palette(sns.color_palette(['#7bc2e0']))
sns.pairplot(data[['Average.Position', 'PPC','Impressions', 'Year', 'Month', 'Day of the month', 'Day of the week']], palette=customPalette)

<seaborn.axisgrid.PairGrid at 0x222271e06a0>

# Visualize distribution of Market
sns.countplot(x ='Market', data = data)

<matplotlib.axes._subplots.AxesSubplot at 0x2222dfb1c18>

We see that US-Market and UK-Market have balance number.

Step 3 - Feature Engineering¶

3.1. Normalize distribution¶

From visualization above, we can see that some features and target variable are right skewed distributed. We need to normalize them using log transformation.

# Log transformation target variable
helpful_log = np.log(data.Clicks)
helpful_log.describe()

C:\Users\user\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: RuntimeWarning: divide by zero encountered in log
  
C:\Users\user\Anaconda3\lib\site-packages\numpy\lib\function_base.py:3768: RuntimeWarning: invalid value encountered in multiply
  x2 = take(ap, indices_above, axis=axis) * weights_above

count    2.592170e+05
mean             -inf
std               NaN
min              -inf
25%               NaN
50%               NaN
75%      2.273156e+00
max      1.299898e+01
Name: Clicks, dtype: float64

Many data points are 0 because many keywords have 0 Average.Position. For a quick fix, we can add 1 to each data point. This works well since the log of 1 is 0. Furthermore, the same spread is retained since all points are increased by 1

click_log = np.log(data.Clicks + 1)
data['Clicks'] = np.sqrt(click_log)
data['Clicks'].describe()

count    259217.000000
mean          0.691737
std           0.961168
min           0.000000
25%           0.000000
50%           0.000000
75%           1.539863
max           3.605410
Name: Clicks, dtype: float64

# Density Plot of Clicks
sns.distplot(data['Clicks'], hist=False, kde=True, 
             bins=15, color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 3})

<matplotlib.axes._subplots.AxesSubplot at 0x2222e3a3668>

# Log transformation of PPC
ppc_log = np.log(data.PPC + 1)
data['PPC'] = np.sqrt(ppc_log)
data['PPC'].describe()

count    259217.000000
mean          0.443651
std           0.629027
min           0.000000
25%           0.000000
50%           0.000000
75%           0.933431
max           2.540005
Name: PPC, dtype: float64

# Density Plot of PPC
sns.distplot(data['PPC'], hist=False, kde=True, 
             bins=15, color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 3})

<matplotlib.axes._subplots.AxesSubplot at 0x2222e3f56d8>

# Log transformation of Impressions
impression_log = np.log(data.Impressions + 1)
data['Impressions'] = np.sqrt(impression_log)
data['Impressions'].describe()

count    259217.000000
mean          1.050147
std           1.341228
min           0.000000
25%           0.000000
50%           0.000000
75%           2.469882
max           4.070264
Name: Impressions, dtype: float64

# Density Plot of Impressions
sns.distplot(data['Impressions'], hist=False, kde=True, 
             bins=15, color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 3})

<matplotlib.axes._subplots.AxesSubplot at 0x222306bdba8>

3.2. Word embedding for keyword feature¶

We will create word embedding using Glove pretrained model for English https://nlp.stanford.edu/projects/glove/. GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

# After downloading:
glove_path = "glove.6B.50d.txt"

def load_word_embeddings(file=glove_path):
    embeddings={}
    with open(file,'r', encoding="utf-8") as infile:
        for line in infile:
            values=line.split()
            embeddings[values[0]]=np.asarray(values[1:], dtype='float64')
    return embeddings

embeddings = load_word_embeddings()

len(embeddings.keys())

400000

# create average feature embedding for each sentence
def average_embedding(keyword, embeddings=embeddings,emb_size=50):
    token_keywords = keyword.lower().split()
    token_keywords=[w for w in token_keywords if w.isalpha() and w in embeddings]
    if len(token_keywords)==0:
        return 0
    keywords_embedding = np.array([embeddings[w] for w in token_keywords])
    sentences_embedding_avg = keywords_embedding.mean(axis=0)
    return sentences_embedding_avg.mean(axis=0)

# apply average embedding funtion to Keyword column
data['Keyword'] = data['Keyword'].apply(average_embedding)

data

Step 4 - Create Prediction Model¶

4.1. Linear Regression¶

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Define features and label
# x = data[['Keyword', 'Market', 'Average.Position', 'PPC', 'Impressions', 'Year', 'Month', 'Day of the month', 'Day of the week']]
x = data[['Keyword', 'Market', 'PPC', 'Year', 'Month', 'Day of the month', 'Day of the week']]
y = data ['Clicks']

# Model initialization
lr_model = LinearRegression()
# Fit the data(train the model)
lr_model.fit(x, y)
# Predict
y_predicted = lr_model.predict(x)

# model evaluation
rmse = mean_squared_error(y, y_predicted)
r2 = r2_score(y, y_predicted)

# printing values
print('Slope:' ,lr_model.coef_)
print('Intercept:', lr_model.intercept_)
print('Root mean squared error: ', rmse)
print('R2 score: ', r2)

# plotting values

# data points
#plt.scatter(x, y)
plt.xlabel('x')
plt.ylabel('y')

# predicted values
plt.plot(x, y_predicted, color='b')
plt.show()

Slope: [-0.70056741 -0.13729769  1.17123246 -0.04475564 -0.01203647  0.0033549
  0.00259599]
Intercept: 90.34532927085633
Root mean squared error:  0.3670406430227459
R2 score:  0.6027011799394391

# show distribution of target variable versus predictions
fig, ax = plt.subplots(figsize = (11, 8))
sns.distplot(np.exp(data.Clicks), kde=False, color='#A15EDB', bins=50, label='click_actual')
sns.distplot(np.exp(y_predicted), kde=False, color='#69547C', bins=50, label='click_pred')

plt.xlabel('Clicks', fontsize=19, labelpad=11)
plt.xticks(fontsize=14)
plt.ylabel('Count', fontsize=19, labelpad=11)
plt.yticks(fontsize=14)
plt.legend(loc='upper right');

#plt.savefig("y_vs_yhat.png");

4.2. Lasso Regression¶

from sklearn.linear_model import LassoCV

# Model initialization
lasso_model = LassoCV()
# Fit the data(train the model)
lasso_model.fit(x, y)
# Predict
y_predicted = lasso_model.predict(x)

# model evaluation
rmse = mean_squared_error(y, y_predicted)
r2 = r2_score(y, y_predicted)

# printing values
print('Slope:' ,lasso_model.coef_)
print('Intercept:', lasso_model.intercept_)
print('Root mean squared error: ', rmse)
print('R2 score: ', r2)

# plotting values

# data points
#plt.scatter(x, y)
plt.xlabel('x')
plt.ylabel('y')

# predicted values
plt.plot(x, y_predicted, color='b')
plt.show()

C:\Users\user\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)

Slope: [-0.6120999  -0.13571678  1.16934205 -0.03216596 -0.01090186  0.00338334
  0.00251832]
Intercept: 65.0018013552894
Root mean squared error:  0.3670899665200862
R2 score:  0.6026477903008063

# show distribution of target variable versus predictions
fig, ax = plt.subplots(figsize = (11, 8))
sns.distplot(np.exp(data.Clicks), kde=False, color='#A15EDB', bins=50, label='click_actual')
sns.distplot(np.exp(y_predicted), kde=False, color='#69547C', bins=50, label='click_pred')

plt.xlabel('Clicks', fontsize=19, labelpad=11)
plt.xticks(fontsize=14)
plt.ylabel('Count', fontsize=19, labelpad=11)
plt.yticks(fontsize=14)
plt.legend(loc='upper right');

#plt.savefig("y_vs_yhat.png");

4.3. Ridge Regression¶

from sklearn.linear_model import RidgeCV

# Model initialization
ridge_model = RidgeCV()
# Fit the data(train the model)
ridge_model.fit(x, y)
# Predict
y_predicted = ridge_model.predict(x)

# model evaluation
rmse = mean_squared_error(y, y_predicted)
r2 = r2_score(y, y_predicted)

# printing values
print('Slope:' ,ridge_model.coef_)
print('Intercept:', ridge_model.intercept_)
print('Root mean squared error: ', rmse)
print('R2 score: ', r2)

# plotting values

# data points
#plt.scatter(x, y)
plt.xlabel('x')
plt.ylabel('y')

# predicted values
plt.plot(x, y_predicted, color='b')
plt.show()

Slope: [-0.70005009 -0.13729783  1.17121593 -0.04475152 -0.01203618  0.0033549
  0.00259601]
Intercept: 90.33703342985348
Root mean squared error:  0.36704064449516854
R2 score:  0.6027011783456326

	Date	Keyword	Market	Average.Position	PPC	Impressions	Clicks
1094	20120526	NaN	US-Market	1.0	2.32	5499999	289365.25
1642	20120527	NaN	US-Market	1.0	2.34	5499999	285877.47
2190	20120529	NaN	US-Market	1.0	2.32	5499999	287599.59
2738	20120530	NaN	US-Market	1.0	2.63	6714282	330769.03
3286	20120531	NaN	US-Market	1.0	2.64	6714282	330107.41
3834	20120601	NaN	US-Market	1.0	2.59	6714281	330516.59
4382	20120602	NaN	US-Market	1.0	2.59	9571432	355690.41
4930	20120603	NaN	US-Market	1.0	2.50	9571423	401689.91
5478	20120604	NaN	US-Market	1.0	2.54	11166672	433008.50
6026	20120605	NaN	US-Market	1.0	2.31	11750007	427226.03
6574	20120610	NaN	US-Market	1.0	2.63	13400001	441963.50
7122	20120612	NaN	US-Market	1.0	2.58	11166669	428694.63
7670	20120613	NaN	US-Market	1.0	2.53	13428574	422703.03
8218	20120614	NaN	US-Market	1.0	2.52	13428574	427788.47
8766	20120615	NaN	US-Market	1.0	2.53	13428574	424935.56
9314	20120616	NaN	US-Market	1.0	2.51	9571428	392467.78
9862	20120617	NaN	US-Market	1.0	2.55	9571428	379890.72
10410	20120618	NaN	US-Market	1.0	2.53	13428580	382663.88
10958	20120619	NaN	US-Market	1.0	2.54	9571430	362908.22
11506	20120620	NaN	US-Market	1.0	2.55	9571426	373766.03
12054	20120621	NaN	US-Market	1.0	2.61	11166665	373583.75
12602	20120622	NaN	US-Market	1.0	2.72	9571431	359941.81
13150	20120623	NaN	US-Market	1.0	2.73	9571426	354189.47
13698	20120624	NaN	US-Market	1.0	2.89	9571426	316026.56
14246	20120625	NaN	US-Market	1.0	2.89	9571426	315998.50
14794	20120626	NaN	US-Market	1.0	2.85	9571429	309798.00
15342	20120627	NaN	US-Market	1.0	2.85	11166663	308133.22
15890	20120628	NaN	US-Market	1.0	2.78	9571429	331790.28
16438	20120629	NaN	US-Market	1.0	3.04	9571431	322409.72
16986	20120701	NaN	US-Market	1.0	3.06	9571428	324125.81
...	...	...	...	...	...	...	...
247597	20130108	NaN	UK-Market	1.0	5.42	2285713	44691.21
248145	20130109	NaN	UK-Market	1.0	5.32	2285715	43786.67
248693	20130110	NaN	UK-Market	1.0	5.37	2285715	43388.24
249241	20130111	NaN	UK-Market	1.0	5.50	2285715	41995.05
249789	20130112	NaN	UK-Market	1.0	5.54	2285715	41719.83
250337	20130113	NaN	UK-Market	1.0	5.52	2285715	41901.30
250885	20130114	NaN	UK-Market	1.0	5.53	1833333	42149.83
251433	20130115	NaN	UK-Market	1.0	5.76	1571428	39829.72
251981	20130117	NaN	UK-Market	1.0	6.30	1833332	37040.57
252529	20130118	NaN	UK-Market	1.0	6.32	1571428	36058.94
253077	20130119	NaN	UK-Market	1.0	6.67	1571428	35389.41
253625	20130120	NaN	UK-Market	1.0	6.64	1571428	34820.20
254173	20130121	NaN	UK-Market	1.0	6.52	1571428	35794.18
254721	20130122	NaN	UK-Market	1.0	6.35	1571428	37103.77
255269	20130123	NaN	UK-Market	1.0	6.18	1571428	38916.38
255817	20130124	NaN	UK-Market	1.0	6.17	1833333	38951.21
256365	20130125	NaN	UK-Market	1.0	6.33	1833332	38993.13
256913	20130126	NaN	UK-Market	1.0	6.37	1571428	38653.07
257461	20130127	NaN	UK-Market	1.0	6.27	1571428	39082.41
258009	20130128	NaN	UK-Market	1.0	6.18	1571429	39547.60
258557	20130129	NaN	UK-Market	1.0	6.22	1833332	40135.79
259105	20130130	NaN	UK-Market	1.0	6.00	2285715	40734.42
259653	20130131	NaN	UK-Market	1.0	6.33	2285714	40296.99
260201	20130201	NaN	UK-Market	1.0	6.59	1571429	38847.38
260749	20130203	NaN	UK-Market	1.0	6.89	1571429	37094.16
261297	20130204	NaN	UK-Market	1.0	6.91	1571429	37020.55
261845	20130205	NaN	UK-Market	1.0	6.76	1833333	36694.93
262393	20130206	NaN	UK-Market	1.0	6.80	1833333	36694.93
262941	20130207	NaN	UK-Market	1.0	6.93	1571428	37054.14
263489	20130212	NaN	UK-Market	1.0	7.56	1571428	35155.06

	Date	Market	Average.Position	PPC	Impressions	Clicks	Year	Month	Day of the month	Day of the week
count	2.592170e+05	259217.000000	259217.000000	259217.000000	2.592170e+05	259217.000000	259217.000000	259217.000000	259217.000000	259217.000000
mean	2.012233e+07	0.499998	0.403028	5.199379	2.150505e+04	471.990056	2012.153586	7.781338	15.960720	3.037390
std	3.329351e+03	0.500001	0.548785	27.054846	3.189934e+05	7848.330662	0.360552	3.399161	9.074103	1.983025
min	2.012052e+07	0.000000	0.000000	0.000000	0.000000e+00	0.000000	2012.000000	1.000000	1.000000	0.000000
25%	2.012080e+07	0.000000	0.000000	0.000000	0.000000e+00	0.000000	2012.000000	6.000000	8.000000	1.000000
50%	2.012100e+07	0.000000	0.000000	0.000000	0.000000e+00	0.000000	2012.000000	8.000000	16.000000	3.000000
75%	2.012121e+07	1.000000	1.000000	1.390000	4.450000e+02	9.710000	2012.000000	10.000000	24.000000	5.000000
max	2.013021e+07	1.000000	11.000000	632.730000	1.566667e+07	441963.500000	2013.000000	12.000000	31.000000	6.000000

	Date	Keyword	Market	Average.Position	PPC	Impressions	Clicks
0	20120524	secure online back up	US-Market	0.0	0.00	0	0.00
1	20120524	agile management software	US-Market	1.0	1.20	260	21.22
2	20120524	crm for financial	US-Market	0.0	0.00	0	0.00
3	20120524	disaster recovery planning for it	US-Market	0.0	0.00	0	0.00
4	20120524	tracking a vehicle	US-Market	0.0	0.00	0	0.00
5	20120524	applications in the cloud	US-Market	0.0	0.00	0	0.00
6	20120524	project management with scrum	US-Market	0.0	0.00	0	0.00
7	20120524	servers for businesses	US-Market	0.0	0.00	0	0.00
8	20120524	android application developers	US-Market	0.0	0.00	0	0.00
9	20120524	android app developers	US-Market	1.0	0.66	270	23.31

	Date	Keyword	Market	Average.Position	PPC	Impressions	Clicks
247	20120524	트위터	US-Market	0.0	0.00	0	0.00
251	20120524	食べログ	US-Market	0.0	0.00	0	0.00
254	20120524	как жрать суши	US-Market	0.0	0.00	0	0.00
269	20120524	nürburgring	US-Market	0.0	0.00	0	0.00
281	20120524	jörg kachelmann	US-Market	0.0	0.00	0	0.00
282	20120524	menowin fröhlich	US-Market	0.0	0.00	0	0.00
287	20120524	ресторан нью-йорк	US-Market	0.0	0.00	0	0.00
288	20120524	машинное обучение	US-Market	0.0	0.00	0	0.00
290	20120524	comercio electrónico	US-Market	0.0	0.00	0	0.00
292	20120524	منتزه يلوستون الوطني	US-Market	0.0	0.00	0	0.00
794	20120525	트위터	US-Market	0.0	0.00	0	0.00
798	20120525	食べログ	US-Market	0.0	0.00	0	0.00
801	20120525	как жрать суши	US-Market	0.0	0.00	0	0.00
816	20120525	nürburgring	US-Market	0.0	0.00	0	0.00
828	20120525	jörg kachelmann	US-Market	0.0	0.00	0	0.00
829	20120525	menowin fröhlich	US-Market	0.0	0.00	0	0.00
834	20120525	ресторан нью-йорк	US-Market	0.0	0.00	0	0.00
835	20120525	машинное обучение	US-Market	0.0	0.00	0	0.00
837	20120525	comercio electrónico	US-Market	0.0	0.00	0	0.00
839	20120525	منتزه يلوستون الوطني	US-Market	0.0	0.00	0	0.00
1342	20120526	트위터	US-Market	0.0	0.00	0	0.00
1346	20120526	食べログ	US-Market	0.0	0.00	0	0.00
1349	20120526	как жрать суши	US-Market	0.0	0.00	0	0.00
1364	20120526	nürburgring	US-Market	0.0	0.00	0	0.00
1376	20120526	jörg kachelmann	US-Market	0.0	0.00	0	0.00
1377	20120526	menowin fröhlich	US-Market	0.0	0.00	0	0.00
1382	20120526	ресторан нью-йорк	US-Market	0.0	0.00	0	0.00
1383	20120526	машинное обучение	US-Market	0.0	0.00	0	0.00
1385	20120526	comercio electrónico	US-Market	0.0	0.00	0	0.00
1387	20120526	منتزه يلوستون الوطني	US-Market	0.0	0.00	0	0.00
...	...	...	...	...	...	...	...
262641	20130206	트위터	UK-Market	0.0	0.00	0	0.00
262645	20130206	食べログ	UK-Market	0.0	0.00	0	0.00
262648	20130206	как жрать суши	UK-Market	0.0	0.00	0	0.00
262663	20130206	nürburgring	UK-Market	1.0	2.15	64	1.21
262675	20130206	jörg kachelmann	UK-Market	0.0	0.00	0	0.00
262676	20130206	menowin fröhlich	UK-Market	0.0	0.00	0	0.00
262681	20130206	ресторан нью-йорк	UK-Market	0.0	0.00	0	0.00
262682	20130206	машинное обучение	UK-Market	0.0	0.00	0	0.00
262684	20130206	comercio electrónico	UK-Market	0.0	0.00	0	0.00
262686	20130206	منتزه يلوستون الوطني	UK-Market	0.0	0.00	0	0.00
263189	20130207	트위터	UK-Market	0.0	0.00	0	0.00
263193	20130207	食べログ	UK-Market	0.0	0.00	0	0.00
263196	20130207	как жрать суши	UK-Market	0.0	0.00	0	0.00
263211	20130207	nürburgring	UK-Market	1.0	2.15	47	1.04
263223	20130207	jörg kachelmann	UK-Market	0.0	0.00	0	0.00
263224	20130207	menowin fröhlich	UK-Market	0.0	0.00	0	0.00
263229	20130207	ресторан нью-йорк	UK-Market	0.0	0.00	0	0.00
263230	20130207	машинное обучение	UK-Market	0.0	0.00	0	0.00
263232	20130207	comercio electrónico	UK-Market	0.0	0.00	0	0.00
263234	20130207	منتزه يلوستون الوطني	UK-Market	0.0	0.00	0	0.00
263737	20130212	트위터	UK-Market	0.0	0.00	0	0.00
263741	20130212	食べログ	UK-Market	0.0	0.00	0	0.00
263744	20130212	как жрать суши	UK-Market	0.0	0.00	0	0.00
263759	20130212	nürburgring	UK-Market	1.0	1.86	44	0.78
263771	20130212	jörg kachelmann	UK-Market	0.0	0.00	0	0.00
263772	20130212	menowin fröhlich	UK-Market	0.0	0.00	0	0.00
263777	20130212	ресторан нью-йорк	UK-Market	0.0	0.00	0	0.00
263778	20130212	машинное обучение	UK-Market	0.0	0.00	0	0.00
263780	20130212	comercio electrónico	UK-Market	0.0	0.00	0	0.00
263782	20130212	منتزه يلوستون الوطني	UK-Market	0.0	0.00	0	0.00

	Date	Keyword	Market	Average.Position	PPC	Impressions	Clicks	Year	Month	Day of the month	Day of the week
0	20120524	0.022198	0	0.0	0.000000	0.000000	0.000000	2012	5	24	3
1	20120524	0.048223	0	1.0	0.887951	2.358924	1.760964	2012	5	24	3
2	20120524	0.064273	0	0.0	0.000000	0.000000	0.000000	2012	5	24	3
3	20120524	0.046857	0	0.0	0.000000	0.000000	0.000000	2012	5	24	3
4	20120524	0.031747	0	0.0	0.000000	0.000000	0.000000	2012	5	24	3
5	20120524	-0.017084	0	0.0	0.000000	0.000000	0.000000	2012	5	24	3
6	20120524	0.034216	0	0.0	0.000000	0.000000	0.000000	2012	5	24	3
7	20120524	0.049713	0	0.0	0.000000	0.000000	0.000000	2012	5	24	3
8	20120524	0.025787	0	0.0	0.000000	0.000000	0.000000	2012	5	24	3
9	20120524	0.049577	0	1.0	0.711911	2.366880	1.786306	2012	5	24	3
10	20120524	0.018618	0	0.0	0.000000	0.000000	0.000000	2012	5	24	3
11	20120524	0.078420	0	0.0	0.000000	0.000000	0.000000	2012	5	24	3
12	20120524	0.023065	0	0.0	0.000000	0.000000	0.000000	2012	5	24	3
13	20120524	0.056338	0	1.0	0.838509	2.510866	1.967113	2012	5	24	3
14	20120524	0.080128	0	1.0	0.912639	2.794991	2.236139	2012	5	24	3
15	20120524	0.080128	0	0.0	0.000000	0.000000	0.000000	2012	5	24	3
16	20120524	0.080128	0	0.0	0.000000	0.000000	0.000000	2012	5	24	3
17	20120524	0.034538	0	1.0	0.671622	2.358924	1.766897	2012	5	24	3
18	20120524	0.028561	0	0.0	0.000000	0.000000	0.000000	2012	5	24	3
19	20120524	0.041374	0	1.0	0.636761	2.408759	1.913305	2012	5	24	3
20	20120524	0.057223	0	0.0	0.000000	0.000000	0.000000	2012	5	24	3
21	20120524	0.038984	0	0.0	0.000000	0.000000	0.000000	2012	5	24	3
22	20120524	0.034538	0	0.0	0.000000	0.000000	0.000000	2012	5	24	3
23	20120524	0.007457	0	0.0	0.000000	0.000000	0.000000	2012	5	24	3
24	20120524	0.083472	0	1.0	0.724381	2.464355	1.790642	2012	5	24	3
25	20120524	0.054911	0	0.0	0.000000	0.000000	0.000000	2012	5	24	3
26	20120524	0.046519	0	0.0	0.000000	0.000000	0.000000	2012	5	24	3
27	20120524	0.033455	0	1.0	1.123075	2.504571	1.782007	2012	5	24	3
28	20120524	-0.037979	0	1.0	0.928919	2.365314	1.713471	2012	5	24	3
29	20120524	0.079269	0	0.0	0.000000	0.000000	0.000000	2012	5	24	3
...	...	...	...	...	...	...	...	...	...	...	...
264007	20130212	-0.046905	1	0.0	0.000000	0.000000	0.000000	2013	2	12	1
264008	20130212	0.119748	1	0.0	0.000000	0.000000	0.000000	2013	2	12	1
264009	20130212	-0.091270	1	0.0	0.000000	0.000000	0.000000	2013	2	12	1
264010	20130212	0.073270	1	1.0	1.747840	1.770733	0.763031	2013	2	12	1
264011	20130212	0.119321	1	1.0	1.243154	2.277583	1.455975	2013	2	12	1
264012	20130212	0.000000	1	0.0	0.000000	0.000000	0.000000	2013	2	12	1
264013	20130212	0.045349	1	1.0	1.878408	2.554701	1.807036	2013	2	12	1
264014	20130212	0.121727	1	0.0	0.000000	0.000000	0.000000	2013	2	12	1
264015	20130212	-0.006922	1	1.0	1.672690	1.962179	0.963441	2013	2	12	1
264016	20130212	0.022574	1	0.0	0.000000	0.000000	0.000000	2013	2	12	1
264017	20130212	0.079855	1	0.0	0.000000	0.000000	0.000000	2013	2	12	1
264018	20130212	-0.098399	1	0.0	0.000000	0.000000	0.000000	2013	2	12	1
264019	20130212	-0.022386	1	0.0	0.000000	0.000000	0.000000	2013	2	12	1
264020	20130212	0.025320	1	0.0	0.000000	0.000000	0.000000	2013	2	12	1
264021	20130212	-0.031211	1	0.0	0.000000	0.000000	0.000000	2013	2	12	1
264022	20130212	0.040901	1	0.0	0.000000	0.000000	0.000000	2013	2	12	1
264023	20130212	0.022664	1	1.0	1.882973	2.724732	1.994820	2013	2	12	1
264024	20130212	0.003459	1	0.0	0.000000	0.000000	0.000000	2013	2	12	1
264025	20130212	-0.076353	1	0.0	0.000000	0.000000	0.000000	2013	2	12	1
264026	20130212	0.087444	1	1.0	1.095429	1.770733	0.598059	2013	2	12	1
264027	20130212	0.027520	1	0.0	0.000000	0.000000	0.000000	2013	2	12	1
264028	20130212	-0.109962	1	0.0	0.000000	0.000000	0.000000	2013	2	12	1
264029	20130212	0.018037	1	0.0	0.000000	0.000000	0.000000	2013	2	12	1
264030	20130212	0.013170	1	1.0	1.748244	2.727232	2.035586	2013	2	12	1
264031	20130212	0.057496	1	0.0	0.000000	0.000000	0.000000	2013	2	12	1
264032	20130212	0.000000	1	1.0	2.455025	1.770733	0.744235	2013	2	12	1
264033	20130212	0.105813	1	0.0	0.000000	0.000000	0.000000	2013	2	12	1
264034	20130212	0.275269	1	1.0	0.826497	2.054144	1.009380	2013	2	12	1
264035	20130212	0.102241	1	0.0	0.000000	0.000000	0.000000	2013	2	12	1
264036	20130212	0.046458	1	0.0	0.000000	0.000000	0.000000	2013	2	12	1