Introduction

This is a quick example of the kind of output being produced from the https://github.com/anthonypc/phrasePartAnalysis project as per the blog post: Text Processing, N-Grams And Paid Search.

The inital analysis of the file and the creation of the ngram lists are all dealt with in the code. Other than R, the entire process should not need any other tools. For ease of processing, I would recommend following the instructions on using Powershell to take care of non-latin characters to avoid issues with encoding.

Please read the read me in GitHub and the comments in the code itself for an explaination on how it works, what the inputs need to be and an overview of the process.

Summary

This process identifies two and three word combinations in search terms and assigned performance statistics against these. The objective is to identify phrase parts within the account, some of which will be shared between campaigns and adgroups, and use the information to shape keyword strategy.

Output

There are two files as per the readme used to produce the output for this file. Most of the output is produced by the ngram-outlier-influence-0-01.R script, though some of the example output from ngrams-ext-1-00.R is included.

ngrams-ext-1-00.R Output

The tables produced from line 145 in the main R script (ngrams-ext-1-00.R) are useful for identifying ngrams shared across the account. The tables below and others like them with additional statistics would meet this need.

The volume of clicks between the labels in the example set are not even for the most part.

kable(head(summary2Dcast), digits=2)
ngram campaignLabel campaignOther total
text word 510 449 959
are there 202 313 515
name word 0 463 463
test string 169 293 462
other text 102 233 335
next other 0 289 289
kable(head(summary3Dcast), digits=2)
ngram campaignLabel campaignOther total
one text word 158 55 213
word test string 56 116 172
other text word 39 127 166
there test string 50 95 145
ase text word 63 72 135
other name word 0 129 129

The following are displaying conversion rate by ngram and label.

kable(head(summary3Dcasta), digits=2)
ngram campaignLabel campaignOther
are test string 4.78 -Inf
are there do -Inf 8.00
are there dou -Inf NA
are there on NA -Inf
are there one NA -Inf
are there string NA 6.56
kable(head(summary3Dcasta), digits=2)
ngram campaignLabel campaignOther
are test string 4.78 -Inf
are there do -Inf 8.00
are there dou -Inf NA
are there on NA -Inf
are there one NA -Inf
are there string NA 6.56

The following tables are the main CSV output, with a file produced against two and three word phrase part combinations.

kable(head(arrange(labelNgrams.work_file2, desc(Clicks))), digits=2)
Labels Campaign Keyword Search.term Impressions Clicks Cost Converted.clicks ngram ctr cpc cpa cvr
campaignLabel Campaign 02 Keyword 01 ase are one 325 39 100.49 13 ase are 0.12 2.58 7.73 0.33
campaignLabel Campaign 02 Keyword 01 ase are one 325 39 100.49 13 are one 0.12 2.58 7.73 0.33
campaignLabel Campaign 02 Keyword 28 ase there one 38 38 79.04 0 there one 1.00 2.08 NA 0.00
campaignLabel Campaign 02 Keyword 28 ase there one 38 38 79.04 0 ase there 1.00 2.08 NA 0.00
campaignOther Campaign 04 Keyword 12 name word 1710 36 63.54 0 name word 0.02 1.76 NA 0.00
campaignOther Campaign 04 Keyword 06 next name word string 34 34 20.06 0 word string 1.00 0.59 NA 0.00
kable(head(arrange(labelNgrams.work_file3, desc(Clicks))), digits=2)
Labels Campaign Keyword Search.term Impressions Clicks Cost Converted.clicks ngram ctr cpc cpa cvr
campaignLabel Campaign 02 Keyword 01 ase are one 325 39 100.49 13 ase are one 0.12 2.58 7.73 0.33
campaignLabel Campaign 02 Keyword 28 ase there one 38 38 79.04 0 ase there one 1.00 2.08 NA 0.00
campaignOther Campaign 04 Keyword 06 next name word string 34 34 20.06 0 name word string 1.00 0.59 NA 0.00
campaignOther Campaign 04 Keyword 06 next name word string 34 34 20.06 0 next name word 1.00 0.59 NA 0.00
campaignOther Campaign 04 Keyword 01 phrase name word uno 275 33 85.03 11 name word uno 0.12 2.58 7.73 0.33
campaignOther Campaign 04 Keyword 01 phrase name word uno 275 33 85.03 11 phrase name word 0.12 2.58 7.73 0.33

Plots

The graphs are fairly straight forward, though the initial distribution did require some work on the data set to produce the chart as per below.

A histogram for clicks within the account by ngram demonstrates the distribution of volume. This example also includes a density plot.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Univariate distributions are useful. With a lot of paid search analysis scatter plots are very useful as bivariate outliers are often more informative than univariate. Such as data points with extreme cost and volume figures as per this chart. The values are log transformed to produce a slightly more normal distribution.

The following is an example that manipulates the labels on the points to highlight those of interest, and leave the plot ‘fairly’ uncluttered.

The following table is a simple one looking at the distribution of the CVR per ngram within groups of labels and campaigns. The data is not really appropriate for this, but it is an interesting exercise in looking how the number of observations can affect the error of the mean CVR.

Campaign Labels mcvr n mean sd se ci
Campaign 01 campaignLabel 0.23 4 0.29 0.17 0.09 0.27
Campaign 02 campaignLabel 0.19 12 0.31 0.29 0.08 0.18
Campaign 03 campaignOther 0.22 5 0.34 0.25 0.11 0.31
Campaign 04 campaignOther 0.10 16 0.13 0.09 0.02 0.05

This is another average of a CVR plot where the actual CVR for the group (label in this case) is plotted as a line through each facet.

For context here is a plot of clicks to CVR without a transformation applied to the values.

Extreme Values

ngram-outlier-influence-0-01.R Output

Ideally keywords grouped together in a paid search account perform more or less the same, where the relationship between what is spent and how much traffic is recieved should be close, and the same for the number of clicks and the number of conversions.

There are a few techniques used for testing assumptions for multivariate regression. A number of these are used for identifying outliers, those with high leverage and influenctial data points. Both Mahalanobis Distance and Cook’s Distance are used to address these issues. The data used here is certainly not appropriate for regression, the two tests mentioned above can be used to identify points that do not exhibit the same relationship between Clicks and Conversions.

Most distributions of those values would look a little more like this, where overtime there should be a relationship between the two values as optimisation activity in the account stablises performance. Though most likely with a heavier skew towards 0 on either axis. The example set in the repository does not.

The following is a simple check for outliers against the fitted linear model with conversions as an outcome by clicks against campaigns and weighted on cost.

The outliers reported below where identified by a Bonferroni Outlier Test as per the car pacakge.

##      rstudent unadjusted p-value Bonferonni p
## 123  6.650903         5.7353e-10   8.7177e-08
## 74   6.171982         6.5755e-09   9.9948e-07
## 161  4.338993         2.6896e-05   4.0882e-03
## 55  -4.186763         4.9220e-05   7.4814e-03
## 126 -3.916863         1.3847e-04   2.1047e-02
ngram Campaign Labels Impressions Cost Clicks Converted.clicks cvr Observation
next other Campaign 04 campaignOther 1304 639.08 289 0 0.00 126
word uno Campaign 04 campaignOther 637 261.21 122 36 0.30 161
name word Campaign 04 campaignOther 3191 726.32 403 44 0.11 123
text word Campaign 02 campaignLabel 1408 601.17 307 42 0.14 74

The first table is the top ten individual observations by clicks matching to the ngrams identified above. Next is a quick review of the performance of the ngrams identified as outliers by campaign.

Keyword Search.term Impressions Clicks Cost Converted.clicks cvr
Keyword 12 name word 1710 36 63.54 0 0.00
Keyword 06 next name word string 34 34 20.06 0 0.00
Keyword 01 phrase name word uno 275 33 85.03 11 0.33
Keyword 01 phrase name word uno 275 33 85.03 11 0.33
Keyword 05 next name word uno 50 30 61.90 20 0.67
Keyword 05 text word test string 50 30 61.90 20 0.67
Keyword 05 text word test string 50 30 61.90 20 0.67
Keyword 05 next name word uno 50 30 61.90 20 0.67
Keyword 17 one text word word phrase 126 28 121.94 0 0.00
Keyword 20 one text word word thing 63 27 14.94 0 0.00

The following is a table of clicks by ngram and campaign.

ngram Campaign 01 Campaign 02 Campaign 03 Campaign 04
name word 0 0 60 403
next other 0 0 0 289
test string 104 65 57 236
text word 203 307 76 373
word uno 32 22 18 122

The following is a table of conversion rate by ngram and campaign.

ngram Campaign 01 Campaign 02 Campaign 03 Campaign 04
name word NA NA 0.00 0.11
next other NA NA NA 0.00
test string 0.17 0.45 0.00 0.00
text word 0.00 0.14 0.18 0.03
word uno 0.00 1.00 0.00 0.30

Other Tests

There are a number of different methods included here for identifying extreme points. The next is a look at the rows returned by Cook’s Distance. Test identified fewer points than the previous one, and it also identified observation 28, which has an unusually low number of conversions for the clicks it has recieved.

ngram Campaign Labels Impressions Cost Clicks Converted.clicks cvr Observation
text word Campaign 01 campaignLabel 685 422.79 203 0 0.00 28
text word Campaign 02 campaignLabel 1408 601.17 307 42 0.14 74
name word Campaign 04 campaignOther 3191 726.32 403 44 0.11 123

Next is a plot of residuals against the model where the Cook’s distance of each point is displayed as the size of the circle for each point. The plot has labeled the most influential points, those which have the most affect on the linear model created at the start of this process.

Again the observations identified were different from the outlier test.

##        StudRes        Hat     CookD
## 28  -1.5078970 0.72091052 0.8530673
## 55  -4.1867634 0.24108652 0.7901841
## 74   6.1719823 0.59133772 2.3406968
## 96  -0.5678472 0.73891135 0.3385422
## 123  6.6509026 0.37161567 1.5858359
## 126 -3.9168633 0.13421568 0.5199644
## 138 -3.0210812 0.08818068 0.3231677
## 145 -2.4970387 0.27776862 0.5378105
## 161  4.3389925 0.03111095 0.2593101
ngram Campaign Labels Impressions Cost Clicks Converted.clicks cvr Observation
text word Campaign 01 campaignLabel 685 422.79 203 0 0.00 28
text word Campaign 02 campaignLabel 1408 601.17 307 42 0.14 74
other text Campaign 03 campaignOther 411 238.56 135 14 0.10 96
name word Campaign 04 campaignOther 3191 726.32 403 44 0.11 123

By Campaign

The following are a series of graphs and tables of the relationship between clicks and converted clicks. Both Mahalanobis Disance and Cook’s Distance are included in the tables for each campaign.

In the output below the first table for each campaign under the scatterplot is sorted by Cook’s Distance and the second Mahalanobis Disance and It is interesting that each table while sharing observations, orders them differently.

Campaign 01

ngram Campaign Labels Impressions Cost Clicks Converted.clicks cvr md1 cd1
text word Campaign 01 campaignLabel 685 422.79 203 0 0.0000000 21.343686 9.4435401
test string Campaign 01 campaignLabel 410 286.63 104 18 0.1730769 11.741934 0.7495201
other are Campaign 01 campaignLabel 273 226.08 93 15 0.1612903 7.862069 0.2370295
other there Campaign 01 campaignLabel 101 173.90 64 18 0.2812500 11.045909 0.1982519
there test Campaign 01 campaignLabel 57 103.62 34 18 0.5294118 12.214889 0.1109411
ase text Campaign 01 campaignLabel 2964 248.51 106 0 0.0000000 3.541226 0.0333855
ngram Campaign Labels Impressions Cost Clicks Converted.clicks cvr md1 cd1
text word Campaign 01 campaignLabel 685 422.79 203 0 0.0000000 21.343686 9.4435401
there test Campaign 01 campaignLabel 57 103.62 34 18 0.5294118 12.214889 0.1109411
test string Campaign 01 campaignLabel 410 286.63 104 18 0.1730769 11.741934 0.7495201
other there Campaign 01 campaignLabel 101 173.90 64 18 0.2812500 11.045909 0.1982519
other are Campaign 01 campaignLabel 273 226.08 93 15 0.1612903 7.862069 0.2370295
ase text Campaign 01 campaignLabel 2964 248.51 106 0 0.0000000 3.541226 0.0333855

Campaign 02

ngram Campaign Labels Impressions Cost Clicks Converted.clicks cvr md2 cd2
text word Campaign 02 campaignLabel 1408 601.17 307 42 0.1368078 22.923721 17.0378336
one text Campaign 02 campaignLabel 1138 519.60 232 0 0.0000000 11.304012 1.9416931
word word Campaign 02 campaignLabel 544 425.11 172 0 0.0000000 5.292963 0.2276719
one word Campaign 02 campaignLabel 1405 379.61 147 0 0.0000000 3.474184 0.0914093
ase are Campaign 02 campaignLabel 551 230.61 89 22 0.2471910 3.355504 0.0767080
test string Campaign 02 campaignLabel 300 115.11 65 29 0.4461538 7.467613 0.0550768
ngram Campaign Labels Impressions Cost Clicks Converted.clicks cvr md2 cd2
text word Campaign 02 campaignLabel 1408 601.17 307 42 0.1368078 22.923721 17.0378336
one text Campaign 02 campaignLabel 1138 519.60 232 0 0.0000000 11.304012 1.9416931
test string Campaign 02 campaignLabel 300 115.11 65 29 0.4461538 7.467613 0.0550768
word uno Campaign 02 campaignLabel 44 36.46 22 22 1.0000000 5.382057 0.0068151
word word Campaign 02 campaignLabel 544 425.11 172 0 0.0000000 5.292963 0.2276719
word test Campaign 02 campaignLabel 120 61.90 30 20 0.6666667 4.044613 0.0140388

Campaign 03