This is a quick example of the kind of output being produced from the https://github.com/anthonypc/phrasePartAnalysis project as per the blog post: Text Processing, N-Grams And Paid Search.
The inital analysis of the file and the creation of the ngram lists are all dealt with in the code. Other than R, the entire process should not need any other tools. For ease of processing, I would recommend following the instructions on using Powershell to take care of non-latin characters to avoid issues with encoding.
Please read the read me in GitHub and the comments in the code itself for an explaination on how it works, what the inputs need to be and an overview of the process.
This process identifies two and three word combinations in search terms and assigned performance statistics against these. The objective is to identify phrase parts within the account, some of which will be shared between campaigns and adgroups, and use the information to shape keyword strategy.
There are two files as per the readme used to produce the output for this file. Most of the output is produced by the ngram-outlier-influence-0-01.R script, though some of the example output from ngrams-ext-1-00.R is included.
The tables produced from line 145 in the main R script (ngrams-ext-1-00.R) are useful for identifying ngrams shared across the account. The tables below and others like them with additional statistics would meet this need.
The volume of clicks between the labels in the example set are not even for the most part.
kable(head(summary2Dcast), digits=2)
ngram | campaignLabel | campaignOther | total |
---|---|---|---|
text word | 510 | 449 | 959 |
are there | 202 | 313 | 515 |
name word | 0 | 463 | 463 |
test string | 169 | 293 | 462 |
other text | 102 | 233 | 335 |
next other | 0 | 289 | 289 |
kable(head(summary3Dcast), digits=2)
ngram | campaignLabel | campaignOther | total |
---|---|---|---|
one text word | 158 | 55 | 213 |
word test string | 56 | 116 | 172 |
other text word | 39 | 127 | 166 |
there test string | 50 | 95 | 145 |
ase text word | 63 | 72 | 135 |
other name word | 0 | 129 | 129 |
The following are displaying conversion rate by ngram and label.
kable(head(summary3Dcasta), digits=2)
ngram | campaignLabel | campaignOther |
---|---|---|
are test string | 4.78 | -Inf |
are there do | -Inf | 8.00 |
are there dou | -Inf | NA |
are there on | NA | -Inf |
are there one | NA | -Inf |
are there string | NA | 6.56 |
kable(head(summary3Dcasta), digits=2)
ngram | campaignLabel | campaignOther |
---|---|---|
are test string | 4.78 | -Inf |
are there do | -Inf | 8.00 |
are there dou | -Inf | NA |
are there on | NA | -Inf |
are there one | NA | -Inf |
are there string | NA | 6.56 |
The following tables are the main CSV output, with a file produced against two and three word phrase part combinations.
kable(head(arrange(labelNgrams.work_file2, desc(Clicks))), digits=2)
Labels | Campaign | Keyword | Search.term | Impressions | Clicks | Cost | Converted.clicks | ngram | ctr | cpc | cpa | cvr |
---|---|---|---|---|---|---|---|---|---|---|---|---|
campaignLabel | Campaign 02 | Keyword 01 | ase are one | 325 | 39 | 100.49 | 13 | ase are | 0.12 | 2.58 | 7.73 | 0.33 |
campaignLabel | Campaign 02 | Keyword 01 | ase are one | 325 | 39 | 100.49 | 13 | are one | 0.12 | 2.58 | 7.73 | 0.33 |
campaignLabel | Campaign 02 | Keyword 28 | ase there one | 38 | 38 | 79.04 | 0 | there one | 1.00 | 2.08 | NA | 0.00 |
campaignLabel | Campaign 02 | Keyword 28 | ase there one | 38 | 38 | 79.04 | 0 | ase there | 1.00 | 2.08 | NA | 0.00 |
campaignOther | Campaign 04 | Keyword 12 | name word | 1710 | 36 | 63.54 | 0 | name word | 0.02 | 1.76 | NA | 0.00 |
campaignOther | Campaign 04 | Keyword 06 | next name word string | 34 | 34 | 20.06 | 0 | word string | 1.00 | 0.59 | NA | 0.00 |
kable(head(arrange(labelNgrams.work_file3, desc(Clicks))), digits=2)
Labels | Campaign | Keyword | Search.term | Impressions | Clicks | Cost | Converted.clicks | ngram | ctr | cpc | cpa | cvr |
---|---|---|---|---|---|---|---|---|---|---|---|---|
campaignLabel | Campaign 02 | Keyword 01 | ase are one | 325 | 39 | 100.49 | 13 | ase are one | 0.12 | 2.58 | 7.73 | 0.33 |
campaignLabel | Campaign 02 | Keyword 28 | ase there one | 38 | 38 | 79.04 | 0 | ase there one | 1.00 | 2.08 | NA | 0.00 |
campaignOther | Campaign 04 | Keyword 06 | next name word string | 34 | 34 | 20.06 | 0 | name word string | 1.00 | 0.59 | NA | 0.00 |
campaignOther | Campaign 04 | Keyword 06 | next name word string | 34 | 34 | 20.06 | 0 | next name word | 1.00 | 0.59 | NA | 0.00 |
campaignOther | Campaign 04 | Keyword 01 | phrase name word uno | 275 | 33 | 85.03 | 11 | name word uno | 0.12 | 2.58 | 7.73 | 0.33 |
campaignOther | Campaign 04 | Keyword 01 | phrase name word uno | 275 | 33 | 85.03 | 11 | phrase name word | 0.12 | 2.58 | 7.73 | 0.33 |
The graphs are fairly straight forward, though the initial distribution did require some work on the data set to produce the chart as per below.
A histogram for clicks within the account by ngram demonstrates the distribution of volume. This example also includes a density plot.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Univariate distributions are useful. With a lot of paid search analysis scatter plots are very useful as bivariate outliers are often more informative than univariate. Such as data points with extreme cost and volume figures as per this chart. The values are log transformed to produce a slightly more normal distribution.
The following is an example that manipulates the labels on the points to highlight those of interest, and leave the plot ‘fairly’ uncluttered.
The following table is a simple one looking at the distribution of the CVR per ngram within groups of labels and campaigns. The data is not really appropriate for this, but it is an interesting exercise in looking how the number of observations can affect the error of the mean CVR.
Campaign | Labels | mcvr | n | mean | sd | se | ci |
---|---|---|---|---|---|---|---|
Campaign 01 | campaignLabel | 0.23 | 4 | 0.29 | 0.17 | 0.09 | 0.27 |
Campaign 02 | campaignLabel | 0.19 | 12 | 0.31 | 0.29 | 0.08 | 0.18 |
Campaign 03 | campaignOther | 0.22 | 5 | 0.34 | 0.25 | 0.11 | 0.31 |
Campaign 04 | campaignOther | 0.10 | 16 | 0.13 | 0.09 | 0.02 | 0.05 |
This is another average of a CVR plot where the actual CVR for the group (label in this case) is plotted as a line through each facet.
For context here is a plot of clicks to CVR without a transformation applied to the values.
Ideally keywords grouped together in a paid search account perform more or less the same, where the relationship between what is spent and how much traffic is recieved should be close, and the same for the number of clicks and the number of conversions.
There are a few techniques used for testing assumptions for multivariate regression. A number of these are used for identifying outliers, those with high leverage and influenctial data points. Both Mahalanobis Distance and Cook’s Distance are used to address these issues. The data used here is certainly not appropriate for regression, the two tests mentioned above can be used to identify points that do not exhibit the same relationship between Clicks and Conversions.
Most distributions of those values would look a little more like this, where overtime there should be a relationship between the two values as optimisation activity in the account stablises performance. Though most likely with a heavier skew towards 0 on either axis. The example set in the repository does not.
The following is a simple check for outliers against the fitted linear model with conversions as an outcome by clicks against campaigns and weighted on cost.
The outliers reported below where identified by a Bonferroni Outlier Test as per the car pacakge.
## rstudent unadjusted p-value Bonferonni p
## 123 6.650903 5.7353e-10 8.7177e-08
## 74 6.171982 6.5755e-09 9.9948e-07
## 161 4.338993 2.6896e-05 4.0882e-03
## 55 -4.186763 4.9220e-05 7.4814e-03
## 126 -3.916863 1.3847e-04 2.1047e-02
ngram | Campaign | Labels | Impressions | Cost | Clicks | Converted.clicks | cvr | Observation |
---|---|---|---|---|---|---|---|---|
next other | Campaign 04 | campaignOther | 1304 | 639.08 | 289 | 0 | 0.00 | 126 |
word uno | Campaign 04 | campaignOther | 637 | 261.21 | 122 | 36 | 0.30 | 161 |
name word | Campaign 04 | campaignOther | 3191 | 726.32 | 403 | 44 | 0.11 | 123 |
text word | Campaign 02 | campaignLabel | 1408 | 601.17 | 307 | 42 | 0.14 | 74 |
The first table is the top ten individual observations by clicks matching to the ngrams identified above. Next is a quick review of the performance of the ngrams identified as outliers by campaign.
Keyword | Search.term | Impressions | Clicks | Cost | Converted.clicks | cvr |
---|---|---|---|---|---|---|
Keyword 12 | name word | 1710 | 36 | 63.54 | 0 | 0.00 |
Keyword 06 | next name word string | 34 | 34 | 20.06 | 0 | 0.00 |
Keyword 01 | phrase name word uno | 275 | 33 | 85.03 | 11 | 0.33 |
Keyword 01 | phrase name word uno | 275 | 33 | 85.03 | 11 | 0.33 |
Keyword 05 | next name word uno | 50 | 30 | 61.90 | 20 | 0.67 |
Keyword 05 | text word test string | 50 | 30 | 61.90 | 20 | 0.67 |
Keyword 05 | text word test string | 50 | 30 | 61.90 | 20 | 0.67 |
Keyword 05 | next name word uno | 50 | 30 | 61.90 | 20 | 0.67 |
Keyword 17 | one text word word phrase | 126 | 28 | 121.94 | 0 | 0.00 |
Keyword 20 | one text word word thing | 63 | 27 | 14.94 | 0 | 0.00 |
The following is a table of clicks by ngram and campaign.
ngram | Campaign 01 | Campaign 02 | Campaign 03 | Campaign 04 |
---|---|---|---|---|
name word | 0 | 0 | 60 | 403 |
next other | 0 | 0 | 0 | 289 |
test string | 104 | 65 | 57 | 236 |
text word | 203 | 307 | 76 | 373 |
word uno | 32 | 22 | 18 | 122 |
The following is a table of conversion rate by ngram and campaign.
ngram | Campaign 01 | Campaign 02 | Campaign 03 | Campaign 04 |
---|---|---|---|---|
name word | NA | NA | 0.00 | 0.11 |
next other | NA | NA | NA | 0.00 |
test string | 0.17 | 0.45 | 0.00 | 0.00 |
text word | 0.00 | 0.14 | 0.18 | 0.03 |
word uno | 0.00 | 1.00 | 0.00 | 0.30 |
There are a number of different methods included here for identifying extreme points. The next is a look at the rows returned by Cook’s Distance. Test identified fewer points than the previous one, and it also identified observation 28, which has an unusually low number of conversions for the clicks it has recieved.
ngram | Campaign | Labels | Impressions | Cost | Clicks | Converted.clicks | cvr | Observation |
---|---|---|---|---|---|---|---|---|
text word | Campaign 01 | campaignLabel | 685 | 422.79 | 203 | 0 | 0.00 | 28 |
text word | Campaign 02 | campaignLabel | 1408 | 601.17 | 307 | 42 | 0.14 | 74 |
name word | Campaign 04 | campaignOther | 3191 | 726.32 | 403 | 44 | 0.11 | 123 |
Next is a plot of residuals against the model where the Cook’s distance of each point is displayed as the size of the circle for each point. The plot has labeled the most influential points, those which have the most affect on the linear model created at the start of this process.
Again the observations identified were different from the outlier test.
## StudRes Hat CookD
## 28 -1.5078970 0.72091052 0.8530673
## 55 -4.1867634 0.24108652 0.7901841
## 74 6.1719823 0.59133772 2.3406968
## 96 -0.5678472 0.73891135 0.3385422
## 123 6.6509026 0.37161567 1.5858359
## 126 -3.9168633 0.13421568 0.5199644
## 138 -3.0210812 0.08818068 0.3231677
## 145 -2.4970387 0.27776862 0.5378105
## 161 4.3389925 0.03111095 0.2593101
ngram | Campaign | Labels | Impressions | Cost | Clicks | Converted.clicks | cvr | Observation |
---|---|---|---|---|---|---|---|---|
text word | Campaign 01 | campaignLabel | 685 | 422.79 | 203 | 0 | 0.00 | 28 |
text word | Campaign 02 | campaignLabel | 1408 | 601.17 | 307 | 42 | 0.14 | 74 |
other text | Campaign 03 | campaignOther | 411 | 238.56 | 135 | 14 | 0.10 | 96 |
name word | Campaign 04 | campaignOther | 3191 | 726.32 | 403 | 44 | 0.11 | 123 |
The following are a series of graphs and tables of the relationship between clicks and converted clicks. Both Mahalanobis Disance and Cook’s Distance are included in the tables for each campaign.
In the output below the first table for each campaign under the scatterplot is sorted by Cook’s Distance and the second Mahalanobis Disance and It is interesting that each table while sharing observations, orders them differently.
ngram | Campaign | Labels | Impressions | Cost | Clicks | Converted.clicks | cvr | md1 | cd1 |
---|---|---|---|---|---|---|---|---|---|
text word | Campaign 01 | campaignLabel | 685 | 422.79 | 203 | 0 | 0.0000000 | 21.343686 | 9.4435401 |
test string | Campaign 01 | campaignLabel | 410 | 286.63 | 104 | 18 | 0.1730769 | 11.741934 | 0.7495201 |
other are | Campaign 01 | campaignLabel | 273 | 226.08 | 93 | 15 | 0.1612903 | 7.862069 | 0.2370295 |
other there | Campaign 01 | campaignLabel | 101 | 173.90 | 64 | 18 | 0.2812500 | 11.045909 | 0.1982519 |
there test | Campaign 01 | campaignLabel | 57 | 103.62 | 34 | 18 | 0.5294118 | 12.214889 | 0.1109411 |
ase text | Campaign 01 | campaignLabel | 2964 | 248.51 | 106 | 0 | 0.0000000 | 3.541226 | 0.0333855 |
ngram | Campaign | Labels | Impressions | Cost | Clicks | Converted.clicks | cvr | md1 | cd1 |
---|---|---|---|---|---|---|---|---|---|
text word | Campaign 01 | campaignLabel | 685 | 422.79 | 203 | 0 | 0.0000000 | 21.343686 | 9.4435401 |
there test | Campaign 01 | campaignLabel | 57 | 103.62 | 34 | 18 | 0.5294118 | 12.214889 | 0.1109411 |
test string | Campaign 01 | campaignLabel | 410 | 286.63 | 104 | 18 | 0.1730769 | 11.741934 | 0.7495201 |
other there | Campaign 01 | campaignLabel | 101 | 173.90 | 64 | 18 | 0.2812500 | 11.045909 | 0.1982519 |
other are | Campaign 01 | campaignLabel | 273 | 226.08 | 93 | 15 | 0.1612903 | 7.862069 | 0.2370295 |
ase text | Campaign 01 | campaignLabel | 2964 | 248.51 | 106 | 0 | 0.0000000 | 3.541226 | 0.0333855 |
ngram | Campaign | Labels | Impressions | Cost | Clicks | Converted.clicks | cvr | md2 | cd2 |
---|---|---|---|---|---|---|---|---|---|
text word | Campaign 02 | campaignLabel | 1408 | 601.17 | 307 | 42 | 0.1368078 | 22.923721 | 17.0378336 |
one text | Campaign 02 | campaignLabel | 1138 | 519.60 | 232 | 0 | 0.0000000 | 11.304012 | 1.9416931 |
word word | Campaign 02 | campaignLabel | 544 | 425.11 | 172 | 0 | 0.0000000 | 5.292963 | 0.2276719 |
one word | Campaign 02 | campaignLabel | 1405 | 379.61 | 147 | 0 | 0.0000000 | 3.474184 | 0.0914093 |
ase are | Campaign 02 | campaignLabel | 551 | 230.61 | 89 | 22 | 0.2471910 | 3.355504 | 0.0767080 |
test string | Campaign 02 | campaignLabel | 300 | 115.11 | 65 | 29 | 0.4461538 | 7.467613 | 0.0550768 |
ngram | Campaign | Labels | Impressions | Cost | Clicks | Converted.clicks | cvr | md2 | cd2 |
---|---|---|---|---|---|---|---|---|---|
text word | Campaign 02 | campaignLabel | 1408 | 601.17 | 307 | 42 | 0.1368078 | 22.923721 | 17.0378336 |
one text | Campaign 02 | campaignLabel | 1138 | 519.60 | 232 | 0 | 0.0000000 | 11.304012 | 1.9416931 |
test string | Campaign 02 | campaignLabel | 300 | 115.11 | 65 | 29 | 0.4461538 | 7.467613 | 0.0550768 |
word uno | Campaign 02 | campaignLabel | 44 | 36.46 | 22 | 22 | 1.0000000 | 5.382057 | 0.0068151 |
word word | Campaign 02 | campaignLabel | 544 | 425.11 | 172 | 0 | 0.0000000 | 5.292963 | 0.2276719 |
word test | Campaign 02 | campaignLabel | 120 | 61.90 | 30 | 20 | 0.6666667 | 4.044613 | 0.0140388 |