Introduction

This is a quick example of the kind of output being produced from the https://github.com/anthonypc/phrasePartAnalysis project as per the blog post: Text Processing, N-Grams And Paid Search.

The inital analysis of the file and the creation of the ngram lists are all dealt with in the code. Other than R, the entire process should not need any other tools. For ease of processing, I would recommend following the instructions on using Powershell to take care of non-latin characters to avoid issues with encoding.

Please read the read me in GitHub and the comments in the code itself for an explaination on how it works, what the inputs need to be and an overview of the process.

Summary

This process identifies two and three word combinations in search terms and assigned performance statistics against these. The objective is to identify phrase parts within the account, some of which will be shared between campaigns and adgroups, and use the information to shape keyword strategy.

Output

There are two files as per the readme used to produce the output for this file. Most of the output is produced by the ngram-outlier-influence-0-01.R script, though some of the example output from ngrams-ext-1-00.R is included.

ngrams-ext-1-00.R Output

The tables produced from line 145 in the main R script (ngrams-ext-1-00.R) are useful for identifying ngrams shared across the account. The tables below and others like them with additional statistics would meet this need.

The volume of clicks between the labels in the example set are not even for the most part.

kable(head(summary2Dcast), digits=2)

ngram	campaignLabel	campaignOther	total
text word	510	449	959
are there	202	313	515
name word	0	463	463
test string	169	293	462
other text	102	233	335
next other	0	289	289

kable(head(summary3Dcast), digits=2)

ngram	campaignLabel	campaignOther	total
one text word	158	55	213
word test string	56	116	172
other text word	39	127	166
there test string	50	95	145
ase text word	63	72	135
other name word	0	129	129

The following are displaying conversion rate by ngram and label.

kable(head(summary3Dcasta), digits=2)

ngram	campaignLabel	campaignOther
are test string	4.78	-Inf
are there do	-Inf	8.00
are there dou	-Inf	NA
are there on	NA	-Inf
are there one	NA	-Inf
are there string	NA	6.56

kable(head(summary3Dcasta), digits=2)

ngram	campaignLabel	campaignOther
are test string	4.78	-Inf
are there do	-Inf	8.00
are there dou	-Inf	NA
are there on	NA	-Inf
are there one	NA	-Inf
are there string	NA	6.56

The following tables are the main CSV output, with a file produced against two and three word phrase part combinations.

kable(head(arrange(labelNgrams.work_file2, desc(Clicks))), digits=2)

Labels	Campaign	Keyword	Search.term	Impressions	Clicks	Cost	Converted.clicks	ngram	ctr	cpc	cpa	cvr
campaignLabel	Campaign 02	Keyword 01	ase are one	325	39	100.49	13	ase are	0.12	2.58	7.73	0.33
campaignLabel	Campaign 02	Keyword 01	ase are one	325	39	100.49	13	are one	0.12	2.58	7.73	0.33
campaignLabel	Campaign 02	Keyword 28	ase there one	38	38	79.04	0	there one	1.00	2.08	NA	0.00
campaignLabel	Campaign 02	Keyword 28	ase there one	38	38	79.04	0	ase there	1.00	2.08	NA	0.00
campaignOther	Campaign 04	Keyword 12	name word	1710	36	63.54	0	name word	0.02	1.76	NA	0.00
campaignOther	Campaign 04	Keyword 06	next name word string	34	34	20.06	0	word string	1.00	0.59	NA	0.00

kable(head(arrange(labelNgrams.work_file3, desc(Clicks))), digits=2)

Labels	Campaign	Keyword	Search.term	Impressions	Clicks	Cost	Converted.clicks	ngram	ctr	cpc	cpa	cvr
campaignLabel	Campaign 02	Keyword 01	ase are one	325	39	100.49	13	ase are one	0.12	2.58	7.73	0.33
campaignLabel	Campaign 02	Keyword 28	ase there one	38	38	79.04	0	ase there one	1.00	2.08	NA	0.00
campaignOther	Campaign 04	Keyword 06	next name word string	34	34	20.06	0	name word string	1.00	0.59	NA	0.00
campaignOther	Campaign 04	Keyword 06	next name word string	34	34	20.06	0	next name word	1.00	0.59	NA	0.00
campaignOther	Campaign 04	Keyword 01	phrase name word uno	275	33	85.03	11	name word uno	0.12	2.58	7.73	0.33
campaignOther	Campaign 04	Keyword 01	phrase name word uno	275	33	85.03	11	phrase name word	0.12	2.58	7.73	0.33

Plots

The graphs are fairly straight forward, though the initial distribution did require some work on the data set to produce the chart as per below.

A histogram for clicks within the account by ngram demonstrates the distribution of volume. This example also includes a density plot.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Univariate distributions are useful. With a lot of paid search analysis scatter plots are very useful as bivariate outliers are often more informative than univariate. Such as data points with extreme cost and volume figures as per this chart. The values are log transformed to produce a slightly more normal distribution.

The following is an example that manipulates the labels on the points to highlight those of interest, and leave the plot ‘fairly’ uncluttered.

The following table is a simple one looking at the distribution of the CVR per ngram within groups of labels and campaigns. The data is not really appropriate for this, but it is an interesting exercise in looking how the number of observations can affect the error of the mean CVR.

Campaign	Labels	mcvr	n	mean	sd	se	ci
Campaign 01	campaignLabel	0.23	4	0.29	0.17	0.09	0.27
Campaign 02	campaignLabel	0.19	12	0.31	0.29	0.08	0.18
Campaign 03	campaignOther	0.22	5	0.34	0.25	0.11	0.31
Campaign 04	campaignOther	0.10	16	0.13	0.09	0.02	0.05

This is another average of a CVR plot where the actual CVR for the group (label in this case) is plotted as a line through each facet.

For context here is a plot of clicks to CVR without a transformation applied to the values.

Extreme Values

ngram-outlier-influence-0-01.R Output

Ideally keywords grouped together in a paid search account perform more or less the same, where the relationship between what is spent and how much traffic is recieved should be close, and the same for the number of clicks and the number of conversions.

There are a few techniques used for testing assumptions for multivariate regression. A number of these are used for identifying outliers, those with high leverage and influenctial data points. Both Mahalanobis Distance and Cook’s Distance are used to address these issues. The data used here is certainly not appropriate for regression, the two tests mentioned above can be used to identify points that do not exhibit the same relationship between Clicks and Conversions.

Most distributions of those values would look a little more like this, where overtime there should be a relationship between the two values as optimisation activity in the account stablises performance. Though most likely with a heavier skew towards 0 on either axis. The example set in the repository does not.

The following is a simple check for outliers against the fitted linear model with conversions as an outcome by clicks against campaigns and weighted on cost.

The outliers reported below where identified by a Bonferroni Outlier Test as per the car pacakge.

##      rstudent unadjusted p-value Bonferonni p
## 123  6.650903         5.7353e-10   8.7177e-08
## 74   6.171982         6.5755e-09   9.9948e-07
## 161  4.338993         2.6896e-05   4.0882e-03
## 55  -4.186763         4.9220e-05   7.4814e-03
## 126 -3.916863         1.3847e-04   2.1047e-02

ngram	Campaign	Labels	Impressions	Cost	Clicks	Converted.clicks	cvr	Observation
next other	Campaign 04	campaignOther	1304	639.08	289	0	0.00	126
word uno	Campaign 04	campaignOther	637	261.21	122	36	0.30	161
name word	Campaign 04	campaignOther	3191	726.32	403	44	0.11	123
text word	Campaign 02	campaignLabel	1408	601.17	307	42	0.14	74

The first table is the top ten individual observations by clicks matching to the ngrams identified above. Next is a quick review of the performance of the ngrams identified as outliers by campaign.

Keyword	Search.term	Impressions	Clicks	Cost	Converted.clicks	cvr
Keyword 12	name word	1710	36	63.54	0	0.00
Keyword 06	next name word string	34	34	20.06	0	0.00
Keyword 01	phrase name word uno	275	33	85.03	11	0.33
Keyword 01	phrase name word uno	275	33	85.03	11	0.33
Keyword 05	next name word uno	50	30	61.90	20	0.67
Keyword 05	text word test string	50	30	61.90	20	0.67
Keyword 05	text word test string	50	30	61.90	20	0.67
Keyword 05	next name word uno	50	30	61.90	20	0.67
Keyword 17	one text word word phrase	126	28	121.94	0	0.00
Keyword 20	one text word word thing	63	27	14.94	0	0.00

The following is a table of clicks by ngram and campaign.

ngram	Campaign 01	Campaign 02	Campaign 03	Campaign 04
name word	0	0	60	403
next other	0	0	0	289
test string	104	65	57	236
text word	203	307	76	373
word uno	32	22	18	122

The following is a table of conversion rate by ngram and campaign.

ngram	Campaign 01	Campaign 02	Campaign 03	Campaign 04
name word	NA	NA	0.00	0.11
next other	NA	NA	NA	0.00
test string	0.17	0.45	0.00	0.00
text word	0.00	0.14	0.18	0.03
word uno	0.00	1.00	0.00	0.30

Other Tests

There are a number of different methods included here for identifying extreme points. The next is a look at the rows returned by Cook’s Distance. Test identified fewer points than the previous one, and it also identified observation 28, which has an unusually low number of conversions for the clicks it has recieved.

ngram	Campaign	Labels	Impressions	Cost	Clicks	Converted.clicks	cvr	Observation
text word	Campaign 01	campaignLabel	685	422.79	203	0	0.00	28
text word	Campaign 02	campaignLabel	1408	601.17	307	42	0.14	74
name word	Campaign 04	campaignOther	3191	726.32	403	44	0.11	123

Next is a plot of residuals against the model where the Cook’s distance of each point is displayed as the size of the circle for each point. The plot has labeled the most influential points, those which have the most affect on the linear model created at the start of this process.

Again the observations identified were different from the outlier test.

##        StudRes        Hat     CookD
## 28  -1.5078970 0.72091052 0.8530673
## 55  -4.1867634 0.24108652 0.7901841
## 74   6.1719823 0.59133772 2.3406968
## 96  -0.5678472 0.73891135 0.3385422
## 123  6.6509026 0.37161567 1.5858359
## 126 -3.9168633 0.13421568 0.5199644
## 138 -3.0210812 0.08818068 0.3231677
## 145 -2.4970387 0.27776862 0.5378105
## 161  4.3389925 0.03111095 0.2593101

ngram	Campaign	Labels	Impressions	Cost	Clicks	Converted.clicks	cvr	Observation
text word	Campaign 01	campaignLabel	685	422.79	203	0	0.00	28
text word	Campaign 02	campaignLabel	1408	601.17	307	42	0.14	74
other text	Campaign 03	campaignOther	411	238.56	135	14	0.10	96
name word	Campaign 04	campaignOther	3191	726.32	403	44	0.11	123

By Campaign

The following are a series of graphs and tables of the relationship between clicks and converted clicks. Both Mahalanobis Disance and Cook’s Distance are included in the tables for each campaign.

In the output below the first table for each campaign under the scatterplot is sorted by Cook’s Distance and the second Mahalanobis Disance and It is interesting that each table while sharing observations, orders them differently.

Campaign 01

ngram	Campaign	Labels	Impressions	Cost	Clicks	Converted.clicks	cvr	md1	cd1
text word	Campaign 01	campaignLabel	685	422.79	203	0	0.0000000	21.343686	9.4435401
test string	Campaign 01	campaignLabel	410	286.63	104	18	0.1730769	11.741934	0.7495201
other are	Campaign 01	campaignLabel	273	226.08	93	15	0.1612903	7.862069	0.2370295
other there	Campaign 01	campaignLabel	101	173.90	64	18	0.2812500	11.045909	0.1982519
there test	Campaign 01	campaignLabel	57	103.62	34	18	0.5294118	12.214889	0.1109411
ase text	Campaign 01	campaignLabel	2964	248.51	106	0	0.0000000	3.541226	0.0333855

ngram	Campaign	Labels	Impressions	Cost	Clicks	Converted.clicks	cvr	md1	cd1
text word	Campaign 01	campaignLabel	685	422.79	203	0	0.0000000	21.343686	9.4435401
there test	Campaign 01	campaignLabel	57	103.62	34	18	0.5294118	12.214889	0.1109411
test string	Campaign 01	campaignLabel	410	286.63	104	18	0.1730769	11.741934	0.7495201
other there	Campaign 01	campaignLabel	101	173.90	64	18	0.2812500	11.045909	0.1982519
other are	Campaign 01	campaignLabel	273	226.08	93	15	0.1612903	7.862069	0.2370295
ase text	Campaign 01	campaignLabel	2964	248.51	106	0	0.0000000	3.541226	0.0333855

Campaign 02

ngram	Campaign	Labels	Impressions	Cost	Clicks	Converted.clicks	cvr	md2	cd2
text word	Campaign 02	campaignLabel	1408	601.17	307	42	0.1368078	22.923721	17.0378336
one text	Campaign 02	campaignLabel	1138	519.60	232	0	0.0000000	11.304012	1.9416931
word word	Campaign 02	campaignLabel	544	425.11	172	0	0.0000000	5.292963	0.2276719
one word	Campaign 02	campaignLabel	1405	379.61	147	0	0.0000000	3.474184	0.0914093
ase are	Campaign 02	campaignLabel	551	230.61	89	22	0.2471910	3.355504	0.0767080
test string	Campaign 02	campaignLabel	300	115.11	65	29	0.4461538	7.467613	0.0550768

ngram	Campaign	Labels	Impressions	Cost	Clicks	Converted.clicks	cvr	md2	cd2
text word	Campaign 02	campaignLabel	1408	601.17	307	42	0.1368078	22.923721	17.0378336
one text	Campaign 02	campaignLabel	1138	519.60	232	0	0.0000000	11.304012	1.9416931
test string	Campaign 02	campaignLabel	300	115.11	65	29	0.4461538	7.467613	0.0550768
word uno	Campaign 02	campaignLabel	44	36.46	22	22	1.0000000	5.382057	0.0068151
word word	Campaign 02	campaignLabel	544	425.11	172	0	0.0000000	5.292963	0.2276719
word test	Campaign 02	campaignLabel	120	61.90	30	20	0.6666667	4.044613	0.0140388

Campaign 03

ngram	Campaign	Labels	Impressions	Cost	Clicks	Converted.clicks	cvr	md3	cd3
other text	Campaign 03	campaignOther	411	238.56	135	14	0.1037037	10.4506015	2.9831061
other there	Campaign 03	campaignOther	164	196.09	81	16	0.1975309	7.4609246	0.4763125
there string	Campaign 03	campaignOther	24	82.00	24	16	0.6666667	7.8262066	0.4091931
text word	Campaign 03	campaignOther	193	141.22	76	14	0.1842105	5.4479779	0.1231220
are there	Campaign 03	campaignOther	279	128.37	41	0	0.0000000	0.4744398	0.0619276
other are	Campaign 03	campaignOther	279	128.37	41	0	0.0000000	0.4744398	0.0619276

ngram	Campaign	Labels	Impressions	Cost	Clicks	Converted.clicks	cvr	md3	cd3
other text	Campaign 03	campaignOther	411	238.56	135	14	0.1037037	10.450601	2.9831061
there string	Campaign 03	campaignOther	24	82.00	24	16	0.6666667	7.826207	0.4091931
other there	Campaign 03	campaignOther	164	196.09	81	16	0.1975309	7.460925	0.4763125
word string	Campaign 03	campaignOther	122	24.30	26	14	0.5384615	5.597012	0.0208493
text word	Campaign 03	campaignOther	193	141.22	76	14	0.1842105	5.447978	0.1231220
name word	Campaign 03	campaignOther	557	53.60	60	0	0.0000000	1.179691	0.0196516

Campaign 04

ngram	Campaign	Labels	Impressions	Cost	Clicks	Converted.clicks	cvr	md4	cd4
name word	Campaign 04	campaignOther	3191	726.32	403	44	0.1091811	101.783405	5.3318903
text word	Campaign 04	campaignOther	2910	671.88	373	12	0.0321716	55.189455	0.6132299
next other	Campaign 04	campaignOther	1304	639.08	289	0	0.0000000	32.141897	0.5732076
test string	Campaign 04	campaignOther	4452	648.50	236	0	0.0000000	17.698671	0.2214219
word uno	Campaign 04	campaignOther	637	261.21	122	36	0.2950820	1.320855	0.1425620
next name	Campaign 04	campaignOther	225	123.76	95	33	0.3473684	32.831751	0.0342466

ngram	Campaign	Labels	Impressions	Cost	Clicks	Converted.clicks	cvr	md4	cd4
name word	Campaign 04	campaignOther	3191	726.32	403	44	0.1091811	101.78341	5.3318903
text word	Campaign 04	campaignOther	2910	671.88	373	12	0.0321716	55.18945	0.6132299
are there	Campaign 04	campaignOther	4855	712.08	272	18	0.0661765	38.16086	0.0151460
next name	Campaign 04	campaignOther	225	123.76	95	33	0.3473684	32.83175	0.0342466
next other	Campaign 04	campaignOther	1304	639.08	289	0	0.0000000	32.14190	0.5732076
are there	Campaign 04	campaignOther	4855	712.08	272	18	0.0661765	23.30375	0.0151460

Ngram Example

Anthony Contoleon

Thursday, August 27, 2015