By Sarianne Gruber
A few months ago I was looking to find an affordable and flexible workspace in New York City. I Googled the key words “workspaces New York” and the seventh organic result caught my interest. It was a New York Times article about a workplace for entrepreneurs called Green Spaces. I immediately clicked on the link from within the article to find a green work space for environmentally-conscious businesses. Even better, the landing page offered a FREE DAY PASS for just submitting some basic demographics. Needless to say I have become a member and a “green” analyst.
I thought it would apropos to introduce the Chi Square (officially written with the Greek lower case letter as χ2 statistic) using web data from the Green Spaces website. Web analytics tracks just about every click on a website as simple counts and our Chi Square can only be used for count data! Measurement items such as source of click, choice of offer, and sign- up for more information are variables with several possible categories of choice. With two of these measurement items, the Chi Square statistic is the indicator as to whether the observed counts in the categories a given the total number available. The best way to think about this test is to view the Chi Square as a relationship measure between two categorical variables. This is referred to as the test for independence.
Several steps are required.
- First, we look at the distribution of counts if no relationship existed on what we call the “expected” distribution. Then we determine the difference between the observed and expected counts. This is step one where we do all the calculations.
- Step two is the hypothesis test where we determine if a relationship exists. A null hypothesis H0 claims that the observed pattern occurred by chance. The alternative hypothesis Ha is the opposite and claims a relationship exists.
Keep in mind that the Chi Square test statistic is applied to large size samples, which is not a problem with the amount of data tracked via web. I will take you through when and how to use the Chi Square test using website data commonly tracked on a web analytics dashboard.
Impressions and clicks – The Observed and Expected in A/B Testing
Let’s say Green Spaces decides they want to increase membership for the spring. They have had moderate success with the current Free Day Pass offer on their landing page and would like to test a second offer for to include a Complimentary Event Pass with the Free Day. This “split” or “A/B” test is a very popular format for testing ads, landing pages, even emails to see if the new offer does significantly better than the old one. In our example, the visitor randomly sees either offer on the landing page. If the visitor is interested, he or she can click through to register for the pass. As part of the hosting site, the number of visitor views (or impressions) is tracked as well as the number of clicks to the free day registration webpage.
In our hypothetical example, we wanted to see which offer was performing better after a week. The observed impression and click counts are tallied for each of the offers, and the data is set into a fourfold table, commonly known as “2 x 2” contingency table. The two levels of the test items are in each row while the actions are in two columns. A third row and column is for totals. Let’s say the New Offer (B) has 489 impressions and 27 registration clicks and the Original Offer (A)has 425 impressions with 38 registration clicks. The “No Clicks” count is the difference between the impressions and actual clicks for each offer. Table 1a contains the count data for each category, which forms the body of the “Observed” counts table.
[av_one_full first av_uid=’av-2si2dp’]
[av_image src=’https://www.movedbymetrics.com/wp-content/uploads/2013/11/Chi-Table-1b.jpg’ attachment=’497′ align=’left’ animation=’no-animation’ link=” target=” av_uid=’av-2n2qqt’]
[/av_one_full]
To find out if there is an association between the rows and columns in the contingency table, we use the Chi Square statistic. Basically, does the offer have an impact as to whether there is a difference in the volume of clicks? In other words, do the viewers of the Green Spaces website register for a Free Pass more often if you also include a Free Event Pass? We then create our null hypothesis test to state that the likelihood of “clicking” on Free Pass registration webpage is the same for both offers (the two variables are independent). Our alternative hypothesis is that the likelihood for clicking to register is not the same for the offers.
[av_one_full first av_uid=’av-8ogu5′][av_image src=’https://www.movedbymetrics.com/wp-content/uploads/2013/11/The-general-formula-for-Chi-Square.jpg’ attachment=’503′ align=’left’ animation=’no-animation’ link=” target=” av_uid=’av-283wlh’][/av_one_full]
For starters, how does one calculate the expected values? By using the common click rate of the total sample. Our null hypothesis tells us the click rates for each offer is the same, so in fact, the best estimate of a common click rate is to divide the total number of clicks by the total number of impressions 65/914 (= 0.0711). Next, for the 489 New Offer views, the expected number of clicks is 425(65/914) = 30.22. The expected clicks for the Original Offer is 489(65/914) = 34.77. The calculations are repeated for the expected number of non-clicks and the common non-click rate is 849/914. The expected values are presented in Table 1b.
[av_one_full first av_uid=’av-22dxh1′]
[av_image src=’https://www.movedbymetrics.com/wp-content/uploads/2013/11/Chi-Table-1a.jpg’ attachment=’496′ align=’left’ animation=’no-animation’ link=” target=” av_uid=’av-73jgd’]
[/av_one_full]
Completing the Chi Square calculation is rather straight forward. The absolute value of (O-E) for all the categories is the same for all four categories, 7.77. We then solve for
[av_one_full first av_uid=’av-1rurs5′]
[av_image src=’https://www.movedbymetrics.com/wp-content/uploads/2013/11/Completing-the-Chi-Squared-Calculation.jpg’ attachment=’502′ align=’left’ animation=’no-animation’ link=” target=” av_uid=’av-1kikz9′]
[/av_one_full]
The Sustainable Chi Square Calculation
Very often you will see the Chi Square Contingency table appear in an algebraic format. Table2a shows the substitution of letters for original numbers I. Some analysts prefer using observed and expected values, for others this may be easier to calculate. The results are the same.
[av_one_full first av_uid=’av-1e6wzp’][av_image src=’https://www.movedbymetrics.com/wp-content/uploads/2013/11/Chi-Table-2-and-Chi-Square-formula.jpg’ attachment=’498′ align=’left’ animation=’no-animation’ link=” target=” av_uid=’av-4aswt’][/av_one_full]
We now have our ChiSquare statistic from both calculations, X2 is 4.02. On its own, this value has no meaning related to association. The association is expressed as the rejection of a hypothesis of no association. A Chi Square test for independence compares the value with the Chi Square distribution with degrees of freedom to accept or reject the null hypothesis. The degrees of freedom for a 2×2 table are 1.
Our value of 4.02 is above the critical value of 3.84 for rejecting at 95% confidence. We can claim that clicks for each offer are not the same and the new offer had significantly less clicks than the original offer (P<.05). Conclusion: the recommendation would be to stick with the Free Day Pass (A) Offer without a free event pass (B). A customer looking to rent a desk would not necessarily be enticed by attending a lecture or movie. (Disclaimer: the data is fictitious, but our results point out that for GreenSpaces, this may be worth doing a live test.)
Analyzing Web Traffic – Several Sources and Expanded Categories
You may be interested in knowing the way first time visitors come to your website (by tracking the traffic source) and what they view/do on the website. For this analysis you would select the Chi Square Test for Independence because it compares the counts of categories between two (or more) independent groups for r number of rows and c number of columns. Traffic Source has three distinct (or independent) buckets. I found the GreenSpaces website from a link in the New York Times article entitled “Green Work Spaces Attract Young Professional“, which is called a Referring Site. If I had typed the URL in the address bar, it would have been a Direct Traffic source, or had I found the site directly using my keyword search- with Google, it would have been a Search Engine source. My first click was for on a Free Day Pass. Other website actions that we may want to look at are: Membership Options, Member Journal (list of current members) or Events Calendar. Let’s look at what a weekly sample may look like from your dashboard.
[av_one_full first av_uid=’av-14f66l’]
[av_image src=’https://www.movedbymetrics.com/wp-content/uploads/2013/11/Chi-Table-3a.jpg’ attachment=’499′ align=’left’ animation=’no-animation’ link=” target=” av_uid=’av-zlu7p’]
[/av_one_full]
We use the Chi Square test to see if there is a relationship between the way one gets to the site and what they do or “action” once they arrive on your landing page. Our null hypothesis would state: H0 Traffic Source and Popular Content are independent, no relationship exists.
Where the alternative hypothesis would state: Ha ,the Popular Content viewed by visitors is associated with Traffic Source by which they reached the website.
An R x C contingency table format is setup in Table 3b. The Chi Square equation is
[av_one_full first av_uid=’av-t5bud’]
[av_image src=’https://www.movedbymetrics.com/wp-content/uploads/2013/11/Chi-R-x-C-contingency-data-table.jpg’ attachment=’495′ align=’left’ animation=’no-animation’ link=” target=” av_uid=’av-mrwl1′]
[/av_one_full]
The expected value for cell a would be calculated as (a+b+c)(a+d+g)/N. Table3b presents all the calculations for the Chi Square.
[av_one_full first av_uid=’av-dubf9′]
[av_image src=’https://www.movedbymetrics.com/wp-content/uploads/2013/11/Chi-Table-3b.jpg’ attachment=’500′ align=’left’ animation=’no-animation’ link=” target=” av_uid=’av-8oxt9′]
[/av_one_full]
We can apply this math to get a better idea of web behavior. It is interesting to observe the differences in observed and expected for different levels of the categorical variables ‘traffic source’ and ‘content view.’
The Chi Square value of 1834.02 with df =4 is greater than the critical value of 9.488 for rejecting at 95% confidence. We accept the alternative hypothesis that there is a relationship between Traffic Source and Content Viewed (P<.05), Nothing else can be inferred from this test.. However, we can pursue further analysis of the frequency of actions for different traffic sources. For example Direct Traffic sends 172 visitors to membership options, 79 more than expected. In other words, people who know the full name of the business might be the most serious.
A Green Summary
Contingency Tables are the best way to start looking at count data. Rows and Column layouts are a compact way to compare counts of two variables with several levels. The Chi Square Test of Independence is a good way to start thinking about the associations between counts. The tables can be made more sophisticated with insertion of percentages and special formatting. Advanced analysis will glean more insights.
Special thanks to Marissa Feinberg, Owner of GreenSpaces NY, http://www.greenspaceshome.com/, for providing us with access to the web data. The above examples are a mix of actual and simulated data.