How Click Streams are like DNA
When doing web analytics, its easy to get overwhelmed with data. Using the proper data structures to store and analyze data is key to keeping things manageable. In this short paper, I present a new idea of a data structure to model user click streams as a series of ‘gene-like’ events, which lends itself nicely to analyzing these user click streams. A click stream refers to all of the clicks that a user makes on the website for a given web-browsing session. An example is a user opening an e-commerce website, navigating through a few pages, and ultimately exiting the website. Looking at clickstreams as an ordered series of small events allows us to capture user experiences over time as opposed to the normal approach of using aggregate numbers at specified time points. Similarly, it allows us to do operations that we are used to, such as ‘string matching’ where we simply look for similar sequences of events to calculate similarity between user experiences, which would otherwise be much more troublesome.
While not necessarily proven across websites, my leading theory is that a user’s experience on a given website dictates his percentage to pay money. For instance, if I am on an e-commerce site and I see the most popular selling shirt, I may be more likely to buy that shirt than if I see the least popular shirt. In this, my probability to pay is dictated by the things I see on the website, and not necessarily by my pre-existing condition (although that may play a factor as well).
Often, there is too much data to capture for every user. For instance, a users mouse may move across thousands of pixels in a given session, and a user may click hundreds of different times, and each of these has a time characteristic associated with it.
So let’s start by making the model simple, and we’ll make a few iterations to make it better.
Lets say we took every URL on our website, and assigned it a number, and lets call this number the URL_ID. For a fictional site, we could make a table of all the urls something like this:
Now, for each of these urls, there are many events that a user can take (watch a movie, select text, press buttons, submit forms, etc), but for simplicity, lets say that a user is limited to clicking links. So, we would make another table that assigned every link on our website a specific number as well, lets call this number LINK_ID. We could then create a table like this:
0 0 (mypage.com header link)
1 0 (mypage.com footer link)
2 0 (mypage.com main link for guys sweaters)
3 0 (mypage.com main link for girls sweaters)
4 1 etc…
In the table above, we show that a given URL has many links a user can click on.
Now, let’s model a users click stream using our fancy new click event ids. So lets take the scenario where a user comes in and navigates through these pages in this order:
(User types url into address bar and presses enter)
Now, we can translate this users clickstream into a series of integers. We do this simply by substituting in a LINK_ID for each link that a user clicked on.
So we started with:
and each of these click events can be translated into an integer just by looking up the proper URL_ID and LINK_ID in the table that we made previously for our website.
mypage.com (URL_ID = 0)
(LINK_ID = 1)
mypage.com/store/girls (URL_ID = 2)
(LINK_ID = 4)
mypage.com/sweaters/guys (URL_ID = 4)
In this case, if we just take the URL_IDs and LINK_IDs we have something that looks like this:
Lets just write those as an array of numbers, namely:
So in this case, we have condensed the users clickstream into a series of numbers, cool! This captures the order of clicks and the order of urls that a user experienced on our site. Now lets say we had another user, let’s call him user2, that had a click stream for one session like:
Do you see any advantages to representing user clickstreams this way? Many things become apparent when we look at these two streams, here are a few:
- Both users sessions ended on the 7 event
-Maybe this is the main drop-off of our website?
- Both users entered on the home page
- Does our entrance page affect the experience?
- Both the users visited pages 0,2,4,5,7
- Are pages 0,2,4,5,7 the user flow that optimizes revenue for my business?
These are just a few questions that could become easily apparent by analyzing click streams in this manner.
If we wanted to compare the similarity of two user experiences:
user1 = [0,5,4,3,2]
user2 = [0,5,2,5,1]
We could simply use many well-known string matching algorithms to compare how similar the experiences were (Levenshtein edit distance, Sellers edit distance, the Hamming distance, the longest common subsequence length, the longest common substring length, and the pair distance metric)! Each algorithm would give us different stats.
Imagine we had one event that we prized, such as users buying an item at our store, lets say that event was 9, and we had two user click sessions:
In this case, we see two unique user experiences on our site that led them to buy an item from our store. One valid way to look at the data may be to find the longest common subsequence, and correlate that with a buying event, namely:
3,2,3,2,5 (longest common subsequence)
If this subsequence was highly correlated with users reaching ‘9’ i.e. purchasing a product, we could then make new users to our site have a similar experience and hopefully convert more poeple to buy at our store, increasing our buyers conversion and getting more people to buy! This leads to the title of the paper, in that we could call this longest common subsequence a ‘gene’ encoding for paying users.
We could change events for users, based on previous click data.
Now, much of this is being done and has been done already many times. The novel part of this idea is to look at events as a series of ordered integers, to easily pick out certain genes and match users to each other through their clickstreams.
So now we can make the model a little more sophisticated. Let’s add many sessions, so each user session could be a string like we showed:
[user first session] [user second session] [user third session]
could look something like
[0,2,5,3,4,5,4] [2,3,4,5,2,3] [2,3,4,2,3]
In this case, a user may visit the site first on the weekend, then not log in for many months, then log into the site on a wednesday, then many months later log into the site again on a tuesday. Inotherwords, the timing of a users actions would be a critical set of data to add. Let’s do that. We use the acronym SBV (seconds between visit) to represent the time elapsed between user visits.
[0,2,5,3,4,5,4] SBV1 [2,3,4,5,2,3] SBV2 [2,3,4,2,3]
And of course, SBV is simply a number of seconds, so lets put that in
[0,2,5,3,4,5,4] 3000 [2,3,4,5,2,3] 5000 [2,3,4,2,3]
However, instead of one long number like the one above, in some cases we like to analyze timing events seperately from the events themselves:
[SBV1, SBV2] = [3000,5000]
[SESSION1] [SESSION2] [SESSION3] = [0,2,5,3,4,5,4] [2,3,4,5,2,3] [2,3,4,2,3]
Adding additional data into the model can easily be done. This is only a start to thinking about user click streams as ‘genes’.
Here are some further ideas:
- It may be easy to generate unique ids for urls simply by taking an MD5 of the url
- Finding the optimal subsequences of user experiences online
- Take into account other events that a user can take on a web page
- Figure out how to effectively use both timing and click sessions together in the same string