E-commerce companies utilizes the Apache Cohort Analysis to figure how the brand is performing over a period of time and which Cohort group are most valuable to their brand. The Outcome of Apache Cohort Analysis can be used to spot the changes or patterns in customer behavior throughout their journey and why there is a change by comparing result with marketing events calendar.
A Cohort is a group of people that they’ll shares a certain characteristics. Usually, but not always. Practice of studying the habits or activities of specific cohorts over a set period of time is called Cohort Analysis. Usually, every marketers prefer a month by month analysis over others but, It can be performed day or week level as well.
Apache Cohort Analysis is like other sophisticated Cohort Analysis tools and is extremely powerful and well-equipped for tackling huge datasets efficiently. It has emerged as a much more accessible and compelling replacement.
Apache Spark is a unique analytics engine for large-scale data processing. Runs roughly 100X faster than a Hadoop MapReduce because of In-memory computation and other spark’s core components like Directed Acyclic Graph scheduler, Catalyst Query Optimizer and its Physical Execution Engine.
There are two main abstractions in spark needs to be discussed before diving into an implementation aspect of this analysis. One is RDD (Resilient Distributed Datasets) and the other one is Data Frames.
1. Resilient Distributed Datasets (RDD)
A RDD (Resilient Distributed Dataset), the basic abstraction in Spark. Represents an immutable partitioned group of elements that can be operated on in parallel. These entities exists in memory and by their nature are immutable. Due to its immutability nature, RDD will be created newly when some transformation applied to it. If there is any failure in action or transformation on RDD during the execution phase, It is possible to roll back to previous functioning state and helps to perform the same action in a other node to achieve fault tolerance.
2. Data Frames
Resilient Distributed Dataset by nature does not have any schema attached to it, but they can be extended with help of Data Frames. Data Frames contain schematic functionality to the dataset it contains, which helps in handling structured data set. It is generally faster then RDD since it has more information about dataset it holds.
Creating Spark Session and Data Frame
Spark Session provides a single point of entry to interact with the Spark functionality and allows programming Spark with Dataset APIs and DataFrame. In order to use APIs of HIVE,SQL, and Streaming, no need to create separate contexts as Spark Session includes all the APIs.
Join Product View Events and Marketing Events
To get customers who are all view product from marketing, they’ve created two Data frames using below, one to store View actions and another one to store UTM source events. By joining both data frames using common user IDs, they have created new Data Frame which now contains all product view events with a corresponding marketing medium source.
Spark does something called Lazy Evaluation which means all transformation will be applied only when some actions like take(), Show(), count(), write() triggered on RDD.
When we write action is triggered, all the transformation applied on RDD will be processed as per logical execution plan. Once it is divided by source, we can see a different folder for each source like below. Once all sources are divided, we can partition each source by week by week and just compare who are in the first week are still returning back in an upcoming week.