This article is about Apache Hive Retail Analytics and its capabilities.
- Retail industry today is big business and will always continue to remain so for the foreseeable future. Recent estimates put world-wide retails sales at USD 7.5 trillion.
- Wal-Mart has been the global leader since its inception. The world’s successful top 5 retailers are Wal-Mart (USA), Carrefour (France), Royal Ahold (The Netherlands), Home Depot (USA) & Kroger (USA).
- In India, retail industry is growing rapidly. Major Indian retailers in this league include Future Group, Reliance Industries, Tata Group and Aditya Birla Group.
Let us see the overall approach in Apache Hive Retail Analytics below
Apache Flume Solved the Data Transfer Problem
- Apache Flume is a reliable, distributed, and available service for efficiently aggregating, collecting and moving large amounts of log data. It has a flexible and simple architecture based on streaming data flows.
- The primary use case for Apache Flume is as a logging system that forms a set of log files on every machine in a cluster and then aggregates them to a centralized persistent HDFS store.
- Flume’s typical dataflow is as follows: A Flume Agent is installed on each node of the cluster that produces log messages. These streams of messages of log is from every node are then sent to the Flume Collector.
- Logs from every nodes can be sent into HDFS on a real-time / weekly / daily / monthly basis. We choose to send some important logs by weekly mainly because of the analytical aspect of the requirement and hence daily basis was not warranted in this regard.
Apache Hive Retail Analytics
- Hive is mainly a data warehouse infrastructure built on top of Hadoop for providing data summarization, query and analysis.
- It provides an SQL language called HiveQL and converts the query into MapReduce tasks.
- Hive uses Schema on Read unlike a old database which uses Schema on Write. Schema on Write implies that a table’s schema and is enforced at data load time. If the data being loaded and does not conform to the schema, then it is rejected.
- While reading log files the simplest way of approach during Hive table creation is to use a RegexSerDe.
- It uses regular expression to serialize or deserialize. It deserializes the main data using regex and extracts groups as columns.
Overall Solution Architecture using Flume + Hive
- Flume was used to transfer and collect log files to HDFS
- Hive was used for analysis
The user information, merchandize details, time of transaction, city / area / state information, coupon codes, customer data and other related details were collected and aggregated from various backend servers.
Flume was installed on these back-end servers to share the various log files into HDFS. Flume was configured to transfer data on a by weekly basis in this case.
Flume Configuration: Following Flume parameters were configured (sample)
- flume.event.max.size.bytes uses the default value of 32KB.
- flume.agent.logdir was changed to point to an appropriate HDFS directory
- flume.master.servers: 3 Flume Masters – flumeMaster1, flumeMaster2, flumeMaster3
- flume.master.store uses the default value – zookeeper
Hive Configuration: Following Hive parameters were configured (sample)
- javax.jdo.option.ConnectionDriverName: set the value to “com.mysql.jdbc.Driver”
By default, Hive metadata is usually stored in an embeddedDerbydatabase which allows only one user to issue queries. This is not ideal for production purposes.
Using the Hadoop system, log transfer time was reduced to 3 hours by weekly and querying time also was significantly improved.
- Count the number of Transactions
Select count (*) from facts;
- Count the number of Distinct users by Gender
Select gender, count (DISTINCT customer_id) from customersby gender;
Only equality inner, joins, & outer joins, semi joins and map joins are supported in Hive. Hive does’nt support join conditions that are not equality conditions as it is very difficult to express such conditions as a MapReduce job.
- List the Category to which the Product Belongs
Select products products .product_id, product_name, categories.category_name from products JOIN categories ON (products.product_category_id = categories.category_id);
- Count of the Number of Transactions from every Location
Select locations.location_name, count from facts JOIN locations ON (facts.location_id = locations.location_id) group by locations .location_name;
Locus IT understands the unique requirements that are needed for Retail sales industries and provides Apache Hive Retail support services, Apache Hive Retail implementation services, Apache Hive Retail Training and Apache Hive Retail Customization.
Our support staff are experienced, dedicated, and ready to respond quickly as soon as possible to your needs. For more details please contact us.