A tutorial using
seaborn to produce insights from customer data
Whether you work in a B2C or D2C company, there is a good chance you will be asked to look at some sales dataset in order to create predictive models or to understand consumers behavior. In this post we’ll explore some questions we might want to ask of our data, and how extract those answers as quickly as possible. To do that, we will write two functions to make our lives easier, as we will be plotting a few bar charts in order to answer different questions.
We’ll be using the Black Friday dataset which can be found on Analytics Vidhya website. you can check the tool we will be using information of the data below.
You can check all the code I develop on my Github page
-Seaborn and matplotlib for data visualization
– Pandas and numpy for data analysis
Questions we want to answer:
Age Group Analysis:
– Total of Purchases by Age Group
-Purchase Mean by Age Group and City
Product Category Analysis:
-Number of Purchases of Each Product Category by City
-Purchase Mean of Product_Category 1 by City
-Purchase Mean of Product_Category 2 by City
-Purchase Mean of Product_Category 3 by City
-Purchase Mean by Gender
-Purchase Mean by Gender and Age Group
– Purchase Mean by Gender and City
-Purchase Mean by Occupation
-Purchase Mean by Occupation and City
Correlation Between Ordinal Variables:
Although our first impression of the data is that they seem to be organized, there are still some missing values. The Product_Category (2 and 3) columns have tons of NaNs variables instead of numerical representation. The information we can get from it is that each NaN represents the fact that the product of this given row does not belong to that specific category. For instance, the product of the first row ( Product_id = P00069042 ) is identified as product 3 and belongs only to the category number one. On the other hand, the product of the second row is identified as 1 in category 1, as 6 in the second category and 14 in the third category of product which means that it belongs to all of the categories.
In addition to that, we have to be aware of the fact that groupby and pivot_table ignore NaNs automatically. That make our job significantly easier since we will be using these two techniques in order to group some data and compute operations on them. In addition to that, it is also important to mention that even though our data might have some outliers, we will consider that it is already mined regarding this issue.
Age Group Analysis:
. . .
Total of Purchases by Age Group
We will commence this project by identifying the total of purchase by age group and the purchase means by age group and city. In order to do that, we have to, first, use the value_counts() method to count the number of purchase of each group and plot a simple bar chat. Second, we have to create a multilevel pivot table DataFrame in which the indexes are the columns Age and City_Category and than plot a bar chart.
The plot below shows that people between 26 and 35 years old take the top spot in terms of quantity of purchases, followed by the 36–45 and 18–25 age groups which have about half of purchases when compared to the first group.
Purchase Mean by Age Group and City
If we plot a straightforward visualization to answer this question, we will come up with the graph below.
Even though the plot above give us the information we want, it’s a bit confusing. We have to bear in mind that the easier to understand the plot the ‘happier’ the person who is reading it will be. As Data Scientists or Data Analysts we have the obligation to deliver the message as clear as possible so that those who look at the plot or dashboard understand everything straightaway. So, in order to make a clear comparison between those groups we will plot each of them separately.
You might be thinking :”I don’t want to write code for each different plot,though!”. You are right! who has time for that?! Remember time is money! So, let’s write one function to do it for us and another one to set the axes list with exactly the same length we need in each visualization.
Aren’t those plots easier to understand than the one we did before ? Now we can clearly see that, for example, from age 0 to 55 the mean purchase of city C is greater than others. We can also see that city B takes the first spot only in the 55+ age group . In short, we can get a lot of information from those plots, and those who need them to make business decisions will probably be glad.
Product Category Analysis:
. . .
Our second task is to analyse the categories and the products. We will start off plotting a bar chart with the total of purchases of each category in each city. Then, we will make use of the function we have built to plot individual charts of all products for each category. It’s important to mention that one product may belongs to different category and groups. For example, the product P00248942 belongs to all categories but is named as group 1 in the first category, group 6 in the second and 14 in the third one. With that said, let’s start our analysis.
Number of Purchase of Each Product Category by City:
The chart below shows that we have sort of a pattern in terms of quantity of purchases. As we can see, the product category 1 takes the top spot, followed by product category 2 and product category 3 in all of the cities. What differs each city regarding the number of purchases is the fact that consumers of city B buy more product of all category than city A and C.
Purchase Mean of Product_Category 1 by City:
From now on you will notice that for each visualization we will be combining the dataframe to select the data we want to visualize,and applying the two functions we have created.
Well, we won’t be scrutinizing all those plots. Instead, let’s just focus on the main information we can get from them. That being said, The first thing that catches my eye is the product 19 with its way lower purchase mean compared to the other products (less than 40 USD on average). In addition, we can notice that product 4 has the greatest purchase mean with city C taking the top spot followed by city B which has a slight advantage over city C.
Purchase Mean of Product_Category 2 by City:
Let’s now analyse all products of category 2. As mentioned in the description of the data, the product may belongs to other category and we can actually notice this looking at the plots below .We have a group of 17 products that also belong to the first category and, as can be seen, the product 10 has the greatest purchase mean. Once again, the city C takes the first position, followed by city B and A (similar to product category 1).
Purchase Mean of Product_Category 3 by City:
The plots below show information regarding the third product category. We can see that there are 15 products in this category and product 3 has the greatest purchase mean in all of the cities.
Well, we all know the relevance of understanding the impact of gender on purchase decision-making. In fact, many studies have been raising questions, for example, about who makes the purchase decision in households. In addition to that, it seems also important to inspect gender differences in purchase decision-making styles and patterns concerning to product categories, price and others characteristics.
That said, let’s work on our data analysis and get some simple but relevant insights into the customers behavior. First, we will plot the total of purchases and the mean purchase of each gender. Then, we will plot the purchase Mean by age group and finally we will see how men and women differ from each other in terms of product acquisition in cities A, B and C.
Number of Purchases and Purchase Mean by Gende
Now, looking at the graphs below we can see that the total of male consumers (414259) is about three times as much as the number of female (135809). We can also notice that, surprisingly, there’s no big difference concerns the purchase mean o both genders. Even though the evident disparity in quantity of men and women, the purchase mean of male consumers has just a slight advantage over female mean purchase ( 700 USD). Based on that, I would say that female consumers are really significant for the business and, for instance, could be target of specials marketing campaigns.
Purchases Mean by Gender and Age Group:
With the purpose of comparing the purchase pattern of different age groups and genders, we can look at the bar charts below. They show that the purchase mean of the men’s group is more than 1000 USD greatest than the purchase mean of women’s group for customers aged from 18 to 25. We can also see that male consumers take the first spot in all age group, though the difference is not so noticeable as in the 18–25 age group.
Purchase Mean by Gender and City:
The last gender analysis consists in understanding how purchases were made for both gender in each city. In order to achieve that, let’s take a moment to analyse the bar charts below that depict exactly what we are looking for. They show that males spend more money than women in all of the cities ( the difference is significant only in city C )and that city C takes the first spot in regards purchase means for both genders. It is also worth noting that men based in city B bought more than those based in city A, whereas women from city A acquired more than those who live in city B.
Do you think that the type of consumer’s occupation impact on their buying behavior ? well, I would expect some association between occupation and the attitude of consumers toward the amount of money they expend. Let’s examine this by plotting bar charts.
Let’s start by plotting the purchase mean of each occupation. This bar chart gives us generalized information about the purchase habits of each consumer regarding their occupation.
Looking at the plot below we can notice that occupations 12, 15 and 17 placed in top three spots concerning purchase means. We can also see that occupations 9, 19 and 20 expend less money than others. Are these information really helpful? Can we leverage insights concerning the amount of money each professional spend by city? well, this is exactly what we see in the subsequent visualization.
Although the first bar bar chart gives us important information, the second data viz allows us to have a better understanding of the money spent by each occupation in cities A, B and C.
Occupation by City:
The bar charts below depict the purchase mean of each occupation in each city. It’s interesting that city C has the greatest purchase means in almost all of the occupation. We see a change in this pattern only with occupations 8, 9 and 19. Another interesting fact is that we have the greatest amount of money (overall) spent in city A by people which number 8 is their occupation.
So, the point is that if we separate each occupation in different charts, we are more likely to leverage insights easily. In other words, This makes it especially easy for colleagues on other teams (eg. Marketing department) to use our findings. We can simply export them to a plots folder, and people can browse the images and be able to drag and drop them right into a PowerPoint presentation or other report.
Basically, Heatmap is a graphical data representation that uses a scheme of color-coding to represent different values. In our case, we are generating a heatmap of correlations between features of our dataset.
In order to generate a heatmap, we have to numerically encoding the categorical data. In other words, we have to convert columns Product_ID, Gender, Age, City_Cat and Stay_In_Current_City_Years into numerical columns as shown in the lines of code below.
After the encoding, we end up with the graph below. It’s interesting to note that there is a positive correlation between the products , which is exactly what we might expect to see since the they may belong to different categories. In addition, the correlation between age and marital status other interesting point to mention.
We will finish this analysis by plotting a clustermap. Clustermaps use hierarchical clustering to group features together by how closely related they are. This make correlations between the variables especially informative when we are analyzing relationships between them. In our case we have 11 features that we want to investigate, so instead of eyeballing the heatmat for which variable is positive or negative associated, the graph will be segmented into clusters, which is easier to analyse.
Looking at the plot above, we can see that the clustering algorithm believes product_category 2 and 3 cluster together, while age and marital status form another strongly associated cluster. We can observe that by looking at the
link between them. They are formed first and have the shortest branch, which indicate that they are more similar them those with highest branch.
I hope you enjoyed this analysis as much as I did. We’ve seen how writing functions to quickly generate visualizations of our findings make our lives easier when we need to produce insights as quickly as possible. Fell free to comment if you have a even more efficient way of doing this.