Project: Pet First
Sales Analysis and Visualization for Pet First: Uncovering the Impact of Repeat Purchases
In this comprehensive sales analysis project for Pet First, a leading pet supply retailer, I explore the transactional data to find insights and patterns. As requested by the stakeholder, I focused on understanding the crucial role of repeat purchases in driving the company’s revenue.
Methodology
Click to access directly
Technologies Used
- Workspace: Datacamp Workspace and Microsoft Excel
- Data cleaning and analyzing: DuckDB SQL, Excel
- Visualizations: Created in Excel, and Datacamp Workspace. In the Datacamp Workspace I used the workspace’s basic graph creator and Python (which included Matplotlib and Seaborn libraries).
Data Source and Management
I received a dataset of 1500 records, along with a schema document that included addition data cleaning requirements. Pet First requested an analysis of sales particularly in terms of repeat purchases and how they impact overall sales. They had been testing this program ‘for the last year’ and now want to know if it should be continued.
Imported the dataset to both my SQL workspace (which uses DuckDB) and into Excel to clean and organize data.
in Datacamp Workspace using DuckDB SQL
--clean column category
UPDATE p_sales
SET category = 'Unknown'
WHERE category = '-';
--clean column size
UPDATE p_sales
SET size = UPPER(SUBSTR(size, 1, 1)) ||
LOWER(SUBSTR(size, 2));
--clean column price
UPDATE p_sales
SET price = (SELECT QUANTILE(price, 0.5) FROM p_sales WHERE price NOT LIKE 'unlisted')
WHERE price = 'unlisted';
--clean column rating
UPDATE p_sales
SET rating = 0
WHERE rating = 'NA';
in Excel with Pivot Tables
Sales grouped by category for repeat purchase, animal based on size, and animal filtered with repeat purchase. Pivot table for sale counts on repeat purchase and overall totals. Lastly, pivot tables showing percentages based on repeat purchases and single purchases.
I began with a query to look at sales by category. I then ran standard descriptive statistics: median, mean, standard deviation, min and max on sales and price. Below are some examples.
Even though the columns price and sales are set as numeric, I had to CAST the columns, accordingly, to decimal. Which you don’t always have to do in other SQLs. However, this type of error code in DDB was easy to identify and fix.
-- correlated query to get sales by category based on single/repeat purchases and total sales
Select category, (SELECT ROUND(SUM(in_p.sales), 2)
FROM p_sales in_p WHERE repeat_purchase = 0 AND in_p.category = out_p.category) AS category_single_sales,
(SELECT ROUND(SUM(in_p.sales), 2)
FROM p_sales in_p WHERE repeat_purchase = 1 AND in_p.category = out_p.category) AS category_repeat_sales,
ROUND(SUM(out_p.sales), 2) AS category_total_sales
FROM p_sales out_p
GROUP BY out_p.category
ORDER BY category_total_sales DESC;
-- median and mean price
Select *
FROM p_sales;
SELECT QUANTILE(price, 0.5) AS median_pr, ROUND(AVG(CAST(price AS DECIMAL(10, 2))), 2) AS median_price
FROM p_sales;
-- standard deviation on sales and price
SELECT STDDEV(CAST(sales AS DECIMAL(10, 2))) AS sd_sales,
STDDEV(CAST(price AS DECIMAL(10, 2))) AS sd_price
FROM p_sales;
--then grouped by category
SELECT STDDEV(CAST(sales AS DECIMAL(10, 2))) AS sd_sales,
STDDEV(CAST(price AS DECIMAL(10, 2))) AS sd_price
FROM p_sales
GROUP BY category;
--then rounded
SELECT ROUND(STDDEV(CAST(sales AS DECIMAL(10, 2))),3) AS sd_sales,
ROUND(STDDEV(CAST(price AS DECIMAL(10, 2))),2) AS sd_price
FROM p_sales
GROUP BY category
Sales based on single and repeat purchases by category.
--sales by category based on repeat purchases
Select category, (SELECT ROUND(SUM(in_p.sales), 2)
FROM p_sales in_p WHERE repeat_purchase = 0 AND in_p.category = out_p.category) AS category_single_sales,
(SELECT ROUND(SUM(in_p.sales), 2)
FROM p_sales in_p WHERE repeat_purchase = 1 AND in_p.category = out_p.category) AS category_repeat_sales,
ROUND(SUM(out_p.sales), 2) AS category_total_sales
FROM p_sales out_p
GROUP BY out_p.category
ORDER BY category_total_sales DESC;
--highest sales in repeat purchase
SELECT category,
ANY_VALUE(animal) AS animal,
ANY_VALUE(size) AS size,
ROUND(AVG(CAST(price AS DECIMAL(10, 2))), 2) AS average_price,
SUM(sales) AS total_sales,
ROUND(AVG(CAST(rating AS INT)), 2) AS average_rating,
ANY_VALUE(repeat_purchase) AS repeat_purchase
FROM p_sales
WHERE repeat_purchase = 1
GROUP BY category, animal
ORDER BY total_sales DESC;
Using the Correlation Coefficient, results indicate a positive linear relationship between the two variables: mean of repeat and single purchase customers. The strong correlation indicates that when repeat or single purchase is high, the other is also.
--correlation of single vs repeat sales
WITH repeat AS (SELECT category,
AVG(CAST(sales AS DECIMAL(10, 2))) AS avg_sales
FROM p_sales
WHERE repeat_purchase = 1
GROUP BY category),
single AS (SELECT category,
AVG(CAST(sales AS DECIMAL(10, 2))) AS avg_sales
FROM p_sales
WHERE repeat_purchase = 0
GROUP BY category)
SELECT CORR(rp.avg_sales, sp.avg_sales) AS correlation_coefficient
FROM repeat rp
JOIN single sp ON rp.category = sp.category
Visualizations and Insights
Excel Dashboard
The dashboard and visualization data were taken from the below pivot charts.
Bar Chart
Mirroring the Excel visualizations, this bar chart from the DuckDB SQL workspace, shows overall Sales is led by the Equipment category .
Histograms
Python code using matplotlib and seaborn libraries.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.read_csv('petfirst_clean_pivot_csv.csv')
plt.figure(figsize=(10, 6))
sns.kdeplot(df['sales'], color='navy')
sns.histplot(df['sales'], kde=True, bins=35, color='limegreen')
plt.title('Histogram of Sales with Kernel Density Estimation')
plt.xlabel('Sales Transaction Amount')
plt.ylabel('Sales Count')
plt.grid(axis='y')
plt.show()
Added the Kernel Density Estimation line where the unimodal data shows that the underlying distribution is close to, but not exactly normal.
import numpy as np
import pandas as pd # Added this line to import pandas
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('petfirst_clean_pivot_csv.csv')
plt.figure(figsize=(10, 6))
mean_value = np.mean(df['sales'])
std_dev_value = np.std(df['sales'])
sns.histplot(df['sales'], kde=True, bins=40, color='darkorange') # Histogram with Kernel Density Estimation
plt.axvline(mean_value, color='r', linestyle='--', label=f'Mean: {mean_value:.2f}')
plt.axvline(mean_value + std_dev_value, color='g', linestyle='--', label=f'Std Dev: {std_dev_value:.2f}')
plt.axvline(mean_value - std_dev_value, color='g', linestyle='--')
plt.title('Histogram of Sales with Kernel Density Estimation')
plt.xlabel('Sales Transaction Amount')
plt.ylabel('Sales Count')
plt.grid(axis='y')
plt.legend()
plt.show()
Observations and Insights
The overriding insight between repeat purchases and total sales is positive; indicating a strong correlation between the two. Repeat sales are often a sign of customer satisfaction and customer loyalty. This is shown almost universally by the greater sales between single and repeat purchases regardless of whether observing by category or by animal.
Observations across the categories are imbalanced. However, as a group, the categories of medical, Housing, Food, and Toys, are fairly balanced. Equipment stands out as an outlier, being significantly higher and skewing the distribution towards this category (right-skewed). While Accessory and Unkown categories are low, adding to the imbalance.
Overall, repeat purchases account for 60% of sales, underscoring their successful and pivotal contribution to Pet First’s revenue stream. However, acquiring a more extensive set of sales transactions over a broader time span would be advantageous in understanding consumer habits over time, and sales trends. Based on the findings from this dataset, it can be inferred that the Repeat Purchase program is effective, and it is recommended to continue its implementation.
Additional technical notes:
Both histograms, the bar chart and the Kernel Density Estimation overlay show a single peak. The median closely follows the peak, which points to a left skew. However when combined with Skewness, calculating to .57, and Kurtosis, calculating to .65, the result is near-normal distribution, with a slight right skew.
Since this data did not come with dates, it’s uncertain if it is the cumulative sales for a given period. This is relevant because more sales transactions over a longer period might exhibit different distribution characteristics, potentially aligning the data with the Central Limit Theory. Thus, leading to a more normal distribution of sales figures, and to better interpretation and insights.