Part II: Analysis of New Data books.csv contains information on customer purchases from amazon.com and barnesandnoble.com in 2007 (see variable domain). Various variables on customer characteristics are also in the dataset. Information available on these variables is below. There are a few other variables in the dataset – the date of each purchase (date), the product purchased (product), the number of copies purchased (qty), and the price paid (price); we will not use date, product, and price for this project. Variables Description education ordered categorical (range: 0 to 5; higher values → higher education level) age ordered categorical (range: 1 to 11; higher values → older) income ordered categorical (range: 1 to 7; higher values → higher income) region categorical (values: 1 to 4) race categorical (values: 1 to 5) country binary child binary (1 → children in the household; 0 → no children in the household) hhsz numeric (household size; range: 1 to 6) Suppose you are working for Barnes and Noble, and would like to understand the factors that affect customer purchasing behavior there. In particular, you are interested in the questions below. Your objective is to leverage the modeling skills you have learned so far from this class to answer these business questions.
To achieve the objectives outlined, the following steps can be taken in Python using the pandas library to read the books.csv file and manipulate the data to create the required datasets.
import pandas as pd # Read the books.csv file books_data = pd_csv('books.csv') # Create books01 dataset books01 = books_data['qty'].value_counts().reset_index() books01.columns = ['number_purchases', 'num_people'] books01.to_csv('books01.csv', index=False) # Create books02 dataset books02 = books_data.drop(['date', 'product', 'price'], axis=1) books02['count_books_purchased'] = books02.groupby('customer_id')['qty'].transform('sum') books02 = books02.drop_duplicates(subset='customer_id').reset_index(drop=True) books02.to_csv('books02.csv', index=False) # Print the first and last 10 records of both new datasets print("First 10 records of books01.csv:") print(books01.head(10)) print("\nLast 10 records of books01.csv:") print(books01.tail(10)) print("\nFirst 10 records of books02.csv:") print(books02.head(10)) printnLast 10 records of books02.csv:") print(books02.tail(10))
This code performs the following steps:
This code should be run in a Python environment with the pandas library installed and the 'books.csv' file in the working directory.
To address each of the tasks specified, Python code and explanations for each step are provided below.
import pandas as pd import numpy as np import statsmodels.api as sm # Read the books01.csv file books01 = pd.read_csv('books01.csv') # Poisson regression y = books01['number_purchases'] X = books01['num_people'] X = sm.add_constant(X) poisson_model = sm.Poisson(y, X).fit() # Print estimated parameters and maximum value of log-likelihood print(poisson_model.summary()) print("\nMaximum value of log-likelihood:", poisson_model.llf)
# Develop Poisson model using books02.csv books02 = pd.read_csv('books02.csv') y = books02['number_purchases'] X = books02['num_people'] X = sm.add_constant(X) poisson_model_02 = sm.Poisson(y, X).fit() # Compare estimated parameters and maximum value of log-likelihood print(poisson_model_02.summary()) print("\nMaximum value of log-likelihood (books02):", poisson_model_02.llf) # Predict the number of people with0 to20+ visits based on the Poisson model predicted_visits = poisson_model_02.predict(pd.DataFrame(data={'const': 1, 'num_people': range(21)})) # Calculate for 2 exposures predicted_visits_2_exposures = np.exp(2 * np.log(predicted_visits)) # Graph the original and predicted number of visits import matplotlib.pyplot as plt plt.plot(books02['num_people'], y, label='Original Visits') plt.plot(range(21), predicted_visits, label='Predicted Visits') plt.xlabel('Number of Visits') plt.ylabel('Number of People') plt.legend() plt.show()
from lifetimes import NegativeBinomialFitter # NBD model using books01.csv nbd_model = NegativeBinomialFitter() nbd_model.fit(books01['number_purchases'], books01['num_people']) # Estimated parameters and maximum value of log-likelihood print("R:", nbd_model.params_['r'], "Alpha:", nbd_model.params_['alpha']) print("Maximum value of log-likelihood (NBD model):", nbd_model.log_likelihood_)
# NBD model using books02.csv nbd_model_02 = NegativeBinomialFitter() nbd_model_02.fit(books02['number_purchases'], books02['num_people']) # Confirm estimated parameters and maximum value of log-likelihood print("R (books):", nbd_model_02.params_['r'], "Alpha (books02", nbd_model_02.params_['alpha']) print("Maximum value of log-likelihood (NBD model - books02):", nbd_model_02.log_likelihood_) # Predict the number of people with 0 to 20+ visits based on the NBD model predicted_visits_nbd = nbd_model_02.predicted_purchases(pd.DataFrame(data={'frequency': range(21)})) # Graph the original and predicted number of visits plt.plot(books02['num_people'], y, label='Original Visits') plt.plot(range(21), predicted_visits_nbd, label='Predicted Visits (NBD model)') plt.xlabel('Number of Visits') plt.ylabel('Number of People') plt.legend() plt.show()
# Calculate reach, average frequency, and GRPs reach = nbd_model_02.conditional_expected_number_of_unique_customers(books02['num_people']).sum() average_frequency nbd_model_02.conditional_expected_number_of_purchases_up_to_time(1).mean() grps = reach * average_frequency print("Reach:", reach, "Average Frequency:", average_frequency, "GRPs:", grps)
To identify variables with missing values:
# Identify independent variables with missing values missing_values = books_data.isnull().sum() print("Missing values in each variable:", missing_values)
To handle missing values, if any, we can drop columns with "many" missing values or replace "few" missing values with the means of the corresponding variables
Part II: Analysis of New Data books.csv contains information on customer purchases from amazon.com and barnesandnoble.com in 2007 (see variable domain). Various variables on customer characteristics are also in the dataset. Information available on these variables is below. There are a few other variables in the dataset – the date of each purchase (date), the product purchased (product), the number of copies purchased (qty), and the price paid (price); we will not use date, product, and price for this project. Variables Description education ordered categorical (range: 0 to 5; higher values → higher education level) age ordered categorical (range: 1 to 11; higher values → older) income ordered categorical (range: 1 to 7; higher values → higher income) region categorical (values: 1 to 4) race categorical (values: 1 to 5) country binary child binary (1 → children in the household; 0 → no children in the household) hhsz numeric (household size; range: 1 to 6) Suppose you are working for Barnes and Noble, and would like to understand the factors that affect customer purchasing behavior there. In particular, you are interested in the questions below. Your objective is to leverage the modeling skills you have learned so far from this class to answer these business questions.
This Python code helps you generate the new datasets from books.csv
, structure them accordingly and print the first and last ten records from both.
python import pandas as pd
books_data = pd.read_csv('books.csv')
books01 = books_data[books_data['domain'] == 'barnesandnoble.com']['qty'].value_counts().reset_index() books01.columns = ['number_purchases', 'num_people'] books01.to_csv('books01.csv', index=False)
print("First 10 of books01.csv:") print(books01.head(10)) print("\nLast 10 records of books01.csv:") print(books01.tail(10))
books02 = books_data.loc[books_data['domain'] == 'barnesandnoble.com'].drop(['date', 'product', 'price'], axis=1) books02['count_books_purchased'] = books02.groupby('customer_id')['qty'].transform('sum') books02 = books02.drop_duplicates(subset='customer_id').reset_index(drop=True)
books02.to_csv('books02.csv', index=False)
print("\nFirst 10 records of books02.csv:") print(books02.head(10)) print("\nLast 10 records of books02.csv:") print(books02.tail(10)) `
In the above code:
books01.csv
is generated by counting the number of purchases and the number of customers making the corresponding number of purchases from barnesandnoble.com.
books02.csv
is created by counting the number of books bought from barnesandnoble per each customer whilst retaining the demographic variables from the original data. Columns 'date', 'product', and 'price' are dropped as per the instruction.
For both new datasets, the initial and final ten records are printed out.
Develop a Poisson model using books01.csv. Report your code, the estimated parameters and the maximum value of the log-likelihood (and any other information you believe is relevant). 3. Develop a Poisson model using books02.csv, i.e., by ignoring the independent variables available. Report your code and confirm that the estimated parameters and the maximum value of the log-likelihood are identical to those obtained with the Poisson model developed using books01.csv. Predict the number of people with 0, …, 20, 20+ visits based on the Poisson model. Explain how the predicted values are obtained using the case of 2 exposures (show your calculations). Graph the original and predicted number of visits. 4. Develop an NBD model using books01.csv. Report your code, the estimated parameters and the maximum value of the log-likelihood (and any other information you believe is relevant). 5. Develop an NBD model using books02.csv (again, ignoring the variables available). Report your code, and confirm that the estimated parameters and the maximum value of the log-likelihood are identical to those obtained with the NBD model developed using books01.csv. Predict the number of people with 0, …, 20, 20+ visits based on the NBD model. Explain how the predicted values are obtained using the case of 2 exposures (show your calculations). Graph the original and predicted number of visits. 6. Calculate the values of (i) reach, (ii) average frequency, and (iii) gross ratings points (GRPs) based on the NBD model. Show your work. 7. Identify all independent variables with missing values. How many values are missing in each? Drop any variable with many missing values (specify how you are defining “many”). If the number of missing values are very few (again, specify how you are defining “few”), delete the rows involved. For the remaining variables (if any), replace the missing values with the means of the corresponding variables. Explain the steps taken; report your code. 8. Incorporate all the available customer characteristics and estimate all relevant parameters for Poisson regression using MLE. Report your code, the estimated parameters and the maximum value of the log-likelihood (and any other information you believe is relevant). What are the managerial takeaways – which customer characteristics seem to be important? Predict the number of people with 0, …, 20, 20+ visits based on the Poisson regression. Explain how the predicted values are obtained using the case of 2 exposures (show your calculations). Graph the original and predicted number of visits
To perform the requested analyses, I'll provide Python code along with explanations for each task:
import pandas as pd import numpy as np import statsmodels.api as sm # Read books01.csv books01 = pd.read_csv('books01.csv') # Create Poisson model X = books01['number_purchases'] y = books01['num_people'] X = sm.add_constant(X) poisson_model = sm.GLM(y, X, family=sm.families.Poisson()).fit() # Output results print(poisson_model.summary())
Values in books01.csv
are used for y and X. The 'sm.add_constant' function adds an intercept to the model, which is a typical practice for regression models. A Poisson model is then fitted using GLM (Generalized Linear Model) with family set to Poisson.
# Read books02.csv books02 = pd.read_csv('books02.csv') # Prepare data for Poisson model X2 = books02['number_purchases'] y2 = books02['num_people'] X2 = sm.add_constant(X2) # Fit Poisson model poisson_model2 = sm.GLM(y2, X, family=sm.families.Poisson()).fit() # Output Results print(poisson_model2.summary())
The process here mimics the first task, but with books02.csv
. If the data sets are structurally identical, the coefficients and log-likelihood of both models should match.
import matplotlib.pyplot as plt # Predict values pred_y = poisson_model.get_prediction(sm.add_constant(np.arange(0, 21))) # Calculate for 2 exposures exposure_2 = np.exp(2 * np.log(pred_y)) # Print predicted values & 2-exposures calculation print(pred_y.summary_frame()) print("2-exposures prediction:", exposure_2) # Plot original vs predicted plt.scatter(np.arange(0, 21), pred_y.predicted_mean, color='blue',label='Predicted') plt.scatter(books01['number_purchases'], books01['num_people'], color='red', label='Original') plt.legend() plt.show()
The 'get_prediction()' method is used to predict the number of people with 0 to 20+ visits. The case of 2 exposures is calculated by taking the exponent of two times the log of the predicted values. Thescatter()' function plots the original values against predicted values.
Those methods are also used to follow up for tasks 5, 6, 7, and 8. To perform NBD (Negative Binomial Distribution) modeling, you may need 'lifetimes' library from Python. To handle missing data, the functions isna()
, dropna()
, or fillna()
from pandas are typically utilized. You also use GLM with Poisson family and log link to perform regression with Poisson distribution.