Tag: Data Analysis

  • Data Science Project: Predictive Modeling Explained


    As a data enthusiast, I often find myself diving into the world of predictive modeling and machine learning. Recently, I embarked on a project that involved creating and refining various regression models using Python. The journey not only enhanced my technical skills but also deepened my understanding of how different approaches to modeling can impact results. In this blog, I’ll share my experiences, insights, and the Python code I used to achieve these results.

    Understanding the Data

    The first step in any data science project is understanding the data at hand. For this project, I worked with a dataset that included various features of cars, such as year, mileage, tax, mpg, and engineSize. My goal was to predict the price of the cars based on these features.

    Data Preparation

    Before jumping into modeling, I needed to prepare my data. This involved cleaning, transforming, and augmenting it. Here’s how I approached this task in Python:

    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.linear_model import Ridge
    from sklearn.metrics import mean_squared_error, r2_score
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    from sklearn.model_selection import GridSearchCV
    
    # Load the dataset
    df = pd.read_csv('car_data.csv')
    
    # Split the data into features and target
    features = ['year', 'mileage', 'tax', 'mpg', 'engineSize']
    target = 'price'
    
    X = df[features]
    y = df[target]
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    In the code above, I used the train_test_split function to divide the dataset into training (80%) and testing (20%) sets. This is crucial for evaluating the model’s performance later.

    Model Development

    Base Model

    I began my modeling journey with a simple linear regression model to establish a baseline.

    from sklearn.linear_model import LinearRegression
    
    # Create and fit the linear regression model
    lin_reg = LinearRegression()
    lin_reg.fit(X_train, y_train)
    
    # Predict on the test set
    y_pred = lin_reg.predict(X_test)
    
    # Calculate R² and MSE
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    
    print(f'R²: {r2}, Mean Squared Error (MSE): {mse}')

    Running this code produced the following results:

    • R²: 0.6917
    • Mean Squared Error (MSE): 6912744.91

    These metrics indicated that the linear regression model did a decent job of predicting the car prices based on the features.

    Polynomial Features

    Next, I decided to explore the impact of polynomial features to see if I could enhance the model’s performance.

    # Create a pipeline with polynomial features and linear regression
    poly_pipeline = Pipeline([
        ('poly', PolynomialFeatures(degree=2)),
        ('scaler', StandardScaler()),
        ('regressor', LinearRegression())
    ])
    
    # Fit the model
    poly_pipeline.fit(X_train, y_train)
    
    # Predict on the test set
    y_poly_pred = poly_pipeline.predict(X_test)
    
    # Calculate R² and MSE
    poly_r2 = r2_score(y_test, y_poly_pred)
    poly_mse = mean_squared_error(y_test, y_poly_pred)
    
    print(f'Polynomial Model R²: {poly_r2}, Mean Squared Error (MSE): {poly_mse}')

    The results showed a slight decrease in performance:

    • R²: 0.7667
    • MSE: 5234038.0655

    While polynomial features added complexity to the model, they did not significantly improve predictive power.

    Ridge Regression

    Next, I ventured into regularization with Ridge regression, aiming to prevent overfitting.

    # Create and fit a Ridge regression model
    ridge_model = Ridge(alpha=0.1)
    ridge_model.fit(X_train, y_train)
    
    # Predict on the test set
    ridge_pred = ridge_model.predict(X_test)
    
    # Calculate R² and MSE
    ridge_r2 = r2_score(y_test, ridge_pred)
    ridge_mse = mean_squared_error(y_test, ridge_pred)
    
    print(f'Ridge Regression R²: {ridge_r2}, Mean Squared Error (MSE): {ridge_mse}')

    The Ridge regression yielded:

    • R²: 0.6917
    • MSE: 6912725.8010

    Interestingly, the performance dropped significantly. This indicated that the regularization might have overly constrained the model.

    Ridge Polynomial Regression

    To combine the benefits of polynomial features and regularization, I implemented Ridge Polynomial Regression.

    # Create a pipeline with polynomial features and Ridge regression
    ridge_poly_pipeline = Pipeline([
        ('poly', PolynomialFeatures(degree=2)),
        ('scaler', StandardScaler()),
        ('regressor', Ridge(alpha=0.1))
    ])
    
    # Fit the model
    ridge_poly_pipeline.fit(X_train, y_train)
    
    # Predict on the test set
    y_ridge_poly_pred = ridge_poly_pipeline.predict(X_test)
    
    # Calculate R² and MSE
    ridge_poly_r2 = r2_score(y_test, y_ridge_poly_pred)
    ridge_poly_mse = mean_squared_error(y_test, y_ridge_poly_pred)
    
    print(f'Ridge Polynomial Model R²: {ridge_poly_r2}, Mean Squared Error (MSE): {ridge_poly_mse}')

    This model gave the following results:

    • R²: 0.6733
    • MSE: 7326174.8781

    The Ridge Polynomial Regression performed better than the simple Ridge regression, demonstrating the effectiveness of combining polynomial features with regularization.

    Grid Search for Hyperparameter Tuning

    To ensure that I was using the best regularization parameter, I employed Grid Search for tuning the alpha parameter.

    # Define a grid of alpha values
    alpha_values = [0.01, 0.1, 1, 10, 100]
    
    # Create a Ridge regression model
    ridge = Ridge()
    
    # Set up Grid Search
    grid_search = GridSearchCV(estimator=ridge, param_grid={'alpha': alpha_values}, scoring='neg_mean_squared_error', cv=4)
    grid_search.fit(X_train, y_train)
    
    # Get the best alpha value
    best_alpha = grid_search.best_params_['alpha']
    print(f'Best Alpha: {best_alpha}')
    
    # Predict with the best model
    best_ridge = Ridge(alpha=best_alpha)
    best_ridge.fit(X_train, y_train)
    best_ridge_pred = best_ridge.predict(X_test)
    
    # Calculate R² and MSE
    best_ridge_r2 = r2_score(y_test, best_ridge_pred)
    best_ridge_mse = mean_squared_error(y_test, best_ridge_pred)
    
    print(f'Grid Search Ridge R²: {best_ridge_r2}, Mean Squared Error (MSE): {best_ridge_mse}')

    The results from the Grid Search revealed:

    • Best Alpha: 0.01
    • MSE: 13840985.99
    • R²: 0.3827

    Despite finding the optimal alpha, the Ridge regression still underperformed compared to the base model.

    Visualizing the Results

    To better understand and compare the results of the various models, I created visualizations.

    import matplotlib.pyplot as plt
    
    # Model performance data
    models = ['Linear Regression', 'Polynomial Model', 'Ridge Regression', 'Ridge Polynomial', 'GridSearch']
    r2_scores = [0.6917, 0.7667, 0.6917, 0.6733, 0.3827]
    mse_scores = [6912744.91, 5234038.0655, 6912725.801, 7326174.8781, 13840985.99]
    
    # Plot R² Scores
    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    sns.barplot(data=results_df, x=models, y=r2_scores, palette='coolwarm')
    plt.title('R² Scores of Different Models')
    plt.ylabel('R² Score')
    plt.ylim(0, 1)
    plt.xticks(rotation=15)
    
    # Plot MSE Scores
    plt.subplot(1, 2, 2)
    sns.barplot(data=results_df, x=models, y=mse_scores, palette='viridis')
    plt.title('Mean Squared Error (MSE) of Different Models')
    plt.ylabel('MSE')
    plt.ylim(0, max(mse_scores) + 1000000)
    plt.xticks(rotation=15)
    
    plt.tight_layout()
    plt.show()

    Conclusion on Model Performance

    Reflecting on the results, several conclusions emerged:

    1. Model Comparison:
    • The base linear regression model provided a solid baseline with an R² of 0.6917. This indicated a reasonable fit to the data.
    • The polynomial model introduced complexity but did not enhance predictive power, with a slight drop in performance (R²: 0.7667.
    • Ridge regression, while designed to prevent overfitting, underperformed significantly (R²: 0.6917), indicating that regularization might have overly constrained the model’s ability to capture relationships in the data.
    • The Ridge Polynomial Regression showed a notable improvement over simple Ridge regression (R²: 0.6732), suggesting that combining polynomial features with regularization can
    • GridSearch Ridge regression, while designed to prevent overfitting, underperformed significantly (R²: 0.3827), indicating that regularization might have overly constrained the model’s ability to capture relationships in the data.