<a href="https://colab.research.google.com/github/mirsazzathossain/compbio-bracu/blob/main/day_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Day 03: Interdisciplinary Computational Biology workshop 2025**

### **1. Import Libraries**

To get started, we need to import a set of essential libraries that will help us perform various tasks:

- **`numpy`**: For efficient numerical computations.
- **`pandas`**: To handle and manipulate structured data.
- **`matplotlib` and `seaborn`**: For creating insightful visualizations.
- **`sklearn`**: To build and evaluate machine learning models.

Let's import them now!


In [1]:
# TODO: Import necessary libraries

import pandas as pd
pd.set_option("display.max_columns", None)

### **2. Load Data**

For this task, you are provided with data files containing information about different proteins. Each group is assigned a specific protein dataset, and they will focus on analyzing and processing their assigned file.

Protein Files for Groups:

- Group 1: `akt1.csv`
- Group 2: `casp3.csv`
- Group 3: `pa2ga.csv`
- Group 4: `cxcr4.csv`
- Group 5: `cp3a4.csv`

Each group should load their assigned protein file into a pandas DataFrame for further analysis. Use the `pd.read_csv()` function to load the data, and display the first few rows of the DataFrame to understand the available columns and data points.


In [None]:
# TODO: Load the data and print the first 5 rows

### **3. Explore the Dataset**

To begin working with the dataset, it is essential to understand its structure and key statistics.

**Tasks for Exploration**:

1. Use `.info()` to check the dataset's structure, including the number of rows, columns, and data types
2. Use `.describe()` to summarize the statistics of all numerical features.

**Things to Observe**:

- Are there any missing values in the dataset?
- What are the ranges of the numerical features?
- Are the data types appropriate for analysis?


In [None]:
# TODO: Display the basic information about the dataset

In [None]:
# TODO: Display some basic statistics of the dataset

### **Part 4: Visualizing Numerical Features**

In this part, we will visualize the distribution and spread of numerical features using histograms and boxplots to better understand the statistics from the `describe()` function.

#### **4.1 Histograms**:

Histograms help visualize the distribution of numerical data. By adding a Kernel Density Estimate (KDE), we can see the shape and spread of the data.

**Instructions:**

1. Complete the code to loop through each numerical column.
2. Plot a histogram with a KDE for each numerical feature.
3. Observe the shape and spread of the distributions.


In [None]:
# set the style to whitegrid
sns.set(style="whitegrid")
plt.figure()

# create a 6x4 grid of histograms
fig, axes = plt.subplots(6, 4, figsize=(20, 20))
axes = axes.ravel()

# iterate over each column and plot the histogram
for i, column in enumerate(data.columns):
    # plot only if the numeric column
    if data[column].dtype != "object":
        # TODO:
        # use sns.histplot to plot the histogram
        # set kde=True to plot the kernel density estimate
        # set bins to the square root of the number of samples
        # set ax=axes[i] to plot on the correct axis

plt.tight_layout()
plt.show()

#### **4.2 Boxplot**

Boxplots provide a summary of the data's distribution by displaying the median, quartiles, and potential outliers. They help us visually assess the spread of values and highlight any extreme values.

![boxplot](https://miro.medium.com/v2/resize:fit:700/0*XG2sFucPoFMg6NeV.png)

**Instructions**:

1. Complete the code to loop through each numerical column.
2. Plot a boxplot for each numerical feature.
3. Identify the median, IQR, and any outliers in the data.


In [None]:
# set the style to whitegrid
sns.set(style="whitegrid")
plt.figure()

# create a 6x4 grid of histograms
fig, axes = plt.subplots(6, 4, figsize=(20, 20))
axes = axes.ravel()

# iterate over each column and plot the histogram
for i, column in enumerate(data.columns):
    # plot only if the numeric column
    if data[column].dtype != "object":
        # TODO:
        # use sns.boxplot to plot the boxplot
        # set ax=axes[i] to plot on the correct axis

plt.tight_layout()
plt.show()

### **5. Data Preprocessing**

#### **5.1 Data Cleaning**

Based on the observations from the dataset, you might need to clean the data before building the machine learning model. Here are some important observations you might consider:

1. **Convert Categorical Variables:** The `Name` column is the target variable and is of type `object`. If the name ends with `act1` or `ac1`, it belongs to one class (Class 1), and if it ends with `decoy1`, it belongs to another class (Class 0). Convert this to numerical values (0 and 1).

2. **Handle Constant Features:** From the histograms and describe() output, we notice that certain features have a constant value (e.g., std = 0), indicating that they do not vary and are not informative for the model. We'll drop these features, using the `drop()` function.


In [None]:
# TODO:
# Fix the Name column by setting the value to 0 if the name contains "decoy1" and 1 if the name contains "act1" or "ac1"
# Print the value counts of the Name column using value_counts() method

In [9]:
# TODO: 
# Find the columns with only one unique value
# Drop the columns with only one unique value

#### **5.2 Feature Selection**

To identify the most relevant features for your machine learning model, you can start with a correlation matrix and then visualize the feature relationships.

1. **Correlation Matrix**:  
   A correlation matrix helps identify the relationships between features. Correlation values range from -1 to 1:

   - A value close to **1** indicates a strong positive correlation.
   - A value close to **-1** indicates a strong negative correlation.
   - A value close to **0** indicates weak or no correlation.

   Use `df.corr()` to calculate the correlation matrix and `sns.heatmap()` to visualize it. Remove highly correlated features to reduce multicollinearity.

2. **Pairplot**:  
   To visualize the relationships between these features further, you can create a pairplot using the `sns.pairplot()` function. This plot displays the pairwise relationships in a dataset, making it easier to identify patterns.


In [None]:
# TODO:
# Calculate the correlation matrix of the dataset using the corr() method
# Plot the correlation matrix using a heatmap


In [None]:
# TODO:
# Plot a pairplot of the dataset using sns.pairplot
# Set the hue to the Name column

Now based on the correlation matrix and pairplot, you can decide which features to keep for the machine learning model. Drop the unnecessary features from the dataset.


In [12]:
columns_to_drop = [
    # TODO: Add the columns to drop
]
# TODO: Drop the columns in the columns_to_drop list

#### **5.3 Standardizing the Data**

Standardization is important to scale the features so that they have a mean of 0 and a standard deviation of 1. This ensures that each feature contributes equally to the model, especially for algorithms sensitive to feature scales.

**Action**:

1. Use `StandardScaler` from `sklearn.preprocessing` to standardize the features.
2. Apply the scaler to all features, excluding the target variable.
3. Replace the original features with the standardized values in the DataFrame.


In [13]:
# TODO:
# Instantiate the StandardScaler object
# Fit and transform the data using the scaler
# Create a DataFrame of the scaled data
# Add the Name column back to the DataFrame


#### **5.4 Separating the Target and Features**

Before building the machine learning model, you need to separate the target variable from the feature variables.

**Action**:

1. Create a variable `X` containing all feature columns and another variable `y` containing the target column.
2. Check the shape of `X` and `y` to ensure they are correct.


In [14]:
# TODO:
# Separate the features and the target variable
# Keep the features in a DataFrame called X
# Keep the target variable in a Series called y

### **6. Model Building**

Now that the data is preprocessed and ready, you can proceed with building a machine learning model.

#### **6.1 Train-Test Split**

Before training the model, split the data into training and testing sets. This allows you to train the model on one set of data and test its performance on unseen data.

- Use `train_test_split` from `sklearn.model_selection` to split the data. The typical split ratio is 80% training and 20% testing.
- Set `stratify=y` to ensure that the class distribution is similar in both training and testing sets.


In [15]:
# TODO:
# Split the data into training and testing sets

#### **6.2 Model Selection**

For this task, you can choose any classification algorithm to build the model. Here are a few popular algorithms you can consider:

**Traditional Classifiers:**

- `LogisticRegression`
- `KNeighborsClassifier`
- `SVC` (Support Vector Classifier)
- `DecisionTreeClassifier`
- `RandomForestClassifier`
- `GradientBoostingClassifier`
- `AdaBoostClassifier`
- `XGBClassifier` (if using `XGBoost`)

**Deep Neural Network Classifier:**

- `MLPClassifier` (Multi-layer Perceptron from `sklearn.neural_network`)

Initialize the chosen model using appropriate parameters.


In [16]:
# TODO: Instantiate a Classifier object of your choice

#### **6.3 Model Training**

Train the initialized model on the training data using the `fit()` function.


In [17]:
# TODO: Fit the classifier object on the training data

#### **6.4 Predictions and Evaluation**

After training the model, make predictions on the test data and evaluate its performance using various metrics.

**1. Predictions:**

Use the `predict()` function to predict the target values on the test set, and save them as `y_pred`.


In [18]:
# TODO: Predict the labels of the testing data

**2. Confusion Matrix:**

Wtrite a function to plot the confusion matrix to visualize the model's performance. Use `confusion_matrix` from `sklearn.metrics` and `heatmap` from `seaborn` to plot the matrix.


In [19]:
def plot_confusion_matrix(y_true, y_pred):
    # TODO:
    # Use the confusion_matrix function to calculate the confusion matrix
    # Plot the confusion matrix using sns.heatmap
    # Set the x-axis label to "Predicted" and the y-axis label to "Actual"
    # Set the title to "Confusion Matrix"

**3. ROC-AUC Curve:**

Plot the ROC-AUC curve to visualize the trade-off between the True Positive Rate and False Positive Rate. Use `roc_curve` and `auc` functions from `sklearn.metrics` to plot the curve.


In [20]:
def plot_roc_curve(y_true, y_pred):
    # TODO:
    # Use roc_curve from sklearn.metrics to get the fpr, tpr, thresholds
    # Use auc from sklearn.metrics to get the auc
    # Plot the fpr vs tpr
    # Plot the line y=x in red as referrence for random classifier
    # Add labels and title to the plot

**4. Evaluate the Model:**

Use the `classification_report` function from `sklearn.metrics` to print a summary of the model's performance, including precision, recall, F1-score, and accuracy. Plot the ROC-AUC curve and confusion matrix to visualize the model's performance.


In [None]:
# TODO:
# Print the classification report using classification_report from sklearn.metrics
# Use plot_confusion_matrix to plot the confusion matrix
# Use plot_roc_curve to plot the ROC curve

### **7. Improving Model Performance**

In this section, we will focus on improving the model's performance by addressing the issues related to the classification of the active class, which has fewer data points. There are several techniques to handle imbalanced datasets:

- **Find the Optimal Threshold:** Adjust the classification threshold to balance precision and recall.
- **Resampling Techniques:** Oversample the minority class (SMOTE) or undersample the majority class to balance the class distribution.
- **Data Augmentation:** Generate synthetic samples by adding noise to the existing data points.
- **Balanced Bagging:** Use ensemble methods like BalancedRandomForestClassifier or EasyEnsemble to handle imbalanced datasets.

#### **7.1 Find the Optimal Threshold**

To find the optimal threshold, you can plot the Precision-Recall curve and identify the threshold that balances precision and recall. Use the `precision_recall_curve` function from `sklearn.metrics` to plot the curve.

**Steps**:

- Use `predict_proba` to get the probabilities of the positive class.
- Call the `precision_recall_curve` function to get the precision, recall, and threshold values.
- Calculate the F1-score using the precision and recall values.
  $$F1 = 2 * \frac{Precision * Recall}{Precision + Recall}$$
- Find the optimal threshold where the f1-score is maximum.
- Print the optimal threshold value.
- Calculate the `y_pred` from the probabilities using the optimal threshold. i.e., if the probability is greater than the threshold, predict 1; otherwise, predict 0.
- Evaluate the model using the confusion matrix, classification report, and ROC-AUC curve.


In [None]:
# TODO: Write your code here to find the best threshold

# TODO: Write your code here to predict the labels based on the best threshold

# TODO: Write your code here to plot the confusion matrix, classification report, and ROC curve for the best threshold

#### **7.2 Fix using Resampling Techniques**

If the model performance is still not satisfactory, you can try using resampling techniques to balance the class distribution. Use sklearn's `resample` function to oversample the minority class or undersample the majority class.

**Steps**:

- Split the data 'X' and 'y' into training and testing sets using `train_test_split`.
- Concatenate 'X_train' and 'y_train' to create a training dataset.
- Find the indices of the minority and majority classes. Use boolean indexing on the target variable.
- Use `resample` to oversample the minority class.
- Concatenate the resampled minority class with the majority class.
- Separate the target variable from the features.
- Train the model on the resampled data and evaluate its performance.


In [None]:
# TODO: Write your code here to split the data into training and testing sets

# Find the majority and minority classes
train_majority = pd.concat([X_train, y_train], axis=1)[y_train == 0]
train_minority = pd.concat([X_train, y_train], axis=1)[y_train == 1]

# Upsample the minority class by resampling with replacement
train_minority_upsampled = resample(
    train_minority, replace=True, n_samples=len(train_majority), random_state=42
)

# Combine the majority class with the upsampled minority class and shuffle the data
train_upsampled = pd.concat([train_majority, train_minority_upsampled])
train_upsampled = train_upsampled.sample(frac=1, random_state=42)

# Separate the features and the target variable
X_train = train_upsampled.drop("Name", axis=1)
y_train = train_upsampled["Name"]


# TODO: 
# Fit the classifier object on the new training data
# Predict the labels of the testing data

# TODO:
# Print the classification report using classification_report from sklearn.metrics
# Use plot_confusion_matrix to plot the confusion matrix
# Use plot_roc_curve to plot the ROC curve

#### **7.3 Fix using SMOTE resampling**

Another popular technique to handle imbalanced datasets is Synthetic Minority Over-sampling Technique (SMOTE). It generates synthetic samples for the minority class by interpolating between existing samples.

**Steps**:

- Split the data 'X' and 'y' into training and testing sets using `train_test_split`.
- Use `SMOTE` from `imblearn.over_sampling` to resample the data.
- Split the data into training and testing sets.
- Initialize the SMOTE model with appropriate parameters.
- Fit the SMOTE model on the training data and resample it.
- Train the model on the resampled data and evaluate its performance.




In [None]:
# TODO: Split the data into training and testing sets

# Instantiate the SMOTE object
smote = SMOTE(sampling_strategy="minority", random_state=42)

# Use SMOTE to oversample the minority class
X_train, y_train = smote.fit_resample(X_train, y_train)

# Shuffle the data
oversampled = pd.concat([X_train, y_train], axis=1)
oversampled = oversampled.sample(frac=1, random_state=42)

# Separate the features and the target variable
X_train = oversampled.drop("Name", axis=1)
y_train = oversampled["Name"]

# TODO:
# Fit the classifier object on the new training data
# Predict the labels of the testing data

# TODO:
# Print the classification report using classification_report from sklearn.metrics
# Use plot_confusion_matrix to plot the confusion matrix
# Use plot_roc_curve to plot the ROC curve

#### **7.4 Balanced Bagging**

Balanced Bagging is an ensemble method that combines multiple classifiers trained on balanced bootstrap samples. It helps improve the model's performance on imbalanced datasets. Use `BalancedBaggingClassifier` from `imblearn.ensemble` to train the model.

**Steps**:

- Split the data 'X' and 'y' into training and testing sets using `train_test_split`.
- Initialize the `BalancedBaggingClassifier` with the base classifier (e.g., DecisionTreeClassifier, RandomForestClassifier).
- Train the model on the training data.
- Evaluate the model's performance using the confusion matrix, classification report, and ROC-AUC curve.



In [None]:
# TODO: Split the data into training and testing sets

# TODO: Instantiate a BalancedBaggingClassifier object with estimator as your choosen classifier

# TODO:
# Fit the new classifier object on the training data
# Predict the labels of the testing data

# TODO:
# Print the classification report using classification_report from sklearn.metrics
# Use plot_confusion_matrix to plot the confusion matrix
# Use plot_roc_curve to plot the ROC curve


#### **7.5 Augmenting Data**

Data augmentation is another technique to handle imbalanced datasets by generating synthetic samples. You can add noise to the existing data points to create new samples.

**Steps**:
- Split the data 'X' and 'y' into training and testing sets using `train_test_split`.
- Concatenate 'X_train' and 'y_train' to create a training dataset.
- Find the indices of the minority and majority classes. Use boolean indexing on the target variable.
- Use `resample` to oversample the minority class.
- Use `np.random.normal` to generate noise with the same shape as the upsampled minority class.
- Add the noise to the upsampled minority class to create augmented data.
- Concatenate the augmented data with the majority class.
- Separate the target variable from the features.
- Train the model on the augmented data and evaluate its performance.


In [None]:
# TODO: Split the data into training and testing sets

# Find the majority and minority classes
train_majority = pd.concat([X_train, y_train], axis=1)[y_train == 0]
train_minority = pd.concat([X_train, y_train], axis=1)[y_train == 1]

# Upsample the minority class by resampling with replacement
train_minority_upsampled = resample(
    train_minority, replace=True, n_samples=len(train_majority), random_state=42
)

# Add random noise to the upsampled data
train_minority_upsampled.iloc[:, :-1] += np.random.normal(
    0, 0.01, train_minority_upsampled.iloc[:, :-1].shape
)

# Combine the majority class with the upsampled minority class and shuffle the data
train_upsampled = pd.concat([train_majority, train_minority_upsampled])
train_upsampled = train_upsampled.sample(frac=1, random_state=42)

# Separate the features and the target variable
X_train = train_upsampled.drop("Name", axis=1)
y_train = train_upsampled["Name"]

# TODO:
# Fit the classifier object on the new training data
# Predict the labels of the testing data

# TODO:
# Print the classification report using classification_report from sklearn.metrics
# Use plot_confusion_matrix to plot the confusion matrix
# Use plot_roc_curve to plot the ROC curve


### **8. Evaluate performance on another dataset**

- Load another protein dataset (e.g., `casp3.csv`) and preprocess it similarly to the previous dataset.
- Use the trained model to make predictions on this new dataset.
- Evaluate the model's performance using the confusion matrix, classification report, and ROC-AUC curve.

In [None]:
# TODO:
# Load another dataset
# Reomve the columns you removed in the previous dataset
# Use the scaler you used in the previous dataset to scale the new dataset
# Separate the features and the target variable
# Test the classifier you trained on the previous dataset on the new dataset
# Evaluate the classifier using the classification report, confusion matrix, and ROC curve
