<a href="https://colab.research.google.com/github/mirsazzathossain/compbio-bracu/blob/main/day_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Day 02: Interdisciplinary Computational Biology workshop 2025**

### **Problem Statement:**

In this hands-on workshop, you'll explore how machine learning can be applied to computational biology. You'll work with Gene Ontology (GO) terms that describe biological functions and processes. The task is to perform binary classification to predict the presence or absence of specific biological functions in a dataset.

- You are provided with a list of GO terms in [this google sheet](https://docs.google.com/spreadsheets/d/1Sc_1Sfi4pKxQ6VGf4j7i7wp0u4kzREzyHb0qwj6AVbM/edit?usp=sharing). 
- Open the spreadsheet and select the GO terms that your group will work on. Before selecting, explore each term in the [QuickGO database](https://www.ebi.ac.uk/QuickGO/term/GO:0000001) to understand the biological context of the term.
- You can use ChatGPT to help you understand the terms.


In this notebook, you'll learn how to retrieve a dataset related to your selected GO terms and apply machine learning techniques to predict the presence or absence of these biological functions.


### **Setup codebase**

We will be using a Python package called `ProFAB` for to retrieve the GO dataset. Run the following code to get the package installed.

In [1]:
import os
if not os.path.exists('profab'):
    !git clone https://github.com/kansil/ProFAB.git
    !cp -r ProFAB/profab .
    !rm -rf ProFAB

### **Import Libraries**

To get started, we need to import a set of essential libraries that will help us perform various tasks:

- **`numpy`**: For efficient numerical computations.
- **`pandas`**: To handle and manipulate structured data.
- **`matplotlib` and `seaborn`**: For creating insightful visualizations.
- **`sklearn`**: To build and evaluate machine learning models.

In Python, it's best practice to import all necessary libraries at the beginning of your script or notebook.  Do these later one by one when you need them in the notebook.

In [12]:
# TODO: Import the necessary libraries

import pandas as pd
pd.set_option("display.max_columns", None)

### **Load Data**

The `ProFAB` package provides the `GOID()` function, which allows you to retrieve datasets related to your selected Gene Ontology (GO) terms.

**Parameters of the `GOID()` Function:**

- **`ratio`**: Specifies the proportions for splitting the data into training, testing, and validation sets.
  - If `ratio` is a single float (e.g., `0.2`), it defines the test set size, with the remaining data allocated to training.
  - If `ratio` is a list of two floats (e.g., `[0.2, 0.1]`), it defines the test and validation set sizes, with the remaining data allocated to training.

- **`protein_feature`**: Determines the numerical feature representation of proteins derived from their sequences. Options include:
  - `'paac'`: Pseudo Amino Acid Composition (50 dimensions)
  - `'aac'`: Amino Acid Composition (20 dimensions)
  - `'gaac'`: Grouped Amino Acid Composition (5 dimensions)
  - `'ctdt'`: CTD Translation (39 dimensions)
  - `'socnumber'`: Quasi-Sequence-Order Coupling Number (60 dimensions)
  - `'ctriad'`: Conjoint Triad (343 dimensions)
  - `'kpssm'`: k-Separated-Bigrams POSSUM Vector (400 dimensions)

  *Default: `'paac'`*

- **`pre_determined`**: Indicates whether the data has been pre-split into training and testing sets.
  - `False`: Data will be split according to the specified `ratio`.
  - `True`: Pre-split data will be used.

- **`set_type`**: Defines the method for splitting the data.
  - `'random'`: Random splitting.
  - `'similarity'`: Splitting based on protein similarity.
  - `'temporal'`: Splitting based on the temporal aspect of data collection.

    *Default: `'random'`*

Use the `GOID()` function to initialize a data module and save it to a variable. 


**Retrieving Data for a Specific GO Term:**

To retrieve data for a specific GO term, use the `get_data()` function. The function takes the following parameters:

- **`data_name`**: The name of the GO term you selected. e.g., `'GO_0000001'`.


**Purpose of Training and Testing Sets:**

- **Training Set (`X_train`, `y_train`)**: Used to train the machine learning model, allowing it to learn the patterns and relationships within the data.
- **Testing Set (`X_test`, `y_test`)**: Used to evaluate the model's performance on unseen data, providing an estimate of how well the model generalizes to new, unseen examples.

Looking a bit complex? Don't worry! Just run the code below with your selected GO term, the data will be loaded and you can start exploring it.


In [3]:
data_model = GOID(ratio = 0.2, protein_feature = 'aac', pre_determined = False, set_type = 'random')
X_train,X_test,y_train,y_test = data_model.get_data(data_name = 'GO_0000018')

### **Explore Data**

Now, we got the data loaded. Let's explore the data by looking at the shape of the data, the first few rows of the data. `FroFAB` package provides data as a list, we will convert the data into a pandas DataFrame for better visualization.

- Preview the first few rows of the data using the `.head()` method.
- Use `.info()` method to get the information about the data.
- Use `.describe()` method to get the summary statistics of the data.



In [None]:
data = pd.DataFrame(X_train)
data['label'] = y_train

# TODO: Display the first 5 rows of the dataset

In [None]:
# TODO: Display the basic information about the dataset

In [None]:
# TODO: Display some basic statistics of the dataset

### **Data Visualization**

Visualizing data is a crucial step in the data analysis process. It helps you understand the data distribution, relationships between features, and more. Use the `pairplot()` function from the `seaborn` library to create a pairplot of the data. The pairplot shows the relationship between different features in the dataset.

In [None]:
# TODO: Wtire code to plot a pairplot of the dataset

### **Feature Selection**

Feature selection is the process of selecting a subset of relevant features for use in model construction. It helps to improve the model's performance by reducing overfitting and increasing the model's interpretability. Use correlation matrix to identify the features that are highly correlated with the target variable.

 **Correlation Matrix**:  
   A correlation matrix helps identify the relationships between features. Correlation values range from -1 to 1:

   - A value close to **1** indicates a strong positive correlation.
   - A value close to **-1** indicates a strong negative correlation.
   - A value close to **0** indicates weak or no correlation.

   Use `df.corr()` to calculate the correlation matrix and `sns.heatmap()` to visualize it. Remove highly correlated features to reduce multicollinearity.

In [None]:
# TODO:
# Calculate the correlation matrix of the dataset using the corr() method
# Plot the correlation matrix using a heatmap

### **Standardizing the Data**

Standardization is important to scale the features so that they have a mean of 0 and a standard deviation of 1. This ensures that each feature contributes equally to the model, especially for algorithms sensitive to feature scales.

**Action**:

1. Use `StandardScaler` from `sklearn.preprocessing` to standardize the features.
2. Apply the scaler to all features, excluding the target variable.
3. Replace the original features with the standardized values in the DataFrame.

In [11]:
# TODO:
# Instantiate a StandardScaler object
# Fit the scaler object on the training data
# Transform the training and testing data using the scaler object


### **Model Building**

Now that the data is preprocessed and ready, you can proceed with building a machine learning model.

#### **Model Selection**

For this task, you can choose any classification algorithm to build the model. Here are a few popular algorithms you can consider:

**Traditional Classifiers:**

- `LogisticRegression`
- `KNeighborsClassifier`
- `SVC` (Support Vector Classifier)
- `DecisionTreeClassifier`
- `RandomForestClassifier`
- `GradientBoostingClassifier`
- `AdaBoostClassifier`
- `XGBClassifier` (if using `XGBoost`)

**Deep Neural Network Classifier:**

- `MLPClassifier` (Multi-layer Perceptron from `sklearn.neural_network`)

Initialize the chosen model using appropriate parameters.

In [13]:
# TODO: Instantiate a Classifier object of your choice

#### **Model Training**

Train the initialized model on the training data using the `fit()` function.

In [None]:
# TODO: Fit the classifier object on the training data 

#### **Predictions**

Use the `predict()` function to predict the target values on the test set, and save them as `y_pred`.

In [15]:
# TODO: Predict the labels of the testing data

#### **Model Evaluation**

Evaluate the model's performance using various metrics such as accuracy, precision, recall, F1-score, and confusion matrix.

##### **Confusion Matrix**

A confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known. It helps visualize the performance of an algorithm.

Use the `confusion_matrix()` function from `sklearn.metrics` to calculate the confusion matrix. Save the confusion matrix in a variable and visualize it using `sns.heatmap()`.

In [None]:
# TODO:
# Calculate confusion matrix using of the predicted labels and the actual labels
# Plot the confusion matrix using a heatmap

##### **Classification Metrics**

There are several metrics to evaluate the performance of a classification model. Here are a few common metrics:

- **Accuracy**: The proportion of correct predictions to the total number of predictions.
$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$$
- **Precision**: The proportion of true positive predictions to the total positive predictions.
$$Precision = \frac{TP}{TP + FP}$$
- **Recall**: The proportion of true positive predictions to the total actual positive instances.
$$Recall = \frac{TP}{TP + FN}$$
- **F1-Score**: The harmonic mean of precision and recall.
$$F1-Score = 2 \times \frac{Precision \times Recall}{Precision + Recall}$$

Use the `classification_report()` function from `sklearn.metrics` to calculate these metrics and print the classification report.

In [None]:
# TODO: Use the classification_report method to calculate the precision, recall and f1-score of the model and print it

##### **ROC Curve**

The Receiver Operating Characteristic (ROC) curve is a graphical representation of the true positive rate against the false positive rate. It shows the trade-off between sensitivity and specificity. The Area Under the Curve (AUC) is a metric that quantifies the overall performance of the model.

Use the `roc_curve()` function from `sklearn.metrics` to calculate the ROC curve and the `auc()` function to calculate the AUC. Plot the ROC curve using `matplotlib`. 
For today's workshop, you just run the code below and get the results.

In [None]:
y_pred_prob = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
auc = roc_auc_score(y_test, y_pred_prob)

plt.plot([0, 1], [0, 1], "k--")
plt.plot(fpr, tpr, label=f"AUC = {auc:.2f}")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Logistic Regression ROC Curve")
plt.legend()
plt.grid()
plt.show()

### **Additional Tasks**

- **Use Different Features**: While loading the data, we chosed `protein_feature` as `aac`. Now, for yur GO term, this feature may not be the best one. You task is to try with different features and see if the model performance improves. 
- **Choose Different Model**: You can try with different models and see if the model performance improves.

In [None]:
# Write your code here