Grid Search with MLflow

September 21, 2023 | by Arround The Web | No comments

Hyperparameters in machine learning models are optimized using the grid search approach. A user can swiftly compare multiple hyperparameter settings and find which ones perform the best because MLflow keeps track of the results of grid search experiments. In MLflow, the grid search experiment can be shared with other team members quickly. Using MLflow, the most effective model from a grid search experiment can be deployed.

Benefits of Grid Search

Automated Hyperparameter Tuning: Grid search automates hyperparameter tuning which allows the systematic exploration without manual trial and error.
Reproducibility: Grid search ensures test validity by reproducibly obtaining the reproducible outcomes which enhances communication and reliability.
Exhaustive Search: GS efficiently finds optimal hyperparameters for a model by exhaustively searching combinations.
Robustness: Grid search is a robust technique that is resistant to data noise which reduces overfitting.
Simple to use: Grid search is simple to use and comprehend which makes it a viable method for hyperparameter tuning.
Model Comparisons: Grid search simplifies the model comparison and evaluation metrics selection.

Drawbacks of Grid Search

Computational cost: Grid search is computationally expensive for tuning a large number of hyperparameters.
Time-consuming: It is time-consuming for complex hyperparameter adjustments.
Not always necessary: It is now always required; random search is the best alternative to it.

Example: Finding the Best Model Settings for University Admission System

Let’s look at a grid search example for hyperparameter tuning inside the framework of an online university admissions system. In this example, we use the scikit-learn and a straightforward Gradient Boosting Classifier (GBC) classifier to forecast a student’s likelihood of being accepted to a university based on factors like GPA points, SAT scores, ACT scores, and extra-curricular activities. Multiple options are available for grid search instead of GBC including Logistic Regression (LR), SVM (Support Vector Machine), etc.

Generate a Random Data for Online Admission System Using MLflow for Grid Search

Python’s Pandas and random packages can be used to create a fictitious dataset for the admissions system. With random values for the APP_NO, GPA, SAT Score, ACT Score, Extracurricular Activities, and Admission Status columns, this code generates a synthetic admission dataset. The num_students variable controls how many rows are there in the dataset.

The admission status is randomly set based on a 70% acceptance rate, and the random module is used to produce random values for several columns. For demonstration purposes, the following code piece creates a fake admission dataset with random values and is saved to the std_admission_dataset.csv file:

Code Snippet:

# Import the Panda and Random libraries
import pandas as panda_obj
import random as random_obj

# Set the number of records for the student dataset to generate
students_records = 1000

# Create lists to store data
std_application_numbers = ['APP-' + str(random_obj.randint(1000, 9999)) for _ in range(students_records)] std_gpa = [round(random_obj.uniform(2.5, 4.0), 2) for _ in range(students_records)] std_sat_scores = [random_obj.randint(900, 1600) for _ in range(students_records)] std_act_scores = [random_obj.randint(20, 36) for _ in range(students_records)] std_extra_curriculars = [random_obj.choice(['Yes', 'No']) for _ in range(students_records)]

# Calculate admission status based on random acceptance rate
std_admission_status = [1 if random_obj.random() < 0.7 else 0 for _ in range(students_records)]

# Create a dictionary to hold the student data
std_data = {

'APPLICATION_NO' : std_application_numbers,

'GPA': std_gpa,

'SAT_Score': std_sat_scores,

'ACT_Score': std_act_scores,

'Extracurricular_Activities': std_extra_curriculars,

'Admission_Status': std_admission_status

}

# Create a DataFrame DataFrame_Student from the dictionary
DataFrame_Student = panda_obj.DataFrame(std_data)

# Save the DataFrame DataFrame_Student to a CSV file named std_admission_dataset.csv
DataFrame_Student.to_csv('std_admission_dataset.csv', index=False)
print("Student Data Successfully Export to CSV File!")

Code Execution:

Use the Python command to compile the code, then use the pip command to install a specific module if you encounter a module error. Use the pip3 install command to install the given library if Python is version 3.X or higher.

Successful Execution:

Sample Data Screenshot:

Step 1: Import the Libraries

The MLflow library for machine learning experiment tracking
The Pandas library for handling the data processing and analysis, as well as the mlflow.sklearn package for integrating the Scikit-Learn models
The fourth line imports the “warnings” library to suppress the errors
The ParameterGrid class for grid search in sklearn.model_selection module
GridSearchCV and GradientBoostingClassifier from sklearn.model_selection and ensemble, respectively, for grid search and gradient boosting classifier models
The accuracy_score and classification_report functions from the sklearn.metrics module to calculate the model accuracy and generate classification reports
The code imports the OS module and sets the GIT_PYTHON_REFRESH environment variable to quiet.

Code Snippet:

# Step-I Import Required Libraries
import mlflow
import mlflow.sklearn
import warnings as warn
import pandas as panda_obj
from sklearn.model_selection import train_test_split as tts, ParameterGrid as pg, GridSearchCV as gscv
import os
from sklearn.ensemble import GradientBoostingClassifier as GBC
from sklearn.metrics import accuracy_score as acs, classification_report as cr
os.environ["GIT_PYTHON_REFRESH"] = "quiet"

Step 2: Set the Tracking URI

MLflow server’s tracking URI is set using the mlflow.set_tracking_uri() function, ensuring a local machine on port 5000 for experiments and models.

mlflow.set_tracking_uri("http://localhost:5000")

Step 3: Load and Prepare the Admission Dataset

Import the Pandas library as panda_obj for data manipulation and analysis. The read_csv() function is applied to load the admission dataset. The path to the dataset is the only argument that is required by the read_csv() function. The path to the dataset in this instance is std_admission_dataset.csv. By employing the read_csv() function, the dataset is loaded into a Pandas DataFrame.

The Admission_Status column from the std_admissions_data DataFrame is first removed by the code. Since this column contains the target variable, preprocessing is not necessary.

Then, the code creates two new variables: “F” and “t”. The features are contained in the “F” variable, while the target variable is contained in the “t” variable.

The data is then distributed into testing and training sets. This is accomplished using the tts() function from the sklearn.model_selection package. The features, the target variable, the test size, and the random state are the four arguments that are required by the tts() function. The test_size parameter stipulates the portion of the data that is utilized for test purposes. Since the test size in this instance is set to 0.2, 20% of the data will be used for the test.

The random_state option specifies the random number generator seed. This is done to ensure that the data is separated at random. The training and testing sets are now stored in the F_training, F_testing, t_training, and t_testing variables. These sets can be used to evaluate and train the machine learning models.

Code Snippet:

# Step-3: Load the admission dataset
std_admissions_data = panda_obj.read_csv('std_admission_dataset.csv')

# Preprocess the data and split into features (F) and target (t)
F = std_admissions_data.drop(['Admission_Status'], axis=1)
t = std_admissions_data['Admission_Status']

# Convert categorical variables to numerical using one-hot encoding
F = panda_obj.get_dummies(F)
F_training, F_testing, t_training, t_testing = tts(F, t, test_size=0.2, random_state=42)

Step 4: Set the MLflow Experiment Name

adm_experiment_name = "University_Admission_Experiment"
mlflow.set_experiment(adm_experiment_name)

Step 5: Define the Gradient Boosting Classifier

The gradient boosting classifier model is now stored in the gbc_obj variable. The admission dataset can be employed to test and train this model. The value of the random_state argument is 42. This guarantees that the model is trained using the exact same random number generator seed which makes the outcomes repeatable.

gbc_obj = GBC(random_state=42)

Step 6: Define the Hyperparameter Grid

The code initially creates the param_grid dictionary. The hyperparameters that are adjusted via the grid search are contained in this dictionary. Three keys make up the param_grid dictionary: n_estimators, learning_rate, and max_depth. These are the gradient-boosting classifier model’s hyperparameters. The number of trees in the model is specified by the hyperparameter n_estimators. The model’s learning rate is specified via the learning_rate hyperparameter. The hyperparameter max_depth defines the highest possible depth of the model’s trees.

Code Snippet:

param_grid = {

'n_estimators' : [100, 150, 200],

'learning_rate' : [0.01, 0.1, 0.2],

'max_depth' : [4, 5, 6]

}

Step 7: Perform the Grid Search with MLflow Tracking

The code then iterates over the param_grid dictionary. For each set of hyperparameters in the dictionary, the code does the following:

Starts a new MLflow run
Converts the hyperparameters to a list if they are not already a list
Logs the hyperparameters to MLflow
Trains a grid search model with the specified hyperparameters
Gets the best model from the grid search
Makes predictions on the testing data working the best model
Calculates the accuracy of the model
Prints the hyperparameters, accuracy, and classification report
Logs the accuracy and the model to MLflow

Code Snippet:

with warn.catch_warnings():
warn.filterwarnings("ignore", category=UserWarning, module='.*distutils.*')
for params in pg(param_grid):
with mlflow.start_run(run_name="Admissions_Status Run"):
# Convert single values to lists
params = {key: [value] if not isinstance(value, list) else value for key, value in params.items()}
mlflow.log_params(params)
grid_search = gscv(gbc_obj, param_grid=params, cv=5)
grid_search.fit(F_training, t_training)
std_best_model = grid_search.best_estimator_
model_predictions = std_best_model.predict(F_testing)
model_accuracy_score = acs(t_testing, model_predictions)
print("Hyperparameters:", params)
print("Accuracy:", model_accuracy_score)
# Explicitly ignore the UndefinedMetricWarning
with warn.catch_warnings():
warn.filterwarnings("ignore", category=Warning)
print("Classification Report:")
print(cr(t_testing, model_predictions, zero_division=1))
mlflow.log_metric("accuracy", model_accuracy_score)
mlflow.sklearn.log_model(std_best_model, "gb_classifier_model")

Step 8: Execute the Program Using Python

Here is the output on the MLflow server:

Conclusion

MLflow’s grid search tool automates tweaking, tracking the results, and modifying the hyperparameters in machine learning models. It helps to determine the ideal hyperparameters and ensures the reliable results but can be computationally expensive for extensive hyperparameter experiments.

Source: linuxhint.com