Building MLflow Pipelines

September 20, 2023 | by Arround The Web | No comments

Machine learning pipelines can be built and managed using the MLflow Pipelines framework. The end-to-end machine learning tasks of data loading, preprocessing, training, evaluating, and predicting are all carried out through pipelines, which are a series of processes. MLflow Pipelines offers a declarative API and a cache-aware executor, making it simple to create and maintain pipelines.

Before creating an MLflow pipeline, a pipeline step must first be defined. A pipeline step refers to performing a specific activity, like loading data, preparing data, training a model, or assessing a model. Python code is used to construct pipeline phases, which may be imported from libraries like Sklearn, Pandas, and NumPy.

MLflow Pipelines makes it simple to construct and manage pipelines by offering a declarative API and a cache-aware executor. A system for building and managing machine learning pipelines is called MLflow Pipelines.

Declarative API

The MLflow Pipelines Declarative API allows you to define pipelines in a human-readable way.
You can define the steps in your pipeline, the dependencies between the steps, and the parameters for each step.
For example, the following pipeline defines steps to load data, preprocess data, and train a model.

Example of Fraud Detection using Pipeline:

Scenario:

Suppose we have requirements as a data scientist where we have to use MLflow Pipelines to automate detecting fraudulent transactions. Data from a financial institution might be loaded into the pipeline, preprocessed, and then used to train and ultimately deploy a machine learning model to identify fraud. This is a relatively straightforward illustration of automating the ML workflow to find fraudulent transactions using MLflow. Here is a step-by-step explanation of the code.

Pipeline Steps:

The pipeline consists of the following steps:

Load the data from a file.
Preprocess the data.
Train a random forest classifier model.
Evaluate the model.
Log the model, data, and evaluation report to MLflow.

Imports

The relevant modules are imported in the first few lines. Dataframes can be read and edited using the Pandas library. Use the mlflow module to stay updated with experiments and artifacts. To divide the data into training and test sets, the sklearn.model_selection module is required. A random forest classifier (RFC) model is trained using the RandomForestClassifier class from sklearn.ensemble package. A classification report is produced using the classification_report function from sklearn.metrics package.

Pipeline Functions:

Load & Pre-Process the Fraud Data

The fraud data is loaded from a file by the load_fraud_data() method. Pre-processing of the fraud data is performed using the preprocess_fraud_data() method. In this instance, no pre-processing is required.

Python Code:

def load_fraud_data():
fraud_data = pd_obj.read_csv("fraud_data.csv")
return fraud_data
def preprocess_fraud_data(fraud_data):
return fraud_data

return fraud_data

Train the Model using Random Forest Classifier

To detect fraud, a random forest classifier (RFC) model is trained through the train_fraud_model() function. The training set is used to build the model, and the test set is used to evaluate its performance. The classification report is returned.

def train_fraud_model(fraud_data):
Fraud_X = fraud_data.drop(columns=["fraud_label"])
Fraud_y = fraud_data["fraud_label"]
X_fraud_train, X_fraud_test, y_fraud_train, y_fraud_test = tts(Fraud_X, Fraud_y, test_size=0.2, random_state=42)
fraud_detection_model = rnd_classifier(n_estimators=100, random_state=42)
fraud_detection_model.fit(X_fraud_train, y_fraud_train)

Evaluate the Model

The fraud detection model predicts fraud labels for test data, resulting in a classification report that includes accuracy, recall, and F1 score. The report provides information on the model’s overall accuracy, proportion of projected fraud transactions, and F1 score.

fraud_model_predictions = fraud_detection_model.predict(X_fraud_test)
fraud_report = cls_report(y_fraud_test, fraud_model_predictions)
return fraud_detection_model, fraud_report

Log Data & Parameters

The MLflow run starts with the “Fraud Detection Run” run ID, loading fraud data from a file, and logging the number of rows to MLflow. This information is used to track the experiment and reproduce results. The mlflow.log_param(“data_rows,” len(fraud_data)) statement helps compare model performance on different datasets.

def fraud_detection_pipeline():
with mlflow.start_run(run_name="Fraud Detection Run") as run:
# Step 5: Load the data and log parameters
fraud_data = load_fraud_data()
mlflow.log_param("data_rows", len(fraud_data))

Process the Data

The preprocessed_fraud_data() function is used to preprocess fraud data, making it more consistent, easier to understand, and less noisy. The fraud detection model undergoes training and assessment using preprocessing methods such as missing value imputation, feature normalization, feature selection, and data transformation. These techniques assist in rendering the data less noisy and more understandable.

preprocessed_fraud_data = preprocess_fraud_data(fraud_data)

Train the Model and Log Metrics

The fraud_detection_model and fraud_evaluation_report are employed to train a random forest classifier on preprocessed fraud data. A holdout dataset is utilized to evaluate the model, and the evaluation report compares the performance of several models.

fraud_detection_model, fraud_evaluation_report = train_fraud_model(preprocessed_fraud_data)

Log the Trained Model

The fraud detection model is logged to MLflow using the mlflow.sklearn.log_model() function, which allows it to be imported by different machine learning platforms for tracking trials, logging artifacts, and deploying models.

mlflow.sklearn.log_model(fraud_detection_model, "fraud_detection_model")

Log the Dataset Artifact

The MLflow platform tracks experiments, records artifacts, and deploys models across the ML lifecycle. The mlflow.log_artifact() function logs fraud data to MLflow.

mlflow.log_artifact("fraud_data.csv")

Log Evaluation Report

The print(“Run ID:”) command displays the specific run ID on the terminal screen while the mlflow.log_text() method records the fraud evaluation report to MLflow, which can be used later.

mlflow.log_text(fraud_evaluation_report, "fraud_evaluation_report.txt")
print("Run ID:", run.info.run_id)
if __name__ == "__main__":
fraud_detection_pipeline()

Calling the fraud_detection_pipeline() function will start the pipeline. This function will carry out the pipeline steps sequentially as it launches an MLflow run. The evaluation report, model, and data will be logged to MLflow. The console will print the run ID.

The Output of the Code

Open the command prompt and go to the Python code file’s directory to run it on a Windows computer. Type the following command on the shell or command prompt screen and hit Enter. Make sure to update the file name to include the location of the required Python file. Here is the result of the file’s successful execution:

The screenshot below shows that the code saves the data file and model to the artifacts folder. Similarly, parameters are stored in the param folder in the respective file:

Conclusion

Building MLflow pipelines can help automate the machine learning workflow, making it easier to reproduce results, track experiments, and deploy models to production. MLflow provides a variety of tools to help build pipelines, including: The mlflow.start_run() function: This function starts an MLflow run, a unit of work that can be tracked and managed. The mlflow.log_param() function logs a parameter to an MLflow run. The mlflow.log_metric() function: This function logs a metric to an MLflow run. The mlflow.log_artifact() function: This function logs an artifact to an MLflow run.

Source: linuxhint.com