Tensorflow Jobs¶

Tensorflow, along with Pytorch, is a highly active open-source deep learning framework that provides a flexible environment for training and deployment.

AI Lab provides support and adaptation for the Tensorflow framework. You can quickly create Tensorflow jobs and conduct model training through graphical operations.

Job Configuration¶

The job types support both Tensorflow Single and Tensorflow Distributed modes.
The runtime image already supports the Tensorflow framework by default, so no additional installation is required.

Job Runtime Environment¶

Here, we use the baize-notebook base image and the associated environment as the basic runtime environment for jobs.

For information on how to create an environment, refer to Environment List.

Creating a Job¶

Example TFJob Single¶

Log in to the AI Lab platform and click Job Center in the left navigation bar to enter the Jobs page.
Click the Create button in the upper right corner to enter the job creation page.
Select the job type as Tensorflow Single and click Next .
Fill in the job name and description, then click OK .

Pre-warming the Code Repository¶

Use AI Lab -> Dataset List to create a dataset and pull the code from a remote GitHub repository into the dataset. This way, when creating a job, you can directly select the dataset and mount the code into the job.

Demo code repository address: https://github.com/d-run/training-sample-code/

Parameters¶

Launch command: Use bash
Command parameters: Use python /code/tensorflow/tf-single.py

"""
  pip install tensorflow numpy
"""

import tensorflow as tf
import numpy as np

# Create some random data
x = np.random.rand(100, 1)
y = 2 * x + 1 + np.random.rand(100, 1) * 0.1

# Create a simple model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(1, input_shape=(1,))
])

# Compile the model
model.compile(optimizer='adam', loss='mse')

# Train the model, setting epochs to 10
history = model.fit(x, y, epochs=10, verbose=1)

# Print the final loss
print('Final loss: {' + str(history.history['loss'][-1]) +'}')

# Use the model to make predictions
test_x = np.array([[0.5]])
prediction = model.predict(test_x)
print(f'Prediction for x=0.5: {prediction[0][0]}')

Results¶

After the job is successfully submitted, you can enter the job details to see the resource usage. From the upper right corner, navigate to Workload Details to view log outputs during the training process.

TFJob Distributed Job¶

Log in to AI Lab and click Job Center in the left navigation bar to enter the Jobs page.
Click the Create button in the upper right corner to enter the job creation page.
Select the job type as Tensorflow Distributed and click Next.
Fill in the job name and description, then click OK.

Example Job Introduction¶

This job includes three roles: Chief, Worker, and Parameter Server (PS).

Chief: Responsible for coordinating the training process and saving model checkpoints.
Worker: Executes the actual model training.
PS: Used in asynchronous training to store and update model parameters.

Different resources are allocated to different roles. Chief and Worker use GPUs, while PS uses CPUs and larger memory.

Parameters¶

Launch command: Use bash
Command parameters: Use python /code/tensorflow/tensorflow-distributed.py

import os
import json
import tensorflow as tf

class SimpleModel(tf.keras.Model):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc = tf.keras.layers.Dense(1, input_shape=(10,))

    def call(self, x):
        return self.fc(x)

def train():
    # Print environment information
    print(f"TensorFlow version: {tf.__version__}")
    print(f"GPU available: {tf.test.is_gpu_available()}")
    if tf.test.is_gpu_available():
        print(f"GPU device count: {len(tf.config.list_physical_devices('GPU'))}")

    # Retrieve distributed training information
    tf_config = json.loads(os.environ.get('TF_CONFIG') or '{}')
    job_type = tf_config.get('job', {}).get('type')
    job_id = tf_config.get('job', {}).get('index')

    print(f"Job type: {job_type}, Job ID: {job_id}")

    # Set up distributed strategy
    strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

    with strategy.scope():
        model = SimpleModel()
        loss_fn = tf.keras.losses.MeanSquaredError()
        optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)

    # Generate some random data
    data = tf.random.normal((100, 10))
    labels = tf.random.normal((100, 1))

    @tf.function
    def train_step(inputs, labels):
        with tf.GradientTape() as tape:
            predictions = model(inputs)
            loss = loss_fn(labels, predictions)
        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))
        return loss

    for epoch in range(10):
        loss = train_step(data, labels)
        if job_type == 'chief':
            print(f'Epoch {epoch}, Loss: {loss.numpy():.4f}')

if __name__ == '__main__':
    train()

Results¶

Similarly, you can enter the job details to view the resource usage and log outputs of each Pod.