Neural Networks - Language Model

Junjie Wang

8/9/20244 min read

This project focuses on building a language model using PyTorch Lightning to predict and generate text based on character-level sequences. The model is trained on sequences of characters from names, where it learns to predict the next character in the sequence, given a preceding sequence of characters. The project involves preprocessing the data, defining the model architecture, training the model, and evaluating its performance. The goal is to create a robust language model that can generate realistic character-level sequences, which can be applied to various natural language processing tasks.

Custom Neural Network Layers

In this section, custom neural network layers are implemented using PyTorch. Each layer is designed to perform specific operations:

  1. Embedding Layer(Input Layer): Maps indices to dense vectors of fixed size, processing to represent words or characters.

  2. FlattenConsecutive Layer(Hidden Layer): Reshapes the input tensor by flattening consecutive elements. Useful for processing sequences.

  3. BatchNorm1d Layer(Hidden Layer): This layer normalizes inputs across the specified dimensions, using batch statistics during training and running averages during evaluation. It also updates running statistics using momentum.

  4. Tanh Layer(Hidden Layer): Applies the hyperbolic tangent activation function to the input.

  5. Linear Layer(Hidden & Output Layer): This layer initializes weights using Kaiming initialization and includes optional bias. The forward pass computes the output by applying a linear transformation to the input.

Sequential Container: Chains multiple layers together, applying each layer sequentially to the input.

Hierarchical Network

This section defines a hierarchical neural network model for character-level learning. The model consists of several layers:

  • Embedding Layer: Converts characters into dense vectors.

  • FlattenConsecutive and Linear Layers: These layers process and transform the embeddings, with BatchNorm1d and Tanh activations interspersed to normalize and introduce non-linearity.

  • Final Linear Layer: Outputs predictions for the vocabulary size.

The model is initialized with weights, and the final layer is scaled down to make predictions less confident, aiming to improve learning stability.

2 Million

24

Hours Training Time

Training Steps

Model Setting

The model is configured with specific parameters:

  • Max Steps: The total number of training steps is set to 2,000,000.

  • Batch Size: Training uses a batch size of 128.

  • Learning Rate: Starts at 0.1 with decay applied halfway through the training process.

The model’s parameters are tracked to ensure proper updates during training.

Training & Optimization

Training involves several steps:

  1. Minibatch Construction: Randomly selects batches from the training data.

  2. Forward Pass: Computes predictions and loss using cross-entropy.

  3. Backward Pass: Computes gradients and updates weights.

  4. Learning Rate Decay: Reduces the learning rate as training progresses to refine learning.

  5. Tracking: Loss and gradient statistics are tracked and printed periodically.

The training process is timed, and the total training duration is reported at the end.

Plot Visualizations

In the Update Ratios Plot, we track how the gradients of the model parameters are scaled relative to their values. This is done by calculating the ratio of the gradient's standard deviation to the parameter's standard deviation, adjusted by the learning rate. These ratios are then transformed using the logarithm (base 10) to better visualize the distribution of these ratios over training steps. The plot includes a horizontal line at -3 to indicate the expected ratio level, which serves as a reference for evaluating whether the gradient updates are within an appropriate range. This analysis helps in monitoring the stability and effectiveness of the parameter updates during training.

Update Ratios

Tracking Loss

The Loss Track Plot visualizes the model's loss progression over training steps. The plot averages the loss values over intervals of 200 training steps, smoothing out fluctuations to reveal the overall trend. This helps monitor the effectiveness of the model's learning process, with a focus on how the loss decreases as training progresses. This plot highlights the importance in evaluating the training effectiveness and ensuring that the model's performance improves over time.

Generating New Company Names

To generate new names, this project used a company dataset and the model is trained on the company names' character sequences. After training, the model can generate new names by sampling from the learned distributions.

Python Working Code

In this blog, I showcase my language model project, where I trained a neural network to generate new company names. The project involved building and fine-tuning the model to create unique, creative names. For those interested in the technical details, I've included a PDF of the Python code used in this project.