HuggingFace - NLP playaround

Nish · August 26, 2023

Table of Contents

Intro: What is HuggingFace

HuggingFace is a great opensource platform which enables people to develop state of the art models, datasets & applications and share them with others in the community with the aim of accelerating open source development in AI.

Overtime the types of models, datasets that are hosted on the platform has grown and now supports a wide arrange of tasks not just bound to NLP.

HuggingFace itself consists of two main parts:

  1. HuggingFace Hub
    • This is the platoform which hosts the model weight, datasets, documentation etc.
    • You can access it by visiting the site.
  2. Programming Libraries
    • Libraries such as transformers, datasets, tokenizers & accelerate all fall under the libraries available under the HuggingFace name.
    • These provide the tools needed to interact with the hosted models, data etc.

With the focus being on playaround with existing models within the NLP domain it’s worthwhile noting that various different API interfaces exist.

  1. Pipeline API
    • This is the highest level API for performing inference with existing models.
    • It’s very intuitive standardised interface which, with a few lines of code, can perform very advanced tasks across a wide range of areas in NLP.
  2. Autoclasses API
    • This is a slightly lower level API than the pipeline one.
    • Typically the go to API for many when it comes to working with existing models (balances convenience with capability).

We will have a playaround with both.

Notable Technical Terms

To work smoothly with the models on HuggingFace it’s wortwhile having a grasp of a few key ideas

  • Architecture:
    • This is the skeleton of the model — the definition of each layer and each operation that happens within the model (forward pass).
  • Checkpoints
    • These are the weights that will be loaded in a given architecture.
  • Model:
    • This is an umbrella term that isn’t as precise as “architecture” or “checkpoint”: it can mean both. This course will specify architecture or checkpoint when it matters to reduce ambiguity.
  • Language Modeling Head:
    • Sometimes also referred too as the adaption head.
    • It represents the final layers of the model which modify the model for the particular task it’s been trained to work on.
  • Base Model:
    • Another term to refer to the architecture.
    • Essentially a model which hasn’t been pretrained.
  • Pretrained-Model:
    • A model which is the culmination of a base model which has been trained on a very large corpus of text for days using some (usually hefty 💰) compute power.
  • Finetuned-Model:
    • A model formed by taking an already pretrained model and training it on a smaller subset of data for your specific task using (usually) less compute.
    • Incorporates the idea of transfer learning leading to less time to train and achieve good results.
    • Helps keep enviromental costs to a minimum.

Playaround: Welcome to the pipeline

The pipeline is great and allows you to perform a wide variety of tasks. It abstracts away much of the underlying complexity, making it a good starting point for working with transformers.

What’s also great is it doesn’t really require you to know the specific model type to use but instead allows you to lead in with “What is the problem I am trying to solve?” and can then automatically select the appropraite model for you.

Below are examples for different NLP tasks, using the pipeline class.

Sentiment Analysis

Sentiment analysis aims to determine the emotional tone behind a piece of text. The sentiment-analysis pipeline returns the sentiment (positive/negative/neutral) along with a confidence score.

from transformers import pipeline

# Initialize the pipeline for sentiment analysis
sentiment_analyzer = pipeline("sentiment-analysis")

# Analyze sentiment
result = sentiment_analyzer("I love programming!")
print(result)

Named Entity Recognition (NER)

NER identifies entities like names of persons, organizations, locations, etc., in a given text. The ner pipeline will tag these entities and provide their type and position in the text.

# Initialize the pipeline for NER
ner_pipeline = pipeline("ner")

# Run NER
result = ner_pipeline("Elon Musk is the CEO of SpaceX.")
print(result)

Text Generation

The text-generation pipeline can generate text based on a given prompt. You can customize this with parameters like max_length and num_return_sequences.

# Initialize the pipeline for text generation
text_generator = pipeline("text-generation")

# Generate text
result = text_generator("Once upon a time", max_length=50, num_return_sequences=1)
print(result[0]['generated_text'])

Text Summarization

Text summarization aims to generate a concise summary of a longer text. The summarization pipeline takes the text and returns a summarized version.

# Initialize the pipeline for text summarization
summarizer = pipeline("summarization")

# Summarize text
result = summarizer("HuggingFace is creating a tool that democratizes AI.", min_length=5, max_length=20)
print(result[0]['summary_text'])

Translation

The translation_xx_to_yy pipeline can translate text from one language (xx) to another (yy). For example, translating English to French:

# Initialize the pipeline for translation from English to French
translator = pipeline("translation_en_to_fr")

# Translate text
result = translator("Hello, how are you?")
print(result[0]['translation_text'])

Zero-shot Classification

This pipeline allows you to classify text into categories that the model has not been specifically trained on.

# Initialize the pipeline for zero-shot classification
zero_shot_classifier = pipeline("zero-shot-classification")

# Classify text
result = zero_shot_classifier(
    "The stock market is doing well.",
    candidate_labels=["economy", "health", "politics"]
)
print(result)

These examples should give a good introduction to the pipeline class for various NLP tasks. Each example initializes a pipeline tailored to a specific task and then demonstrates a basic usage of that pipeline. For more information on this checkout the pipeline docs provided above.

Playaround: AutoClasses here we go 🔥

The AutoClasses API in the HuggingFace Transformers library offers a more flexible but slightly more complex way to work with models. These classes automatically infer the correct architecture and can be used for custom pipelines, fine-tuning, or more advanced use-cases.

Below are examples for different NLP inference tasks using AutoClasses. To enable an easy comparison the same like for like examples for each task from the pipelines` section have been used.

Sentiment Analysis

For sentiment analysis, you can use AutoModelForSequenceClassification and AutoTokenizer.

from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

# Tokenize and analyze sentiment
inputs = tokenizer("I love programming!", return_tensors="pt")
outputs = model(**inputs)
print(outputs.logits.argmax(dim=1).item())  # Output class index (0: negative, 1: positive)

Named Entity Recognition (NER)

For NER, AutoModelForTokenClassification and AutoTokenizer are used.

from transformers import AutoModelForTokenClassification, AutoTokenizer

# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")

# Tokenize and run NER
inputs = tokenizer("Elon Musk is the CEO of SpaceX.", return_tensors="pt")
outputs = model(**inputs)
print(outputs.logits.argmax(dim=2))  # Output class indices for each token

Text Generation

For text generation tasks, AutoModelForCausalLM and AutoTokenizer are suitable.

from transformers import AutoModelForCausalLM, AutoTokenizer

# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Tokenize and generate text
inputs = tokenizer("Once upon a time", return_tensors="pt")
outputs = model.generate(inputs.input_ids, max_length=50)
print(tokenizer.decode(outputs[0]))  # Generated text

Text Summarization

For text summarization, AutoModelForSeq2SeqLM and AutoTokenizer can be used.

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")

# Tokenize and summarize
inputs = tokenizer("HuggingFace is creating a tool that democratizes AI.", return_tensors="pt")
outputs = model.generate(inputs.input_ids, max_length=50)
print(tokenizer.decode(outputs[0]))  # Summarized text

Translation

For translation, you’d typically use AutoModelForSeq2SeqLM and AutoTokenizer.

# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-fr")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-fr")

# Tokenize and translate
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(inputs.input_ids)
print(tokenizer.decode(outputs[0]))  # Translated text

Zero-shot Classification

Zero-shot classification can also be done using AutoModelForSequenceClassification.

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
import torch.nn.functional as F

# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-mnli")
model = AutoModelForSequenceClassification.from_pretrained("facebook/bart-large-mnli")

# Prepare the prompt and candidate labels
prompt = "The stock market is doing well."
candidate_labels = ["economy", "health", "politics"]

# Tokenize and create input tensors
input_pairs = [f"{prompt} This example is about {label}." for label in candidate_labels]
inputs = tokenizer(input_pairs, padding=True, truncation=True, return_tensors="pt")

# Forward pass
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# Compute probabilities using softmax
probs = F.softmax(logits, dim=1)[:, 0].tolist()

# Pair labels with their probabilities
result = {label: prob for label, prob in zip(candidate_labels, probs)}

print(result)

The AutoClasses API provides more control compared to the pipeline class, allowing you to customize the workflow and even fine-tune models. Each example above demonstrates initializing a tokenizer and a model using AutoClasses, then applying them to a specific NLP task.

In each example we have provided explicit paths to the particular model that we want to use. You can do this by grabbing the model_path for the desired model from the HuggingFace Hub.

HuggingFace & AWS

HuggingFace have a partnership with AWS meaning you can now utilise power SOTA models easily across a wide range of AWS services. HuggingFace also have colloted useful material for working with AWS too.

Performing Model Inference using AWS SageMaker Notebook instances

When it comes to performing model inference in theory the snippets above should work however you may find yourself in the scenario where your notebook instance has some networking/firewall restrictions meaning you can’t access the internet. In such cases you’ll need a work around to use the models, here are the steps you can take:

  1. Download the files associated with your model of choice to your local file system.
  2. Upload those files to some directory inside s3.
    • s3 is amazons cloud storage service in which you can create buckets (root level directories inside your s3 instance).
    • Create a folder inside the bucket that your notebook will have access too.
  3. Download those files from s3 to your working notebook instance.
    • You can automate this with a helper function as provided below.
  4. Instantiate the models as before but instead specifying a local path to files stored within your notebook instance.
  5. Use your models.
    • Examples provided below showcasing a few examples of using models.
import boto3
from pathlib import Path

def download_model_files_from_s3(bucket_name, model_folder, local_folder, download_all=False, specific_files=None):
    """
    Download files from an S3 bucket to a local folder.
    
    Parameters
    -----------
    bucket_name : str 
        Name of the S3 bucket where the files are stored.
    model_folder : str
        Folder path in the S3 bucket where the files are located.
    local_folder : str
        Local folder where the files should be downloaded.
    download_all : bool
        Flag to download all files from the S3 folder.
    specific_files : list
        List of specific files to download. Used if download_all is False.

    Returns
    -------
    None : 
        Downloads files in specified local folder, no explicit return items.
    """
    s3 = boto3.client('s3')
    
    # Ensure the local folder exists; if not, create it.
    Path(local_folder).mkdir(parents=True, exist_ok=True)
    
    # Get list of all objects in the S3 folder
    s3_objects = s3.list_objects_v2(Bucket=bucket_name, Prefix=model_folder)
    
    if 'Contents' not in s3_objects:
        raise Exception(f"No files found in {model_folder} in bucket {bucket_name}")

    # Get list of all file names in the S3 folder
    s3_files = [obj['Key'].split('/')[-1] for obj in s3_objects['Contents']]
    
    if download_all:
        # Download all files
        for file_name in s3_files:
            s3.download_file(bucket_name, f"{model_folder}/{file_name}", f"{local_folder}/{file_name}")
    else:
        # Download only essential or specific files
        files_to_download = specific_files if specific_files else [
            "config.json",
            "pytorch_model.bin",
            "vocab.txt",
            "vocab.json",
            "merges.txt",
            "special_tokens_map.json",
            "tokenizer_config.json",
            "tokenzier.json",
            "generation_config.json"
        ]
        
        for file_name in files_to_download:
            if file_name in s3_files:
                s3.download_file(bucket_name, f"{model_folder}/{file_name}", f"{local_folder}/{file_name}")
            else:
                print(f"Warning: {file_name} not found in S3 folder. Skipping download.")

Text Generation

Firstly you can dowload the files to your notebook instance using the helper function provided.

BUCKET_NAME = '<insert_bucket_name>'
MODEL_FILES_PATH = '<insert_textgen_model_file_path>'
LOCAL_FILES_PATH = '<insert_desired_local_path'

download_model_files_from_s3(bucket_name=BUCKET_NAME, model_folder=MODEL_FILE_PATH, local_folder=LOCAL_FILES_PATH)

When specifying the paths I tend to call with folders inside s3 and locally inside the notebooks memory if not the same then very close to the model_id from the model page, this just makes it easier when working with multiple different types of models and remembering what each one is (plus saves you time in having to think of a different name).

From there you just want to pull in your model to do that, here I am using the GPT-2 model for this but ultimately the model you choose is upto you.

from transformers import GPT2Config, GPT2Tokenizer, GPT2LMHeadModel

def load_gpt2(local_model_folder):
    """
    Loading in various aspects of the model like config, tokenizer and the model itself.
    """
    config = GPTConfig.from_pretrained(f"{local_model_folder}/config.json")
    tokenzier = GPT2Tokenizer.from_pretrained(f"{local_model_folder}/", config=config)
    model = GPT2LMHeadModel.from_pretrained(f"{local_model_folder}/pytorch.bin", config=config)

    return model, tokeinzer, config

local_model_path = '<path_to_gpt2_model>'

gpt2_model, gpt2_tokenizer, gpt2_config = load_gpt2(local_model_folder=local_model_path)

input_text = '<Add your text here>'
encoded_input = gpt2_tokenizer(input_text, return_tensors='pt')
output = gpt2_model.generate(**encoded_input,
                            num_beams=5,
                            max_new_tokens=5,
                            num_return_sequences=5,
                            top_k=50,
                            top_p=0.95,
                            temperature=0.7,
                            do_sample=True
                        )

# prints out generated text to the console
for generated_ids in output:
    generated_text = gpt2_tokenzier.decode(generated_ids, skip_special_tokens=True)
    print(generated_text)

Some points to bear in mind from the above snippet:

  1. I am NOT using model agnostic API signatures for calling the model since I know the model I want and am being explicit for demo purposes.
    • In reality most times you’ll want to use model agnostic signatures and specify the model you want by passing through the approraite arguements.
  2. The model classes are smart and technically don’t need a direct path to the specific intialisation file but instead just the folder containing them. I have added them in this case just for additional clarity.
    • It can detect which are the appropraite files that it needs.
  3. The choices for generation arguements is also for demo purposes. You can find the docs here which run over what the various strategies are and how to perform the tweaks to suit your liking.
  4. Batching together your input sequences into a list is also possible and will work.
    • You’d want to ensure you understand the generated output to keep track of which output is corresponding to which input sequences.
  5. You can package up the output as you want, for playaround I was just printing these to the console for quick viewing.

Sentiment Analysis

This would take a very similar form to the text-generation example prior however you’d want to tweak the model used and related post processing around that.

Here I have gone with a Twitter-RoBERTa model which is based on the original ReBERTa base model but finetuned on a bunch of twitter data, it outputs whether some input text is negative, neutral or postive.

import torch
import torch.nn.functional as F
from transformers import AutoConfig, AutoTokenizer, AutoModelForSequenceClassification

def load_sentiment_analysis_model(local_model_folder):
    """
    Loading in various aspects of the model like config, tokenizer and the model itself.
    """
    config = AutoConfig.from_pretrained(f"{local_model_folder}/")
    tokenzier = AutoTokenizer.from_pretrained(f"{local_model_folder}/", config=config)
    model = AutoModelForSequenceClassification.from_pretrained(f"{local_model_folder}/", config=config)

    return model, tokeinzer, config

local_model_files_path = '<insert_path>'
twitter_roberta_model, twitter_roberta_tokenizer, twitter_roberta_config = load_sentiment_analysis_model(local_model_folder=local_model_files_path)

input_text = '<insert_desired_text>'
encoded_input = twitter_roberta_tokenizer(input_text, return_tensor='pt')

with torch.no_grad():
    outputs = twitter_roberta_model(**encoded_input)

logits = outputs.logits
probs = F.softmax(logits, dim=-1)
probs_percent = probs * 100

# defining class labels, these can probably be pulled from some model attribute or config (didn't bother searching)
class_labels = ['negative', 'neutral', 'positive']

results = []
for prob in probs_percent:
    result = {class_label[i]: f"{prob[i].item():.2f}%" for i in range(len(class_labels))}
    results.append(result)

for i, result in enumerate(results, 1):
    print(f"Predicted sentiments for sentence {i} in batch are {results}")

A few noteworthy points here are:

  1. Decided to use model agnostic API since other example didn’t, also didn’t specify specific files for the intialisation and am relying on the smart initialisation features.
  2. Making use of the torch.no_grad() context manager to prevent gradient tracking as we are performing inference leading to perhaps slight performance improvements.
  3. Post processing the output so that the output works for multiple sequences in a batch & you can see all the percentages for each class, not just the highest classes class label.

Citation Information

If you find this content useful & plan on using it, please consider citing it using the following format:

@misc{nish-blog,
  title = {HuggingFace - NLP playaround},
  author = {Nish},
  howpublished = {\url{https://www.nishbhana.com/HuggingFace-Playaround/}},
  note = {[Online; accessed]},
  year = {2023}
}

x.com, Facebook