WIP: Hugging Face cheatsheet part 1

As prerequisite we need to have Python, pip, venv installed, for Git there is a need for lfs, because Hugging Face supports files larger than 5GB in repos. To work with with account you can use Anaconda and Jupyter Notebook, or Google Colab (then we don’t need install anything), or tools from JetBrains (I use IntelliJ with plugins, lots of people PyCharm). The token and also loaded models are kept in ~/.cache/huggingface/ folder.

pip install huggingface-hub

# Log in using a token from huggingface.co/settings/tokens it's done only once
huggingface-cli login

When working with HuggingFace objects you can use help() method, which displays a detailed information about what objects contains e.g. properties and their bounds. Try e.g. help(pipeline) to get all possible tasks like:
– “question-answering”: will return a [`QuestionAnsweringPipeline`].
– “summarization”: will return a [`SummarizationPipeline`].
– “table-question-answering”: will return a [`TableQuestionAnsweringPipeline`].
– “text2text-generation”: will return a [`Text2TextGenerationPipeline`].
– “text-classification” (alias `”sentiment-analysis”` available): will return a
[`TextClassificationPipeline`].

Hugging Face page also have a sample code how to start using their objects in Use this model menu.

Using transformers from code

# in Google Colab use ! for pip
!pip install transformers

# login to hub on Colab, no need locally if logged in earlier
from huggingface_hub import notebook_login
notebook_login()

# -------------------------------------------

# for most of objects you can use help() e.g. help(pipeline)

# using transformers and pipelines
import transformers
from transformers import AutoTokenizer
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")
classifier = pipeline("text-classification", model="philschmid/tiny-bert-sst2-distilled")

# training data load
from datasets import load_dataset
dataset = load_dataset("smth")

# to get tensor
print(tokenizer("Some random sentences here", return_tensors='pt').input_ids) 

# to get classifier analysis
classifier("I love that thing!")
classifier(["I love that thing!", "I hate your style."])

# to decode single tensor id usually from input_ids
tokenizer.decode(1024)

input_ids = tokenizer("Some random sentences here", return_tensors='pt').input_ids

model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3.5-mini-instruct")

# optional - example only
response_logits = model(input_ids).logits
best_find_index = response_logits[0, -1].argmax()
tokenizer.decode(best_find_index)
e1 = torch.topk(response_logits[0, -1], 10)
e2 = torch.topk(response_logits[0, -1].softmax(dim=0), 10)

# it's torch object so you can use response_logits.shape to see its dimensions, 
# and use indexes or just take last vector [0, -1] if 1 sequence and 3 dimensions
# e1 - just topK,  e2 - topK with probabilities

# TEXT generation
out = model.generate(input_ids, max_new_tokens=20, repetition_penalty=1.4, do_sample=True, top_k=5, top_p=0.9, temperature=0.5)
print(tokenizer.decode(out[0]))

Leave a Reply Cancel reply