Afbeeldingen preprocessen

In deze oefening ga je de flickr-gegevensset gebruiken, met 30.000 afbeeldingen en bijbehorende bijschriften, om bewerkingen voor preprocessing op afbeeldingen uit te voeren. Deze preprocessing is nodig om de afbeeldingsgegevens geschikt te maken voor inferencing met Hugging Face-modeltaken, zoals tekstgeneratie uit afbeeldingen. In dit geval genereer je een tekstcaption voor deze afbeelding:

Photo of 2 people with 1 playing the guitar

De gegevensset (dataset) is geladen met de volgende structuur:

Dataset({
    features: ['image', 'caption', 'sentids', 'split', 'img_id', 'filename'],
    num_rows: 10
})

Het model voor image captioning (model) is geladen.

Deze oefening maakt deel uit van de cursus

Multi-modale modellen met Hugging Face

Cursus bekijken

Oefeninstructies

Laad de afbeelding van het element op index 5 van de gegevensset.
Laad de image processor (BlipProcessor) van het pretrained model: Salesforce/blip-image-captioning-base.
Voer de processor uit op image en geef aan dat PyTorch-tensors (pt) vereist zijn.
Gebruik de methode .generate() om met het model een caption te genereren.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Load the image from index 5 of the dataset
image = dataset[5]["____"]

# Load the image processor of the pretrained model
processor = ____.____("Salesforce/blip-image-captioning-base")

# Preprocess the image
inputs = ____(images=____, return_tensors="pt")

# Generate a caption using the model
output = ____(**inputs)
print(f'Generated caption: {processor.decode(output[0])}')
print(f'Original caption: {dataset[5]["caption"][0]}')

Code bewerken en uitvoeren

Deze oefening maakt deel uit van de cursus

Multi-modale modellen met Hugging Face

SkillTag.level.intermediateSkillTag.label

4.8+

Begin de cursus gratis

Navigate the Hugging Face model hub, transform raw text, audio, and visual data into AI-friendly formats. Learn how to find the latest most popular models for tasks such as text generation and harness the power of pre-built pipelines.

Exercise 1: Navigeren door Hugging Face-modellen Exercise 2: Hoeveel modellen!?Exercise 3: Het populairste text-to-image-model vinden Exercise 4: Voorbewerking van verschillende modaliteiten Exercise 5: Tekst tokenizen Exercise 6: Afbeeldingen preprocessen

Huidige oefening

Exercise 7: Voorbewerking van audio Exercise 8: Pipelinetaken en evaluaties Exercise 9: Pipeline voor bijschriftgeneratie Exercise 10: Keyword-argumenten doorgeven Exercise 11: Model evalueren op een aangepaste gegevensset

Learn to master individual modalities with state-of-the-art models. Dive into computer vision for image classification and segmentation, explore speech recognition and text-to-speech synthesis, and learn effective fine-tuning techniques. Build practical skills with pre-trained models from Hugging Face's transformers library.

Exercise 1: Computer vision Exercise 2: Image classification Exercise 3: Object detection Exercise 4: Image background removal Exercise 5: Fine-tuning computer vision models Exercise 6: CV fine-tuning: dataset prep Exercise 7: CV fine-tuning: model classes Exercise 8: CV fine-tuning: trainer configuration Exercise 9: Speech recognition and audio generation Exercise 10: Automatic speech recognition Exercise 11: Creating speech embeddings Exercise 12: Audio denoising Exercise 13: Fine-tuning text-to-speech models Exercise 14: Fine-tuning a text-to-speech model Exercise 15: Generating new speech

Learn to fuse visual, textual, and audio information for richer AI applications. Master techniques like CLIP for zero-shot classification, build sentiment analyzers that see and read, and create emotion detectors that combine facial expressions with voice. Take your AI models beyond single-modality thinking.

Exercise 1: Zero-shot image classification Exercise 2: Zero-shot learning with CLIP Exercise 3: Automated caption quality assessment Exercise 4: Multi-modal sentiment analysis Exercise 5: Prompting Vision Language Models (VLMs)Exercise 6: Multi-modal sentiment classification with Qwen Exercise 7: Zero-shot video classification Exercise 8: Video audio splitting Exercise 9: Video sentiment analysis with CLIP CLAP

Transform ideas into reality! Master cutting-edge AI techniques to generate and manipulate visual content using text prompts. Create stunning images, edit photos intelligently, and build powerful question-answering systems for images and documents. Turn your creative vision into digital reality with multi-modal AI.

Exercise 1: Visual question-answering (VQA)Exercise 2: VQA with Vision Language Transformers (ViLTs)Exercise 3: Document VQA with LayoutLM Exercise 4: Image editing with diffusion models Exercise 5: Custom image editing Exercise 6: Image inpainting Exercise 7: Video generation Exercise 8: Build a video!Exercise 9: Assessing video generation performance Exercise 10: Congratulations!