Week 2

Week 2 (July 4th - July 8th)

July 5th:

Professor Ordóñez Román and I met on July 5th after returning from a 3-day weekend (due to the 4th of July). On the 5th we discussed the tests I ran with the CLIP Model. CLIP models are trained via natural language supervision on a very large dataset and are capable of high zero-shot performance. For further context the CLIP model jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. I ran the CLIP model on Google Collab by importing it from its original Github Repository. I used the CLIP model to run some experiments using my own data, images, and text prompts. I would then see how the model responded to my original images. For the purpose of these experiments I used a picture of a water bottle, a bracelet, and a computer mouse.

My results are as follows:

CLIP table

How I displayed my data is through a table that displays each text prompt given to the image and what the probability indicated by the CLIP model stated. The closer the probability value is to 1 the more accurate the text prompt is according to the CLIP model. In the first round of tests I used easier text prompts while in the second round of tests I used text prompts that I believed may make it more confusing for the CLIP model. Based off these reults you can see how there was no probability except for the correct text prompt, in the first round, however, for the second round the probability values where more spread out. From these tests I gathered that the CLIP model can identify conventional everyday objects.

July 6th:

After performing these tests I used some of Professor Ordóñez Román’s lectures and slides from his COMP 646: Deep Learning for Vision and Language class to get a better understanding of integral concepts in Vision and Language. His lectures and slides gave me a lot of insight on how vision and language models function as well as the different features that each contain and what they mean.

Along with the lectures I began some more tutorials on Python, including using pytorch and image classification. These tutorials are on Google Collab. Since I have not used Python with vision and language, it is important that I am able to execute the tasks on the Collab notebooks correctly and comprehend them. Some things that I learned was the difference between Supervised and Unsupervised Learning. Moreover, some examples of supervised learning and how it works within machine learning (e.g. machine learning classfiers and regression models).

July 7th:

On the 7th, Professor Ordóñez Román, Eleanor (the other DREU intern working with Professor Ordóñez Román), and I got lunch at the POST Houston which is a food mall with a lot of variety. Since I haven’t been around Houston much it was a cool thing to get to see and the terrace view was amazing, we all got sushi and then headed back to Rice University.

July 8th:

On friday I began running some experimental behavioral tests on the CLIP model (CLIP-ViT-B/32) using the Winoground challenge dataset as discussed prior with my mentor. Before I began running these tests I read the Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality article.

Here are some key takeaways from the article:

  • Invention: Winoground, a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning
  • Goal: given two images and two captions, the model must match them correctly
    • Challenge: both captions contains a completely identical set of words, only in a different order
  • Good performance on Winoground = models need to not only encode text and images but must synthesize information across the two modalities
  • Tags split into: objects, relations, swaps, one predicate, two predicates, pragmatics, symbolic, and series
  • Metrics included: text score, image score, and group score

Basically the Winoground dataset would serve vision and language models captions with identical words but where supposed to be matched to two seperate images. For example,

Example Winoground

Now, it was my turn to run tests on the CLIP model using the Winoground dataset. Basically, we upload both the CLIP model and Winoground challenge dataset into a Google Collab Notebook and then use the images in the Winoground dataset to test the CLIP model. Here is the count for the images in the Winoground dataset:

Image Count Winoground

After you test each image on the CLIP model you can calculate the text score, measures whether a model can select the correct caption, given an image, the image score, measures whether a model can select the correct image, given a caption, and the group score, measures whether every combination for a given example {(C0, I0), (C0, I1), (C1, I0), (C1, I1)} is correctly scored by the model.

Winoground on CLIP results

Upcoming

  • Keep checking resources discussed in previous meetings
  • More practice with machine learning in Python
  • Watch more lectures + read slides from mentor’s class on Vision and Language
Written on July 4, 2022