Week 1

Week 1 (June 20th - June 23rd)

June 20th:

I began my research this week in person, I got to campus on Tuesday June 20th. This day I took to settle in to campus and my dorm room. I am very excited to spend another Summer at the vislang lab.

June 21st

This day, I was able to meet one of the PhD students working in Vicente’s lab over the Summer, Jefferson. He was king enough to give me a tour of the Rice campus and also introduce me to the other PhD student Ernesto. I was also able to get my Rice ID from the RUPD and get my desk setup in Duncan Hall. After this, I was able to meet up with the other intern Sindhu and Cecilia Fernandez who is the DEI chair in the School of Engineering. We met up for coffee, it was nice to talk to Cecilia about what it’s like being a women in Computer Science as well as start to learn some things to do in Houston while I’m here.

June 23rd

Vicente will be returning to campus today but until he returns he gave me two articles to look over and start cultivating ideas on how to go about the research this Summer. The first article was about LAION-5B: An open large-scale dataset for training next generation image-text models. Here are some quick notes I took over the article:

LAION-5B - a dataset consisting of 5.85 billion CLIP-filtered image-text pairs of which 2.32B (billion) contain English language.
With this dataset there’s been successful replication and fine fine tuning of models likes CLIP, GLIDE, and Stable Diffusion
Provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation, and detection scores for watermark, NSFW, and toxic content detection
CLIP has achieved large gains in zero-shot classification on ImageNet
Pre-training dataset is important!!
A lot of the advances seen recently in image-text models has been through large datasets containing hundreds of millions or even billions of image-text pairs (for example 400 million for CLIP and 6.6 billion for BASIC)
- None of the datasets being used on these models are publicly available making it hard to follow along
The CLIP models have been released by OpenAI to the public but neither the pre-training dataset or the resulting models are available
Challenge: research is not open to AI community impeding progression in labs
Used the data source Common Crawl and filtered through the data with an existing and public CLIP model
- New Dataset
  - 2.32 billion English image-text examples
  - 2.26 billion multilingual examples
  - 1.27 billion examples that are not specific to a particular language (e.g. places, products, etc.)
Exploring its ethical implications and flaws that emerge with large-scale data collection
We strongly recommend that LAION-5B should only be used for academic research purposes in its current form. We advise against any applications in deployed systems without carefully investigating behavior and possible biases of models trained on LAION-5B
Data Assembly Pipeline
- Feed in common crawl
  - Parsed the HTML IMG tag from common crawl webisre files
  - Made sure they had alt-text
- Webpage filtering
  - Perform detection using CLD3 with three possible outputs
    - English
    - Another language
    - No language (products and places)
  - Store data in PostgreSQL server
  - 500M image URLs in server at all times
- Download image-text pairs
  - Download raw images from the parsed URLs
  - Use roughly 300 workers in parallel all with 10,000 links and 2vCPUs, 1GB of RAM, and 10Mbps download bandwitdth
- Content filtering
  - Remove data with less than 5 characters of text
  - Less than 5KB of image data
  - Potential malicious, large, or redundant images
  - Compute cosine similarities between image-text pairs
    - Use CLIP models for different outputs
  - Remove all English image-text pairs with cosine similarity below 0.28, and all other pairs with similarity below 0.26
  - This removed around 90% of the original 50 billion images, leaving almost 6 billion examples
- Store data

Screen Shot 2023-07-11 at 4 49 44 PM

TO-DO FOR NEXT WEEK

Keep reading articles
Think of how to improve work from last summer
Start familiarizing with updated code from the GitHub repository

Written on June 20, 2023