E6: BLIP/BLIP2 & E7: Perplexity
Understanding images with AI, Image Aesthetics, Google's Innovator's Dilemma and quotes from both episodes.
Hey Cogs,
Today we’ll discuss:
Understanding images with AI
Image Aesthetics
Google’s Innovator’s Dilemma
Notable quotes from Episodes 6 & 7
A look ahead and “homework” for Episode 8
Understanding Images with AI
The future of multimodal models like Google's Flamingo holds great potential for processing visual data as natively as text data. However, until that day arrives, reducing visual data to text is the current standard for processing visual data through language models. Options for using image data to guide language models include captioning, VQA, OCR, tagging, image selection, aesthetics, and text-image matching. Image captioning is a popular and desired option, and the best bet is BLIP. Other options include BLIP VQA, Amazon Lambda, Huggingface, and Microsoft. Visual QA and aesthetics are also viable options. Tagging misses out on alt text and narrative assistance but still holds utility for image generation prompting. Segmentation, a combination of tagging and spacial carve-up, is powerful for masking for Image 2 Image models. Image generation fine-tuning, depth, and video processing are also viable options for processing visual data through language models.
Images Aesthetics
The quality of user-uploaded images can be a challenge for applications that allow image uploads, such as social media platforms, e-commerce websites, and photo editing tools. Poor-quality images can negatively affect the user experience, reduce sales, and impact the overall aesthetic of the platform. However, AI can be used to evaluate the aesthetic appeal of images and rank them accordingly. There are a variety of AI datasets, services, and open-source models available to help with this problem. In this post, the authors share their findings and recommendations for the best AI technologies and tools for evaluating the aesthetics of images. Despite low correlations between model scores and human evaluator scores, all models appeared to bring the best images to the top and push the worst to the bottom. Specific recommendations include Everypixel's proprietary commercial model and LAION's open-source aesthetic model.
Thank you Omneky for sponsoring The Cognitive Revolution. Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work, customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.
Google’s Innovator’s Dilemma
“Do we plan to build our own models?…it doesn't matter if we use GPT 3.5 or 4 or 2 or our own models or anthropic models. Nobody cares. They just want the answer. As a company, who should you obsess about? The user. So that's why we started off with GPT 3.5 - it was the best model in the market.”
The Innovator's Dilemma (Clayton Christensen) suggests that established companies often struggle to innovate because they are focused on improving their existing products or services to meet the needs of their current customers. They may overlook or dismiss new technologies or business models that don't fit with their current strategy or provide lower margins. This can leave them vulnerable to disruption by smaller, more agile competitors who are able to create new markets with innovative solutions. Aravind suggests that companies should prioritize the needs of their users when making decisions about building their own models or using existing ones. By doing so, they can create a product that meets the user's needs, which can lead to a competitive advantage. However, he also acknowledges that there may be a need to build their own models to reduce costs, which could be seen as a way to disrupt the market or compete more effectively with existing players. Avarind emphasizes the importance of staying customer-focused and adaptable to changes in the market.
Notable Quotes from Episode 6 with Junna Li and Dongxu Li
“Last year, we saw a very interesting use case where captions were generated for Pokemon images. The stable diffusion was fine-tuned to generate Pokemon based on the text, which was quite fascinating. That's one of the interesting use cases I've seen before.”
“I found another demo that uses captions for image searches. We have also been exploring this idea of text being a concentrated gravitation of an image, which is humanly interpretable. If you translate every image to text, the information is condensed into a very small space of text tokens. This space is much smaller compared to the original image. You can use techniques such as sentence embedding to perform fast similarity searches across a wide database. This approach provides an alternative way to perform image-to-image and even image-to-text searches.”
“I want to emphasize that our pre-training strategy is the reason for the success of our approach. It's something unique to Blip 2 because we have a two-stage pre-training strategy. First, we connect the connector to the vision model and perform pre-training to ensure that the connector is well-aligned with the vision model. This way, we can understand the vision information in terms of how the text can correlate with the image features. Only after this stage, we plug in the language model and adapt the connector to work as a bridge between the vision model and language model. Our paper shows that if we skip the first stage and directly connect the two models while using an image capturing loss, the performance becomes much worse.”
“I don't believe we're currently facing imminent danger from models like BLIP-2. However, if we extrapolate this trend a bit, we can see the potential to connect various sensors to language models in a similar way. This could lead to the creation of truly multimodal systems that can perform a wide range of tasks. However, what concerns me is the lack of understanding regarding what is being injected into the language model. We need to be cautious and ensure that we fully understand the implications of these types of connections.”
“You can fine-tune prompts to guide a language model for certain downstream tasks, resulting in better performance. However, interpreting what these prompts learn has proven difficult. The soft prompt is essentially a black box, and it's unclear what it really captures. Due to the massive size of the language model, there are many hidden limitations that can guide it toward certain things. In terms of our soft prompt for embedding vision information, we don't know exactly what that information is. However, we do know that the vision model can make use of certain limitations of the image. Our confidence in the relationship between the prompt and image stems from our first-stage pre-training approach.”
“Our vision is to build the ultimate multimodal system that can perform a variety of tasks, not just limited to images. We want to be able to understand and integrate data from various modalities, such as coding and actions. This is what we're striving towards. Although it's unfortunate that we don't have access to larger models like ChatGPT, at Salesforce research, we aim to open-source the models we have developed. Almost all of our research will be delivered to the community with open-source code and models.”
“These models are far from sentient or perfect. They are essentially just big neural networks and are far from what we expect from humans. While we're looking forward to that day in the future, in the meantime, we need to pay attention to related issues with these models. We have an ethical team who helps us review our use cases, especially with some of the demos we're making. We have seen a few examples of potentially threatening user interactions with recent large language chatbots, so it's important to address these concerns."
Notable Quotes from Episode 7 with Aravind Srinivas
“Do we plan to build our own models?…it doesn't matter if we use GPT 3.5 or 4 or 2 or our own models or anthropic models. Nobody cares. They just want the answer. You deliver the best answer in the shortest amount of time and make the product user experience amazing. They don't care at all. Like VCs care but the actual user doesn't care.”
“As a company, who should you obsess about? You should obsess about the user. So that's why we started off with GPT 3.5 - it was the best model in the market. Now, there is a reason to build your own models and it’s not to impress the VCs. The primary reason is to reduce the cost per query. As you know, Google is doing search in a different way not because they don't want to do it in this manner, but because their search volume is like billions of queries a day. And you cannot use a billion for every query. If you're Google, you're just gonna burn billions of dollars a year for that.”
“The real thing that people are already missing out on is so much economically valuable work is being done with LLMs right now - marketing, sales, research. Some people view Perplexity as a research assistant that writes a summary of anything with references. So there's so much valuable economic work that you would hire an actual human to do. Data cleaning and data labeling are being done with LLM forward paths and API calls, and are already changing the world as we speak today. And that is AGI in some sense to me. OpenAI defines AGI as all remote work that's being done.”
“If you keep focusing on a Turing test or Montezuma event, you're actually missing out on what's really happening at a fundamental GDP level. And I feel like that's always been the two different camps in this. DeepMind is more of the academic, classic style thinking of an AGI, and OpenAI is more practical, implementational, and economically thinking about what's going to happen to the industry itself. I'm more in the second camp. Montezuma's Revenge matters less than all the programmers getting replaced.”
“I’m worried about the income inequality that can happen through this (AGI), at least in the short term. In the long term, I think it should be fine, but in the short term, it's already happening. A lot of people don't have jobs right now. They got laid off, but it's not like you have an immediate need to hire them either. For example, in our situation as a startup, a lot of people compliment us for achieving a lot with only eight or seven people. That’s because we use a lot of AI tools. We don't need to hire a marketing person. We don't need to hire many engineers. Our existing engineers can work with Copilot or chatGPT and write code. The more and more these AI tools get better, the need for hiring more people will go down. Companies will be a lot smaller and get more things done.”
“Only the best engineers are needed who can do things AI cannot do. That's going to put a lot of people out of jobs or make their role in society much less prominent, or they just have to innovate on being useful until everybody's not useful.”
“We need to build AGI so that humans can just go back to living, just live a nice life. Not everybody needs to work so hard. AI can do most of the work that we think is hard work for us. And I think this is not new. Google wanted to do this too. Larry Page always wanted to let computers do the hard thing so that humans can just go live life. A lot of people don't appreciate what's going to happen to them once we have a proto-AGI or even whatever we accept worldwide as an AGI. It'll almost be like you get to live the life of a millionaire or a billionaire.”
“You right now are already living a higher quality life than the President of the United States 50 years ago. You just have access to technology that they could only dream about. And your iPhone is the same phone as Elon Musk, the richest person on the planet. So technology is the biggest leveler to making humanity equitable. A lot of people don't get it, they just keep complaining about wealth inequalities and AGI being dangerous, blah, blah, blah. But they're going to benefit tremendously from it, just like they benefited from every technological revolution.”
“And if intelligence is in abundance, you no longer have to compete to be the highest IQ person in your class or something like that, you can try to do stuff that's interesting and creative to you and learn from the AI.”
“OpenAI has earned the most credit among any organization to build it, just by their track record of progress. And Sam Altman is probably the best CEO in the world right now to do these things. I feel like I would trust him to make the right judgment here. Whoever's concerned about it, should earn the right to control it too. And you can see if Elon just keeps complaining about it and doesn't act on it, he doesn't get the right to decide things.”
Meaningful AI News
OpenAI released GPT-4 - research paper here - and named Nathan as a Red Teamer
Ezra Klein wrote an AI article called This Changes Everything in the NYT which echoes our podcast name The Cognitive Revolution: How AI Changes Everything
On the next episode of The Cognitive Revolution, we have Mahmoud Felfel of play.ht (the AI-powered text-to-voice generator). The episode will drop tomorrow Thursday, March 16th.
If you want to open a few browser tabs in advance of our next episode to prepare:
Listen to this Lex Fridman interview of Richard Feynman created by play.ht. It was completely generated by AI, both the text and voices. The text was generated by fine-tuning OpenAI’s GPT3 on Feynman's books. The voices were generated by finetuning our own Speech Model on the voice of both Lex and Feynman.
Until next time.