Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Meta’s large language models (LLMs) can now see.
Today at Meta Connect, the company rolled out Llama 3.2, its first major vision models that understand both images and text.
Llama 3.2 includes small and medium-sized models (at 11B and 90B parameters), as well as more lightweight text-only models (1B and 3B parameters) that fit onto select mobile and edge devices.
“This is our first open-source multimodal model,” Meta CEO Mark Zuckerberg said in his opening keynote today. “It’s going to enable a lot of applications that will require visual understanding.”
Like its predecessor, Llama 3.2 has a 128,000 token context length, meaning users can input lots of text (on the scale of hundreds of pages of a textbook). Higher parameters also typically indicate that models will be more accurate and can handle more complex tasks.
Meta is also today for the first time sharing official Llama stack distributions so that developers can work with the models in a variety of environments, including on-prem, on-device, cloud and single-node.
“Open source is going to be — already is — the most cost-effective customizable, trustworthy and performant option out there,” said Zuckerberg. “We’ve reach an inflection point in the industry. It’s starting to become an industry standard, call it the Linux of AI.”
Rivaling Claude, GPT4o
Meta released Llama 3.1 a little over two months ago, and the company says the model has so far achieved 10X growth.
“Llama continues to improve quickly,” said Zuckerberg. “It’s enabling more and more capabilities.”
Now, the two largest Llama 3.2 models (11B and 90B) support image use cases, and have the ability to understand charts and graphs, caption images and pinpoint objects from natural language descriptions. For example, a user could ask in what month their company saw the best sales, and the model will reason an answer based on available graphs. The larger models can also extract details from images to create captions.
The lightweight models, meanwhile, can help developers build personalized agentic apps in a private setting — such as summarizing recent messages or sending calendar invites for follow-up meetings.
Meta says that Llama 3.2 is competitive with Anthropic’s Claude 3 Haiku and OpenAI’s GPT4o-mini on image recognition and other visual understanding tasks. Meanwhile, it outperforms Gemma and Phi 3.5-mini in areas such as instruction following, summarization, tool use and prompt rewriting.
Llama 3.2 models are available for download on llama.com and Hugging Face, and across Meta’s partner platforms.
Talking back, celebrity style
Also today, Meta is expanding its business AI so that enterprises can use click-to-message ads on WhatsApp and Messenger and build out agents that answer common questions, discuss product details and finalize purchases.
The company claims that more than 1 million advertisers use its generative AI tools, and that 15 million ads were created with them in the last month. On average, ad campaigns using Meta gen AI saw 11% higher click-through rate and 7.6% higher conversion rate compared to those that didn’t use gen AI, Meta reports.
Finally, for consumers, Meta AI now has “a voice” — or more like several. The new Llama 3.2 supports new multimodal features in Meta AI, most notably, its capability to talk back in celebrity voices including Dame Judi Dench, John Cena, Keegan Michael Key, Kristen Bell and Awkwafina.
“I think that voice is going to be a way more natural way of interacting with AI than text,” Zuckerberg said during his keynote. “It is just a lot better.”
The model will respond to voice or text commands in celebrity voices across WhatsApp, Messenger, Facebook and Instagram. Meta AI will also be able to reply to photos shared in chat and add, remove or change images and add new backgrounds. Meta says it is also experimenting with new translation, video dubbing and lip syncing tools for Meta AI.
Zuckerberg boasted that Meta AI is on track to be the most-used assistant in the world — “it’s probably already there.”