Apple Researchers Working on MM1, a Family of Multimodal AI Model With Up to 30 Billion Parameters


Apple researchers have shared their work on building multimodal artificial intelligence (AI) large language models (LLMs) in a pre-print paper. Published on an online portal on March 14, the paper highlights how it was able to achieve advanced capabilities of multimodality and train the Foundation model on both text-only data as well as images. The new advancements in AI for the Cupertino-based tech giant come after CEO Tim Cook's comments during the company's earnings call, where he said AI features could arrive later this year.

A pre-print version of the research paper has been published on arXiv, an open-access online repository of scholarly papers. However, the papers posted here are not peer-reviewed. While the paper does not mention Apple itself, most of the researchers mentioned are affiliated with the company's machine learning (ML) division, leading to belief that the project is also affiliated with the iPhone maker.

According to the researchers, they are working on MM1, a family of multimodal models with 30 billion parameters. Calling it “Performant Multimodal LLM (MLLM)”, the paper's authors highlighted that image encoders, vision language connectors, and other architecture components and data options were created to build AI models that could understand text as well as The text is capable of understanding both. Image-based input.

Giving an example, the paper states, “We demonstrate that using a careful mix of image-caption, interleaved image-text, and text-only data for large-scale multimodal pre-training achieves the state-of-the-art. is important. (SOTA) Few-shot results in several benchmarks, compared to other published pre-training results.

To break it down, the AI ​​model is currently in the pre-training stage, meaning it is not trained enough to produce the desired output. This is the stage when algorithms and AI architecture are used to design the workflow of the model and ultimately how it processes the data. The team of Apple researchers was able to add computer vision to the model using image encoders and a vision language connector. Then, when tested with only images, image and text, and a mixture of text-only data sets, the team found that the results were competitive with existing models at the same stage.

Although the success is significant, this research paper is not enough to ensure that a multimodal AI chatbot will be added to Apple's operating system. At this stage, it is difficult to even say whether the AI ​​model is multimodal when taking input or even giving output (whether it can generate AI images or not). But if the results are confirmed to be consistent after peer review, it can be said that the tech giant has taken another big step towards building a native generative AI foundation model.

Affiliate links may be automatically generated – see our ethics statement for details.

Related Post

Leave a Reply

Your email address will not be published. Required fields are marked *