diff --git a/content/literature-notes/Articles/MINT-1T Scaling Open-Source Multimodal Data by 10x A Multimodal Dataset With One Trillion Tokens.md b/content/literature-notes/Articles/MINT-1T Scaling Open-Source Multimodal Data by 10x A Multimodal Dataset With One Trillion Tokens.md new file mode 100644 index 000000000..573aac1cb --- /dev/null +++ b/content/literature-notes/Articles/MINT-1T Scaling Open-Source Multimodal Data by 10x A Multimodal Dataset With One Trillion Tokens.md @@ -0,0 +1,22 @@ +--- +author: [[Anas Awadalla]] +title: "MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset With One Trillion Tokens" +date: 2024-08-04 +tags: +- articles +- literature-note +--- +![rw-book-cover](https://blog.salesforceairesearch.com/content/images/2024/07/Screenshot-2024-07-22-at-3.02.30-PM.png) + +## Metadata +- Author: [[Anas Awadalla]] +- Full Title: MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset With One Trillion Tokens +- URL: https://blog.salesforceairesearch.com/mint-1t/?s=09 + +## Highlights +- We are excited to open-source 🍃MINT-1T, the first trillion token multimodal interleaved dataset and a valuable resource for the community to study and build large multimodal models. ([View Highlight](https://read.readwise.io/read/01j3rn6bdeyr6b0q9e48szws52)) +- Multimodal interleaved documents are sequences of images interspersed in text. This structure allows us to train large multimodal models that can reason across image and text modalities. Some of the most capable multimodal models like [MM1](https://machinelearning.apple.com/research/mm1-methods-analysis-insights?ref=blog.salesforceairesearch.com), [Chameleon](https://arxiv.org/abs/2405.09818?ref=blog.salesforceairesearch.com), and [Idefics2](https://huggingface.co/blog/idefics2?ref=blog.salesforceairesearch.com) have shown the importance of training on interleaved data to attain best performance. ([View Highlight](https://read.readwise.io/read/01j3rn75xx5tve38m418fj5jtr)) +- Our key principles behind curating 🍃MINT-1T are scale and diversity. While previous open-source datasets such as [OBELICS](https://huggingface.co/datasets/HuggingFaceM4/OBELICS?ref=blog.salesforceairesearch.com) and [MMC4](https://github.com/allenai/mmc4?ref=blog.salesforceairesearch.com), where at most 115 billion tokens, we collect 1 trillion tokens for 🍃MINT-1T allowing practitioners to train much larger multimodal models. To improve the diversity of 🍃MINT-1T we go beyond HTML documents, and include web-scale PDFs and ArXiv papers. We find that these additional sources improve domain coverage particularly on scientific documents. ([View Highlight](https://read.readwise.io/read/01j3rn7pp45xt3zm386fjtsgv3)) +- ![](https://blog.salesforceairesearch.com/content/images/2024/07/AD_4nXdM3-i8wE-NAMR6RJlIk6WJOExx7RFlvhMR1SGI-_n0m0XihtwZUc4HC0T2pHNdWfFOb3CLrACngOMdAKyZtDg977C1RcV1lDS5c0eB4HQ08kR9v2N7bmRchwZSlbQ6XyQXbAUbH1Xa0Be08IG_sefXb_1r-1.png) ([View Highlight](https://read.readwise.io/read/01j3rn92dt5eqdce9dw25z7amp)) +- ![](https://blog.salesforceairesearch.com/content/images/2024/07/AD_4nXdM3-i8wE-NAMR6RJlIk6WJOExx7RFlvhMR1SGI-_n0m0XihtwZUc4HC0T2pHNdWfFOb3CLrACngOMdAKyZtDg977C1RcV1lDS5c0eB4HQ08kR9v2N7bmRchwZSlbQ6XyQXbAUbH1Xa0Be08IG_sefXb_1r-1.png) ([View Highlight](https://read.readwise.io/read/01j3rn92gs36yzknv55es7z3hy)) + - Note: The axis...