Recent studies have demonstrated the ability of auto-regressive and seq-to-seq generative models to reach state-of-the-art performance on various Natural Language Understanding (NLU) and Natural Language Processing (NLP) tasks. They operate by framing all the tasks in a single formulation: text auto-completion or text-to-text encoding-decoding. These models can be trained on the products corpus in order to understand the information in the e-commerce products listings. In this paper, we present a new generative model to involve different modalities (e.g. text and vision). The proposed model is an encoder-decoder model with the T5 (Text To Text Transfer Transformer) foundation in which the non-text components are fused to the text tokens. Specific relative positional and token type embeddings are used in the encoder part, while the decoder generates new text corresponding to diverse tasks. Hence, we name the proposed model MMT4: Multi Modality To Text Transfer Transformer. The experiments are done over our proprietary e-commerce catalog involving image and text, with the rationale that the image of a product provides more information about the product. One of the main advantages of this model is to generate product attributes (product specifications) that can be either solely inferred from the text or the image, or both. In the experiments, we pre-train and fine-tune MMT4 to solve a number of downstream tasks: attribute generation, image-text matching (ITM), and title (product name) generation from product’s image (captioning). The experimental results show up to 35% accuracy improvement in comparison with the fine-tuned T5 in the attribute generation task. Product title generation also shows more than 3% higher Rouge-1 recall than the fine-tuned state-of-the-art captioning model. Although we fine-tuned our model on less than 2M samples in a generative mode, its performance is only 2% area under the precision-recall curve lower than the state-of-the-art ITM model.