An illustration of Our Social Media Cognitive Framework. We build a cognitive pyramid based on Bloom's Taxonomy, including cognitive levels of Knowledge & Comprehension, Application, Analysis, Evaluation, and Creation. These cognitive abilities are derived from different types of users on social media and represent different levels of demands for information processing.
The growth of social media, characterized by its multimodal nature, has led to the emergence of diverse phenomena and challenges, which calls for an effective approach to uniformly solve automated tasks.
The powerful Large Vision Language Models make it possible to handle a variety of tasks simultaneously, but even with carefully designed prompting methods, the general domain models often fall short in aligning with the unique speaking style and context of social media tasks.
In this paper, we introduce a Large Vision Language Model for Social Media Processing (SoMeLVLM), which is a cognitive framework equipped with five key capabilities including knowledge & comprehension, application, analysis, evaluation, and creation.
We discover three major challenges faced by general domain models in addressing the nuances of social media: Limitations in social multimedia understanding(Figure (a)), Challenges in informal language understanding(Figure (b)), Unable to Cope with Complex cognitive demands in social media tasks(Figure (c)). Overall, the contributions of our paper are as follows:
We design a cognitive pyramid according to Bloom's Taxonomy, which is a classic teaching theory proposed by Benjamin Bloom in 1956. The pyramid contains five cognitive levels: Knowledge & Comprehension, Application, Analysis, Evaluation, and Creation.
We have develop a 654k social media dataset SoMeData, which consists of five cognitive modules and various CSS task categories.
We conduct both classification task and generation task on both plain text domain and multimodal domain. Specifically, for tasks containing images, we choose Blip2, InstructBlip (both Vicuna-based and FlanT5xl-based), Llava, and Minigpt4 as our baseline models. And for tasks involving plain text, we select Llama-2-7b-chat-hf, Vicuna-7b-v1.1, and ChatGLM2-6B as our baseline models.
Cognitive Abilities Analysis: We collect results according to the cognitive abilities mentioned in our framework. Specifically, we collect the in-domain performance of multimodal parts (using overall Acc performance) and the OOD performance of plain-text parts at the dataset level and categorize them into Knowledge & Comprehension, Application, Analysis, Evaluation, and Creation, five cognitive levels in total.
Clearly, SoMeLVLM shows greater cognitive ability over baseline models in all of the cognitive levels. At the multimodal Creation level, all of the models perform poorly as they are required to generate three hashtags that best describe the post, which is not an easy task even for human beings.
In our work, we introduce SoMeLVLM, a multimodal language model for social media processing, wherein we design five cognitive capabilities, each of which is mapped to various levels of social media tasks.
Building on this, we collect related plain text and multimodal datasets and enhance the capabilities of vision-language models on relevant tasks through instruction tuning. Additionally, we construct an evaluation based on cognitive levels and test our model under zero-shot conditions, comparing it with other advanced LLMs and LVLMs. The experimental results thoroughly demonstrate the superiority of our model. Our work contributes to the computational social science field by providing methods for modeling and evaluating various tasks on social media and a large-scale, high-quality multimodal social media dataset.
@article{zhang2024somelvlm,
author = {Xinnong Zhang and Haoyu Kuang and Xinyi Mou and Hanjia Lyu and Kun Wu and Siming Chen and Jiebo Luo and Xuanjing Huang and Zhongyu Wei},
title = {SoMeLVLM: A Large Vision Language Model for Social Media Processing},
year = {2024},
journal = {arXiv preprint arXiv: 2402.13022}
}