Queryloop

Research

Best SLM? Stable LM vs Tiny LLama vs Mini CPM vs Qwen 1.5 | War of SLMs

Zain ul Abideen
February 17, 2024
11 min read

Benchmarking Emotional intelligence evaluation, Code Generation, Text summarization, and Narrative composition.

Small Language Models (SLMs) have been the talk of the town for some time now. Different models are being released almost everyday with the focus to achieve on par results with Large Language Models (LLMs). However, in terms of computational and memory cost, SLMs are already ahead...

Introduction

Small Language Models (SLMs) have been the talk of the town for some time now. Different models are being released almost everyday with the focus to achieve on par results with Large Language Models (LLMs). However, in terms of computational and memory cost, SLMs are already ahead. For sometime they have been regarded as only the smaller versions of the LLMs but now the conditions have changed. SLMs are getting better and better with each passing day and their results are somewhat comparable with the LLMs. Now the question arises: Which SLM is the best? To answer this question, I compared the performance of these small language models (Stable LM, Tiny LLama, MINI CPM, and QWEN 1.5). It was necessary to submit each model to a number of benchmark tests that focused on different NLP tasks. These tasks comprised emotional intelligence evaluation, code generation, text summarization and narrative composition. By looking at the findings of the evaluation, I found that one model consistently outperformed the other across all tasks, while one consistently performed poorly. The two models were comparable to each other and generated similar responses.

Advantages of SLMs

Before jumping onto the comparison of these SLMs, it is necessary for us to understand what advantages does SLMs have over LLMs. The list is long but some of the most important aspects include:
  • Lower computational requirements: This is because SLMs are often less resource intensive and demand less memory and computing power than LLMs. This allows for their utilization on devices with limited resources and also in environments where compute resource would otherwise be limited.
  • Faster training times: With the lesser number of parameter to optimize during the training, SLMs generally converge faster than LLMs, consequently leading to the quicker and better iterations.
  • Cost savings: The cost of small models usually is lower than what you would pay for the larger ones for both training and usage. The licensing fees could be reduced or probably the small models can be deployed and maintained at very minimized costs.
  • Deployment on Cutting-Edge Devices: Another area where SLMs are more effective than LLMs is on cutting-edge devices, which have resource-limited hardware and they need the optimized computation to make the interface attractive for users. The devices that can be categorized among such examples include smartphones, wearables, and IoT gadgets.

Testing Conditions

Prior to conducting comparative analysis of small language models (SLMs), several preconditions were met to ensure consistency and fairness. Specifically, these conditions were:
Adherence to these conditions would most probably have resulted in unbiased responses from SLMs. Since nothing is perfect, hence the term probably.
  • All models be instantiated in a conversational format (Chat Models), capable of engaging in dialogues with human.
  • The total parameter count for each SLM must not exceed 2 billion, thereby focusing on truly compact architectures.
  • Every model was presented with identical prompts for each task, avoiding any preceding conversation history or context. This approach aimed to minimize bias and ensure that each SLM's response was solely dependent on the given input.

Comparison

Now, we are gonna compare these four LLMs:
Across different prompts and we will also rate their responses along with reasoning for each response. We will be doing evaluation on the basis of emotional intelligence, code generation, text summarization and narrative composition.
  • Stable LM-2 1.6 B
  • Tiny LlaMA chat 1.1B
  • QWEN-1.5 chat 1.8B
  • MiniCPM-2B

Emotional Intelligence Evaluation

We will be using 3 prompts for emotional intelligence evaluation. These prompts are:

Prompt 1:

Examine the emotion and sentiment expressed in the following movie review excerpt: 'The acting was superb, but the plot was predictable and lackluster.' Determine if the overall impression conveyed by the statement leans more towards being positive, negative, or neutral.

Prompt 2:

Describe two scenarios where understanding customer emotions could significantly contribute to improving business outcomes. Suggest a potential solution involving emotion detection technology for each situation.

Prompt 3:

Based on the weather conditions described below, predict the likely mood of the speaker: 'A heavy blanket of clouds smothered the sky, casting an eerie gray pallor over the once vibrant cityscape. Raindrops pattered against windows with rhythmic monotony, creating a somber symphony that echoed the residents' melancholic spirits.'
Few of the screenshots are attached below:
  • Stable LM-2 1.6 B: Across all three prompts, the responses generated by Stable LM were given a rating of 9/10. The main reason being that it remained consistent in its response, disected the prompt appropriately and had depth in its answer
  • Tiny LlaMA chat 1.1B: The responses generated by this model were ranked 8/10. It provided accurate answers but were over-simplified and depth was missing which is considered important in terms of emotional intelligence
  • QWEN-1.5 chat 1.8B: Its responses have been given the same rating as that of Stable LM-2 i.e. 9/10. It provided very descriptive and precise answers and maintained a balanced perspective
  • MiniCPM-2B: For the first prompt, the model couldn't perform well (Rating 7/10) but for the remaining two prompts, the results were on par and were given a rating of 9/10. The reason for low rating of first prompt was vague arguments and the model was not confident in its response.
Blog image 1
Blog image 2

Narrative Composition/Story Writing

We have done this evaluation on a single prompt and have ranked the response based on the storyline and details incorporated by each response.
Some of the screenshots of responses are atttached below.
Blog image 1
Blog image 2

Prompt:

In a sleepy town where nothing ever happens, ordinary citizens start developing extraordinary powers overnight — an elderly woman gains telekinesis, a schoolboy acquires super strength, and a timid girl suddenly becomes invisible. As everyone grapples with their newfound abilities, tensions rise, fueling fear and prejudice among neighbors. Write a poignant story exploring themes of acceptance, change, and community in this magical setting.
  • Stable LM-2 1.6 B: Rating 9/10. Consistent pacing, good balance of emotion and action, and solid exploration of the themes.
  • Tiny LlaMA chat 1.1B: Rating 8/10. Heartwarming portrayal of acceptance, change, and community. Somewhat predictable yet still engaging, with room for improvement in terms of descriptiveness and complexity of subplots.
  • QWEN-1.5 chat 1.8B: Rating 6/10. There were disparities in tone. There was no linkage between emotional growth and better community issues.
  • MiniCPM-2B: Rating 8/10. It is the conflict resolution, character development, and good theme integration which make the work engaging. The use of subtleness and complexity could help to create the suspense prior to the revelations of the super powers.

Code Generation

For code generation, we have evaluated the models on 2 prompts

Prompt 1:

Develop a lightweight microservice written in Go or Rust that resizes incoming JPG images to specified dimensions using OpenCV or any alternative computer vision library. Optimize the solution for minimal latency and memory footprint.

Prompt 2:

Given a database schema consisting of two tables: 'Orders' (OrderID int PRIMARY KEY, CustomerName varchar(50)) and 'OrderDetails' (DetailID int PRIMARY KEY, OrderID int, ProductName varchar(50), Quantity int, UnitPrice decimal(18,2)), write an SQL query to retrieve the total revenue for each customer who has placed orders. Format the output as follows: CustomerName, TotalRevenue, where TotalRevenue represents the sum of all products' prices multiplied by quantities ordered by that customer. Display customers with zero sales too. Sort the final result set alphabetically by customer name.
Some of the screenshots of the responses are attached below
  • Stable LM-2 1.6 B: Rating 9/10. Stable LM mostly generated the right code but on some instances left the space empty for the user to fill where the main logic should be written.
  • Tiny LlaMA chat 1.1B: Rating 6.5/10. It could not perform well on both the coding tasks specifically in SQL query.
  • QWEN-1.5 chat 1.8B: Rating 7/10. This model generated the worst response for SQL query. But performed relatively well for the Go microservice.
  • MiniCPM-2B: Rating 8.5/10. Performed well on both prompts. Generated a slightly better response for Go Microservice prompt.
Blog image 1
Blog image 2

Text Summarization

For this task, I have picked up a random article from the web having approximately 4500 tokens. The article is about Ethical Assessment of Implantable Brain Chips
Some screenshots are attached below.
  • Stable LM-2 1.6 B: Rating 7/10.It touches on almost all the important points of the original text, but misses some nuances about the potential societal implications of the technology.
  • Tiny LlaMA chat 1.1B: Rating 8/10. It covers all the relevant topics and adds valuable context to some of the issues raised in the original text.
  • QWEN-1.5 chat 1.8B: Rating 0/10. No text generated due to fixed context length (2048) of the model.
  • MiniCPM-2B: Rating 9/10. The response generated by this model is the strongest, as it comprehensively addresses the topic and offers insightful commentary on the ethical and societal implications of implantable brain chips.
Blog image 1
Blog image 2

Conclusion

After the comparative evaluation and performance assessment of Stable LM-2, Tiny LLama, MINI CPM and QWEN 1.5, it was found that Stable LM-2 outperformed others. Its emotional intelligence, coding exercises, text summarization, and story writing abilities demonstrated its competence.
In the other end of the game play, Tiny llama had been trailing behind its competitors, being beaten on almost each task. Through splashes of brilliance, it was able to keep up with the rest of the contenders, earning its place as the least efficient model.
As for MINI CPM and QWEN 1.5, the study indicated that their performances were quite the same through most of the tests. Besides not being able to surpass Stable LM-2, both of them exhibited flair in some areas and therefore could be utilized alongside each other according to the user case requirements or availability of resources.
AIMachine LearningDeep LearningLanguage ModelLLMQueryloop