Everything you need to know about Google's new Gemma 7B and 2B Models

Introduction

Google has been in the LLM space for quite some time now, yet Gemma remains their first open LLM. The release of Gemma has stirred the community and everyone is excited to try it out. Like everyone, I am no exception. But, how good this model really is? To answer this question, I have compared the performance of different variants of the Gemma family and stated their results, and I have also released 2 more variants of Gemma-it. Before moving on, let's introduce Gemma first.

The Gemma offers a collection of lightweight open models from Google that are created from the same research and technology as that used in the development of the Gemini model. They are text-to-text, decoder-only language models, available in English. The main highlight of this family of models is that their weights are open, and they also offer pre-trained and instruction-tuned variants of the model. Gemma models can be used in different natural language processing problems such as question answering, summarization, and reasoning. Due to their relatively small sizes, the possibility of using them in environments with restricted resources such as laptops, desktops, or one's cloud infrastructure, no longer looks unrealistic and supports an innovation for everybody. The four variants released are:

Gemma-2B
Gemma-2B-it
Gemma-7B
Gemma-7B-it

The difference between 'it' aka 'Instruction Tuned' and the base model is that the 'it' variants are better for chat purposes since they have been fine-tuned to better understand the instructions and generate better answers while the base variants are those that have not undergone under any sort of fine-tuning. They can still generate answers but not as good as the 'it' one.

Performance

Now, coming towards the performance side. The Gemma performs well on the Open LLM leaderboard. But if we compare Gemma-2b (2.51 B) with PHI-2 (2.7 B) on the same benchmarks, PHI-2 easily beats Gemma-2b.

The results of PHI-2 are almost comparable to Gemma-7B. The numbers are even worse on the Nous and EQ benchmarks. Gemma-2b(-it) (2.51B param) severely underperforms phi-2 (2.78B param) on Nous' benchmark suite. Quite surprising that both AGIEval and Bigbench are particularly related with human evaluation.

To view and analyze the results across various benchmarks, visit the model card for Google's Gemma. It is tempting to say the least that Gemma might have been overfitting the test benchmarks.

Release of Gemma-7B-Openhermes and Gemma-2B-Openhermes

Gemma-7b-openhermes is a variant of the Gemma 7B language model, which has been further fine-tuned on the OpenHermes-2.5 preference dataset using QLoRA.

google/gemma-7b-it
mlabonne/chatml-OpenHermes2.5-dpo-binarized-alpha

Similarly Gemma-2b-openhermes has been finetuned.

google/gemma-2b-it
mlabonne/chatml-OpenHermes2.5-dpo-binarized-alpha

Since Gemma 'base' and 'it' models did not show satisfactory performance, I tried to steer the model, releasing the DPO'd variant of the models on the OpenHermes-2.5 preference dataset. Both Gemma-2b and Gemma-7b are available to try out in openhermes variant. As compared to the 'it' variant of the model, the model improved a bit but some of the results were still lower than the original 'it' variants. To give you a little context, on the AGIEVAL Gemma-2B-Openhermes showed an improvement from 23.76 to 23.80, and 29.41 to 44.75 on BIGBENCH. But on the GPT4ALL and TRUTHFULQA, the model severely underperforms.

What is wrong with Gemma?

Gemma's hype-train did not last for long due to excessive RLHF. Specifically, the instruction-tuned version did not show great results. The alignment/RLHF process involves adjusting the model's outputs so that they align with certain ethical guidelines and values, which can sometimes result in overly cautious or conservative responses. While this approach is important for ensuring responsible AI practices, it appears that in the case of the instruction-tuned variants of the Gemma family, the alignment may have been too restrictive, leading to models that lacked the necessary flexibility to provide informative and engaging answers. There were a lot of X's to fill in its response like 'I can't answer X'. The main reason behind this behavior was the excessive alignment that this model (it) has undergone. These restrictions were reflected in the DPO'd variant as well. The final verdict, in my opinion, is that the instruction-tuned variants are stripped of the basic facts and are censored more than what is required, not leaving an impact on the community at the moment.

Next step?

So, the future course of action will be to fine-tune the base models of the Gemma family and check to see if the model has shown any sort of improvement over the instruction-tuned variant. The truth is if we want to get something useful out of Gemma we need to start with the foundational model and train it with a lot of basic information about reality.

In the days to come, Google might realize that excessive RLHF is not useful as it is driving away the enthusiasts and sooner or later, they need to tackle the issue of denying the users their response.

Conclusion

Overall, while the idea of instruction-tuned models holds potential, the current implementations appear to have fallen short of expectations, particularly when it comes to providing SOTA LM to the wider community. Further research and development will likely be needed to address these challenges and create more effective and engaging instruction-tuned models in the future.

AIGoogleGemma ModelMachine LearningDeep learningQueryloop

Queryloop