Introducing SmallThinker-3B-Preview. An o1-like reasoning SLM!

Today we release SmallThinker-3B-Preview. A reasoning model finetuned from Qwen2.5-3b-Instruct

Running on NVIDIA-2080Ti

Benchmark score

SmallThinker is designed for the following use cases:

  1. Edge Deployment: Its small size makes it ideal for deployment on resource-constrained devices.
  2. Draft Model for QwQ-32B-Preview: SmallThinker can serve as a fast and efficient draft model for the larger QwQ-32B-Preview model. From my test, in llama.cpp we can get over 70% speedup (from 40 tokens/s to 70 tokens/s on NVIDIA 4090).

We believe that for achieving reasoning capabilities, it's crucial to generate long chains of COT reasoning. Therefore, based on QWQ-32B-Preview, we used various synthetic techniques(such as personahub) to create the QWQ-LONGCOT-500K dataset. Compared to other similar datasets, over 75% of our samples have output tokens exceeding 8K. To encourage research in the open-source community, we've also made the dataset publicly available - feel free to use it!

Limitation:
This is just our first step, and currently, the model has some issues: it tends to produce repetitive outputs. Please increase the repeat penalty to mitigate this problem.
We will continue to iterate on similar models, and we hope that in the future, everyone will have their own reasoning model!

Despite our demo being conducted on PC GPUs, we are currently developing an inference framework for SLM specifically optimized for Qualcomm NPUs. Stay tuned!