Sailor

Serving the Underserved in South-East Asia with Open LLMs

Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs

REPORT GITHUB HUGGING FACE DEMO COMMUNITY Introduction In this blog, we introduce Sailor2, a community-driven initiative that brings cutting-edge multilingual language models to South-East Asia (SEA). Our research highlights a strong demand for models in the 8B and 20B parameter range for production use, alongside a 1B model for specialized applications, such as speculative decoding and research purposes. These models, released under the Apache 2.0 license, provide enhanced accessibility to advanced language technologies across the region. ...

December 2, 2024 · 11 min · 2154 words · Sailor Team

SailCraft: Data Toolkit for Sailor Language Models

GITHUB Introduction Today, we are thrilled to introduce the SailCraft data processing pipeline, a comprehensive open-source solution for large language model dataset curation. Built with meticulous attention to detail, Sailcraft represents a significant advancement in data preprocessing techniques for machine learning models. The pipeline encompasses a sophisticated four-stage data cleaning approach: Initial data cleaning Near deduplication Exact deduplication Second-round data cleaning With a particular emphasis on linguistic diversity, Sailcraft provides specialized cleaning capabilities for a wide range of languages, including Arabic, Bengali, Catalan, Spanish, Basque, French, Hindi, Portuguese, Urdu, and optimized processing for English, Indonesian, Vietnamese, Chinese, Thai, Lao, and Malay. ...

May 1, 2024 · 2 min · 275 words · Sailor Team

Sailor: Open Language Models for South-East Asia

PAPER GITHUB HUGGING FACE DEMO Introduction Sailor is a suite of Open Language Models tailored for South-East Asia (SEA), focusing on languages such as 🇮🇩Indonesian, 🇹🇭Thai, 🇻🇳Vietnamese, 🇲🇾Malay, and 🇱🇦Lao. Developed with careful data curation, Sailor models are designed to understand and generate text across the diverse linguistic landscapes of the SEA region. Built from Qwen 1.5, Sailor encompasses models of varying sizes, spanning from 0.5B to 14B versions for different requirements. ...

May 1, 2024 · 6 min · 1155 words · Sailor Team