GITHUB

Introduction

Today, we are thrilled to introduce the SailCraft data processing pipeline, a comprehensive open-source solution for large language model dataset curation. Built with meticulous attention to detail, Sailcraft represents a significant advancement in data preprocessing techniques for machine learning models.

The pipeline encompasses a sophisticated four-stage data cleaning approach:

  1. Initial data cleaning
  2. Near deduplication
  3. Exact deduplication
  4. Second-round data cleaning

With a particular emphasis on linguistic diversity, Sailcraft provides specialized cleaning capabilities for a wide range of languages, including Arabic, Bengali, Catalan, Spanish, Basque, French, Hindi, Portuguese, Urdu, and optimized processing for English, Indonesian, Vietnamese, Chinese, Thai, Lao, and Malay.

Key Capabilities

Sailcraft distinguishes itself through its robust and flexible data processing framework. Researchers and developers can leverage this tool to:

  • Obtain granular filtered data counts at each processing stage
  • Implement language-specific cleaning rules with unprecedented ease
  • Conduct detailed investigations into data removal processes

The pipeline’s design reflects our commitment to transparency and open scientific research. By providing a comprehensive, adaptable data processing solution, we aim to empower the machine learning community with high-quality dataset curation tools.

Acknowledgements

This project stands as a testament to the collaborative spirit of the open-source community, drawing inspiration and leveraging contributions from critical projects including:

  • BigScience data cleaning tool
  • Chenghao Mou’s all-in-one deduplication tool
  • Google’s deduplication project

Looking Forward

We invite the community to explore, utilize, and provide feedback on Sailcraft. Your insights will be crucial in refining and expanding this data processing framework.

Citation

@article{dou2024sailor,
  title={Sailor: Open Language Models for South-East Asia},
  author={Dou, Longxu and Liu, Qian and Zeng, Guangtao and Guo, Jia and Zhou, Jiahui and Lu, Wei and Lin, Min},
  journal={arXiv preprint arXiv:2404.03608},
  year={2024}
}