Meet Argilla: An Open-Source Data Curation Platform for Large Language Models (LLMs) and MLOps for Natural Language Processing

Mai 19, às 17:43


4 min de leitura


1 leituras

Generative Artificial Intelligence has taken over the world, especially in the past few months. The super popular chatbot, ChatGPT, developed by OpenAI, has more than a million users and is used by almost...
Meet Argilla: An Open-Source Data Curation Platform for Large Language Models (LLMs) and MLOps for Natural Language Processing

Generative Artificial Intelligence has taken over the world, especially in the past few months. The super popular chatbot, ChatGPT, developed by OpenAI, has more than a million users and is used by almost everyone, whether researchers in the AI domain or students. Based on the GPT architecture, this Large Language Model (LLM) helps answer questions, generate unique and accurate content, summarize long textual paragraphs, complete codes, and so on. With the release of the latest version by the OpenAI community, i.e., the GPT-4 version, ChatGPT now also supports multimodal data. Other famous LLMs like DALL-E, BERT, and LLaMa have also contributed to some great advancements in the domain of Generative AI.

An open-source data curation platform called Argilla has recently been introduced for Large Language Models. Argilla has been released to help users in completing the full lifecycle of developing, evaluating, and improving Natural Language Processing Models, from the initial experimentation phase to the deployment in production environments. This platform uses human and machine feedback to build some robust LLMs through quicker data curation. 

Argilla helps the user in each and every step of the MLOps cycle, ranging from data labeling to model monitoring. Data labeling is a crucial step in training supervised NLP models, as annotating and labeling raw textual data helps in creating high-quality labeled datasets. On the other hand, Model monitoring is another crucial step to monitor the performance and behavior of deployed models in real time, thereby maintaining the model’s reliability and consistency. 

The developers have shared a few principles upon which Argilla is based on. Those are as follows. 

  1. Open-source – Argilla is open-source in nature, meaning it’s free for everyone to use and modify. It supports major NLP libraries like Hugging Face transformers, spaCy, Stanford Stanza, Flair, etc., and users can combine their preferred libraries without implementing any specific interface.
  1. End-to-end – Argilla provides an end-to-end solution for ML model development by bridging the gap between data collection, model iteration, and production monitoring. Argilla considers the data collection process an ongoing process for continuous improvement of the model and enables iterative development throughout the entire Machine Learning lifecycle.
  1. Better user and developer experience – Argilla focuses on user and developer experience by creating a user-friendly environment where domain experts can easily interpret and annotate data and experiment, and engineers have complete control over the data pipelines. 
  1. Beyond traditional hand-labeling – Argilla goes beyond traditional hand-labeling workflows by offering a range of innovative data annotation approaches. It allows the users to combine hand labeling with active learning, bulk labeling, and zero-shot models, which enables more efficient and cost-effective data annotation workflows.

Argilla is a production-ready framework and supports data curation, evaluation, model monitoring, debugging, and explainability. It automates human-in-the-loop workflows and can smoothly integrate with any tools of the user’s choice. It can be locally deployed on the device using the Docker command – ‘docker run -d –name argilla -p 6900:6900 argilla/argilla-quickstart:latest’.

Check out the Github link. Don’t forget to join our 21k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at [email protected]

🚀 Check Out 100’s AI Tools in AI Tools Club

Tanya Malhotra

Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.

Continue lendo


Motorola Razr Plus é o novo dobrável rival do Galaxy Z Flip
Após duas tentativas da Motorola em emplacar — novamente — telefones dobráveis, eis que temos aqui a terceira, e aparentemente bem-vinda, tentativa. Estamos falando do Motorola Razr Plus, um smartphone...

Hoje, às 15:20


Mentoring for the LGBTQ+ Community
Once unpublished, all posts by chetanan will become hidden and only accessible to themselves. If chetanan is not suspended, they can still re-publish their posts from their dashboard. Note: Once...

Hoje, às 15:13


IA: mais um arrependido / Déficit de TI / Apple: acusação grave · NewsletterOficial
Mais um pioneiro da IA se arrepende de seu trabalho: Yoshua Bengio teria priorizado segurança em vez de utilidade se soubesse o ritmo em que a tecnologia evoluiria – ele junta-se a Geoffr...

Hoje, às 14:37

Hacker News

The Analog Thing: Analog Computing for the Future
THE ANALOG THING (THAT) THE ANALOG THING (THAT) is a high-quality, low-cost, open-source, and not-for-profit cutting-edge analog computer. THAT allows modeling dynamic systems with great speed,...

Hoje, às 14:25


[DISCUSÃO/OPINIÕES] – Outsourcing! O que, para quem, por que sim, por que não! · dougg
Quero tentar trazer nesta minha primeira publicação, uma mistura de um breve esclarecimento sobre o que são empresas de outsourcing, como elas funcionam e ganham dinheiro, mas também, ven...

Hoje, às 13:58


Duvida: JavaScript - Desenvolver uma aplicação que vai ler um arquivo *.json · RafaelMesquita
Bom dia a todos Estou estudando javascript e me deparei com uma dificuldade e preciso de ajuda *Objetivo do estudo: *desenvolver uma aplicação que vai ler um arquivo *.json Conteudo do in...

Hoje, às 13:43


Automatize suas negociações com um robô de criptomoedas
Índice Como o robô de criptomoedas Bitsgap funciona?Qual a vantagem de utilizar um robô de criptomoedas?Bitsgap é confiável? O mercado de trading tem se tornado cada vez mais popular e as possibilidades de...

Hoje, às 13:13

Hacker News

Sketch of a Post-ORM
I’ve been writing a lot of database access code as of late. It’s frustrating that in 2023, my choices are still to either write all of the boilerplate by hand, or hand all database access over to some...

Hoje, às 13:11


14 chuveiros elétricos para o banho dos seus sonhos
Índice Chuveiro ou Ducha?Tipos de chuveiro elétrico9 fatores importantes para considerar na hora de comprar chuveiros elétricosMelhores chuveiros elétricosDuo Shower LorenzettiFit HydraAcqua Storm Ultra...

Hoje, às 11:00


Learn about the difference between var, let, and const keywords in JavaScript and when to use them.
var, let, and const: What's the Difference in JavaScript? JavaScript is a dynamic and flexible language that allows you to declare variables in different ways. You can use var, let, or const keywords to...

Hoje, às 10:21