Stanford Researchers Introduce Sophia: A Scalable Second-Order Optimizer For Language Model Pre-Training

Mai 26, às 07:45


3 min de leitura


0 leituras

Given the high up-front cost of training a language model, any non-trivial improvement to the optimization process would drastically reduce the time and money needed to complete the training process. Adam and...
Stanford Researchers Introduce Sophia: A Scalable Second-Order Optimizer For Language Model Pre-Training

Given the high up-front cost of training a language model, any non-trivial improvement to the optimization process would drastically reduce the time and money needed to complete the training process. Adam and its variants were the states of the art for a long time, while second-order (Hessian-based) optimizers were rarely utilized due to their greater per-step overhead.

A lightweight estimate of the diagonal Hessian is proposed as the pre-conditioner for the second-order optimizer Sophia, Second-order Clipped Stochastic Optimization, proposed by the researchers. Sophia is a novel optimizer that can solve LLMs twice as fast as Adam. An element-by-element clip is conducted after the update, which is found by taking the mean of the gradients and dividing it by the mean of the estimated Hessian. The clipping limits the size of the worst-case update and mitigates the effect of the trajectory’s non-convexity and fast Hessian changes. Adding some new lines of code might reduce the $2M budget to the $1M range (assuming scaling laws apply).

The average per-step time and memory overhead are low because Sophia only estimates the diagonal Hessian every few iterations. Sophia doubles Adam’s speed in terms of the number of steps, total compute, and wall-clock time while modeling language with GPT-2 models ranging in size from 125 million to 770 million. Researchers demonstrate that Sophia can accommodate large parameter variations that underlie language modeling tasks. The runtime bound is independent of the loss’s condition number.

Key features

  • Sophia is straightforward to implement with PyTorch, as it requires a lightweight estimate of the diagonal Hessian as a pre-condition on the gradient (see pseudo-code in the first picture) before individually clipping elements.
  • Sophia also helps with pre-workout steadiness. Much less often than in Adam and Lion, gradient clipping is induced. The re-parameterization trick, where the focused temperature varies with the layer index, is unnecessary.
  • Sophia ensures a consistent loss reduction across all parameter dimensions by penalizing updates more heavily in sharp sizes (with large Hessian) than in flat dimensions (with small Hessian). In two-dimensional space, Adam converges more slowly.

Important aspects of this undertaking 

  • This shows that even with limited resources, academics may examine LLM pre-training and develop novel, effective algorithms. 
  • In addition to reviewing material from previous optimization courses, researchers extensively used theoretical reasoning throughout the study process.

In the code scheduled for release tomorrow, researchers used a slightly modified version of the commonly accepted definition of LR. While tidier for typing, the paper’s LR definition could be better for computer code.

Check out the Paper. Don’t forget to join our 22k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at [email protected]

🚀 Check Out 100’s AI Tools in AI Tools Club

Dhanshree Shripad Shenwai

Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone's life easy.

Continue lendo


Motorola Razr Plus é o novo dobrável rival do Galaxy Z Flip
Após duas tentativas da Motorola em emplacar — novamente — telefones dobráveis, eis que temos aqui a terceira, e aparentemente bem-vinda, tentativa. Estamos falando do Motorola Razr Plus, um smartphone...

Hoje, às 15:20


Mentoring for the LGBTQ+ Community
Once unpublished, all posts by chetanan will become hidden and only accessible to themselves. If chetanan is not suspended, they can still re-publish their posts from their dashboard. Note: Once...

Hoje, às 15:13


IA: mais um arrependido / Déficit de TI / Apple: acusação grave · NewsletterOficial
Mais um pioneiro da IA se arrepende de seu trabalho: Yoshua Bengio teria priorizado segurança em vez de utilidade se soubesse o ritmo em que a tecnologia evoluiria – ele junta-se a Geoffr...

Hoje, às 14:37

Hacker News

The Analog Thing: Analog Computing for the Future
THE ANALOG THING (THAT) THE ANALOG THING (THAT) is a high-quality, low-cost, open-source, and not-for-profit cutting-edge analog computer. THAT allows modeling dynamic systems with great speed,...

Hoje, às 14:25


[DISCUSÃO/OPINIÕES] – Outsourcing! O que, para quem, por que sim, por que não! · dougg
Quero tentar trazer nesta minha primeira publicação, uma mistura de um breve esclarecimento sobre o que são empresas de outsourcing, como elas funcionam e ganham dinheiro, mas também, ven...

Hoje, às 13:58


Duvida: JavaScript - Desenvolver uma aplicação que vai ler um arquivo *.json · RafaelMesquita
Bom dia a todos Estou estudando javascript e me deparei com uma dificuldade e preciso de ajuda *Objetivo do estudo: *desenvolver uma aplicação que vai ler um arquivo *.json Conteudo do in...

Hoje, às 13:43


Automatize suas negociações com um robô de criptomoedas
Índice Como o robô de criptomoedas Bitsgap funciona?Qual a vantagem de utilizar um robô de criptomoedas?Bitsgap é confiável? O mercado de trading tem se tornado cada vez mais popular e as possibilidades de...

Hoje, às 13:13

Hacker News

Sketch of a Post-ORM
I’ve been writing a lot of database access code as of late. It’s frustrating that in 2023, my choices are still to either write all of the boilerplate by hand, or hand all database access over to some...

Hoje, às 13:11


14 chuveiros elétricos para o banho dos seus sonhos
Índice Chuveiro ou Ducha?Tipos de chuveiro elétrico9 fatores importantes para considerar na hora de comprar chuveiros elétricosMelhores chuveiros elétricosDuo Shower LorenzettiFit HydraAcqua Storm Ultra...

Hoje, às 11:00


Learn about the difference between var, let, and const keywords in JavaScript and when to use them.
var, let, and const: What's the Difference in JavaScript? JavaScript is a dynamic and flexible language that allows you to declare variables in different ways. You can use var, let, or const keywords to...

Hoje, às 10:21