1/n GPT-3 is very expensive to train, costing an estimated $5M (even when you know exactly what to do).
2/n We are so far away from building AGI, but I see a powerful language model like GPT-3 as "table stakes". Language is the substrate of thought, after all.
3/n So $5M is a lower bound estimated cost of training an AGI. If you are Nvidia, Google, Huawei, or any other lab that owns a supercomputer, you might be in the enviable position of being able to train it for ~$1M-$2.5M. Again, assuming you know exactly what to do.
4/n Nobody else can afford to do this right now. Let's pretend Moore's law is not over for NN accelerators - how long do we need to wait for things to be affordable?
5/n Factoring R&D costs, let's assume computation cost halves every 2 years. At a discounted price of $2.5M today, we'll have to wait 16 years before one experiment costs *only* $10k to train.
6/n If you want to make a giant transformer ingest video and other modalities than text, maybe the model is going to be 100x bigger than today. So it'll be 4 more years (2 decades total) before it *only* costs the annual salary of a senior Bay Area SWE to train it.
7/n Some rightly question the cost/benefit tradeoff of an endeavor such as "building AGI", but humor me - if one's goal was to build AGI, how can we drastically reduce the cost in < 10 years?
8/n I don't want to hear something generic like "algorithms will get better". I work on algorithms and I am not optimistic about any of them right now. One would need to invent a new, groundbreaking, 50% cost reduction technique every 2 years.
9/n I understand that O($1M) is small for some megacorps, given potential upside. But I would be personally conflicted about running a job that has a 95% chance of failure, and it costing a life-changing amount of money that could be otherwise used for important things.
10/n Suggestions welcome! I currently mentor some junior researchers getting into ML who don't have any compute resources and this is a problem weighing heavily on me lately.
11/n My hope is that we are going to have to get very clever with "dynamic programming". Never throwing away checkpoints, memoizing the result of every FLOP like no tomorrow, making hparams differentiable, etc.
12/n To play the devil's advocate on the cost item - my discomfort with spending O($1k) of dollars per experiment is a reflection of my frugal values. People who run large teams/orgs have to get comfortable with directing cash piles that far exceed what they will ever own.
Still, the number of people in the world who have the rationality and experience to direct large-cap company-scale resources at a problem like this is very small. Makes it difficult for pretty much the entire research community to contribute.
Another hope: bc there is no explicit symbol grounding about the world, LMs push a lot of entropy into a model and require it to be huge. Better grounding (a robotic vision system) might reduce the conditional entropy of the target distribution and make it cheaper to train
Finally, I should clarify that I am *not* an expert on GPT-3, or CAPEX planning, or compute cost estimation in general. Would appreciate fact checks from people who can do better than my napkin calculations.
You can follow @ericjang11.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.