Edit ‘the_seventy_maxims_of_effective_machine_learning_engineers’
This commit is contained in:
parent
59079062c9
commit
ccf8156605
@ -1,70 +1,70 @@
|
||||
* Preprocess, then train.
|
||||
* A training loop in motion outranks a perfect architecture that isn’t implemented.
|
||||
* A debugger with a stack trace outranks everyone else.
|
||||
* Regularization covers a multitude of overfitting sins.
|
||||
* Feature importance and data leakage should be easier to tell apart.
|
||||
* If increasing model complexity wasn’t your last resort, you failed to add enough layers.
|
||||
* If the accuracy is high enough, stakeholders will stop complaining about the compute costs.
|
||||
* Harsh critiques have their place—usually in the rejected pull requests.
|
||||
* Never turn your back on a deployed model.
|
||||
* Sometimes the only way out is through… through another epoch.
|
||||
* Every dataset is trainable—at least once.
|
||||
* A gentle learning rate turneth away divergence. Once the loss stabilizes, crank it up.
|
||||
* Do unto others’ hyperparameters as you would have them do unto yours.
|
||||
* “Innovative architecture” means never asking, “What’s the worst thing this could hallucinate?”
|
||||
* Only you can prevent vanishing gradients.
|
||||
* Your model is in the leaderboards: be sure it has dropout.
|
||||
* The longer training goes without overfitting, the bigger the validation-set disaster.
|
||||
* If the optimizer is leading from the front, watch for exploding gradients in the rear.
|
||||
* The field advances when you turn competitors into collaborators, but that’s not the same as your h-index advancing.
|
||||
* If you’re not willing to prune your own layers, you’re not willing to deploy.
|
||||
* Give a model a labeled dataset, and it trains for a day. Take its labels away and call it “self-supervised,” and it’ll generate new ones for you to validate tomorrow.
|
||||
* If you’re manually labeling data, somebody’s done something wrong.
|
||||
* Training loss and validation loss should be easier to tell apart.
|
||||
* Any sufficiently advanced algorithm is indistinguishable from a matrix multiplication.
|
||||
* If your model’s failure is covered by the SLA, you didn’t test enough edge cases.
|
||||
* “Fire-and-forget training” is fine, provided you never actually forget to monitor drift.
|
||||
* Don’t be afraid to be the first to try a random seed.
|
||||
* If the cost of cloud compute is high enough, you might get promoted for shutting down idle instances.
|
||||
* The enemy of my bias is my variance. No more. No less.
|
||||
* A little dropout goes a long way. The less you use, the further backpropagates.
|
||||
* Only overfitters prosper (temporarily).
|
||||
* Any model is production-ready if you can containerize it.
|
||||
* If you’re logging metrics, you’re being audited.
|
||||
* If you’re seeing NaN, you need a smaller learning rate.
|
||||
* That which does not break your model has made a suboptimal adversarial example.
|
||||
* When the loss plateaus, the wise call for more data.
|
||||
* There is no “overkill.” There is only “more epochs” and “CUDA out of memory.”
|
||||
* What’s trivial in Jupyter can still crash in production.
|
||||
* There’s a difference between spare GPUs and GPUs you’ve accidentally mined Ethereum on.
|
||||
* Not all NaN is a bug—sometimes it’s a feature.
|
||||
* “Do you have a checkpoint?” means “I can’t fix this training run.”
|
||||
* “They’ll never expect this activation function” means “I want to try something non-differentiable.”
|
||||
* If it’s a hack and it works, it’s still a hack and you’re lucky.
|
||||
* If it can parallelize inference, it can double as a space heater.
|
||||
* The size of the grant is inversely proportional to the reproducibility of the results.
|
||||
* Don’t try to save money by undersampling.
|
||||
* Don’t expect the data to cooperate in the creation of your dream benchmark.
|
||||
* If it ain’t overfit, it hasn’t been trained on enough epochs.
|
||||
* Every client is one missed deadline away from switching to AutoML, and every AutoML is one custom loss function away from becoming a client.
|
||||
* If it only works on the training set, it’s defective.
|
||||
* Let them see you tune the hyperparameters before you abandon the project.
|
||||
* The framework you’ve got is never the framework you want.
|
||||
* The data you’ve got is never the data you want.
|
||||
* It’s only too many layers if you can’t fit them in VRAM.
|
||||
* It’s only too many parameters if they’re multiplying NaNs.
|
||||
* Data engineers exist to format tables for people with real GPUs.
|
||||
* Reinforcement learning exists to burn through compute budgets on simulated environments.
|
||||
* The whiteboard is mightiest when it sketches architectures for more transformers.
|
||||
* “Two dropout layers is probably not going to be enough.”
|
||||
* A model’s inference time is inversely proportional to the urgency of the demo.
|
||||
* Don’t bring BERT into a logistic regression.
|
||||
* Any tensor labeled “output” is dangerous at both ends.
|
||||
* The CTO knows how to do it by knowing who Googled it.
|
||||
* An ounce of precision is worth a pound of recall.
|
||||
* After the merge, be the one with the main branch, not the one with the conflicts.
|
||||
* Necessity is the mother of synthetic data.
|
||||
* If you can’t explain it, cite the arXiv paper.
|
||||
* Deploying with confidence intervals doesn’t mean you shouldn’t also deploy with a kill switch.
|
||||
* Sometimes SOTA is a function of who had the biggest TPU pod.
|
||||
* Failure is not an option—it is mandatory. The option is whether to let failure be the last epoch or a learning rate adjustment.
|
||||
*. Preprocess, then train.
|
||||
*. A training loop in motion outranks a perfect architecture that isn’t implemented.
|
||||
*. A debugger with a stack trace outranks everyone else.
|
||||
*. Regularization covers a multitude of overfitting sins.
|
||||
*. Feature importance and data leakage should be easier to tell apart.
|
||||
*. If increasing model complexity wasn’t your last resort, you failed to add enough layers.
|
||||
*. If the accuracy is high enough, stakeholders will stop complaining about the compute costs.
|
||||
*. Harsh critiques have their place—usually in the rejected pull requests.
|
||||
*. Never turn your back on a deployed model.
|
||||
*. Sometimes the only way out is through… through another epoch.
|
||||
*. Every dataset is trainable—at least once.
|
||||
*. A gentle learning rate turneth away divergence. Once the loss stabilizes, crank it up.
|
||||
*. Do unto others’ hyperparameters as you would have them do unto yours.
|
||||
*. “Innovative architecture” means never asking, “What’s the worst thing this could hallucinate?”
|
||||
*. Only you can prevent vanishing gradients.
|
||||
*. Your model is in the leaderboards: be sure it has dropout.
|
||||
*. The longer training goes without overfitting, the bigger the validation-set disaster.
|
||||
*. If the optimizer is leading from the front, watch for exploding gradients in the rear.
|
||||
*. The field advances when you turn competitors into collaborators, but that’s not the same as your h-index advancing.
|
||||
*. If you’re not willing to prune your own layers, you’re not willing to deploy.
|
||||
*. Give a model a labeled dataset, and it trains for a day. Take its labels away and call it “self-supervised,” and it’ll generate new ones for you to validate tomorrow.
|
||||
*. If you’re manually labeling data, somebody’s done something wrong.
|
||||
*. Training loss and validation loss should be easier to tell apart.
|
||||
*. Any sufficiently advanced algorithm is indistinguishable from a matrix multiplication.
|
||||
*. If your model’s failure is covered by the SLA, you didn’t test enough edge cases.
|
||||
*. “Fire-and-forget training” is fine, provided you never actually forget to monitor drift.
|
||||
*. Don’t be afraid to be the first to try a random seed.
|
||||
*. If the cost of cloud compute is high enough, you might get promoted for shutting down idle instances.
|
||||
*. The enemy of my bias is my variance. No more. No less.
|
||||
*. A little dropout goes a long way. The less you use, the further backpropagates.
|
||||
*. Only overfitters prosper (temporarily).
|
||||
*. Any model is production-ready if you can containerize it.
|
||||
*. If you’re logging metrics, you’re being audited.
|
||||
*. If you’re seeing NaN, you need a smaller learning rate.
|
||||
*. That which does not break your model has made a suboptimal adversarial example.
|
||||
*. When the loss plateaus, the wise call for more data.
|
||||
*. There is no “overkill.” There is only “more epochs” and “CUDA out of memory.”
|
||||
*. What’s trivial in Jupyter can still crash in production.
|
||||
*. There’s a difference between spare GPUs and GPUs you’ve accidentally mined Ethereum on.
|
||||
*. Not all NaN is a bug—sometimes it’s a feature.
|
||||
*. “Do you have a checkpoint?” means “I can’t fix this training run.”
|
||||
*. “They’ll never expect this activation function” means “I want to try something non-differentiable.”
|
||||
*. If it’s a hack and it works, it’s still a hack and you’re lucky.
|
||||
*. If it can parallelize inference, it can double as a space heater.
|
||||
*. The size of the grant is inversely proportional to the reproducibility of the results.
|
||||
*. Don’t try to save money by undersampling.
|
||||
*. Don’t expect the data to cooperate in the creation of your dream benchmark.
|
||||
*. If it ain’t overfit, it hasn’t been trained on enough epochs.
|
||||
*. Every client is one missed deadline away from switching to AutoML, and every AutoML is one custom loss function away from becoming a client.
|
||||
*. If it only works on the training set, it’s defective.
|
||||
*. Let them see you tune the hyperparameters before you abandon the project.
|
||||
*. The framework you’ve got is never the framework you want.
|
||||
*. The data you’ve got is never the data you want.
|
||||
*. It’s only too many layers if you can’t fit them in VRAM.
|
||||
*. It’s only too many parameters if they’re multiplying NaNs.
|
||||
*. Data engineers exist to format tables for people with real GPUs.
|
||||
*. Reinforcement learning exists to burn through compute budgets on simulated environments.
|
||||
*. The whiteboard is mightiest when it sketches architectures for more transformers.
|
||||
*. “Two dropout layers is probably not going to be enough.”
|
||||
*. A model’s inference time is inversely proportional to the urgency of the demo.
|
||||
*. Don’t bring BERT into a logistic regression.
|
||||
*. Any tensor labeled “output” is dangerous at both ends.
|
||||
*. The CTO knows how to do it by knowing who Googled it.
|
||||
*. An ounce of precision is worth a pound of recall.
|
||||
*. After the merge, be the one with the main branch, not the one with the conflicts.
|
||||
*. Necessity is the mother of synthetic data.
|
||||
*. If you can’t explain it, cite the arXiv paper.
|
||||
*. Deploying with confidence intervals doesn’t mean you shouldn’t also deploy with a kill switch.
|
||||
*. Sometimes SOTA is a function of who had the biggest TPU pod.
|
||||
*. Failure is not an option—it is mandatory. The option is whether to let failure be the last epoch or a learning rate adjustment.
|
Loading…
x
Reference in New Issue
Block a user