diff --git a/the_seventy_maxims_of_effective_machine_learning_engineers.myco b/the_seventy_maxims_of_effective_machine_learning_engineers.myco index d63398a..6a47890 100644 --- a/the_seventy_maxims_of_effective_machine_learning_engineers.myco +++ b/the_seventy_maxims_of_effective_machine_learning_engineers.myco @@ -1,70 +1,70 @@ -* Preprocess, then train. -* A training loop in motion outranks a perfect architecture that isn’t implemented. -* A debugger with a stack trace outranks everyone else. -* Regularization covers a multitude of overfitting sins. -* Feature importance and data leakage should be easier to tell apart. -* If increasing model complexity wasn’t your last resort, you failed to add enough layers. -* If the accuracy is high enough, stakeholders will stop complaining about the compute costs. -* Harsh critiques have their place—usually in the rejected pull requests. -* Never turn your back on a deployed model. -* Sometimes the only way out is through… through another epoch. -* Every dataset is trainable—at least once. -* A gentle learning rate turneth away divergence. Once the loss stabilizes, crank it up. -* Do unto others’ hyperparameters as you would have them do unto yours. -* “Innovative architecture” means never asking, “What’s the worst thing this could hallucinate?” -* Only you can prevent vanishing gradients. -* Your model is in the leaderboards: be sure it has dropout. -* The longer training goes without overfitting, the bigger the validation-set disaster. -* If the optimizer is leading from the front, watch for exploding gradients in the rear. -* The field advances when you turn competitors into collaborators, but that’s not the same as your h-index advancing. -* If you’re not willing to prune your own layers, you’re not willing to deploy. -* Give a model a labeled dataset, and it trains for a day. Take its labels away and call it “self-supervised,” and it’ll generate new ones for you to validate tomorrow. -* If you’re manually labeling data, somebody’s done something wrong. -* Training loss and validation loss should be easier to tell apart. -* Any sufficiently advanced algorithm is indistinguishable from a matrix multiplication. -* If your model’s failure is covered by the SLA, you didn’t test enough edge cases. -* “Fire-and-forget training” is fine, provided you never actually forget to monitor drift. -* Don’t be afraid to be the first to try a random seed. -* If the cost of cloud compute is high enough, you might get promoted for shutting down idle instances. -* The enemy of my bias is my variance. No more. No less. -* A little dropout goes a long way. The less you use, the further backpropagates. -* Only overfitters prosper (temporarily). -* Any model is production-ready if you can containerize it. -* If you’re logging metrics, you’re being audited. -* If you’re seeing NaN, you need a smaller learning rate. -* That which does not break your model has made a suboptimal adversarial example. -* When the loss plateaus, the wise call for more data. -* There is no “overkill.” There is only “more epochs” and “CUDA out of memory.” -* What’s trivial in Jupyter can still crash in production. -* There’s a difference between spare GPUs and GPUs you’ve accidentally mined Ethereum on. -* Not all NaN is a bug—sometimes it’s a feature. -* “Do you have a checkpoint?” means “I can’t fix this training run.” -* “They’ll never expect this activation function” means “I want to try something non-differentiable.” -* If it’s a hack and it works, it’s still a hack and you’re lucky. -* If it can parallelize inference, it can double as a space heater. -* The size of the grant is inversely proportional to the reproducibility of the results. -* Don’t try to save money by undersampling. -* Don’t expect the data to cooperate in the creation of your dream benchmark. -* If it ain’t overfit, it hasn’t been trained on enough epochs. -* Every client is one missed deadline away from switching to AutoML, and every AutoML is one custom loss function away from becoming a client. -* If it only works on the training set, it’s defective. -* Let them see you tune the hyperparameters before you abandon the project. -* The framework you’ve got is never the framework you want. -* The data you’ve got is never the data you want. -* It’s only too many layers if you can’t fit them in VRAM. -* It’s only too many parameters if they’re multiplying NaNs. -* Data engineers exist to format tables for people with real GPUs. -* Reinforcement learning exists to burn through compute budgets on simulated environments. -* The whiteboard is mightiest when it sketches architectures for more transformers. -* “Two dropout layers is probably not going to be enough.” -* A model’s inference time is inversely proportional to the urgency of the demo. -* Don’t bring BERT into a logistic regression. -* Any tensor labeled “output” is dangerous at both ends. -* The CTO knows how to do it by knowing who Googled it. -* An ounce of precision is worth a pound of recall. -* After the merge, be the one with the main branch, not the one with the conflicts. -* Necessity is the mother of synthetic data. -* If you can’t explain it, cite the arXiv paper. -* Deploying with confidence intervals doesn’t mean you shouldn’t also deploy with a kill switch. -* Sometimes SOTA is a function of who had the biggest TPU pod. -* Failure is not an option—it is mandatory. The option is whether to let failure be the last epoch or a learning rate adjustment. \ No newline at end of file +*. Preprocess, then train. +*. A training loop in motion outranks a perfect architecture that isn’t implemented. +*. A debugger with a stack trace outranks everyone else. +*. Regularization covers a multitude of overfitting sins. +*. Feature importance and data leakage should be easier to tell apart. +*. If increasing model complexity wasn’t your last resort, you failed to add enough layers. +*. If the accuracy is high enough, stakeholders will stop complaining about the compute costs. +*. Harsh critiques have their place—usually in the rejected pull requests. +*. Never turn your back on a deployed model. +*. Sometimes the only way out is through… through another epoch. +*. Every dataset is trainable—at least once. +*. A gentle learning rate turneth away divergence. Once the loss stabilizes, crank it up. +*. Do unto others’ hyperparameters as you would have them do unto yours. +*. “Innovative architecture” means never asking, “What’s the worst thing this could hallucinate?” +*. Only you can prevent vanishing gradients. +*. Your model is in the leaderboards: be sure it has dropout. +*. The longer training goes without overfitting, the bigger the validation-set disaster. +*. If the optimizer is leading from the front, watch for exploding gradients in the rear. +*. The field advances when you turn competitors into collaborators, but that’s not the same as your h-index advancing. +*. If you’re not willing to prune your own layers, you’re not willing to deploy. +*. Give a model a labeled dataset, and it trains for a day. Take its labels away and call it “self-supervised,” and it’ll generate new ones for you to validate tomorrow. +*. If you’re manually labeling data, somebody’s done something wrong. +*. Training loss and validation loss should be easier to tell apart. +*. Any sufficiently advanced algorithm is indistinguishable from a matrix multiplication. +*. If your model’s failure is covered by the SLA, you didn’t test enough edge cases. +*. “Fire-and-forget training” is fine, provided you never actually forget to monitor drift. +*. Don’t be afraid to be the first to try a random seed. +*. If the cost of cloud compute is high enough, you might get promoted for shutting down idle instances. +*. The enemy of my bias is my variance. No more. No less. +*. A little dropout goes a long way. The less you use, the further backpropagates. +*. Only overfitters prosper (temporarily). +*. Any model is production-ready if you can containerize it. +*. If you’re logging metrics, you’re being audited. +*. If you’re seeing NaN, you need a smaller learning rate. +*. That which does not break your model has made a suboptimal adversarial example. +*. When the loss plateaus, the wise call for more data. +*. There is no “overkill.” There is only “more epochs” and “CUDA out of memory.” +*. What’s trivial in Jupyter can still crash in production. +*. There’s a difference between spare GPUs and GPUs you’ve accidentally mined Ethereum on. +*. Not all NaN is a bug—sometimes it’s a feature. +*. “Do you have a checkpoint?” means “I can’t fix this training run.” +*. “They’ll never expect this activation function” means “I want to try something non-differentiable.” +*. If it’s a hack and it works, it’s still a hack and you’re lucky. +*. If it can parallelize inference, it can double as a space heater. +*. The size of the grant is inversely proportional to the reproducibility of the results. +*. Don’t try to save money by undersampling. +*. Don’t expect the data to cooperate in the creation of your dream benchmark. +*. If it ain’t overfit, it hasn’t been trained on enough epochs. +*. Every client is one missed deadline away from switching to AutoML, and every AutoML is one custom loss function away from becoming a client. +*. If it only works on the training set, it’s defective. +*. Let them see you tune the hyperparameters before you abandon the project. +*. The framework you’ve got is never the framework you want. +*. The data you’ve got is never the data you want. +*. It’s only too many layers if you can’t fit them in VRAM. +*. It’s only too many parameters if they’re multiplying NaNs. +*. Data engineers exist to format tables for people with real GPUs. +*. Reinforcement learning exists to burn through compute budgets on simulated environments. +*. The whiteboard is mightiest when it sketches architectures for more transformers. +*. “Two dropout layers is probably not going to be enough.” +*. A model’s inference time is inversely proportional to the urgency of the demo. +*. Don’t bring BERT into a logistic regression. +*. Any tensor labeled “output” is dangerous at both ends. +*. The CTO knows how to do it by knowing who Googled it. +*. An ounce of precision is worth a pound of recall. +*. After the merge, be the one with the main branch, not the one with the conflicts. +*. Necessity is the mother of synthetic data. +*. If you can’t explain it, cite the arXiv paper. +*. Deploying with confidence intervals doesn’t mean you shouldn’t also deploy with a kill switch. +*. Sometimes SOTA is a function of who had the biggest TPU pod. +*. Failure is not an option—it is mandatory. The option is whether to let failure be the last epoch or a learning rate adjustment. \ No newline at end of file