Create ‘the_seventy_maxims_of_effective_machine_learning_engineers’
This commit is contained in:
		| @@ -0,0 +1,70 @@ | ||||
| * Preprocess, then train. | ||||
| * A training loop in motion outranks a perfect architecture that isn’t implemented. | ||||
| * A debugger with a stack trace outranks everyone else. | ||||
| * Regularization covers a multitude of overfitting sins. | ||||
| * Feature importance and data leakage should be easier to tell apart. | ||||
| * If increasing model complexity wasn’t your last resort, you failed to add enough layers. | ||||
| * If the accuracy is high enough, stakeholders will stop complaining about the compute costs. | ||||
| * Harsh critiques have their place—usually in the rejected pull requests. | ||||
| * Never turn your back on a deployed model. | ||||
| * Sometimes the only way out is through… through another epoch. | ||||
| * Every dataset is trainable—at least once. | ||||
| * A gentle learning rate turneth away divergence. Once the loss stabilizes, crank it up. | ||||
| * Do unto others’ hyperparameters as you would have them do unto yours. | ||||
| * “Innovative architecture” means never asking, “What’s the worst thing this could hallucinate?” | ||||
| * Only you can prevent vanishing gradients. | ||||
| * Your model is in the leaderboards: be sure it has dropout. | ||||
| * The longer training goes without overfitting, the bigger the validation-set disaster. | ||||
| * If the optimizer is leading from the front, watch for exploding gradients in the rear. | ||||
| * The field advances when you turn competitors into collaborators, but that’s not the same as your h-index advancing. | ||||
| * If you’re not willing to prune your own layers, you’re not willing to deploy. | ||||
| * Give a model a labeled dataset, and it trains for a day. Take its labels away and call it “self-supervised,” and it’ll generate new ones for you to validate tomorrow. | ||||
| * If you’re manually labeling data, somebody’s done something wrong. | ||||
| * Training loss and validation loss should be easier to tell apart. | ||||
| * Any sufficiently advanced algorithm is indistinguishable from a matrix multiplication. | ||||
| * If your model’s failure is covered by the SLA, you didn’t test enough edge cases. | ||||
| * “Fire-and-forget training” is fine, provided you never actually forget to monitor drift. | ||||
| * Don’t be afraid to be the first to try a random seed. | ||||
| * If the cost of cloud compute is high enough, you might get promoted for shutting down idle instances. | ||||
| * The enemy of my bias is my variance. No more. No less. | ||||
| * A little dropout goes a long way. The less you use, the further backpropagates. | ||||
| * Only overfitters prosper (temporarily). | ||||
| * Any model is production-ready if you can containerize it. | ||||
| * If you’re logging metrics, you’re being audited. | ||||
| * If you’re seeing NaN, you need a smaller learning rate. | ||||
| * That which does not break your model has made a suboptimal adversarial example. | ||||
| * When the loss plateaus, the wise call for more data. | ||||
| * There is no “overkill.” There is only “more epochs” and “CUDA out of memory.” | ||||
| * What’s trivial in Jupyter can still crash in production. | ||||
| * There’s a difference between spare GPUs and GPUs you’ve accidentally mined Ethereum on. | ||||
| * Not all NaN is a bug—sometimes it’s a feature. | ||||
| * “Do you have a checkpoint?” means “I can’t fix this training run.” | ||||
| * “They’ll never expect this activation function” means “I want to try something non-differentiable.” | ||||
| * If it’s a hack and it works, it’s still a hack and you’re lucky. | ||||
| * If it can parallelize inference, it can double as a space heater. | ||||
| * The size of the grant is inversely proportional to the reproducibility of the results. | ||||
| * Don’t try to save money by undersampling. | ||||
| * Don’t expect the data to cooperate in the creation of your dream benchmark. | ||||
| * If it ain’t overfit, it hasn’t been trained on enough epochs. | ||||
| * Every client is one missed deadline away from switching to AutoML, and every AutoML is one custom loss function away from becoming a client. | ||||
| * If it only works on the training set, it’s defective. | ||||
| * Let them see you tune the hyperparameters before you abandon the project. | ||||
| * The framework you’ve got is never the framework you want. | ||||
| * The data you’ve got is never the data you want. | ||||
| * It’s only too many layers if you can’t fit them in VRAM. | ||||
| * It’s only too many parameters if they’re multiplying NaNs. | ||||
| * Data engineers exist to format tables for people with real GPUs. | ||||
| * Reinforcement learning exists to burn through compute budgets on simulated environments. | ||||
| * The whiteboard is mightiest when it sketches architectures for more transformers. | ||||
| * “Two dropout layers is probably not going to be enough.” | ||||
| * A model’s inference time is inversely proportional to the urgency of the demo. | ||||
| * Don’t bring BERT into a logistic regression. | ||||
| * Any tensor labeled “output” is dangerous at both ends. | ||||
| * The CTO knows how to do it by knowing who Googled it. | ||||
| * An ounce of precision is worth a pound of recall. | ||||
| * After the merge, be the one with the main branch, not the one with the conflicts. | ||||
| * Necessity is the mother of synthetic data. | ||||
| * If you can’t explain it, cite the arXiv paper. | ||||
| * Deploying with confidence intervals doesn’t mean you shouldn’t also deploy with a kill switch. | ||||
| * Sometimes SOTA is a function of who had the biggest TPU pod. | ||||
| * Failure is not an option—it is mandatory. The option is whether to let failure be the last epoch or a learning rate adjustment. | ||||
		Reference in New Issue
	
	Block a user
	 osmarks
					osmarks