At this scale, a supercomputer would likely need terabytes of working memory just to store the model. The memory problem gets even worse when you bring GPUs into the picture. GPUs can process neural network workloads orders of magnitude faster than general purpose CPUs can, but each GPU has a relatively small amount of RAM—even the most expensive Nvidia Tesla GPUs only have 32GB of RAM. Medini says, "training such a model is prohibitive due to massive inter-GPU communication."
Instead of training on the entire 100 million outcomes—product purchases, in this example—Mach divides them into three "buckets," each containing 33.3 million randomly selected outcomes. Now, MACH creates another "world," and in that world, the 100 million outcomes are again randomly sorted into three buckets. Crucially, the random sorting is separate in World One and World Two—they each have the same 100 million outcomes, but their random distribution into buckets is different for each world.
With each world instantiated, a search is fed to both a "world one" classifier and a "world two" classifier, with only three possible outcomes apiece. "What is this person thinking about?" asks Shrivastava. "The most probable class is something that is common between these two buckets."
At this point, there are nine possible outcomes—three buckets in World One times three buckets in World Two. But MACH only needed to create six classes—World One's three buckets plus World Two's three buckets—to model that nine-outcome search space. This advantage improves as more "worlds" are created; a three-world approach produces 27 outcomes from only nine created classes, a four-world setup gives 81 outcomes from 12 classes, and so forth. "I am paying a cost linearly, and I am getting an exponential improvement," Shrivastava says.
Better yet, MACH lends itself better to distributed computing on smaller individual instances. The worlds "don't even have to talk to one another," Medini says. "In principle, you could train each [world] on a single GPU, which is something you could never do with a non-independent approach." In the real world, the researchers applied MACH to a 49 million product Amazon training database, randomly sorting it into 10,000 buckets in each of 32 separate worlds. That reduced the required parameters in the model more than an order of magnitude—and according to Medini, training the model required both less time and less memory than some of the best reported training times on models with comparable parameters.
Of course, this wouldn't be an Ars article on deep learning if we didn't close it out with a cynical reminder about unintended consequences. The unspoken reality is that the neural network isn't actually learning to show shoppers what they asked for. Instead, it's learning how to turn queries into purchases. The neural network doesn't know or care what the human was actually searching for; it just has an idea what that human is most likely to buy—and without sufficient oversight, systems trained to increase outcome probabilities this way can end up suggesting baby products to women who've suffered miscarriages, or worse.
Comments
Post a Comment