Saturday, 8 June 2013

Facebook introduces new Presto engine for querying 250 PB data warehouse

When Facebook was launched in 2004, Mark Zuckerberg never could have imagined that one day he’d see it swell to 1 billion users worldwide and publicly traded.  With all of those people comes a seemingly infinite amount of images, videos and other assorted data bits.  In an effort to contain and find faster ways of analyzing all of that data, Facebook has implemented a Presto query engine.
Previously, Facebook used Hive for all of their data analysis
On June 6, 2013, Facebook headquarters held a developers conference where engineers for the company unveiled the all-new Presto query engine that will enable faster interactive analysis.  Currently the company houses an insane 250-petabytes of data that will continue to swell at the current rate of user activity and account creations.
According to engineer Martin Traverso, Facebook says they have nearly 900 employees who use the Presto engine all scanning approximately 320 terabytes each day.
“Historically, our data scientists and analysts have relied on Hive for data analysis.  The problem with Hive is it’s designed for batch processing,” Traverso said. “We have other tools that are faster than Hive, but they’re either too limited in functionality or too simple to operate against our huge data warehouse. Over the past few months, we’ve been working on Presto to basically fill this gap.”
Facebook servers
Facebook servers
Image credit: Flickr / Robert Scoble
Facebook created Hive a few years ago when their database was much smaller.  The purpose of Hive was to give Hadoop a data warehouse along with implementing SQL-like abilities.  Now with 1 billion users worth of data piling up, Presto was needed since Hive simply was not fast enough to query through all the data since it needed MapReduce to operate.
According to Facebook, scanning over an entire dataset could sometimes take them hours, which was not cost effective and detrimental when answers were needed quickly.  Presto solves this problem and smaller queries can be run in a few hundred milliseconds as opposed to minutes and larger queries can be done in minutes as opposed to hours.
Artist's rendering of the Iowa Facebook data center. In April of 2013, Facebook announced they would begin construction on a server farm in Altoona, Iowa. The construction was estimated to cost 300 million dollars and it would be the largest data center of its kind.
Artist’s rendering of the Iowa Facebook data center. In April of 2013, Facebook announced they would begin construction on a server farm in Altoona, Iowa. The construction was estimated to cost 300 million dollars and it would be the largest data center of its kind.
According to Traverso, Presto is Facebook’s version of Cloudera’s Impala SQL querying engine and that Facebook is also planning to release an open source of it in the Fall of this year.
Ravi Murthy, who serves as an engineering manager at Facebook, says that their data storage is 4,000 times bigger than it was four years ago.  Murthy also said that they could expect to see that size grow to one exabyte in scale, which is 10^18 in size (on the modern scale) or 1,000 petabytes.

No comments:

Post a Comment