Source: The New York Times
The business of Big Data, which involves collecting large amounts of data and then searching it for patterns and new revelations, is the result of cheap storage, abundant sensors and new software. It has become a multibillion-dollar industry in less than a decade. Growing at speed like that, it is easy to miss how much remains to do before the industry has proven standards. Until then, lots of customers are probably wasting much of their money.
There is essential work to be done training a core of people in very hard problems, like advanced statistics and software that ensures data quality and operational efficiency. Broad-based literacy in the uses of data should probably happen too, along with new kinds of management, better tools for reading the information, and privacy safeguards for corporate and personal information.
That such a huge number of tasks are taking place is a good indicator that, even with the hype, Big Data is a big deal. Last Friday, a number of technologists gathered at a forum hosted by the University of California, Berkeley, iSchool and talked about ways many of these jobs are being done. (Disclosure: I lecture at the iSchool, which is the school of information science, and moderated several panels there.) They talked about the progress so far, and identified a number of good ideas and businesses left to pursue.
In some ways, Big Data is about managing all kinds of weird new data, like social media updates from a mobile phone. It is hard to categorize in the first place, and may be used in lots of different ways, from advertising to traffic management. The so-called unstructured database of choice is by now pretty clearly Hadoop.
Cloudera, a leading software producer, is now training 1,500 people a month, mostly online, in how to use both the database and associated applications. According to Amr A. Awadallah, Cloudera’s chief technical officer, over 10,000 people have been trained on its system.
Data quality from new diverse sources is still a big problem, as is persuading companies and organizations to let others see data that might be more valuable in a commonly shared algorithm. “I’ve tried paying money for it, but it’s easier for companies to decide not to share,” said Gil Elbaz, the founder of Factual, a company that seeks to hold lots of online data. “The only way that works is to get them to take risks in exchange for data that is valuable to them.”
Much of the fear about exposing data, he said, has to do with competitors learning secrets. Mr. Elbaz thinks there is a good business in developing “de-identifiers” that can make data anonymous, and privacy insurers specializing in covering the costs of exposure.
On a personal level, others think the government or a trusted private institution will hold the personal identifiers of things like medical data, releasing it to trusted parties. “It’s a little scary that right now a cab driver using Uber knows more about you than a doctor, who has to take all of your information for the first time,” said Peter Skomoroch, principal data scientist at LinkedIn.
Another data-improving business consists of moving the world’s older data online. A company called Captricity aims to couple image-capture from things like cellphone cameras with cheap workers in Amazon.com’s Mechanical Turk service, in order to put older handwritten documents into digital databases. The company’s early business is from government and charity sites in Africa and India, but there is no reason why it shouldn’t be valuable for most medical records. If someone took the trouble to write it down, the company figures, that is a good way to assume it is valuable data.
There are other businesses trying to take the arcane side of Big Data into the mainstream, with easy-to-use statistical tools and new ways of visualizing data that make it easier to understand. Companies like ClearStory and Platfora “want to make it possible for businesses, for history majors, to use,” said Ben Werther, chief executive of Platfora. “We’re in the pre-industrial age of Big Data.” Martin Wattenberg, creator of a well-known wind map and who is now at Google, talked about a necessary revolution in design of data outcomes that have yet to become widespread.
Away from big online companies like Google, some of the earliest data-driven businesses include hedge funds and insurance companies, which have the money and tradition of using lots of math. Another, Mr. Skomoroch said, is the Mormon Church. “It is the focus on genealogy data,” he says. “If you move, they’ll always find you.”
What Big Data is seeing now looks like the classic industrial curve. There is the first discovery of something big, leading to establishing principles like scientific rules. Science moves toward engineering as a means to manufacturing, resulting in mass deployment. Then things really change.