Data Science: What Data?
Data science, data analytics, and big data are all topics that have a rising buzz in the last few years. As with many “new” tech topics, much of what these terms encompass is not new at all. There clearly are ties to existing activity in areas like data mining, decision support systems, business intelligence, visualization, etc. So what’s new and why the new terms and growing buzz?
One key to the shift in discussion clearly is the data itself. There are several categories of data that are simply exploding in size and importance. In trying to get your head around data science, it seems useful to categorize the types of data involved. My current mental model is that there are three broad categories of data that seem relevant to the discussions of data science. They are:
Human Generated Data
The volume of data published on the Web by individuals is truly one of the amazing features of our time. And the publication rate and variety of this data continues to accelerate. For anyone interested in what people are doing and thinking, this is a total game changer. Some examples of data in this category are:
- Clickstreams and navigation histories of Web activity
- Tweets – person to person message interactions
- Facebook, Linkedin, and semi-public records of people’s lives and interactions
- Citizen science – data gathering in support of science by interested non-scientists
Device Generated Data
There have been devices that generate massive amounts of data for decades, with areas like medicine, lab science, and aerospace providing ready examples. But the number and type of devices that create large data streams accessible via the Web is rising sharply. Projecting forward to fully instrumented intelligent infrastructure implies that the history of device generated data is barely a trickle compared to the future. Some examples of data in this category are:
- Scientific devices – e.g., medical and molecular imaging
- Sensors – intelligent infrastructure
- Video and audio capture – traffic cams; security cams
Newly Accessible Data
As more and more of the world’s data shifts online, there are legacy data sources that take on new meaning. Much of this is data that was previously paper or computerized but off the Net. It includes data that may have been previously available, but that was prohibitively expensive and time consuming to access and aggregate. Examples of data in this category are:
- Real estate transactions
- Legal filings
- Price data
The iSchool at Drexel has active research efforts that address a variety of topics related to data science. Our degree programs increasingly address these topics too. And clearly the development of education for data science has just begun.