Data has really exploded in the past five to ten years. Teams are now managing petabytes or exabytes worth of data; in the past it was gigabytes. This growth has a lot to do with how you can stream data into systems in real time, along with the different types of data sources available.
At a high level, there are three areas in the data space:
- Data capture / integration. This is the core of how you’re getting the data and processing it.
- Analysis for insights. Using algorithmic methods can help uncover patterns and anomalies.
- Data visualization. These visualizations let you illustrate the results of your analysis in a way that’s easy to understand.
Data Initiatives at MIT
CSAIL has a big data initiative that’s keyed into research, looking at the effects of how you process and analyze data. MIT Sloan has multiple initiatives on big data and just announced a new MBA program with a concentration in data analysis. The Office of Sustainability and MIT Facilities are using data to chart energy use on campus.
IS&T’s Data Science Team is about a year old and has eight members. The team is working on several fronts, starting with the basics: What are MIT’s needs – as a campus – for data analysis?
On the security side, the team is working on Domain Name System (DNS) anomaly detection on the network: analyzing network logs and figuring out which areas to zero in on for anomalies. Security analysts can then pull apart that data and decide: “We should take a closer look at this” or “It’s an anomaly, but no big deal.”
Sustainability is a major focus for MIT, one of the themes of MIT 2030. MIT is constructing several buildings, both commercial and research-related, in Kendall Square and elsewhere. Energy efficiency is a key component for these new buildings, as well as for the campus as a whole. To move ahead on this front, research and administrative areas around campus need data.
So IS&T is building an MIT Data Hub. It’s a platform for aggregating data from lots of different sources around campus and making it available – to the public, only to MIT researchers and staff or, in some cases, only to a specific project.
IS&T is working with Facilities to capture energy and utility data from their systems and from sensors installed on buildings around campus. The department is also collecting data on parking and transportation, waste, and a matrix of other categories.
The Data Science Team is streaming that data and processing it as it comes in, using R to develop algorithms for analysis. [R is a programming language and software environment for statistical computing and graphics.] The team can integrate these data sources and perform data blends, bringing it all together. In this way IS&T is helping Sustainability and Facilities figure out where MIT’s dollars would best be spent in terms of improving energy efficiency.
The Data Science Team has also partnered with CSAIL to do more efficient data discovery. It’s developing programs with them that can look at data sources dynamically and pull out associated metadata on the fly. You can write a query into a web application and get information about all of the data sources you’re potentially interested in.
With respect to data visualization, the team is using RStudio and the command-line app, D3.js, to build a visualization gallery. As the team members work on different projects, they post any output that can be shared on the Data Hub. People can look at the data visualizations the team has developed, along with the back-end code.
More about the Data Hub
IS&T is using Amazon Web Services (AWS) to manage the Data Hub infrastructure; Amazon S3 (simple storage service) for data set storage; Amazon EMR (Elastic MapReduce, a version of Hadoop, an open-source framework) to do the data processing; and a variety of tools – Hadoop, Spark, SparkR – to augment this work.
IS&T’s vision is to have a central place for data discovery and storage, but also the ability to spin up analytical clusters on the fly using AWS. Your group can fire up a cluster with data sets of interest, which you can have for as long as you need. The cluster will perform your analysis and store the analyzed data it comes out with.
IS&T has a version of this running right now. In classic engineer fashion, it’s not as user friendly as it could be. You have to do some programming to get started. The Data Science Team is working hard to make it more accessible; it’s developing a user interface for the Data Hub, so it will be as simple as click and go.
In some ways the task is daunting, but it’s also fun because the team gets to work with a lot of different groups around campus with different types of data issues. CSAIL, for example, has many programmers and data scientists on staff. They’re not coming to the Data Science Team for that. But they do come to IS&T for infrastructure. They can build on top of it and do their own analysis. The Office of Sustainability has only one data analyst on staff, so the Data Science Team can provide some value in helping them with analysis.
The current model for the team is that it will partner with you in whatever way you need. It’s worked well for the past year and there are three or four new projects lined up for this coming fiscal year (starting July 1). It’s a very dynamic space.
If you or your department, lab or center have questions about the Data Hub or working with IS&T’s Data Science Team, you can reach them at firstname.lastname@example.org.