Search Google Appliance

Where Art Thou Big Data? Insights from CSAIL's Kalyan Veeramachaneni
April 30, 2014
Robyn Fizz
Share |
IMAGE: KOTIST / BIGSTOCK

Kalyan Veeramachaneni thinks big. Together with Dr. Una-May O'Reilly, he leads multiple Data Science efforts in the Anyscale Learning for All (ALFA) group in CSAIL. One primary focus is on knowledge discovery for education, where the frame of reference is Massive Open Online Courses (MOOCs). MOOCs provide millions of data points; finding patterns in this data could help MOOC instructors teach more effectively and improve student engagement.

Veeramachaneni will discuss his work at an upcoming xTalk, Where Art Thou Big Data? Identifying & Harnessing Sources of Data for MOOC Data Science, on May 14. He'll cover the predictive models ALFA is trying to build and the challenges involved in building them. Here’s a high-level preview of his presentation. Note: Dr. O'Reilly presented a related xTalk last December on Taming MOOC Big Data.

Predicting Stopout
One model ALFA has built predicts “stopout” – that is, when students leave a MOOC and don’t come back or when they stop participating. This model predicts ahead of time when a student is going to leave a course or is showing signs that he or she will leave. The group’s definition of leaving a course is “stopped engaging in problem-solving.”

Building predictive models requires machine learning and involves feature definition and engineering. (In the field of machine learning, features are what we think of as variables.) To build a predictive model for stopout, you need to design variables that can characterize student behavior. Veeramachaneni notes that it is ideal in the context of stoput prediction to identify students who are similar to each other. This requires capturing a number of complex variables and then matching students based on those variables.

Examples of variables include “How many times on average does a student go to a forum in between attempting problems?” and “How close to a deadline does a student submit homework?” One of the biggest challenges for successful modeling is coming up with these variables or hypotheses.

The Importance of Input
Veeramachaneni recalls how he and his colleagues approached deciding on the variables for predicting stopout. First, they sat around a table and discussed what made sense to capture.

Next, they got input from faculty and students in an EECS UROP course, 6.MITx, where the focus was on building web apps for educational technology. They asked the 6.MITx cohort what would predict stopout and got persuasive arguments about why certain variables would be effective predictors. While there was some overlap with the variables ALFA had come up with, an interesting set of new variables emerged.

Some of these variables measured students in relation to other students, rather than in absolute terms. One example: “What percentile does this student fall into in terms of time spent on the course?” ALFA moved forward with its modeling after analyzing the feedback from the 6.MITx faculty and students.

Now Veeramachaneni wants to open up the opportunity to hypothesize variables to a much wider audience. To do this, ALFA has been developing a web-based interactive platform where people can submit their ideas and others can comment. He hopes to release it in time for his talk on May 14, so that those in attendance can participate.

The Challenges of Feature Engineering
Veeramachaneni and his group have encountered three challenges when doing feature engineering with MOOC databases.

  1. The data sets are large. In examining a MOOC with 150,000 students, the researchers have to parse about 200 million clickstream events and approximately 8 million submissions.
     
  2. Data is complex and intricately connected. To answer one question – for example, the number of times a student goes to a forum between trying problems during a week – researchers have to go to three different places in the dataset. They have to extract time stamps for a student’s attempts and also go to the forums to see if a student participated in the discussion in between those time stamps. Computationally, it’s a complicated query to write. Says Veeramachaneni, “The complexity of extracting data from a huge MOOC dataset takes on a life of its own.”
     
  3. Variables of interest vary widely. When you ask people from different groups and disciplines “What are your variables of interest?,” everyone has a different answer. So choosing variable definitions is an exercise in itself.

Building Successful Predictive Models
Based on well-defined variables and carefully developed coding, ALFA is now able to predict MOOC stopout with very good accuracy. In doing so, it has taken advantage of what’s known as scalable and agile machine learning.

For ALFA, this means two things. In terms of big data, it means building a machine learning model by breaking it down to run in parallel on many computers.

In the context of data science, it means being able to try out many different models simultaneously in a very agile fashion, then summarizing and seeing which models provide the best explanation. ALFA develops anywhere from 150 to 800 models before it hits on one that really captures what it's trying to predict.

Veeramachaneni hopes in the next few months to release tools and software for analyzing MOOC data. Ultimately ALFA plans to release a platform where users can contribute and share scripts. With standardized schemas and crowd sourcing of data analytics, he hopes to help unlock the full potential of MOOCs.