Shashank Srikant


Full list on Google scholar

Machine learning

Automatic grading of computer programs

Work done with Varun Aggarwal and Gursimran Singh.
Published at: KDD 2014 [link], KDD 2016 [link]
[Blog post]    [Report on IEEE spectrum]

We designed Automata, a tool to holistically grade computer programs. A critical component was to determine whether a program was semantically `close’ to a correct answer. This was traditionally done either by evaluating test-suites or by solving constraint equations using SAT solvers. The former is an error-prone metric while the latter infeasible for real-time systems. We designed a grammar for generating features from the abstract syntax trees of programs. They captured data and control dependencies between variables, which we showed to provide significant additional information. Supervised models learnt on these features had a correlation of ~0.9 with expert raters, rivaling inter-rater correlations.

A particular bottleneck in our KDD ‘14 work was in building predictive models for every programming question we designed. It is a time consuming and human-intensive effort. Extending our work, we conceived a way to transform program features from each question onto a question-independent space. This question-independent space quantified a program’s correctness without the help of labeled data. Such a metric allowed us to compare programs attempting different questions on the same scale. We showed that supervised models learnt on these transformed features were able to predict as well as those learnt on question-specific features

Analyzing test case statistics

Work done with Varun Aggarwal. Used internally in the group as a tool
[Blog post]

Test cases evaluate whether a computer program is doing what it’s supposed to do. There are various ways to generate them – automatically based on specifications, say by ensuring code coverage or by subject matter experts (SMEs) who think through conditions based on the problem specification. We asked ourselves whether there was something we could learn by looking at how student programs responded to test cases. Could this help us design better test cases or find flaws in them? By looking at such responses from a data-driven perspective, we wanted to know whether we could .a. design better test cases .b. understand whether there existed any clusters in the way responses on test cases were obtained and .c. whether we could discover salient concepts needed to solve a particular programming problem, which would then inform us of the right pedagogical interventions. We built a cool tool which helped us look at statistics on over 2500 test cases spread across over fifty programming problems attempted by nearly 18,000 students and job-seekers in a span of four weeks!

Learning models for job selection

Work done with Vinay Shashidhar and Varun Aggarwal
Published at: Workshop on ML for education, ICML 2015 [link]
[Blog post]
We addressed the problem of desgning classifiers to match skills (as measured by our standardized tests) to the right job-roles. The models needed to follow some of these constraints -
  • They needed to be theoretically plausible. For instance, a candidate with a higher test score cannot be rejected while accepting one with a lower score. We identified coordinate-wise monotonicity as the weakest structure needed for this to happen.
  • They needed to be simple and human interpretable. This is important because we cannot completely depend on data given its non-causal nature and the sample bias it may contain.
  • It needed to provide a suite of trade-off models with different type-1 and type-2 errors, allowing HR personnel to accommodate prevalent market conditions and allowing them to suit their companies' standards.
Our formal experimental setup answered the following questions -
  • Could we build employability benchmarks with acceptable type-1 and type-2 errors using our techniques? If organizations use our benchmark for hiring, will they be able to reduce hiring unsatisfactory employees without adverse selection?
  • Does an ensemble of linear models provide better prediction accuracy than a single, linear classification model?
  • What insight and knowledge discovery may happen by studying our models? Could we discover what combination of parameters and variables determine employability for a job sector?

A framework to apply machine learning to subjective assessment tasks

Work done with Varun Aggarwal and Vinay Shashidhar
Published at: Workshop on data driven education, NIPS 2012 [link]
[PDF]     [Related whitepaper]
This is a neat position paper which describes the broader framework of grading open-ended assessments using machine learning. This gives an idea of how one should pick out of a problem in open assessments and use it in casting a problem in ML. This work highlights how solving open-ended assessment tasks generally entail some very hard problems which have no good answers, such as -
  • Getting labeled data - More complex the domain, the harder, more time consuming and more noisy this is going to get. There have been clever workarounds which we demonstrated in areas like speech processing and computer programming. Nevertheless, this is something to reckon with.
  • Low data - Say goodbye to big data. Getting large samples of labeled data will usually be impractical, resulting in small labeled data-sets to work with. This work will generally require applying and evaluating the basics of ML. Be prepared.
  • Sparse data - Say hello to sparse data. Features generated in these problems typically will exceed the number of labeled data points you have. This is very similar to problems in the domain of biology and genetics. Again, no good answers; first principles' approach of penalizing your models right is the way ahead.
Also check out the white-paper [link] we came out with which discusses the state of the art in the data-driven assessments space. This was released as part of ASSESS 2014.

Predicting web-surfers' interests from their interaction with web browsers

Work done with Kunal Sangwan, Sohil Arora and Dr. Jitender Chhabra.
We designed a browser add on which would sniff URLs that a user would visit. Based on the keywords and meta information found on the webpage, we would gather the user's interests. Based on this discovery, relevant pages off the internet would be shown to the user to increase serendipitous knowledge acquisition. In this work, we had the following key components
  • Probabilistic LSA - To determine topics corresponding to the text in a webpage
  • Feature engineering - Once we had the topics, we had a layer computing some more informative features like the user's cumulative interest list (see figure below), time spent on a page etc.
  • Linear regression - To learn the weights of an equation which modeled a user's interests, given our engineered features
  • Scraping engine - To pull out webpages given a set of keywords - used for recommendation
Unfortunately, the codebase is not easily accessible right now. If anyone's interested in this, please write me a note and I'll put in the effort to dig 'em out and host it on github.

Predicting user movement in public transport networks

Work done with Kunal Sangwan, Sohil Arora and Dr. Jitender Chhabra.
Submitted to a student olympiad organized at ICSE 2011
[Report]    [Slides]
We participated in SCORE 2011, a student olympiad organized at ICSE, a top-tier conference in software engineering (see CORE ranking). The problem statement we picked required us to design a smart transportation system. The idea was to demonstrate sound software engineering principles in tackling such a fairly complicated problem. We particularly solved the problem of making personalized predictive models for users traveling a network. Some highlights -
  • Learn user-specific models and store thresholds against each user
  • Learn thresholds to determine if a path is a frequently-travered path by a user
  • Scan each learnt paths against paths reported to have downtimes
  • Use Floyd-Warshal to recompute alternate paths and suggest to user
The system design allowed us to scale predictions in real-time to the user-base. We validated the entire setup on synthetic datasets we generated.

Education & Society

Introducing data science to kids

Work done with Varun Aggarwal
Published at: SIGCSE 2017 [link]
[PDF]    [GitHub]    [My opinion piece]    [Talk slides]    [Tutorial slides]
This was a fun project from the word go. We came up with a carefully designed tutorial to teach kids the basics of data science. Involved hands-on exercises designed on MS Excel. We also designed a decision-tree based super simple classifier which could provide a measure for classifier-accuracy.
Led to co-founding Data science for kids. Check it out pronto!

Why Bihar can save only 0.3% of its malnourished children

Work done with Nishant Ojha, Director, Seva Setu
[Press report]
Bihar, a state in India, does poorly on global metrics measuring infant and mother health. The state reports 50% of its children below the age of 5 to be malnourished. What is being done about it? Can the current infrastructure ensure this issue is tackled switftly and efficiently? We dug up reams of government documents (obtained through the very helpful Right to Information Act) and realized the situation is far worse than imagined. Our results were published in, India's first data-journalism initiative.

Seva Setu's Each one, Reach one program

Contributions by Yogesh Kaushik and other open source enthusiasts
[Live site]    [FAQs]    [Source code]
In our work at Seva Setu, we focused on beneficiaries of maternal healthcare provided by the state. We observed that mothers were expected to understand and keep track of non-trivial schedules regarding pregnancy-related medical attention. To add to this, medical officials were not pro-active in engaging with communities and hand-holding them where required. As a result, we noticed in these communities a general ignorance towards maternity related medical care. The Each one Reach one program is an attempt at modularly separating information dissemination from actionable steps. We connect young rural mothers to women from urban India who provide them with timely information on the medical care they need. They also follow up on the actionable steps taken by their rural peers and advise them if found off-track. The call and maternity schedules of each of these mothers are automated and managed via a web-based tool.

If you're into web-based software, write to me to contribute to this tool.
If you're an urban woman, register on the site to get started!


Mirth and aggression

Work done with Aditya Singh, M.Sc. IIT Gandhinagar
[Preliminary report]
In humans, the act of laughter generally conveys happiness, is innate, and has little variability across individuals. But can something as opposite as pain induce laughter? Try this simple experiment - surprise a person you know very well by sitting on him/her while s/he is lying down. The pain you cause the person shall be very real and agonizing. However, you'll observe that the person responds by incessantly laughing! Is there an explanation? Is there a larger theory here which can explain this unusual neural pathway and response? We devised a series of experiments to prove a set of hypotheses we came up with. Work yet to be published.


Interactive skill-demand maps

Work done with Bhanu Pratap and Varun Aggarwal
[Blog post]     [Live site]
We developed a neat interactive map of the US where one can check which states demand particular skills. Consider these use-cases:
  • You've moved into a new town and want to understand what are the "hot" skills in that town? Is it something you can pick up to eventually land a job?
  • You want to understand which city/state pays the highest for being a <insert your skill/profession>. You want to make an informed choice and head to that city right away.
The map that we developed helps answer these questions! I contributed in analyzing, vetting and correcting the scraped data. It's a fun tool - check it out.

Parallelization of weighted sequence comparison using EBWT

Work done with Binay Pandey and Dr. Rajdeep Niyogi.
We were exploring whether we could improve the performance of a then new compression algorithm. By the end of my term, I was unable to come up with a neat parallelized algorithm which would give me a 200X gain (I firmly believed then that I had superhuman skills). I had to settle for a 5X speed up of core functionality and a 2X speed-up of overall performance as compared to the performance on a high-end CPU.