Interactive AI Graduate Intern (Data Condensation)
Internship: Evaluation of dataset condensation methods in (biodiversity) datasets
Intel is developing solutions for interactively developing AI models and deploying them efficiently in practice. You will be part of the incremental learning and data exploration research group. There will be daily and weekly meetings with your supervisors. As teams are distributed part of these meetings will be remotely.
Intel has an office in Groningen, Netherlands (Interactive AI), but you are welcome to work from other sites within the Netherlands depending on the actual COVID situation.
The incremental human-in-the-learning loop consists of several steps: annotation of an "active set", model training, model evaluation and a new sampling of the active set. The last step is the active learning step and aims at finding interesting samples to be annotated by the user. What is interesting can be defined in many ways including uncertainty, outliers, distance to known samples, etc. After interesting images have been annotated and added to the labelled dataset, they will become less interesting because they are now part of the training set and are more likely to be correctly recognized. The labelled set will grow and will increase training times.
This internship is concerned with reducing the size of training set to speed up training. The goal is to reduce the size of the training set with similar or better performance measures. The general term for reducing the size of the training set is condensation. Successful outcomes of these experiments could lead to a recommendation to include condensation as an integral part of the incremental learning loop to reduce GPU time and/or improve performance. Reducing GPU times saves costs but also energy use.
Methods to perform condensation can be the same methods as in active learning but now applied to the training set instead of the unlabelled set. The trainee is also encouraged to come up with new methods for finding interesting samples, possibly ones which are geared towards condensation specifically, for example class balancing (which cannot be done with an unlabelled set). Examples of other methods are Shapley values and core-sets.
In this internship you will compare different approaches to condensation on several datasets, including complex biodiversity datasets. Biodiversity datasets are characterized by having many fine-grained classes, a hierarchical class structure, a large imbalance in numbers of samples per class and partially overlapping classes. These challenging conditions make biodiversity datasets a good test case for other real-world applications. While the focus is on implementing and evaluating existing methods there is room for developing your own approaches. A successful internship could result in a scientific publication.
Internship duration: Minimum 6 months
This internship assumes:
You must be studying for a MSc in computer science, data science, computational biology, cognitive science, artificial intelligence, or related fields
Experience with training deep learning models using one of the common frameworks (PyTorch, Tensorflow)
Good Python skills
Good understanding of statistical analysis and testing
Inside this Business Group
Employees of the Internet of Things Solutions Group (IOTG) have an exciting opportunity before them: To grow Intel's leadership position in the rapidly evolving IoT market by delivering the best silicon, software and services that meet a wide range of customer requirements - from Intel® Xeon® to Intel® Quark®. The group, a fresh, dynamic collaboration between Intel's Intelligent Solutions Group and Wind River Systems, utilizes assets from across all of Intel in such areas as industrial automation, retail, automobiles and aerospace. The IOTG team is dedicated to helping Intel drive the next major growth inflection through productivity and new business models that are emerging as a result of IoT.