RequirementsMachine Learning. Optimisation. Algorithmique avancée
Program requirementsCC+examen
TeacherStéphane Boucheron
Weekly hours 2 h CM
Years M2 Logos

Syllabus

Usage des méthodes randomisées en traitement des données massives et en traitement des flots de données (streaming). Familiarisation avec Spark. Articulation estimation/optimisation

Contents

  1. Plus proches voisins en grande dimension.
    • Locally sensitive hashing et au delà
    • Applications aux données textuelles (Spark ML Feature Extraction)
  2. Compressed sensing
    • Reconstruction parfaite des signaux parcimonieux par pénalisation $\ell_1$
    • Algorithmes (LASSO, AMMD, Coordinate descent, ...)
  3. Données de streaming
    • Échantillonnages
    • Comptage approximatif (Hyperloglog, Spark SQL)
  4. Estimation robuste
    • Enjeux
    • Median of Means
    • Relaxation SDP

Bibliography

  • Arnold, T., & Tilton, L. (2015). Humanities data in R: exploring networks, geospatial data, images, and text. Springer.
  • Bandeira, A. S. (2015). Ten lectures and forty-two open problems in the mathematics of data science. Lecture Notes.
  • Blum, A., Hopcroft, J., & Kannan, R. (2016). Foundations of data science. Vorabversion eines Lehrbuchs.
  • Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.
  • Chambers, B., and Matei Z. (2018). Spark: the definitive guide: big data processing made simple. O'Reilly Media, Inc..
  • Foucart, S., & Rauhut, H. (2013). A mathematical introduction to compressive sensing. Birkhäuser.
  • Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive datasets. Cambridge university press.
  • Lugosi, G. (2017). Lectures on Combinatorial Statistics. St. Flour.
  • Moitra, A. (2018). Algorithmic aspects of machine learning. Cambridge University Press.
  • Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science. Cambridge University Press.