Instructor

Lisa Lendway, PhD
Office Hours: See Moodle page

Email: llendway@macalester.edu

Preceptors: See Moodle page

Course Description

This capstone course will expand on R skills you learned in COMP/STAT 112: Introduction to Data Science and STAT 253: Statistical Machine Learning. The biggest part of the course is a project which you’ll start working on early in the course. Last spring, class was on Monday, Wednesday, and Fridays so I described the course using cute names (Modeling Mondays, Wisdom Wednesdays, and Function Fridays). Although we’re meeting on Tuesdays and Thursdays now, so the names don’t match the days, I decided to still keep them because I think they do a good job of describing the main themes of the course. So, just know the topics will not only occur on those days. I have a calendar at the end you can reference for more detail.

  • Modeling Mondays are focused on … modeling, extending the machine learning tools and skills you learned in STAT 253. This will definitely include learning new modeling methods and tools. You will also learn about topics that are important parts of the modeling process but aren’t directly related to building a model: using git and GitHub for sharing code and collaborating, using R and SQL to extract data from a database, building shiny apps to put a model in a place where people can use it, exploring new ways to aid model interpretation, and evaluating the impact of the model. The learning material for these days can be found on the Course Materials tab of the course website. Specifics of which material to read when will be found on the moodle page. You will do the reading/watching/prep before class and come to class prepared to work on problem sets that reinforce these topics.

  • Wisdom Wednesdays are guest speaker days. We will have speakers from a variety of industries visiting our class. They will tell you about what they do (ie. pass on some wisdom) and you will have an opportunity to ask them questions. There will also be time on Wednesdays for you to share wisdom with one another while working on problem sets.

  • Function Fridays are when all of you are the teachers! You will work in groups of 2-3 students giving a ~20 minute presentation/tutorial on an R package or useful set of R functions. In addition to the presentation, you will also write a problem or short set of problems to be added to a problem set. Some ideas for potential topics are listed below. You will meet with me roughly a week before your presentation to talk about your topic, and we will meet again a couple days before to assure you’re ready to present.

    • Manipulating data with functional programming using purrr
    • Manipulating strings with regular expressions - I recommend starting with stringr, including its cheatsheet.
    • Creating an R package (doesn’t have to be anything too fancy)
    • Using python in R with reticulate
    • Text analysis with tidytext
    • mapping
      • rayshader
      • other mapping topics not covered in the intro course
    • Creating generative art using R. I can provide some resources.
    • Using the data.table library to manipulate data.
    • Nice ways to display tables: DT, gt, kable, kableExtra, etc.
    • Have something else you want to talk about? Great! Check in with me, and it will likely work.
  • NEW! Tidy Tuesdays are days to practice your data wrangling and visualization skills! If you took Intro Data Science with me last year, you know what Tiday Tuesday is about. If not, you can read more about it on their website. We’ll work on this in-class for roughly 30-40 minutes every other week and you will spend some more time outside of class. See more details below and in the separate document I’ll provide.

Additionally, Data Ethics/Justice will be a key component of the class. I am not an expert in this area so I will be doing a lot of learning along with you.

Learning Goals

  • Make using git and GitHub a habit. Get more comfortable with using it when collaborating with others - branching, pull requests, etc.
  • Create a website with work from the course that you can share with future employers, advisers, collaborators, etc.
  • Become familiar with more machine learning models and methods and the associated R code.
  • Improve your data wrangling and visualization skills.
  • Gain confidence in learning new R functions on your own and experience teaching others to use them.
  • Use SQL within R Studio.
  • Understand what it means to “put a model into production” (even though we may not execute all the pieces in class).
  • Create a shiny app to interact with a model.
  • Appreciate the importance of data ethics/justice and integrate it into all aspects of data science.
  • Practice working in a group to achieve a shared goal.

Course environment

Academic Integrity: Students are expected to maintain the highest standards of honesty in their college work; violations of academic integrity are serious offenses. Students found guilty of any form of academic dishonesty – including, for instance, forgery, cheating, and plagiarism – are subject to disciplinary action. Examples of behavior that violates this policy, as well as the process and sanctions involved, can be found on the Academic Programs website.

Accessibility: I am committed to ensuring access to course content for students. Reasonable accommodations are available for students with documented disabilities. Contact the Disability Services Office, 651-696-6874 to schedule an appointment and discuss your individual circumstances. It is important to meet as early in the semester as possible; this will ensure that your accommodations can be implemented early on. The Director of Disability Services coordinates services for students seeking accommodations.

Diversity: At Macalester, we embrace diversity of age, background, beliefs, ethnicity, gender, gender identity, gender expression, national origin, religious affiliation, sexual orientation, and other visible and non-visible categories. I do not tolerate discrimination. We are all here because we deserve to be here.

Names/pronouns: You deserve to be addressed in the manner you prefer. To guarantee that I address you properly, you are welcome to tell me your pronoun(s) and/or preferred name at any time, either in person or via email.

Grading and Evaluation

Problem sets: You will complete ~6 problem sets, mostly in the first half of the course. These will reinforce the modeling concepts I cover and will include problems from the topics covered by students. You are encouraged to work in groups but each person will turn in their own assignment. These will be graded by the preceptors and I.

Function Friday teaching: You will work in groups to present a topic of the group’s choosing (see my suggestions above). Presentations will occur in class and will be about 20 minutes.

  • Provide some background for why this package or function or set of functions is useful.
  • Go through examples of the function(s)/packages being used and create a resource that we can reference later. The resource you create could be the same thing you go through in class for your presentation. I would recommend posting this on your website or creating a website for it. This resource should be created in R! The easiest thing to do would be to create an R Markdown document with the code download button at the top.
  • Write a problem or short set of problems to be added to a problem set, with solutions. You will not post the solutions publicly, at least not at first, but turn those in to me. The problem or set of problems should take less than an hour for a student to complete (aim for ~30 mins).

Attending and participating in guest speaker sessions: You are expected to attend class when we have a speaker and participate in conversation with the speaker. This is your opportunity to learn about what people do in their work as data scientists! The majority of the speakers will be on zoom, so I will try to give you some time to get to other places if you don’t want to be in the classroom during that time.

Tidy Tuesdays: You will do about 6 of these throughout the semester. You’ll spend 30-40 minutes in class working on them, discussing ideas with the people around you. You will post your work, including the final graph, to your website. I will provide more detail in a separate document on the moodle page.

Project: This is a HUGE part of this course! You will start working on the project during the 3rd-4th week of the course and much of the 2nd half will be dedicated to that project. I will provide more details in a separate document.

Your grade will be determined mostly by you. I will provide you with written and oral feedback and you will have opportunities to reflect on your learning and evaluate how you are doing. In the end, you will decide your letter grade. If I feel your choice is really different than the grade I would have assigned, I can change it, but we will have plenty of opportunities to talk about this.

Weekly Schedule

Week Start date Topics Notes
1 2020-08-30 git/GitHub, creating a website
2 2020-09-06 ML review with tidymodels
3 2020-09-13 Model stacking
4 2020-09-20 Boosting
5 2020-09-27 SQL, Shiny
6 2020-10-04 Interpretable ML
7 2020-10-11 Interpretable ML
8 2020-10-18 Catch-up Fall Break
9 2020-10-25 H20
10 2020-11-01 Deep Learning
11 2020-11-08 plumber & Docker
12 2020-11-15
13 2020-11-22
14 2020-11-29 Thanksgiving Break
15 2020-12-06
16 2020-12-13 Project presentations

Calendar of Due Dates