Instructor

Lisa Lendway, PhD
Office Hours: See Moodle page

Email: llendway@macalester.edu

Preceptor: See Moodle page

Course Description

This capstone course will expand on R skills you learned in COMP/STAT 112: Introduction to Data Science and STAT 253: Statistical Machine Learning. The biggest part of the course is a project which you’ll start working on very early in the module. Since we’re meeting on MWF’s, I’ve created categories of topics/skills for each day of the week.

  • Modeling Mondays are focused on … modeling, extending the machine learning tools and skills you learned in STAT 253. This will definitely include learning new modeling methods and tools. You will also learn about topics that are important parts of the modeling process but aren’t directly related to building a model: using git and GitHub for sharing code and collaborating, using R and SQL to extract data from a database, building shiny apps to put a model in a place where people can use it, exploring new ways to aid model interpretation, and evaluating the impact of the model. I will create the learning materials for these days. They will probably be a mix of my videos, other people’s videos, and live demonstrations. You will work on problem sets that reinforce these topics.

  • Wisdom Wednesdays are guest speaker days. We will have seven different speakers from a variety of industries visiting our class. They will tell you about what they do (ie. pass on some wisdom) and you will have an opportunity to ask them questions. There will also be time on Wednesdays for you to share wisdom with one another while working on problem sets or your project (more about that later).

  • Function Fridays are when all of you are the teachers! You will work in groups of 2-3 students giving a ~20 minute presentation/tutorial on an R package or useful set of R functions. In addition to the presentation, you will also write a problem or short set of problems to be added to a problem set. Some ideas for potential topics are listed below.

    • mapping (topics NOT covered in the intro course)
      • geom_sf()
      • tmap
      • tidycensus
      • tidygeocoder
      • countrycode
      • osmdata
      • rayshader
    • Manipulating data with functional programming using purrr
    • Manipulating strings with regular expressions - I recommend starting with stringr, including its cheatsheet.
    • Using python in R with reticulate
    • Text analysis with tidytext
    • Analyzing missing data with naniar
    • Using the data.table library to manipulate data.
    • Nice ways to display tables: DT, gt, kable, kableExtra, etc.
    • Have something else you want to talk about? Great! Check in with me, and it will likely work.

Additionally, Data Ethics/Justice will be a key component of the class. I am not an expert in this area so I will be doing a lot of learning along with you.

Learning Goals

  • Make using git and GitHub a habit. Get more comfortable with using it when collaborating with others - branching, pull requests, etc.
  • Create a website with work from the course that you can share with future employers, advisers, collaborators, etc.
  • Become familiar with more machine learning models and methods and the associated R code.
  • Gain confidence in learning new R functions on your own and experience teaching others to use them.
  • Use SQL within R Studio.
  • Understand what it means to “put a model into production” (even though we may not execute all the pieces in class).
  • Create a shiny app to interact with a model.
  • Appreciate the importance of data ethics/justice and integrate it into all aspects of data science.
  • Practice working in a group to achieve a shared goal.

Course environment

Academic Integrity: Students are expected to maintain the highest standards of honesty in their college work; violations of academic integrity are serious offenses. Students found guilty of any form of academic dishonesty – including, for instance, forgery, cheating, and plagiarism – are subject to disciplinary action. Examples of behavior that violates this policy, as well as the process and sanctions involved, can be found on the Academic Programs website.

Accessibility: I am committed to ensuring access to course content for students. Reasonable accommodations are available for students with documented disabilities. Contact the Disability Services Office, 651-696-6874 to schedule an appointment and discuss your individual circumstances. It is important to meet as early in the semester as possible; this will ensure that your accommodations can be implemented early on. The Director of Disability Services coordinates services for students seeking accommodations.

Diversity: At Macalester, we embrace diversity of age, background, beliefs, ethnicity, gender, gender identity, gender expression, national origin, religious affiliation, sexual orientation, and other visible and non-visible categories. I do not tolerate discrimination. We are all here because we deserve to be here.

Names/pronouns: You deserve to be addressed in the manner you prefer. To guarantee that I address you properly, you are welcome to tell me your pronoun(s) and/or preferred name at any time, either in person or via email.

Grading and Evaluation

Problem sets: You will complete ~4 problem sets, mostly in the first half of the course. These will reinforce the modeling concepts I cover and will include problems from the topics covered by students. You are encouraged to work in groups but each person will turn in their own assignment.

Function Friday teaching: You will work in groups to present a topic of the group’s choosing (see my suggestions above). I will give you A LOT of flexibility in how you want to do this; some may want to do a live presentation and some may want to create a pre-recorded video or something like a blog post tutorial. There are some requirements:

  • Provide some background for why this package or function or set of functions is useful.
  • Create a resource with examples of the package being used. I would recommend posting this on your website or creating a website for it. This could also be something you go through in your presentation.
  • Write a problem or short set of problems to be added to a problem set, with solutions. You will not post this publicly, at least not at first. The problem or set of problems should take less than an hour for a student to complete (aim for 30-45 mins).

Attending and participating in guest speaker sessions: You are expected to attend class (on Zoom) when we have a speaker and participate in conversation with the speaker. This is your opportunity to learn about what people do in their work as data scientists! You will be writing briefly about each speaker, either in its own assignment or as part of the weekly exercises.

Project: This is a HUGE part of this course! You will start working on the project during the 2nd week of the course and most of the 2nd half will be dedicated to that project. I will provide more details in a separate document.

Your grade will be determined mostly by you. I will provide you with written and oral feedback and you will have opportunities to reflect on your learning and evaluate how you are doing. In the end, you will decide your letter grade. If I feel your choice is really different than the grade I would have assigned, I can change it, but we will have plenty of opportunities to talk about this.

Schedule