Staff
- Instructor: Kenneth C. Arnold, Calvin University, North Hall NH298
- Meeting Time: MWF 9-9:50am
- Contact: Preferred: ka37@calvin.edu, Office: 616-526-8723, Cell: 443-310-4002.
- Teaching Assistant / Grader: Yejae Kim
- Office Hours
- Prof Arnold: Wed 3-4pm in this Teams meeting or in NH 295. See my Google Calendar for exceptions. I’m also available by appointment.
- Yejae: Wed 4:30-5:30pm in Teams (General channel)
Objectives
This course integrates the data acumen and visualization skills you developed in DATA 101 or 175 with the computational and mathematical/statistical skills you have been developing in other classes. We will emphasize data wrangling, predictive modeling, and visualization.
Upon completing this class, you should be able to work through each part of the data science lifecycle, including:
- Planning e.g., Refine a data-based question based on the characteristics of available data and client needs.
- Data Acquisition, e.g., Evaluate potential data sources based on utility and biases
- Generating hypotheses: e.g., identify potential issues of bias and representation in a dataset based on exploratory analyses
- Wrangling, e.g., Apply Grammar of Data operations to transform data into a form appropriate to analysis.
- Modeling, e.g., contrast the suitability of linear vs nearest-neighbor models for a given modeling task
- Validation, e.g., identify whether a reported accuracy number might be misleading because of overfitting
- Visualization, e.g., refine a given visualization to improve its effectiveness and faithfulness
- Communication, e.g., write intelligible and faithful interpretations of the process and results of a predictive modeling process
Communication
We will use the following communication tools:
- Email (ka37@calvin.edu)
- When in doubt, email me. I may redirect you to one of these other places.
- Expect a weekly “heads-up” email from me.
- MS Teams
- All synchronous meetings will happen on our class Team (with the hopeful exception of some lectures; see below)
- Please use Teams for Q&A that may be relevant to others in the class. If I get such a question by email, I will probably direct you to post it on Teams so I can answer there.
- Moodle is the main source of course material.
- Many Moodle segments will actually link to this website.
- Some material will be duplicated in other places. If there are inconsistencies, please let the instructors know and treat Moodle as the reference.
- Moodle will also be the home of our discussion forums.
Materials
Technology
- GitHub
- As part of our objective of reproducibility, we will be using
git for distributing assignments, collaboration, and tracking progress.
- RStudio Server
Textbooks
We will use the following materials. All are available freely online, but some may also be purchased in hardcover if desired.
Weekly expectations
Each week is the same(ish). Each week you will be expected to:
- Be ready: Prep readings due Monday
- Be present at Monday/Wednesday lectures and Friday labs
- Lab reports are due on Monday
- Practice: Homework (or project milestone) due Wednesday
- Check in: Quizzes on Thursday
- Look around: Discussion forum posts due Tuesday
Lecture and Lab
- Lectures are held Mondays and Wednesdays at 9-9:50am. If you have a laptop, please bring it.
- On the first day of class (9/2), we will meet in SB 010. Please bring your laptops.
- Based on how the Calvin community is doing managing Covid, we will decide whether to hold class in person or not. We may make more conservative decisions than the university requires overall.
- So please check Teams for class location before coming in person. You may always join online.
- Labs will be held Fridays at 9-9:50am, always on Teams.
Lab reports will be due on the following Monday.
Projects
You will complete two multi-week projects in this class.
In the Midterm Project, you will practice some parts of the data science lifecycle by reproducing a published visualization of your choice from source data. In the Final Project, you will additionally apply predictive analytics. You may choose to use the same or different dataset. Details about the projects
Final projects may be completed in teams of up to 3. Teams will have the following additional expectations:
- Teams must submit a team contract about how they will work together
- Teams must convince the instructional staff that each team member learned something substantial from completing the project.
- Each team member must submit an assessment of how they and other team members fulfilled their contract.
Details of these expectations are forthcoming.
Grading
Unless otherwise arranged, grades will be weighted as follows:
- 5% Prep and Participation
- 10% Discussion forums
- 10% Lab exercises
- 20% Homework
- 20% Quizzes
- 10% Midterm Exam
- 10% Midterm Project
- 15% Final Project
Your lowest quiz score will get thrown out.
As the Calvin Academic Integrity Policy says, “At Calvin, the student-faculty relationship is based on trust and mutual respect.”
Data science is a fundamentally collaborative endeavor. Collaboration brings the benefits of multiple perspectives, needed to tackle complex problems faithfully and responsibly. But teamwork also brings the risk of one person doing all of the “learning” for the other. Thus:
- Collaboration on homework and labs is encouraged.
- Some assignments will be pair or team work. The assignment will indicate which.
- The write-ups for homework and lab submissions should be exclusively the work of the named authors.
- Every named author should be able to point to substantive contribution to the write-up. Trade off whose computer you’re working on.
- All of the work should be your own words and code.
- It is okay and sometimes encouraged to look up how to do something online! But if you do:
- Record the exact URL that had the information that helped you. (This will help improve our instructional materials for next year.)
- Retype any code yourself, from memory, even if you have to switch back and forth a lot. (This will help you internalize what you’re borrowing.)
- Beware that there is lots of bad R code out there. Strive to do better.
- A similar policy applies for asking other students for help.
- When asking for help (and everyone should ask for help when they need it), try to solve the problem on your own first. This is critical. Then, when you ask for help, share what you’ve tried and what leads you to think it’s not working. (not just “It’s not working!!”)
- List what help you received for each problem. (e.g., “Joe helped me understand what this question is asking.”)
Diversity and Inclusion
I came to Calvin because I wanted to explore what our Christian calling to “act justly” means in the context of data and the technologies that we use with it. Engaging that question wholeheartedly requires that each of us, me included, engage respectfully with perspectives very different from our own. For example, we must question those who abuse data for selfish gain, but we also must question the perspectives of those who challenge those abuses on purely secular grounds.
I intend for this class to be an environment where we equally respect people of every ethnicity, gender, socioeconomic background, political learning, religious background, etc. I will try to create that community by having us read diverse voices, engage with issues of importance to people unlike ourselves, and structure discussions that require students to engage respectfully with perspectives different from their own. I invite your help.
We will not always do this well. If you or someone else in this class is hurt by something I say or do in class, I would like to work to remedy it. I’ll welcome this feedback in whatever way is comfortable for you: in public, in private, via another person (such as our TA or my department chair, Keith Vander Linden), or via a report to Safer Spaces or the provost’s office.
Special Circumstances
Occasionally there are special circumstances that require that course policies be adjusted for a particular student. In such cases, it is the responsibility of the student to inform me of the situation as soon as possible, so that the appropriate arrangements can be made. This includes, but is not limited to, students with documented disabilities.
Calvin University has a continuing commitment to providing reasonable accommodations for students with documented disabilities. Like so many things this fall, the need for accommodations and the process for arranging them may be altered by the COVID-19 changes we are experiencing and the safety protocols currently in place. Students with disabilities who may need some accommodation in order to fully participate in this class are urged to contact Disability Services in the Center for Student Success (disabilityservices@calvin.edu) as soon as possible to explore what arrangements need to be made to assure access. The three of us (student, instructor, and Disability Services) will work together to come up with an appropriate solution.
We will give an incomplete grade (I) only in unusual circumstances, and only if those circumstances have been confirmed by the Student Life office.
Topics
About two weeks will be dedicated to each of the following topics:
- Visualization: Start with Seeing
- Wrangling: because data doesn’t come clean
- Predictive Modeling: because we want to make good guesses
- Model Validation: because it’s easy to get it wrong
- EDA: because datasets don’t come with instruction manuals
- Data Acquisition and Management: because getting data isn’t as easy as opening a file your instructor hands you
- Experiment Design: because it helps to have a plan
- Communication
See Moodle for details. We will also be discussing a variety of human contexts and ethics topics. We will curate this list of topics together in the first week of class.
Acknowledgments
A substantial amount of content for the first few weeks of this course is based on material from the “Data Science in a Box” (abbreviated “dsbox” in the materials here) project led by Dr. Mine Çetinkaya-Rundel.