Project Guidelines
Your Task
Each student has been allocated into a project group of three. Each group has been assigned a specific project research topic. Your goal is to complete the required project deliverables and checkpoints, in accordance with the guidelines detailed in the remainder of this document.
Deliverables
This project has the following three key deliverables.
1. Report
[template] (right click and choose “Save Link As…” to download)
DUE THURSDAY, JULY 24 AT 11:59PM ET
Your report should be written using Quarto and submitted as a rendered .html
file. We recommend using an IDMRaD (Introduction, Data, Methods, Results and Discussion) report format, with details provided in the report template.
2. Poster
[template] (Google Slides link)
DUE TUESDAY, JULY 22 AT 11:59PM ET [HARD DEADLINE—so that we have enough time for poster printing]
Your poster should be submitted as a .pdf
file. We will then make a printed copy for the poster session on the final day. (Note: the recommended size is 48 inches wide by 36 inches tall.)
3. Slides
DUE THURSDAY, JULY 24 AT 11:59PM ET
Each group will give a 7-minute presentation on the final day (July 25). The presentation should effectively have the same structure as your report with an introduction, data description, an overview of methods, followed by results, recommendations, and discussion. Your slides may be created in any software, but we only accept submissions in the form of a .pdf
file, a Google Slides link, or a Quarto presentation (self-contained .html
file or hosted online).
Checkpoints
Checkpoint 1: 5-minute presentation during lab on June 30
Note: It is perfectly fine if you don’t have any results at this point
No notes/scripts are allowed
Your first checkpoint presentation should be structured as follows.
Introduction (1 slide): Describe your project topic/question(s) and why it is important
Data: (1 slide) Data description and any relevant data pre-processing steps (e.g., whether you consider specific observations, create any meaningful features, etc.—but don’t mention minor steps like column type conversion, filtering out unnecessary rows)
EDA (2 slides max): 1–2 EDA plots related to your question(s) of interest
- Design the slides using the assertion-evidence model
Methods (1 slide): Early thoughts on methods/modeling strategy. Justify why it might be appropriate to answer your question(s) of interest
- Plan of action (1 slide): List all the steps needed to complete your project (be specific). Highlight the completed steps. What are the next steps?
Checkpoint 2: 7-minute presentation during lab on July 16
No notes/scripts are allowed
Your second checkpoint presentation should be structured as follows.
Introduction (1 slide): Describe your project topic/question(s) and why it is important
Data: (1 slide) Data description and any relevant data pre-processing steps (e.g., whether you consider specific observations, create any meaningful features, etc.—but don’t mention minor steps like column type conversion, filtering out unnecessary rows)
Plan of action (1 slide): List all the steps needed to complete your project (be specific). Highlight the completed steps.
Present the completed steps (5 slides max): methods, plots, findings, recommendations, etc.
- Design the slides using the assertion-evidence model
Plan of action (1 slide, use the same one as before): what are the steps still to be completed?
Project Topics
Project 1: Premature deaths (TAs: James, Hao)
How is the socioeconomic status of a county (e.g., income inequality, unemployment, high school completion rates, etc.) associated with the number of premature deaths of certain racial groups at the county level?
Project 2: Healthcare access & preventable hospital stays (TAs: Princess, James)
Does healthcare access (e.g., primary care physicians, uninsured rate, etc) affect the number of preventable hospital stays of certain racial groups at the county level?
Project 3: Mental health (TAs: Hao, Julian)
Do the number of mental health professionals per county affect the number of poor mental health days?
Project 4: Substance abuse (TAs: James, Julian)
Are there demographic and social factors that are predictors of substance abuse outcomes (e.g., drug- and alcohol-related deaths)?
Project 5: Influences on childhood outcomes (TAs: Princess, Julian)
How are juvenile healthcare outcomes impacted by adult health-related practices (e.g., smoking, drinking, diet)?
Project 6: Obesity (TAs: Princess, Hao)
Do physical inactivity level and access to healthy food affect obesity rate at the county level?
Analysis
Your analysis should focus on both:
Exploratory data analysis: Create visualizations to explore the underlying structure of the data and gain insights about distributions and relationships between variables. These should be ideally based on reasoned hypotheses.
Statistical modeling: Demonstrate the use of statistical and machine learning modeling techniques. This may involve justifications for your choice of model (e.g., comparison with model specifications such as using different predictors, or with other methods), and then any relevant interpretation of the model with regards to your project’s topic. Depending on your project, the model(s) you rely on may be used for either an inference (i.e., interpreting coefficients) or prediction task. The model you choose just needs to be motivated by your question of interest.
Data
Required: County Health Rankings Data
The County Health Rankings Data—collected by the University of Wisconsin Population Health Institute—ranks every county in each state on their Health Outcomes and Health Factors.
This dataset also contains the measurements used to calculate the rankings for each county. More information can be found here.
You must (at minimum) use the 2025 ranking measures which can provide more insight into the most recent health ranking outcomes priorities. This can be used to better shape your project topic and related hypotheses.
Your analysis should mainly be done at the entire United States scale (as feasible). However, you are welcome to focus on some specific counties/states to test more granular spatial hypotheses.
Optional: additional suggestions and data sources
Consider doing a temporal or trend analysis for your analyses, as the County Health Rankings Data are typically collected over time.
- For predictive modeling, consider adding time-varying features or forecasting an outcome, with suitable uncertainty quantification.
Consider merging the County Health Rankings Data with other publicly available datasets.
- Example sources include US Census data (accessed via the
tidycensus
R
package) and COVID-19 data (can be accssed via thecovidcast
R
package).
- Example sources include US Census data (accessed via the
(Note: All data used must be publicly available.)