• 3. Reproducibility summary

Links to: Summary, Chatbot Tutor, Practice Questions, Glossary, R functions and R packages, and Additional resources.

Chapter Summary

Doing science is hard work. It is therefore important to keep your hard-fought data in a stable form that is less likely to be corrupted, and to provide clean, reproducible code to analyze these data. By following the best practices in collecting, storing, documenting, and analyzing data laid out in this chapter, your work will be reproducible and trustworthy.

Chatbot tutor

Please interact with this custom chatbot (ChatGPT link here, Gemini link here) I have made to help you with this chapter. I suggest interacting with at least ten back-and-forths to ramp up and then stopping when you feel like you got what you needed from it.

Practice Questions

Try these questions!

Q1) What is the biggest mistake in the table below?
ID weight date_collected_empty_means_same_as_above
1-A1 104 2024-03-01
1-1B 210
3-7 150
2-B 176 2024-03-15
1-A5 110

While some of these (like the long name for date) are clearly shortcomings, spreadsheets should never leave values implied.

.


Q2) What would you expect in a data dictionary accompanying the table above? (select all correct)

Q3) How do you read data from a Excel sheet, called raw_data in an Excel filed named bird_data.xlsx located inside the R project you are working in?

Q4) What should you do to make code reproducible? (pick the best answer)

Q5) As we saw in this chapter, R has a built-in dataset called iris. You can look at it or give it to functions by typing iris. Which variable type is the Species in the iris dataset?

Q6) Cosider the plot generated by the code in the previous section. The plot is consists of “small multiples” (or in ggplot language “facets”). The facet on the far right is pink. What is the facet on the far right?

For the following question consider the diabetes dataset available at: https://raw.githubusercontent.com/ybrandvain/datasets/refs/heads/master/diabetes.csv

Q7) What is the first column in the diabetes dataset that will change its name after going through the clean_names() function?:


For the next questions, consider the R script below, which is (sadly) not far from bad code I often have to handle as a data editor.

# My R script

setwd("~/ybrandva/Desktop/silphium_project")

#### load data 
silphium_data <- read_csv("mydata.csv")
View(silphium_data)


ggplot(silphium_data, aes(x = location, y = yield))+
  geom_point()

Q8) The script above worked fine on my computer yesterday, but today I opened up a new R session on the very same computer and it failed. What went wrong?


Q9) what about the script would make it work on my computer (once the thing above is fixed) but not on another computer?


Q10) What about the script is annoying for someone using this code (once it works) and should be removed, but doesnt stop our code from working?


Glossary of Terms

Absolute Path – A file location specified from the root directory (e.g., /Users/username/Documents/data.csv), which can cause issues when sharing code across different computers. Using relative paths instead is recommended.

Data Dictionary – A structured document that defines each variable in a dataset, including its name, description, units, and expected values. It helps ensure data clarity and consistency.

Data Validation – A method for reducing errors in data entry by restricting input values (e.g., dropdown lists for categorical variables, ranges for numerical values).

Field Sheet – A structured data collection form used in the field or lab, designed for clarity and ease of data entry.

Metadata – Additional information describing a dataset, such as when, where, and how data were collected, the units of measurement, and details about the variables.

R Project – A self-contained environment in RStudio that organizes files, code, and data in a structured way, making analysis more reproducible.

Raw Data – The original, unmodified data collected from an experiment or survey. It should always be preserved in its original form, with any modifications performed in separate scripts.

README File – A text file that provides an overview of a dataset, including project details, data sources, file descriptions, and instructions for use.

Reproducibility – The ability to re-run an analysis and obtain the same results using the same data and code. This requires careful documentation, structured data storage, and clear coding practices.

Relative Path – A file path that specifies a location relative to the current working directory (e.g., data/my_file.csv), making it easier to share and reproduce analyses.

Tidy Data – A dataset format where each variable has its own column, each observation has its own row, and each value is in its own cell.


Key R functions

📥 Data import


🔍 Inspecting data

  • glimpse() — shows the structure of your data from the dplyr package.
  • View() — open data viewer.

🧹 Cleaning & renaming


🔧 Data wrangling


R Packages Introduced

Additional resources

R Recipes:

Other web resources:

Videos: