Get startedGet started for free

Reproducibility and references

1. Reproducibility and references

Good job working on those reports! Understanding how to structure a report is tremendously important.

2. Written report

A fundamental part of communicating our findings, it making sure a report is clear and reproducible.

3. Reproducibility example

For example, say we have a friend that always cooks this amazing chocolate cake. We got at his place and follow his exact steps using his equipment: that's reproducibility. We trust he can actually bake this delicious cake. On a data project, if you run your Jupyter notebook again today and tomorrow, our results should be identical (provided the dataset hasn't changed).

4. Replicability example

In baking replicability would be the ability to cook that cake again ourselves, following the recipe but using our own utensils and ingredients. In a data project, it would be the ability to obtain similar results using the same general approach, but a different environment (the data is processed in the same way, but using a different language on a different OS).

5. Reproducibility and replicability virtues

This scientific approach is crucial because it prevents duplication of effort and allows other data scientist to build upon preexisting work, enabling them to build upon previous work and focus on new challenges. Last but not least, it allows peer-reviewing. Although we tend to focus on coding examples in this course, you should always make sure that your code is easily reproducible and replicable, whether we use Excel spreadsheets, Tableau visualizations or Jupyter Notebooks. What matters is to keep track of how the results were produced.

6. Best practices

We should document all the scripts we used to obtain our results using, for example, comments in our code, and list all the packages used in our environment. One helpful technique is to use a version control system (like Git), that tracks all the changes and versions of our scripts.

7. Best practices

We should avoid doing any manual manipulation of the data. We must never change the dataset directly in an editor. If possible, we should save all versions of our dataset. Also, we should save the raw data together with a script with intermediate steps. It can help us tell the story of our data transformation and create a narrative around it. Such a clear view ensures we know at any point what is happening with the data, and therefore can adapt and resolve problems. Take a data imputation example. There was a bug in the data pipeline, and you're told you can use the average values to impute missing values for a specific product. You go ahead and push the edits manually, then close your editor. Right after, your colleague informs you that actually, those products weren't available for sale on those dates, so the missing values should be zero. The data is already overwritten, so it's going to be hard to know which values were changed in the first place. Had we versioned the changes, it would be much simpler.

8. Best practices

We usually use machine learning algorithms or pipelines to create our workflow. Some of them might involve randomness techniques. We can usually set a random seed and introduce reproducibility into our model outputs. The random seed controls confounding variables: it lets us ensure a change was due to the model and not just randomness.

9. Best practices

Interpretability is the degree to which a human can understand the cause of a decision or can consistently predict a model's result. Telling a story with a compelling narrative helps our stakeholders understand and interpret our findings. Because they are able to understand them, our conclusions can be reproduced.

10. Best practices

Lastly, we should always correctly cite other people's work used in our analysis.

11. References

Citations are the basic information required to identify and locate a specific publication

12. References

There are different styles, but they all have the same underlying logic, with slight variations. The most common style is APA, which uses in-text citations, putting the author name first and the date of publication next.

13. Reference

There are a number of reference management tools that can ease the burden of tracking all the citations. They help us automatically switch styles, and search online sources for references. EndNote, Mendeley, and RefWorks are some examples, but there are many more.

14. References

In a business context, these editions rules are much more relaxed. Most people will simply include a hyperlink to the source. What matters in the end is that the information is easy to retrieve, if the reader wants to refer to the sources material.

15. Let's practice!

Let's see how you can make your work more reproducible!