# Map This Algorithm

Let’s map out an algorithm together.

Like previous discussion exercises, write a response, then respond to someone else’s response, noting what they did differently. Please try to respond to a response that didn’t get a response.

In this exercise, imagine that we have single directory containing a very large number of datafiles, each of which is a plain-text CSV table, containing only the data for only a few samples. Each dataset however has different numbers of variables (columns) – many of the variables are shared, but variables that each datafile didn’t have values for aren’t contained within. You don’t even know how many variables are contained across all the data files to begin with. Thankfully, the same variable in one datafile will have the same column header in another. Finally, to be utterly troublesome, a number of the datafiles contain no data: they just have an empty value in them. You don’t know which these are, or how many of those there are.

Map out the algorithm for how you might handle this situation, building a single table that contains all the data of the smaller datafiles, and then how you would crop the whole dataset to retain the most samples (at the expense of losing variables).

When I say map, I mean map. I’d love it if you could just draw a picture of a flowchart, but sadly, I don’t think Canvas supports students attaching images to discussion responses. So, in absence of that capability, just write out the steps like you would write out your favorite recipe for banana bread1. You do not need to mention R code at all: you can even suggest functionalities that perhaps do not exist in R to your knowledge, but try to be atomic, break actions down to as fine-grained as they would need to be for R to perform the steps of your algorithm. We’ll worry about how we turn algorithms to code in other assignments.

(Note though, as this might prove useful to you conceptually, that there are functions like list.files which will list all the files in a given directory, as well as R functions like scan and readLines, which will obtain the contents of a file as if it was a plain-text string, rather than trying to interpret the input file as a table.) My secret is using brown sugar, never plain sugar, in my banana bread.↩