We are filling in the exercises below
in order to make the lesson plan more concrete.
Contributions (both in the form of pull requests with filled-in exercises,
and comments on specific exercises, ordering, and timings) are greatly appreciated.
Process Used
Michael Pollan’s advice if he taught R or Python programming:
This lesson was developed using a slimmed-down variant of the “Understanding by Design” process.
The main sections are:
Assumptions about audience, time, etc.
(The current draft also includes some conclusions and decisions in this
section - that should be refactored.)
Desired results:
overall goals, summative assessments at half-day granularity, what learners
will be able to do, what learners will know.
Learning plan:
each episode has a heading that summarizes what will be covered,
then estimates time that will be spent on teaching and on exercises,
while the exercises are given as bullet points.
Stage 1: Assumptions
Audience
Graduate students in numerate disciplines from cosmology to archaeology
Who have manipulated data in spreadsheets and with interactive tools like SAS
But have not programmed beyond CPD (copy-paste-despair)
Constraints
One full day 09:00-16:30
06:15 class time
0:45 lunch
0:30 total for two coffee breaks
Learners use native installs on their own machines
May use VMs or cloud resources at instructor’s discretion
But must keep native local install as an option
No dependence on other Carpentry modules
In particular, does not require knowledge of shell or version control
Use the Jupyter Notebook
Authentic tool used by many instructors
There isn’t really an alternative
And means that even people who have seen a bit of Python before
will probably learn something
Motivating Example
Creating 2D plots suitable for inclusion in papers
Appeals to almost everyone
Makes lesson usable by both Carpentries
And means that even people who have seen a bit of Python before
will probably learn something
Data
Use the gapminder data throughout
But break into multiple files by continent
To make display of output from examples tidier
(e.g., use Australia/New Zealand, which is only two lines)
And allow examples showing use of multiple data sets
Focus on Pandas instead of NumPy
Makes lesson usable by both Data Carpentry and Software Carpentry
Genuine novices are likely to want data analysis
And people with some prior experience:
will accept data analysis as an authentic task,
and are unlikely to have encountered Pandas,
so they’ll still get something useful out of the lesson
Challenges will mostly not be “write this code from scratch”
Want lots of short exercises that can reliably be finished in allotted time
So use MCQs, fill-in-the-blanks, Parsons Problems, “tweak this code”, etc.
Stage 2: Desired Results
Questions
How do I…
…read tabular data?
…plot a single vector of values?
…create a time series plot?
…create one plot for each of several data sets?
…get extra data from a single data set for plotting?
…write programs I can read and re-use in future?
Skills
I can…
…write short scripts using loops and conditionals.
…write functions with a fixed number of parameters that return a single result.
…import libraries using aliases and refer to those libraries’ contents.
…do simple data extraction and formatting using Pandas.
Concepts
I know…
…that a program is a piece of lab equipment that implements an analysis
Needs to be validated/calibrated before/during use
Makes analysis reproducible, reviewable, shareable
…that programs are written for people, not for computers
Meaningful variable names
Modularity for readability as well as re-use
No duplication
Document purpose and use
…that there is no magic: the programs they use are no different
in principle from those they build
…how to assign values to variables
…what integers, floats, strings, NumPy arrays, and Pandas dataframes are
…how to trace the execution of a for loop
…how to trace the execution of if/else statements
…how to create and index lists
…how to create and index NumPy arrays
…how to create and index Pandas dataframes
…how to create time series plots
…the difference between defining and calling a function
…where to find documentation on standard libraries
…how to find out what else scientific Python offers
Stage 3: Learning Plan
Summative Assessment
Midpoint: create time-series plot for each file in a directory.
Final: extract data from Pandas dataframe
and create comparative multi-line time series plot.
Select entire rows or entire columns from a dataframe.
Select a subset of both rows and columns from a dataframe in a single operation.
Select a subset of a dataframe by a single Boolean criterion.
Challenges: 15 min
Write an expression to find the Per Capita GDP of Serbia in 2007.
What rule governs what is (or isn’t) included in numerical and named slices in Pandas?
What does each line in the following short program do?
What do idxmin and idxmax do?
Write expressions to get the GDP per capita for all countries in 1982,
for all countries after 1985,
etc.
Given the way its borders have changed since 1900,
what would you do if asked to create a table of GDP per capita for Poland
for the Twentieth Century?