
This page is a (revisited) mirror page
published on August 4, 1997
marc.meurrens@acm.org
(
http://homepages.ulb.ac.be/~meurrens)
http://wizard.ucr.edu/~nagler/coding_style.html
Jonathan Nagler
(
nagler@wizard.ucr.edu)
http://www.ulb.ac.be/esp/ip-Links/Java/joodcs/index.html
[back]
(added August 4, 1997)
Coding Style and Good Computing Practices
http://wizard.ucr.edu/~nagler/coding_style.html
by
Jonathan Nagler
(
nagler@wizard.ucr.edu)
The Political Methodologist Spring, 1995 - Volume 6, No 2.
A list of approx. 30 "pre-prepared" links to this page
is available at URL http://www.ulb.ac.be/esp/ip-Links/Java/joodcs/nagler-LinksTo.html.
These links are (or will be) used in other pages related to "OO design and Coding Standards".
This article appeared in The Political Methodologist Spring, 1995 - Volume 6, No 2.
I thank Charles
Franklin, Bob Hanneman, Gary King, and Burt Kritzer for useful comments
and suggestions. Jonathan Nagler is associate professor of political
science at
University of
California Riverside and can be reached at
nagler@wizard.ucr.edu
Recently
The
Political Methodologist presented the case for
replication. Replication depends
upon individual researchers being able to explain exactly what they have done.
And being able to explain exactly what one has done requires keeping good
records of it. This article describes basic good computing practices. It
contains advice for writing clear code, as well as advice on general
computing practices to maintain. The goals are simple. First, the
researcher should be able to replicate his or her own work six hours
later, six months later, and even six years later. Second, others should
be able to look at the code and understand what was being done (and
preferably why it was being done). In addition, following good computing
practices encourages the researcher to maintain a thorough grasp of what
is being done with the data, and thus makes it easier to perform
additional analyses. Good coding allows for more efficient research. One
is not always re-reading one's own work and retracing one's own steps to
perform the smallest bit of additional analysis.
The sequence this article is written in leads the reader from more general themes to more specific ones. I encourage readers who don't have the patience for the big picture to skip ahead rather than skip the whole article. Even learning basic conventions about variable names will put you ahead of most coders! And this article is not meant only for sophisticated statistical researchers. In fact the statistical procedures you will ultimately use have absolutely nothing to do with the topic of this article. These practices should be used even if you are doing nothing more complex than producing 2x2 tables with a particular data-set.
First, what do I mean by computing practices and `code'? Computing practices covers everything you do from the time you open the codebook to a dataset, or begin to enter your own data, to the time you produce the actual numbers that will be placed in a table of an article. By code I mean the actual computer syntax -- -- or computer program -- -- used to perform the computations. This most likely means the set of commands issued to a higher-level statistics package such as SAS or SPSS. I will refer to a given file of commands as a `program.' Most political scientists do not think of themselves as computer programmers. But when you write a line of syntax in SAS or SPSS, that is exactly what you are doing -- -- programming a computer. It is coincidental that the language you use is SAS instead of Fortran or C. The paradox is that most political scientists are not trained as computer programmers. And so they never learned basic elements of programming style, nor are they practiced in the art of writing clean, readable, maintainable code. The classic text on programming style remains Kernighan and Plauger The Elements of Programing Style (1974), and most books on programming include a chapter on style. I recommend including a section on coding style in every graduate methods sequence.
This article starts from the point when a raw data set exists somewhere on a computer. It breaks analysis down into two basic parts: coding the data to put it into a useable form, and computing with the data. The first part can be broken down into two component steps: reading the essential data from a larger data-set; and recoding it and computing new variables to be used in data analysis.
Here are the basic points that will be covered below, and at the end of the article I list a set of `rules' laid out.
Date: Oct 11, 1994
File: mnp152.g
Author: JN
Purpose: Analysis - this file does multinomial probits on our basic model.
Results: The results were used in Table 1 of the Midwest paper.
Machine: Run on billandal (IBM/RS6000).
It is a good idea to have a template that you follow for each Labbook entry, this encourages you to avoid getting careless in your entires. The above template includes: Date, File, Author, Purpose, Results, and Machine. You might have a set of `Purposes' -- -- Re-Coding, Data-Extraction, Data-Analysis -- -- that you feel each file fits into. It may seem superfluous to indicate what machine the file was executed on. But should you develop the habit of computing on several machines, or should you move from one machine to another in the course of a project, this information becomes invaluable in making sure you can locate all off your files.
It makes a lot of sense to have the labbook online. It can be in either a Wordperfect file, or a plain ascii text file, or whatever you are most comfortable writing in. First, it is easy to search for particular events if you remember names of them. Second, it can be accessed by more than one researcher if you are doing joint-work.
Rule: Maintain a labbook from the beginning of a project to the end.
Labbooks should provide an `audit-trail' of any of your results. The labbook should contain all the information necessary to take a given piece of data-analysis, and trace back where all of the data came-from - including its original source and all recode steps. So while there are many different sorts of entries you might want to keep in a labbook, and many different styles to keep them; the central point to keep in mind is that whatever style you choose meets this purpose.
You generally want a large set of comments at the beginning of each file indicating what the file is intended to do. And each file should not do too much. Comments on the top of a file should list the following:
\* File-Name: mnp152.g
Date: Feb 2, 1994
Author: JN
Purpose: This file does multinomial probits on our basic model.
Data Used: nes921.dat (created by mkasc1c.cmd)
Output File: mnp152.out
Data Output: None
Machine: billandal (IBM/RS6000)
*\
You could keep a template with the fields above left blank, and read in
the template to start each new command file. You should treat the
comments at the top of a file the way you would the notes on a table; they
should allow the file to stand alone and be interpreted by anyone opening
the file without access to other files.
vote in 1992, not voted in 1988 = 1 \\ voted in 1992, voted in 1988 = 0
Rule: Code each variable so that it corresponds as closely as possible to a verbal description of the substantive hypothesis the variable will be used to test.
Rule: Errors in code should be corrected where they occur and the code re-run.
Rule: Separate tasks related to data-manipulation vs data-analysis into separate files.
Thus if you will be engaging in producing some tables before multivariate analysis, you might have a series of programs: descrip1.cmd, descript2.cmd, ..., descrip9.cmd. Following this, you might produce: reg1.cmd, reg2.cmd,..., reg99.cmd. You need not constrain yourself to one regression per file. But the regressions in each file should constitute a coherent set. For instance, one file might contain your three most likely models of vote-choice - each dissaggregated by sex. This does tend to lead to proliferation of files. One can start with reg1.cmd and finish with reg243.cmd. But disk-space is cheap these days; and the files can easily be compressed and stored on floppies if things are getting tight.
Rule: Each program should perform only one task.
Rule: Do not try to be as clever as possible when coding. Try to write code that is as simple as possible.
Rule: Use a consistent style regarding lower and upper case letters.
When possible a variable name should reveal subject and direction. The simplest case is probably a dummy variable for a respondent's gender; imagine it is coded so that 0 = men, 1 = women. We could either call the variable `SEX', or `WOMEN.' It is pretty clear that `WOMEN' is the better name since it indicates the direction of the variable. When we see our coefficients in the output we won't have to guess whether we coded men=1 or women=1.Rule: Use variable names that have substantive meaning.
Similarly, value labels are very useful things for packages that permit them. The examples of computer syntax I use in this article are written in SST (Dubin/Rivers 1992), but they can be easily translated into SAS, SPSS, or most statistical packages. Here is a simple example. The variable {\bf natlecr} indicates the respondent's retrospective view of performance of the national economy. Notice that the variable name can only indicate so much information in 8 characters. But the label of it and the values tell us what we need. And the fact that the label tells us where to look the variable up in the code-book is further protection.Rule: Use variable names that indicate direction where possible.
label var[natlecr] lab[v3531:national economy - retro] \
val[1 gotbet 3 same 5 gotworse]
Some people using NES data -- -- or any data produced by someone else
accompanied by a codebook -- -- follow the convention of naming the
variable by its codebook number (i.e., V3531), and using labels for
substantive meaning. I think this is a poor practice. Consider which of
the following statements is easier to read:
logit dep[preschc] ind[one educ women partyid]
or:
logit dep[V5609] ind[one V3908 V4201 V3634]
The codebook name for the variable should definitely be
retained; but it can be retained in the label statement. Without
the codebook name one would not know which of the several
party-identification variables the variable partyid refers to.
Not knowing the syntax of SST, it might be completely opaque that the following code saves all observations of the variables preschc ... deficit for which all of the variables contain a valid response. However, the spacing can be a big help towards figuring this out.
rem *******************************************************************
rem *******************************************************************
recode var[preschc educ east south west \
women respfinp natlec resplib bclibdis gblibdis rplibdis \
dem rep respgjob resphlth respblk respab \
age1829 age3044 age4559 newvoter termlim deficit] map[md=-9]
write var[preschc educ east south west \
women respfinp natlec bclibdis gblibdis rplibdis \
dem rep respgjob resphlth respblk respab \
termlim age1829 age3044 age4559 newvoter deficit] \
file[nes9212r.asc] \
if[(preschc!=-9)&(educ!=-9)&(east!=-9)&(south!=-9)&(west!=-9)& \
(women!=-9)&(respfinp!=-9)&(natlec!=-9)&(bclibdis!=-9)& \
(gblibdis!=-9)&(rplibdis!=-9)& \
(dem!=-9)&(rep!=-9)&(respgjob!=-9)&(resphlth!=-9)& \
(respblk!=-9)&(respab!=-9)&(termlim!=-9)& \
(age1829!=-9)&(age3044!=-9)&(age4559!=-9)&(newvoter!=-9)& \
(deficit!=-9)]
rem *******************************************************************
rem *******************************************************************
Kernighan and Plauger (1972) suggest the telephone test, perhaps a bit anachronistic since you will more likely email the code than read it to someone, but useful nonetheless. Read your code to someone over the phone. If they can't understand it, try writing the code again.Rule: Use appropriate white-space in your programs, and do so in a consistent fashion to make them easy to read.
Remember though, the comments should add to the clarity of the code. Don't put a comment before each line repeating the content of the line. Put comments in before specific blocks of code. Only add a comment for a line where the individual line might not be clear. And remember, if the individual line is not clear without a comment - maybe you should rewrite it.
Rule: Include comments before each block of code describing the purpose of the code.
Rule: Include comments for any line of code if the meaning of the line will not be unambiguous to someone other than yourself.
Following is a case where a single comment lets us know what is going on:Rule: Rewrite any code that is not clear.
rem ******************************************************************* rem Create party-id dummy variables rem ******************************************************************* rem Missing values are handled correctly here by SST. rem In other statistics packages these three variables might rem have to be initialized as missing first. set dem = (pid < 3) set ind = (pid == 3) set rep = (pid > 3) rem ******************************************************************* rem *******************************************************************Now here is a case where a longer comment is essential:
rem ******************************************************************* rem Create ideological-distance variables rem ******************************************************************* rem Ideological distance is computed as the distance between the rem respondent, and the mean placement of the candidate by all rem respondents used in our multivariate analysis who could place rem the candidate. rem The mean-values used here are generated by making a first pass rem at the data set for the subsample we will use here. See the code rem at the end of this file for that pass. rem This hard-wiring of numbers is poor style; but it is very rem difficult to automate this given the way in which SST handles rem missing data. set gblibdis = (resplib - 5.32)^2 set bclibdis = (resplib - 2.98)^2 set rplibdis = (resplib - 4.49)^2 rem ******************************************************************* rem *******************************************************************Most programmers think that well-written code should be self-documenting. This is partly true. But no matter how well-written your code is some comments can make it much clearer.
In some statistics packages you may be best served by initializing all new variables as missing data, and allowing them to become legitimate values only when they are assigned a legitimate value. The best advice is to recode and create new variables defensively.Rule: Verify that missing data is handled correctly on any recode or creation of a new variable.
Generally it is poor style to `hard-wire' values into your code. Any specific values are likely to change when some related piece of code somewhere else is altered or when the data-set changes. The example of ideological-distance variables in the Comment section above is an example where values are hard-wired into the code. This is a case where a choice had to be made. The values 5.32, 2.98, and 4.49 represent means of candidate placement by a selected set of respondents. Computing this with the means command over the appropriate set of respondents, and doing the assignment was complicated enough that rather than write code that could not be simple - it was decided to hard-wire the values. If the following code produced the desired results (it does not), it would be far preferable:Rule: After creating each new variable or recoding any variable, produce frequencies or descriptive statistics of the new variable and examine them to be sure that you achieved what you intended.
rem ******************************************************************* rem Create ideological-distance variables rem ******************************************************************* set gblibdis = (resplib - mean(bushlib))^2 set bclibdis = (resplib - mean(clinlib))^2 set rplibdis = (resplib - mean(perolib))^2 rem ******************************************************************* rem *******************************************************************
Finally, after you have done recodes and created new variables it is a good idea to list all of the variables. This way you can confirm that you and your statistics package agree on what data is available, and how many observations are available for each variable. In SST this would be done with a list command, in SAS, PROC CONTENTS will produce a clean list of variables. Most statistics packages offer similar commands.Rule: When possible, automate things and avoid placing hard-wired values (those computed `by-hand') in code.
Most people are in a huge hurry when they write their code. They are either excited about getting the results and want them as fast as possible, or they figure the code will be run once and then thrown out. If your program is not worth documenting, it probably isn't worth running. The time you save by writing clean code and commenting it carefully may be your own.
Return to: 
Jonathan Nagler's
Home Page
nagler@wizard.ucr.edu
(Jonathan Nagler)
|
Brussels
published by marc.meurrens@acm.org
( http://homepages.ulb.ac.be/~meurrens)
original URL: http://wizard.ucr.edu/~nagler/coding_style.html
published, in January 1995, by ![]() Jonathan Nagler
( nagler@wizard.ucr.edu)
current URL: http://www.ulb.ac.be/esp/ip-Links/Java/joodcs/Nagler.html
internet programming Links: http://www.ulb.ac.be/esp/ip-Links (ip-Links)
Université Libre de Bruxelles: http://www.ulb.ac.be (ULB)
La Cambre - Architecture: http://www.lacambre-archi.be
Belgian JAVA User Group: http://www.bejug.org (BeJUG)
|
|
Use this form to send your feedback and/or submit a link
|
Conventions used in these pages:
html file, text file or java or CPP source located on our site
download area (on our site) or file to be downloaded (use the right button of your mouse)
document on a belgian academic or scientific site
document (on another site) or link to be fixed or link we didn't visit/evaluate; documents indicated with their full URL will be displayed in their own "top" window.
ftp download or
file to be downloaded (use the right button of your mouse)
indicates a "mailto" link.
and indicate links added or updated within the last month.
Click on the |
| |||||||||||||||||||||||||