logobanner1

I teach political economy and statistics at a large public university in the US. You can find my academic website here.

The Waste Book very occasionally collects my passing thoughts on politics, economics, statistics, data visualization, life, culture, and everything.

We aim for funny, will settle for intriguing, and, the times and the Internet being what they are, resign ourselves to a certain amount of bemoaning.

cadolph@thewastebook.com

highlights

entries by category

politics

stories

computing

literature

vox

entries by tag

satire

technical

books

life

universe

everything

bemoaning

dialogs

entries by year

2011

2015

2014

2013

2012

funny

useful

cool

fuzzy

naked promotion



computing · March 20, 2012 · comments

Social Science Computing for the Mac in 15 Steps and $29

by Chris Adolph

as a uantitative social scientist, I use software every day that many people have never even heard of. At the same time, I don’t even have a copy of Microsoft Word or PowerPoint on my laptop. As you might imagine, just finding, installing, and configuring scientific software can be quite a distracting quest. Students embarking on quantitative social science careers may be interested in how my own computing environment is set up, so they can get on with the business of using these tools instead of searching for them.

I’ll assume you’re a Ph.D. student in political science, sociology, economics, statistics, or a cognate discipline, and that you want to develop solutions for at least a few of the following tasks: modern statistical computing, editing code, typesetting scientific papers, making lecture slides, producing publication quality scientific graphics, and developing websites. I will also assume that you use a Mac. Most of the applications below exist for Windows in some form, and many exist for Linux/Unix, but I will leave advice for finding these resources to another day and to commenters.

Where do you get started? I’ll begin with step-by-step recommendations for essentials, and then follow with some optional extras. Naturally, you should feel free to pick and choose: I wouldn’t expect anyone to install and learn all of these tools in a day, or even a single month. It took me much longer than that, so pace yourself. And if you have your own suggestions, alternatives, or corrections, feel free to leave a comment.



1.   Essentials 3.   Getting more out of LaTeX
1.1  Statistics with R 3.1  Making slides with LaTeX
1.2  Typesetting with LaTeX 3.2  Papers with LaTeX and XeTeX
1.3  Editing with Emacs 3.3  Cool fonts in XeTeX and R
 
2.   Extras 4.   Some advice
2.1  Editing vector graphics 4.1  Backing up
2.2  Web development 4.2  File systems
2.3  “Productivity” software 4.3  The desktop
2.4  Internet software
2.5  Bayesian methods
2.6  Text processing


1.  Essentials


1.1 Statistics with R. This powerful, free, and ubiquitous statistical computing environment is your ticket to cutting edge statistical skills. And while it’s not perfect, R is clearly the best game in town for social scientists aspiring to make the most of their data and willing to learn how to write a little bit of code. R is open-source, and uncounted statistical scientists are making it better by adding their code to it as you read this. As a result, the gap between the capabilities of R and other statistics packages will most likely continue to grow, so don’t get left behind with another stat package!

To install R on your machine, simply download and install the latest binaries from cran. After you install R, you will notice that you can add new packages – that is, new tools – to R by downloading them from cran or from elsewhere on the web. My students should go ahead and install two packages, tile and simcf, that are available from my webpage; see my software page for the package binaries and installation instructions.

The best way to learn R is through a statistics course or two. When a beginner hits an “R problem”, much of the time, R is doing exactly what they ask, but that, in turn, includes an incoherent application of statistics. Less flexible statistics packages force you do to statistics right (or at least not creatively wrong), but also force you to do it a certain (possibly inappropriate or suboptimal) way. The more flexible R lets you do almost anything, but also reveals, through unfortunately cryptic error messages, whether you know what you are doing. That said, for new users with a solid statistics background, there are at last some good introductory and intermediate R books now on the market.


1.2 Typesetting with LaTeX. If your papers include graphics and math, you will quickly discover that Microsoft Word makes your graphics look blurry or even squashed, and that typesetting equations in Microsoft products leads to daydreams of whether the appropriate collective noun for mouse clicks is a “frustration”. There is a better way, but you’ll need to leave the safety of wysiwyg word processing behind. Most scientists use a typesetting system known as LaTeX to produce beautiful papers with clear mathematics and razor-sharp vector graphics. LaTeX is a language, though to use it you won’t need to learn as many commands as you will for R. For the most part, you can let LaTeX format things the way it wants to by default, and you will get a more readable document that Word could ever produce. With experience, and a little coding, you can also make LaTeX produce fully typeset books ready to go to the printer.

LaTeX is free, and like R, open-source. To get started, you should install the MacTeX system, including TeXLive, available here. This will take a while – there’s a lot to install. And after the 1.7 gigabyte installation is complete, you are in for a surprise. You might expect that LaTeX consists of a gui that lets you edit files, but that’s not the case. ather, LaTeX is an engine that you run on your marked-up “document” to create a beautifully typeset pdf file. What you edit – your “document” – is the .tex file, a plain text file in which your paper’s text is intermixed with the LaTeX commands to format it into a final product. After you finish editing the document, you run the LaTeX program on it, and out pops your pdf.

The fastest way to learn LaTeX is to read “The Not So Short Introduction to LaTeX”. You will also want to learn about BibTeX, which will manage your bibliography for you, and the natbib package, which provides easy commands for adding in-text citations. (LaTeX, like R, can add to itself through packages, but niftily, TeXLive will go find and install any missing packages on the internet when you compile your .tex file, so any package you request – through a \usepackage{mypackage} command in your preamble – will magically be available for use. So you already have BibTeX and natbib available, if you’ve installed LaTeX.)


1.3 Editing with Emacs. To use R and LaTeX, you’ll need to edit code. You can’t use a word processor to write code, though: it will try to helpfully “correct” your spelling, capitalization, and punctuation, and will save your work in a proprietary format nothing else can read, which makes it awfully hard for other packages to “run” your program. Instead, to write code, you need a good text editor: one that will automatically color-code and indent your script in accordance with the conventions of the language you are using (whether it is R, LaTeX, HTML, or something else), one that will let you easily flip between lots of different text files in a tabbed environment, and one that won’t ever crash and take your files with it. You have several choices of text editors, but my favorite is Emacs, available for Mac in the elegant Aquamacs implementation.

Aquamacs is free (notice a pattern?), and easy to install. If you save your R scripts with a .r or .R extension, and your LaTeX files with a .tex extension, Aquamacs will automatically recognize them for you. You can even run R scripts or blocks of code from Aquamacs (which I don’t usually bother with), and you can compile your LaTeX files directly from Aquamac’s LaTeX menu (which is amazingly handy).

Emacs-based editors have lots of features, but even with just a few basics you’ll be up and running. There’s a built in tutorial to get you started (within Aquamacs, press Control-h, then t to access it), and an entire blog devoted to learning Emacs. Finally, here is a handy cheatsheet of Emacs commands.

You can customize Aquamacs in many ways, a few of which I recommend you look into soon. To minimize eyestrain, I like to set Aquamacs to have a black background, white foreground, and Inconsolata text. To do this, you’ll need to place the Inconsolata .otf file in your /Library/Fonts folder and restart Aquamacs. Then to customize Aquamacs’ behavior for each programming mode (R, LaTeX, etc.), while visiting a file of that type, select from the menu Options -> Appearance -> Font for mode and choose your desired options. Of course, the display font is up to you; just be sure to pick a fixed width font to keep the code tidy. Be sure to choose a comfortable font size, which you can adjust later using Command + and Command -. You will notice the optimal size drifts upwards with your age.



2.  Extras


2.1 Editing vector graphics. I’ll assume you already have a pdf viewer like Adobe Acrobat. Thus the most useful graphics program you probably don’t yet have is Adobe Illustrator, with which you can edit any vector graphic (pdf or postscript) to satisfy even the most obsessive heart. Adobe Illustrator is not free (gasp), and is, in fact quite expensive, even under an academic license. However, uw affiliates can now use Adobe Illustrator remotely through ViDA.

Finally, if you find yourself making postscript files (ie, graphics in the native language of printers, and the basis for pdfs and much of publishing), you will need a postscript viewer. MacGSview will do the trick, but you’ll need to install ghostscript as well to get it to work (though it you’ve installed MacTeX, you should already have it).


2.2 Web development. Suppose you want to make a webpage to share all the work you're doing with your new tools. You already have the text editor you’ll need to edit code, but what kind(s) of code should you write? You’ll definitely need to learn some html, as that’s the fundamental language of the web, but writing in html only can make for a lot of work: if your website is all html, you will need to write a new file for each page that can be displayed, and you may need to edit a dozen files every time you update your site.

Once you have more coding experience, there are better solutions. php is a language that literally writes html for you: a single, efficiently designed php file can create any number of pages, including pages requested by the user. This weblog for example, or my academic webpage, each consist of a single php file of less than 1000 lines of code, coupled, in each case, with a css file, or Cascading Style Sheet. css defines formats that apply to the whole website, and thus allow global changes to the look and feel of the site with a single edit. The self-assembling nature of php, in turn, means that I can focus on updating content, through light-weight text files, and usually don’t even need to look at code when adding an entry to my sites.

Aquamacs has editor modes for html, php, and css, so you won’t need any new software. There are tutorials on html, css, and an especially good one on php. You will also want to test php code on your computer’s localhost before uploading it to your webserver; this site will explain how to turn php on in MacOS (it’s actually already part of your system, but “turned off” by default). Finally, if you are one of my students, and would like to use the php code for my webpages as a template, you need only ask.


2.3 “Productivity” software. I said I don’t have a copy of Microsoft Office. Indeed, the only functionality of Office I need is an Excel-like spreadsheet, for basic data manipulation and easy LaTeX table construction. My spreadsheet of choice is OpenOffice, which is free, has been able to do anything Excel ever did for me, and appears to respond more quickly to feedback from statisticians on getting statistical formulas right than a certain edmond developer. OpenOffice will also let you read the Word documents and PowerPoint slides of other users, though not always faithfully – but really, they should be sending you pdfs!


2.4 Internet software. Aside from your trusty browser (most likely Chrome), you will need an ftp program to move files up to your webserver. Fetch ($29) works so well and so elegantly I consider it well worth the money.

You’ll need one other application, a replacement for MacOS’s clunky terminal. I recommend the free and wonderful iTerm, which has cut-and-paste, tabbed terminal sessions, and other bells-and-whistles. iTerm will make it easier to run local scripts, but its other great use is to remotely connect to computers to run R on a cluster. To do this, you'll probably need to open an ssh connection to your server; I recommend using a separate tab in iTerm for each remote session.


2.5 Simple Bayesian computing with BUGS. If you are exploring Bayesian methods, you may find the WinBUGS package helpful. WinBUGS is a standalone package which implements the Gibbs Sampler for a large class of hierarchical and multilevel regression models, and can be called from R. The downside is that it doesn’t play very well with MacOS. You may want to look in to the OpenBUGS project as a (somewhat) more Mac friendly alternative. (If you do, let me know how it goes.)

The best way to learn WinBUGS is through Gelman and Hill’s renowned and readable text.


2.6 Languages for text processing: perl, python, and regular expressions. At some point, you may need to learn a programming language that’s good at working with text data. All such languages rely on so-called regular expressions (regexp) to match, extract, and/or substitute patterns of abitrary complexity in texts of arbitrary length. egular expressions are really a language of their own, embedded in other languages, in which a complete program might be 20 characters long – and take half an hour to write. Chapter 5 of the classic Programming Perl is an excellent, comprehensive introduction to regular expressions.

Once you’re up to speed with regular expressions, you’ll need a programming language that uses them well and plays with the internet nicely, so you can grab and process text. I use perl, only because I learned it years ago; if I were starting today, I’d use python, which is easier to write. If you are coding a large amount of text which is either available electronically, or susceptible to ocr, allow me to introduce you to your new research assistant.

Mac users already have both language installed as part of the os. And if you installed the essentials above, you already have an editor: Aquamacs will treat files that end as .pl as perl scripts, and files that end in .py as python scripts, and format them accordingly. To run one of these scripts, just open an iTerm session, and run perl myprog.pl or python myprog.py while in the folder containing your script.

The aforementioned Programming Perl likely contains all the perl knowledge you will ever need (though there are gentler introductions, if you like). I solicit recommendations in the comments for the best introduction(s) to python.



3.  Getting more out of LaTeX


To add.



4.  Some advice


4.1 Back up daily or hourly. Nothing is more frustrating that losing work, whether that work is data, code, or writing. And eventually, you will lose data. Hard disks die, computers get lost, power surges happen. If you never backup, eventually one of these disasters will erase all your work – maybe years of it. I’ve worked with data for 15 years now, and I back up hourly.

If you keep your work locally on your Mac, then you really should use Time Machine to automate backup. It’s best to back up to a disk connected to your router, or built in (as with Apple’s Time Capsule). External usb drives are cheaper, but in my experience they tend to fail before your computer’s disk ever does, especially if not removed properly every time you back up. I’ve lost more backup usb drives than I can count, and no longer trust them as primary backup devices. That said, they can play a role as a second backup in case your computer and first backup meet a common disaster (a fire, for instance). Just keep a monthly backup on an offsite external drive, and then you’re safe against everything but the kind of disaster that will leave you unconcerned with data.

Even if you keep your data and work files on a server, it’s worth making frequent backups of your own. Even a good server administrator probably backs up no more often than once a day. Why take a chance?


4.2 File system structure. Now that you are writing code, you are, without even taking an extra step, creating a permanent archive of your work that you can and will draw on for years to come. This means it’s imperative your choose a file structure now that can expand with your projects and teaching, and which won’t ever confuse you or let you lose work. With LaTeX and R, little files proliferate fast, and you could get lost/accidentally over-write if you don’t have a carefully constructed and pruned tree with a separate folder for each task. There are many viable ways to organize your folders, but I offer some tips from my own system to get you started.

In my root directory, I have one folder for course materials, and another for research projects. Within each of these folders, I create a separate subfolders for each course and project, with potentially many sub-subfolders, sub-sub-subfolder, and so on. For example, I might use /courses/mle/lec/topic1 to store the files used to make the first week’s lecture for my Maximum Likelihood course, including the tex file, any example data and R code, and any graphics files for figures.

On the research side, diving deep into my file structure you might find a folder such as /projects/china/revise/runs/arma/pc12/m1, which stores the replication code and results for model 1 specification as run on the 12th Party Congress data using an arma-modeling approach for the revision of a paper I wrote on the Chinese Communist Party’s leadership. This might seem excessively compartmentalized at first, but thinking through a careful set of subfolders saves time and avoids confusion in the long run. Folders with data processing code for the revised paper might go under /projects/china/revise/data; the actual revised paper, and associated bibliography files and final graphics, might go in /projects/china/revise/write. If you set up similar trees for each project you pursue or course you teach, you’ll find you can instantly find even files you haven’t touched for five years or more, which can save a lot of time when you need to cannibalize old code or lecture notes for something new.


4.3 The desktop. Once upon a time, personal space meant a place of rest, relaxation, or hygiene. Today, when the internet and work intrude on every moment, there’s no more private or personalized place than one’s computer desktop. If you don’t believe me, just sit down at someone else’s laptop and try to be as productive as you are on your own machine, where every program and window is just how you like it. Better still, try this experiment without asking permission first!

I expect you will rightly ignore any advice I have to give on setting up your desktop. Still, in case I can inspire improvment of some kind, I offer my own setup.

The most useful feature of my work environment is the wondrous Hyperspaces, which runs on Snow Leopard and earlier Mac OSes, and allows you to create a tiled arrangement of workspaces for different tasks, with specific applications zoned to each workspace. I have 8 spaces, in a 4 by 2 arrangement, with the following applications attached to each space:


1.Communications 2.Web
Thunderbird, Skype Chrome

3.Editing 4.Statistics
Aquamacs, OpenOffice R

5.System 6.Media
Finder, Activity Monitor iTunes, uicktime

7.Remote 8.Library
iTerm, Fetch Acrobat

The key to using these spaces is the Exposé feature of Mac OS (prior to Lion), which lets you zoom out to see all the windows open in specific space. Thus if I’m running a program in R in the Statistics space, and need to peek at some data I have open in the Editing space, I just zip over by pressing Command-Left Arrow. Alternatively, I could click OpenOffice in my Toolbar, which I keep on the left of the screen as vertical space is at a premium on a laptop. Once in my new space, I press F3 to have Exposé show all my open spreadsheets, then click the one I want to bring it to the foreground. When I’m finished, I click the R icon in the Toolbar to go back to my Statistics desktop. If a new email pops up during all this back-and-forth, I can go check it out in the Communication space (by clicking the Thunderbird icon or pressing Command-Up Arrow) without losing my place in any other workspace. Every space has a different desktop image and a title in the menu bar, so even switching desktops at high speed, I know where I am. And all this happens without ever needing to move or resize a window.

I would highly recommend Hyperspaces, but for a catch: Lion’s Mission Control attempts to replicate it, but mostly breaks its functionality and efficiency, as part of Apple’s unfortunate decision to make its heavy-duty laptops function more like cell phones.



Last updated March 26, 2012. If that was a long time ago, you might find some broken links above. Feel free to let me know if you do…


tags: technical

Comment on Social Science Computing

for the Mac in 15 Steps and $29.


This page is privately hosted. Its content reflects my views and the views of persons quoted or commenting, and not those of any other individuals or groups. I reserve the right to identify and delete obscene, bigoted, disruptive, threatening, or uninteresting comments.

Content © 2011–4
Chris Adolph

Artwork © 2011–4
Erika Steiskal

Jefferson (2007-2011)