* Date: 2.1.18, Updated: 2.2.18
* Name: Taylor Carlson
* Purpose: Introduction to Stata for POLI 30
* Data: Example Dataset anes2016.dta
* Welcome to Stata! This document is called a do file that will allow you to
* write, save, and execute your code to analyze data in Stata.
* Before we begin, I want to point out a few things:
* 1. Note the header at the top of this file that includes the date the file
* was started and updated, name of who created the file, and the purpose of
* the file. This is really important! It will help keep your code organized
* and help you know what the goal of your do file was.
* 2. You'll notice a lot of asterisks! Asterisks signal that we are
* commenting the code. When Stata reads an asterisk, it knows to ignore it
* instead of interpreting it as code. I strongly encourage you to comment
* your code so that you know what you were trying to accomplish with a
* given command. It will make things easier as you come back to it, AND it
* will help you learn!
* In this tutorial, you will learn how to:
* (A) Open a dataset (I will be using the anes2016.dta and then you can use
* a dataset of your choice)
* (B) Use some basic "point and click" functions in Stata
* (C) Visually inspect your data
* (D) Rename variables
* (E) Examine the distribution of a variable
* (F) Generate summary statistics, such as mean, median, and standard
* deviation
* (G) Create graphs of variables: histogram, scatterplot, barplot
* (H) Recode variable values as missing data
* PART (A): Opening Datasets
* When you open Stata, it is a good idea to clear out your previous work
* so that you are starting fresh. You can do so with the clear command
clear
*load datafile
use "/Users/Taylor/Desktop/anes2016.dta"
* Note, you can do this by clicking the "Open" button at the top of your
* Stata window and then browsing to the file. Make sure it is a .dta file.
* OR, if you know the file path (where the dataset is saved), you can
* use the code above which just tells Stata where to find your data file.
* The code as written above will NOT work on your computer (unless your name
* is Taylor and your dataset is saved on your desktop). So, to run this code
* you would need to change the file path in red above to the file path where
* your dataset is saved.
* PART (B): Point and Click
*point out variable/code/review boxes
*point out data browser
*point out do-file
* PART (C): Visualizing Your Dataset
* One of the first things I like to do is to take a look at my dataset
* Click on the Data Browser icon at the top of your Stata window
* You can see each observation (a survey respondent in this case) is a row
* and each variable is a column. You can see that each column is labeled with a
* variable name as you would see it in the codebook. The numbers or words in
* each cell make up the VALUES of the variables.
* PART (D): Renaming Variables
* You'll notice that your variables are named funny things like V10, V4, etc.
* It can get confusing to know what you're working with if you keep variables
* named like this. So, one thing you can do is to rename the variables so that
* you can easily remember what they are. For instance, instead of having to
* remember that V4 means age every time you want to analyze age, you can just
* rename that variable to be age instead of V4.
* NOTE: Stata code and variable names are CASE SENSITIVE. That is, Stata thinks
* that v4 and V4 are two different things. So, it is very important to keep
* the case consistent. I recommend naming your variables with all lowercase
* letters.
* To rename a variable, you'll use the rename command. All you have to do is:
* type rename
* [space]
* the variable whose name you want to change
* [space]
* the new name you want for that variable
* Let's say we want to rename V4 (which we know from our codebook is how old our
* respondents are) as age
rename V4 age
* Now look at your Variables window and in the Data Browser. V4 no longer
* exists -- instead it shows up as age. Nice!
* Let's say we want to rename V11 handling_economy because V11 refers to
* approval or disapproval of how the president is handling the economy (which
* we can see from our codebook)
rename V11 handling_economy
* --- PRACTICE --- Try to rename the following variables: V2R, V12, V60, and V75B
* PART (E): Variable Distributions
* Now that we've prepared our data for analysis by renaming the variables, let's
* take a look at the common trends in our data by checking out the distributions
* We'll use the tabulate command, which is going to show us how many AND what
* percentage of our observations fell into each value of our variable
*look at variable distributions
*demo
* To use the tabulate command, simply type
* tabulate
* [space]
* variable name
tabulate handling_economy
* We can see here that 1,159 of our respondents strongly approved of the way
* President Trump was handling the economy, 702 approved, but not strongly,
* 348 disapproved, but not strongly, and 1,394 strongly disapproved.
* These are all in the "Freq" column
* We can also see the percentage of respondents taking on each value of our
* variable (giving each response option) in the "Percent" column.
* --- PRACTICE --- Pick one of the variables you renamed before in PART D.
* Describe the distribution as I did above.
* What was the modal (most common) value of the variable?
* What were the observed values of the variable?
* Is this a nominal, ordinal, or interval variable?
* PART (F): Generate Summary Statistics
* Now we might be interested in measures of central tendency (mean, median)
* and dispersion (standard deviation). We often call these basic statistics
* "Summary Statistics" because they summarize our data quite well.
* To view the summary statistics for a variable in Stata, you'll use the
* summarize command.
* To use the summarize command, simply type:
* summarize
* [space]
* variable name
*demo
summarize age
* We can see that the average (mean) age is 49.5 years old (Mean column)
* the standard deviation is 17.6 (Std. Dev. column)
* Obs tells us the number of observations we have
* Min tells us the minimum value in our data (so our youngest respondent
* was 18 years old)
* Max tells us the maximum value in our data (so our oldest respondent
* was 90 years old)
summarize age, detail
* You can get even more information by adding , detail
* Here you can see the percentiles, variance, and some other statistics
* that we won't get to in this class (e.g. skewness, kurtosis)
* --- PRACTICE --- Choose one of the variables you renamed earlier. Generate
* the summary statistics for that variable to tell us:
* What is the mean?
* What is the standard deviation?
* What is the minimum?
* What is the maximum?
* NOTE: mean and standard deviation make the most sense for interval variables
* PART (G): Generating Graphs
* Visualizing our data in graphs is really important! It can help us see
* patterns in our data and better communicate those patterns to others.
* Let's start by creating a histogram, which helps us see the distribution
* of our data by plotting how frequently each value of our variable is
* observed in our data.
* To create a histogram, use the histogram command. Simply type:
* histogram
* [space]
* variable name
* When you do this, a new graph window will pop up.
*demo
histogram age
* --- PRACTICE --- Choose one of the variables you renamed earlier. Generate
* a histogram for that variable.
* What do you see?
* But what if we have two variables we're interested in?
* One thing we might want to see is if different groups of people have
* different attitudes. For instance, do Democrats and Republicans have
* different feelings toward President Trump? What if we want to know
* the average feeling thermometer score toward Trump (V17) among Democrats and
* among Republicans? To visualize this, we might create a bar chart
* that shows these mean values, grouped by party ID (V36)
* To do this, we're going to use the graph command and the mean command with
* some specifications. Let me show you the code first and then explain it:
graph bar (mean) V17, over(V36)
* graph is just the command telling us we're going to make a graph
* bar tells us what kind of graph we're going to make (a bar graph!)
* (mean) V17 tells us that we want to graph the mean of variable V17,
* which is the feeling thermometer toward President Trump
* , the comma is just for separation
* over(V36) tells us the groups we want to examine the means in.
* In our case, it tells us that we want to examine the mean
* feelings toward Trump (V17) --over-- the party ID (V36) variable.
* What this will do is tell us the mean feeling thermometer score
* AMONG respondents with each value of the party ID variable
* (strong democrats, not very strong democrats, etc.)
* --- PRACTICE --- Create a graph that plots the average feeling thermometer
* score toward Democrats by gender
* What do you see?
* Another way to visualize our data, especially if we have two continuous
* variables we are interested in is to use a scatterplot. Let's try a
* scatterplot of the feeling thermometer toward the Democratic Party (V18)
* and the feeling thermometer toward the Republican Party (V19)
* We'll use the scatter command, which we can use by simply typing:
* scatter
* [space]
* variable name 1
* [space]
* variable name 2
scatter V18 V19
* --- PRACTICE --- Create a graph that plots the feeling thermometer
* score toward Obama, compared to the feeling thermometer score toward Clinton
* PART (H): Recode Variable Values as Missing Data
* Sometimes the variable you are studying has values that complicate your
* analysis or aren't needed for your hypothesis.
* Example 1: Suppose you have an ordinal variable of level of support for a
* policy (e.g. Strongly support, support, oppose, strongly oppose). But,
* suppose that there was also a "don't know" category that gets listed in
* the data as (Strongly support, support, oppose, strongly oppose, don't know)
* Does Don't Know make sense in this ordering? Is "don't know" MORE than
* strongly oppose? Is it LESS than strongly oppose? Where would you put it?
* It's hard to think about where to place Don't Know. So, you might want to
* recode the variable to set all of the Don't Know responses as missing data
* so that they are not included in your analysis.
* NOTE: This is common in the Eurobarometer dataset! So watch out for this
* if you work with that dataset
* Example 2: Suppose you are interested in vote choice, but you're only
* interested in those who voted for Trump or Clinton, not Johnson, Stein, or
* other. You could recode the Johnson, Stein, and Other values as missing data.
* To recode a variable, we'll use the recode command.
* All you have to do is type:
* recode
* [space]
* variable name
* (value you want to change = new value you want that to be)
* For example, let's say we are interested in V34: "Which party is better at
* handling the economy?" The values of this variable are:
* 1. Democrats
* 2. Republicans
* 3. Not much difference between them
* 4. Neither party
* Suppose we're only interested in those who chose Democrats or Republicans, not
* those who chose option 3 (not much difference between them) or option 4
* (neither party). In this case, we want to recode options 3 and 4 to missing.
* In stata a period means missing.
* Here's the code we'd use:
recode V34 (4=.)
recode V34 (3=.)
* Let's walk through this. We've typed our command: recode
* then we've typed the VARIABLE NAME that has a value we want to change
* then we've clarified in parentheses which value we want to change: 4
* and what we want to change it to . which means missing
* Then we did the same thing again but changing 3 to missing instead of 4
* How can we check to make sure we did it right?
tabulate V34
* If we check out the distribution of our data again, we should
* only see 1 (Democrats) and 2 (Republicans)
misstable summarize V34
* Another way is to see how many missing values you have.
* the commands above: misstable summarize V34 will generate a table
* of observations that are missing (the Obs=. column) and the
* observations that are not missing (the Obs<. column)