* Purpose: Introduction to Stata for POLI 30
* Date: 1/21/19, Updated: 1/22/19
* Name: Liesel Spangler
*******************************************************************************
* Welcome to Stata!
* This document is called a "do file" that will allow you to write, save, and
* execute your code to analyze data in Stata.
* Before we begin, I want to point out a few things:
* 1) Note the header at the top of this file that includes the date the file
* was started and updated, name of who created the file, and the purpose
* of the file. This is really important! It will help keep your code
* organized and help you know what the goal of your do file was.
* 2) Always use a do file. It helps you keep track of what you've done.
* And it will help you get help and feedback from the TAs.
* 3) You can run code directly from the do file.
* 4) You'll notice a lot of asterisks! Asterisks signal that we are
* commenting the code. When Stata reads an asterisk, it knows to ignore
* it instead of interpreting it as code.
* In this tutorial, you will learn how to:
* A) Open a dataset
* B) Use some basic "point and click" functions in Stata
* C) Visually inspect your data
* D) Rename variables
* E) Recode variables
* F) Examine the distribution of a variable
* G) Generate summary statistics, such as mean, median, and standard deviation
* H) Create graphs of variables: histogram, scatterplot, barplot
* I) Survey Weights
*******************************************************************************
* PART A: OPENING DATASETS
* 1) Clearing your workspace: When you open Stata, it is a good idea to clear out
* your previous work so that you are starting fresh.
* To do so, You can use the clear command:
clear
* Note, that unlike other programs with statistical applications (e.g., R,
* Python) you can only load one data set at a time, which is why the clear
* can be helpful.
* 2) Load the data:
* Note, there are three different ways to open a STATA(.dta) file:
* 1) You can click on the .dta file
* 2) You can use the point and click "Open" butten at the top of the
* Stata window.
* 3) You can use the following code; you will need to type the
* appropriate file path (the part in red) for where the file
* is located on your computer.
* The advantage of this method is that you can easily rerun
* the code in the future. If you choose one of the other
* ways to open the file, it helps to copy the code from the
* Stata results window and copying it into your do file to
* make your life easier in the future.
use "/Users/lispang/Desktop/anes2016.dta"
* Note also that the above is how you open native Stata files (.dta). If you
* want to open up other file types such as .sav, .csv, .tsv, .xslx, etc.,
* you will need to import the file. For now, we will not go into this,
* but if you ever need to do this, please use this reference:
* https://www.stata.com/manuals13/dimportdelimited.pdf
*******************************************************************************
* PART B: POINT AND CLICK
* 1) point out variable/code/review boxes
* 2) point out data browser
* 3) point out do-file
*******************************************************************************
* PART C: BROWSING YOUR DATA
* 1) One of the first things I like to do is to take a look at my dataset
* 2) Click on the Data Browser icon at the top of your Stata window
* Each observation (a survey respondent in this case) is a row
* Each variable is a column. You can see that each column is labeled with a
* variable name as you would see it in the codebook.
* The numbers or words inneach cell make up the VALUES of the variables.
*******************************************************************************
* PART D: RENAMING VARIABLES
* 1) You'll notice that your variables are named funny things like V10, V4, etc.
* You can easily rename your variables to something that is meaningful.
* To rename a variable, you'll use the rename command. All you have to do is:
* type rename
* [space]
* the variable whose name you want to change
* [space]
* the new name you want for that variable
*** EXAMPLE: Let's say we want to rename V4 (which we know from our codebook is
* how old our respondents are) as age
* Now look at your Variables window and in the Data Browser. V4 no longer
* exists -- instead it shows up as age.
rename V4 age
* 2) Stata code and variable names are CASE SENSITIVE. That is, Stata thinks
* that v4 and V4 are two different things. So, it is very important to keep
* the case consistent. I recommend naming your variables with all lowercase
* letters.
* 3) Stata variable names cannot have spaces in them, but you can separate words
* by using an underscore (called snake case) or by strategically using
* upper and lower case letters (called camel case)
*** EXAMPLE: Let's say we want to rename V11 handling_economy because V11
* refers to approval or disapproval of how the president is handling the
* economy (which we can see from our codebook) [Note, this is snake case]
rename V11 handling_economy /*snake case*/
rename handling_economy handlingEconomy /*camel case*/
* Whichever case you prefer, just be consistent.
* --- PRACTICE --- Try to rename the following variables: V2R, V12, V60, and V75B
* use the codebook to identify the variables and values
*******************************************************************************
* PART E: RECODING VARIABLE VALUES
* There are many circumstances when you would want to recode the values
* taken by a variable, such as:
* 1) To group values
* 2) To excludes observations from the analysis based on the value given
*1) Grouping Values
* Suppose we want to know how many survey respondents favored Clinton in the
* 2016 election (V16), we might want to recode the values into two groups, those
* respondents that favored Clinton to some degree and those who did not.
rename V16 clinton_support /*renaming variable to make sense*/
* The original value coding is on a 0-100 scale, with 0 indicating no support
* and 100 indicating complete support. So we may consider recoding the values
* as the following:
* Anyone answering 49 and below is coded as a 1 (low support)
* Anyone answering 50 is coded as a 2 (neutral)
* Anyone answering 51 and above is coded as a 3 (high support)
* Let's look at the variable before recoding
tabulate clinton_support /*frequency distribution of cases before recoding*/
list clinton_support in 1/5 /*this lists the first 5 observations*/
* To recode a variable, we'll use the recode command.
* All you have to do is type:
* recode
* [space]
* variable name
* (value you want to change = new value you want that to be)
recode clinton_support (0/49 = 1)(50 = 2)(51/100 = 3)
* The forward slashes indicate a range, so 0/49 represents all values
* between 0 and 49. 51/100 indicates all the values between 51/100
tabulate clinton_support /*frequency distribution of cases after recoding*/
list clinton_support in 1/5 /*this lists the first 5 observations*/
* Do a higher proportion of respondents feelstrong support, neutral, low
* support for Clinton?
* How did the values for the first 5 observations change with the recoding?
* 2) Sometimes the variable you are studying has values that complicate your
* analysis or aren't needed for your hypothesis. For instance, "Don't Know" or
* "Prefer Not to Answer" This is extremely common in data sets. Using the
* codebook will help you identify such variables.
* For instance, suppose you have an ordinal variable of level of support for a
* policy, such that that the values of the variable are:
* Strongly Support
* Support
* Oppose
* Strongly Oppose
* Don't Know
* Does Don't Know make sense in this ordering? Is "don't know" more or
* less than strongly oppose? Where would you put it?
* It's hard to think about where to place Don't Know in the ordering.
* Is "don't know" more or less than strongly oppose?
* So, you might want to recode the variable to set all of the "don't
* know" responses as missing data so that they are not included in your
* analysis.
* Observations with missing data are automatically excluded from analysis.
* Example: let's say we are interested in V34: "Which party is better at
* handling the economy?" The values of this variable are:
* 1. Democrats
* 2. Republicans
* 3. Not much difference between them
* 4. Neither party
* Suppose we're only interested in those who chose Democrats or Republicans,
* not those who chose option 3 (not much difference between them) or option 4
* (neither party). In this case, we want to recode options 3 and 4 to missing.
* In Stata a period indicates missing.
* Here's the code we'd use:
recode V34 (4=.)
recode V34 (3=.)
* Let's walk through this. We've typed our command: recode
* then we've typed the VARIABLE NAME that has a value we want to change
* then we've clarified in parentheses which value we want to change: 4
* and what we want to change it to . which means missing
* Then we did the same thing again but changing 3 to missing instead of 4
* 3) Double Checking Work.
* To make sure we've recoded the values correctly we can use the tabulate
* command.
tabulate V34 /*this shows a frequency table of the data*/
* If we check out the distribution of our data again, we should
* only see 1 (Democrats) and 2 (Republicans)
misstable summarize V34
* Another way is to see how many missing values you have.
* the commands above: misstable summarize V34 will generate a table
* of observations that are missing (the Obs=. column) and the
* observations that are not missing (the Obs<. column)
*******************************************************************************
* PART F: VARIABLE DISTRIBUTIONS
* Now that we've prepared our data for analysis by renaming the variables, let's
* take a look at the common trends in our data by checking out the distributions
* We'll use the tabulate command, which is going to show us how many AND what
* percentage of our observations fell into each value of our variable
*look at variable distributions
*demo
* To use the tabulate command, simply type
* tabulate
* [space]
* variable name
tabulate handling_economy
* We can see here that 1,159 of our respondents strongly approved of the way
* President Trump was handling the economy, 702 approved, but not strongly,
* 348 disapproved, but not strongly, and 1,394 strongly disapproved.
* These are all in the "Freq" column
* We can also see the percentage of respondents taking on each value of our
* variable (giving each response option) in the "Percent" column.
* --- PRACTICE --- Pick one of the variables you renamed before in PART D.
* Describe the distribution as I did above.
* What was the modal (most common) value of the variable?
* What were the observed values of the variable?
* Is this a nominal, ordinal, or interval variable?
*******************************************************************************
* PART G: GENERATE SUMMARY STATISTICS
* Now we might be interested in measures of central tendency (mean, median)
* and dispersion (standard deviation). We often call these basic statistics
* "Summary Statistics" because they summarize our data quite well.
* 1) To view the summary statistics for a variable in Stata, you'll use the
* summarize command. Simply type:
* summarize
* [space]
* variable name
summarize age
* We can see that the average (mean) age is 49.5 years old (Mean column)
* the standard deviation is 17.6 (Std. Dev. column)
* Obs tells us the number of observations we have
* Min tells us the minimum value in our data (so our youngest respondent
* was 18 years old)
* Max tells us the maximum value in our data (so our oldest respondent
* was 90 years old)
summarize age, detail
* You can get even more information by adding , detail
* Here you can see the percentiles, variance, and some other statistics
* that we won't get to in this class (e.g. skewness, kurtosis)
* 2) --- PRACTICE --- Choose one of the variables you renamed earlier. Generate
* the summary statistics for that variable to tell us:
* What is the mean?
* What is the standard deviation?
* What is the minimum?
* What is the maximum?
* NOTE: mean and standard deviation make the most sense for interval variables
*******************************************************************************
* PART H: GENERATING GRAPHS
* 1) Visualizing our data in graphs help us see patterns in our data and
* better communicate those patterns to others.
* Let's start by creating a histogram, which helps us see the distribution
* of our data by plotting how frequently each value of our variable is
* observed in our data.
* To create a histogram, use the histogram command. Simply type:
* histogram
* [space]
* variable name
* When you do this, a new graph window will pop up.
histogram age
* --- PRACTICE --- Choose one of the variables you renamed earlier. Generate
* a histogram for that variable.
* What do you see?
* 2) But what if we have two variables we're interested in?
* One thing we might want to see is if different groups of people have
* different attitudes.
* For instance, do Democrats and Republicans have different feelings toward
* President Trump? What if we want to know the average feeling thermometer
* score toward Trump (V17) among Democrats and among Republicans?
* To visualize this, we might create a bar chart that shows these mean
* values, grouped by party ID (V36)
* To do this, we're going to use the graph command and the mean command with
* some specifications. Let me show you the code first and then explain it:
graph bar (mean) V17, over(V36)
* graph is just the command telling us we're going to make a graph
* bar tells us what kind of graph we're going to make (a bar graph!)
* (mean) V17 tells us that we want to graph the mean of variable V17,
* which is the feeling thermometer toward President Trump
* , the comma is just for separation
* over(V36) tells us the groups we want to examine the means in.
* In our case, it tells us that we want to examine the mean
* feelings toward Trump (V17) --over-- the party ID (V36) variable.
* What this will do is tell us the mean feeling thermometer score
* AMONG respondents with each value of the party ID variable
* (strong democrats, not very strong democrats, etc.)
* --- PRACTICE --- Create a graph that plots the average feeling thermometer
* score toward Democrats by gender
* What do you see?
* 3) Another way to visualize our data, especially if we have two continuous
* variables we are interested in is to use a scatterplot. Let's try a
* scatterplot of the feeling thermometer toward the Democratic Party (V18)
* and the feeling thermometer toward the Republican Party (V19)
* We'll use the scatter command, which we can use by simply typing:
* scatter
* [space]
* variable name 1
* [space]
* variable name 2
scatter V18 V19
* --- PRACTICE --- Create a graph that plots the feeling thermometer
* score toward Obama, compared to the feeling thermometer score toward
* Clinton
*******************************************************************************
* PART I: SURVEY WEIGHTS
* When creating surveys using randomization, you still may not have a sample that
* adequately matches the population in terms of demographic variables. That is,
* the survey is not representative of the population (and therefor lacks
* external validity).
* To use the survey data, you may need to use the weights that have been calculated
* to make the survey data representative so that the summary statistics and
* results of analyses can be generalized to the whole population.
* 1) Thankfully, it is easy to apply weights. You have to call the "svyset" command,
* and in brackets tell Stata what variable represents the weights (in this
* case, PW2016_FULL)
svyset [pweight=PW2016_FULL]
* 2) Let's compare means with and without the weights:
svy: mean age /*this shows the mean age with the survey weights*/
summarize age /*this shows the mean age without the survey weights*/
* What happens to the mean with and without the weighting?
* What does this mean about our sample?
* NOTE, You will need to use the survey weights in the First HW for Part B
* Question 1 in order to get the 62.3% for category 1.