Stata in IPython


Stata is a popular data analytics tool used by researchers for statistical analysis. Nowadays, there are numerous tools available for performing data analysis. The popular open-source programming tools like R and Python. Even though R and Python are open source and easy to implement, they are still not mature. What I felt after learning R and Python is that R has numerous libraries, but the syntax are not consistent across libraries, which sometimes makes it hard for research work. In case of python, the syntax are consistent but many statistical analysis methods or modelling approach are still not available or are in the phase of development. Thus, I have to often depend on paid software products which are feature rich and mature. Stata is one of them that I often use for research related analysis. Now Stata 17 offers integration with Python which makes the analysis process super easy and fun.

Letā€™s say you are doing some analysis in python and want to do some statistical analysis. You searched the internet and realized that the implementation of the statistical model is not available in Python, or not the exact implementation available that you want, then you have to approach a paid software and perform the analysis.

Stataā€™s new Jupyter notebook support makes it super easy. Now, you can send data from python to Stata and vice versa. For example, you can send a part of the data from python to Stata, conduct analysis and return the output to Python for further analysis or vice versa. This can be done entirely from Jupyter notebook.

Aim of the Article

The aim of the article is to illustrate how we could utilise Python and Stata together to perform statistical analysis directly from Jupyter notebook.

Article Outline

  1. Stata in Ipython Notebook
  2. Loading a Dataset Into Python and Transferring it to Stata for Analysis
  3. Transferring Predictions from Stata to Python

1. Stata in Ipython Notebook

Loading Stata

To use the Stata in Ipython Notebook. First, you need to set up Python. Here, Iā€™m using anaconda distribution and Python version 3.7. You need to ensure that you have Stata 17, which provides integration of Stata and Python in Ipython notebook/ Jupyter Notebook.

To start with the Ipython notebook you need to install stata-setup package/library using pip.

pip install stata-setup

Next, open an Ipython notebook, and you need to import stata_setup module. Further, we need to use stata_setup.config( ) and supply the directory where the Stata exist in your local machine, also specify the edition of Stata. Here in my case Iā€™m using the Basic Edition so, ā€œbeā€.

Once you run it, you will see the following Stata page, indicating you are now connected to Stata desktop.

import stata_setup
stata_setup.config("D:\Application Installation\STATA", "be")
Stata 17 Basic Edition in Ipython notebook

Load auto Data in Stata

First, Iā€™m going to set the white tableau scheme permanently, which is a wonderful plot scheme.

You can enable it by installing schemepack package developed by Asjad Naqvi. Follow the link for installation instructions: Link.

In jupyter notebook, to send any instruction to Stata we need to initiate the command with aĀ %%stataĀ magic command.

set scheme white_tableau, perm

Once, we set the tableau scheme; next we start analysing data. Letā€™s load the auto data.

Here, we used the system defaultĀ autoĀ data and summarize it.

sysuse auto, clear
Auto data summary table

Generating a Scatter Plot

Letā€™s generate a scatter plot betweenĀ mpgĀ andĀ weightĀ for Domestic and Foreign cars separately using theĀ twowayĀ command.

twoway (scatter mpg weight, msize(vlarge)), by(foreign)
Scatter plot between mpg and weight by foreign

2. Loading a DataSet Into Python and Transferring it to Stata for Analysis

Letā€™s load the inbuiltĀ tipsĀ data from Pythonā€™s Seaborn library.

import seaborn as sns
tips = sns.load_dataset("tips")
Top 5 observations

We can also check the value counts for categorical data.

Value counts of time variable labels

Before we send this data to Stata we need to ensure that there are no other data in Stata memory. Thus, it is good practice to clear the memory usingĀ clearĀ command.


Transferring Data from Python to Stata

To transfer the tips data to Stata we need to use -d datasetname

We can now useĀ list in 1/5Ā to print top five observations

%%stata -d tips
list in 1/5
Top 5 observations view in Stata

Letā€™s summarize the data usingĀ summarizeĀ command. It only produced summary for the continuous data, i.e.,Ā total_bill, tipĀ andĀ size.

Tips data summary

Letā€™s see the data format/type using theĀ describeĀ command.

Tips data description

You can observe that sex, smoker, day and time are in string format.The next step is to encode the labels and transform them into categorical variables (sex, smoker, day and time).

Label sex

We label the sex ā†’ 0: Male and 1: Female and save it into another variable calledĀ sex_enc.

label define sex_lab 0 "Male" 1 "Female"
encode sex, gen(sex_enc) label(sex_lab)
tab sex_enc
Gender frequency table

Label smoker

We label the smoker status ā†’ 0: No and 1: Yes and save it into another variable calledĀ smoker_enc.

label define smoker 0 "No" 1 "Yes"
encode smoker, gen(smoker_enc) label(smoker)
tab smoker_enc
Smoker frequency table

Label time

We label the time ā†’ 0: Lunch and 1: Dinner and save it into another variable calledĀ time_enc.

label define time_lab 0 "Lunch" 1 "Dinner"
encode time, gen(time_enc) label(time_lab)
tab time_enc
Time frequency table

Label Day

We label the Day status ā†’ 0: Sat, 1: Sun, 2: Thur and 3: Fri and save it into another variable calledĀ day_enc.

label define day_lab 0 "Sat" 1 "Sun" 2 "Thur" 3 "Fri"
encode day, gen(day_enc) label(day_lab)
tab day_enc
Day frequency table

Chi-square Test of Independence

Once we label all categorical variables, letā€™s check whether the categorical variables are acting as it should act in Stata. Letā€™s conduct a Chi-square test of independence and check whether sex and smoker are related. The test statistics (p>0.05) revealed that sex and smoker are independent.

tab sex_enc smoker_enc, chi2
Chi-square contingency table

Fit a Linear Regression Model

Letā€™s fit a linear regression usingĀ regĀ Stata command. It worked as expected.

reg tip total_bill ib(0).smoker_enc ib(0).sex_enc ib(0).time_enc ib(0).day_enc
Tip regression summary table

Compute margins

Letā€™s generate a margin plot by supplying the total bill from 3 to 50 at an interval of 5, while holding other variables constant.

quietly margins, at(total_bill=(3(5)50))
Margin plot

3. Transferring Predictions from Stata to Python

Sometimes we may need to transfer some estimates from Stata to Python to perform any computation on that. Say, we want to transfer the margin estimate computed previously to Python. We can use the -doutd and save it to preddata. We will use this preddata in next step.

For now, letā€™s calculate the margin again and save it in Stata asĀ predictions. Now, if we print the predictions, we can see the name of the columns total_bill asĀ _at1Ā and margins asĀ _margin.Ā Letā€™s rename the columns asĀ total_billĀ andĀ pr_tip.

%%stata -doutd preddata
quietly margins, at(total_bill=(3(5)50)) saving(predictions, replace)
use predictions, clear
list _at1 _margin in 1/5
rename _at1 total_bill
rename _margin pr_tip
Margin estimates

If we now access the two columns fromĀ preddataĀ and print the first 5 observations in Ipython notebook. It will print the data as pandas dataframe.

preddata[['total_bill', 'pr_tip']].head()
Predicted margins as pandas data frame

Stata is a wonderful software for performing statistical analysis. Similarly, Python is a wonderful general purpose programming language. We can use both of them parallelly to harness the power to solve both statistical and machine learning related problems.

Click hereĀ for the data and code

I hope youā€™ve found this article useful!

Leave a Reply

Your email address will not be published. Required fields are marked *