Just a blog

Empirical study preparation based on Woolridge, Introductory Econometric

December 29, 2007 · Leave a Comment

What are the preparation that we need before starting the empirical study? As, we know that empirical study dealt with the relationship of variabels that is the relationship between the dependent variable and explantory variable. You might interested in study how education effect wage? Whether high education will raise wage? Or you might interested if there is any wage discrimination between mail and female? or You might interested how police expenditure reduce crime rate? These example give the core concept of econometric studies. In the followings, I am going to summary all I have read for two semesters for Master Course at Nagoya University, basing from Woolridge Text Book, from Chapter 1 to Chapter 19.

But what is the preparations need before we can run our study on such project?
1-First we need data. Collecting dat might sound a simple thing to do only at some cost. However, technical problem of data is sometime hard to remedy. Here, I would like to present some problem of data based on Woolridge text book of introductory econometric. One of the problem is known as mesurment error This is a kind of problem that we sometime can not get the data we expected but in stead we have only the data we observe. For example, when we collect data on income or saving we might not get the right anwer from our survey since some households might be unwilling to report their true income maybe because they try to avoid paying high tax or because of any other kinds of reason we might not get the right data we expected. In such situation if we still use the OLS regression we might get the bias estimetor if the error is correlated with the variable interested. The problem of mesurement error can be divided in two kinds. One is the mesurement error with dependent variable and another is the mesurment error with explantory variable. The error here can be given example as follow: suppose that household income is 1000 dollars but he report 900 dollars, then 100 dollars is error. For the problem of mesurment error with dependent variable the problem is not serious as long as the error is uncorrelated with the dependent variable we will get the unbaised and consistent estimator using OLS. However, if the error is correlated with the dependent variable we cant help only to recollect the right data. For the problem of measurement error of explantory variable the problem is a little tricky. There are two possibilities here, one if the variable interested or reported is uncorrelated with error then we still obtain unbaised and consistent estimator only at the cost of large variance and standard error of estimator. What happen if the reported explantory variable is correlated with error then the problem is quite serious although the bias of estimator is attenuation approching zero. So what is the soluation in this case? We can recollect the data or we can used the Instrumental Variable (IV) and two stage least square (2SLS) to obtain the right estimator.

2-The second preparation is the problem of choosing the explantory variable to include in model. From basic econometric we already known that if we fail to include enough variable we will not get the unbaised estimator. Therefore, we have to try to include all explantory variable in the model to make the model complete. However, inlcuding variable should be taken account of problem of multicollinarythat is the problem of large variance of estimator. Anyways, what we wish is to get the right estimator for the coefficient of explanatory variable. Therefore, we seek to include all explanatory variable as many as possible. The problem occure is sometime we can not use the explantory variable itself becuase the variable is unobserved then we have to use the proxyin stead. The problem of failure to include enough explanatory variable is also called endogentiy problem. Actaually, we can test to see wheather our explanatory variable is endogenous or exogenous by using statistical test. In some serve situation we might not be able to choose the right proxy for our purpose. In such situation if we wish to use OLS regression we will get the baised estimator. However, we can use the 2SLS or IV method to get the unbasied estimator.

3-Problem of misspecification of model One of my friend study at Graduate school of International Development (GSID) always ask me why in some case we need to use logarithsm form while other case we need simple linear one. These are problem of model specification. In econmetric class, I have taken with Prof .Sonota or with Prof. Fujikawa I ask him very often about this. According to the text book of Woolrdige I have read, It said that using log is prefer as we can be sure that log is not suffer from scaling in unit (say from 1 thousand to 100 thousand units), using log lead to more normal distribution (This can be test on heteroskedacity), using log is convineint for variable that is positve such as income. What is important I have read from text book, it stated that we can test for the form specification of model that use log, quatratic (parabolic) or linear format using some test such as Ramsey Test (RESET) or David-Mckinon Test. We can also used adjusted R-square to choos between model or using F-test to test non-nested model.

4-Problem of outlierSometime outlier lead to incosisten of estimator although we can apply law of large number or Central Limit Theorem (CLT). By romoving the outlier we can get the consitent estimator. Some statistical sofware package offer us such user-freindly tool.

5-Some problem with Guass Maskov Assumption for OLS There are six assumption before we can use Ordinary Least Square Regression (OLS). Let me go throuhg each assumption and mention it’s failure and solution.
Assumption 1: The model must be linear in term of paramter. This is a very basic assumption. So what happen if the explantory variable and dependent variable related with one another but not in linear form. Then, we can not use OLS regression, however, as developped by mathematical tool, in the neighborhood that is the small interval of all variable we can linearized all kind of function into linear form that make sense for our purpose.
Assumption 2:Randomness: It is quite important that we prefere random variable. In natural sicence we can have experiment that is stochastic but in social sicence the assumption might be violated. Especially in the case of time-series and panel data. Why data is not random and what is the consequence if data we collect is not radom? Data is not random is due to we select sample best on some qualification. This sometime call strata in survey and sampling method. The qualification is defined by dependent variable, indpendent variable (explanatory variable) or other related to these two variable. For example, we are willing to study the effect of training on worker productivity but we might select only firm that have training program or we might select firm that have huge capital. These lead to small sample size. Or We wish to study the effect of education on wage discrimation but we select only some group of people that have income over some range. According to the text book with proof that I excluded here said that exogenous sampling that is sampling based on qualification of explanatory variable is not a serious problem we still get the the unbaised estimator using OLS. However, endogenous sampling where sample is select based on dependent variable serious and lead to bias estimator.

Assumption 3: No perfect collineary exist between explantory variable. Perfect means variables are not related in linear function however they can related in other form of function i.e non-linear form.

Assumption 4: Exogenous Assumption: This is quite important in OLS. We have both strict and weak exogenity. Actually we can test for exogenous as I mentioned earler. The core concept here is that each variable is uncorrenated with error terms. Assmuption 1 to 3 is basic for unbasied estimator.

Assumption 5: Homoskedacity: This is important for obtaining the standard error term of each estimator. Witout this assumption we can not calucate the varaince of estimator and its standard error, therefore we can construct confidential interval or t-test or F-test or even LM-test. Therefore, what we need here is we have to ensure that our sample will ensure a costant variance. We can also test to see wheather our data follow homoskedacity or heteroskedacity using Bruece Pegan (BP) Test or White Test.. Sometime we can reject the test and thus accepting that we have heteroskedacity in our sample but we still have two solution. One we can robust the t-test, F-test or LM-test. This is the most simple method run by most sofware. In some case if we know exactly the shape of the distriubtion of error term or the exact function we can use General Least Square (GLS) or Feasible GLS (FGLS) in stead of OLS. Most sofware package provide us with these tools.

Assumption 6: Normality of error term distribution. This assumption is need to ensure that we can use t-test or F-test. However, this assumption can be relaxed if the number of observation in the sample is large enough. That is we can use the asymptotic t-test, F-test.

6-Problem with Time series data and Panal Data With time series data we have to consider addition assumption in addition to 6 assumption we have so far for cross-section data. The first I am going to talk now is the problem of serial correlation problem.
Serial correlation (SC) is problem where error term in one period correlated with error term in another time period. This occurs only in time series data. We can test to see whether our time series is serila corrlated or not using some statistical test. The test name Durbin Watson (DW-Test) can be used to test for serial correlation. Why serial correlation is so important issues in time series regression? This is because SC will cause give us bias estimator. A very simple SC occur in the form where error term appear in Autor correlation of order 1. AR(1) and the unit root case in which we have random walk for the error term. So after we test for SC if it happen that our time series suffer from SC we have two solution. First solution we can roubst SC. An alternative soulution is we can use correction the SC of form AR(1) by quasi-different and using GSL to obtain the estimator.

7-Problem with Panel Data: Panel Data is classified into two category. The pooling crossection data (a kind of data where each observation is randomly selected for each time period) for example, in year 2007, 300 household being selected to get income reported then in 2008 we selected another 300 household to get income reported agains (some previous household might be included). This kind of pooling cross-section data we can apply OLS as we did with cross-section and time series. A powerful method to deal with pooling cross-section data is to include dummy variable for each time period in population model.
Another category of panel data is a panel data where each individual in the sample is being asked again and again every time period. As this process violate the random property we can not apply OLS regression as we did with cross section and time series. However we have soultion for such kind of data by using First Different (FD), Fixed Effect (FE) or Random Effect (RE)..

All the above description is all the preparatio we should think of whenever we want to run any empirical analysis project. Most of the idea, I summarize from text book of Woolridge Introductory Econometric 2006, 3rd, Edition.

Categories: Economics

0 responses so far ↓

  • There are no comments yet...Kick things off by filling out the form below.

Leave a Comment