---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- name: log: /Users/bernardofanfani/Desktop/teaching/research_topics_labor/lab_1/lecture1.log log type: text opened on: 15 Oct 2024, 17:23:43 . . ***************** . * 2. EXPLORATORY ANALYSIS OF THE DATABASE RPER.DTA . * open the rper.dta database in the Stata memory . use rper.dta, clear . . * basic commands to describe and look at the data . des Contains data from rper.dta Observations: 279,092 Variables: 30 10 Oct 2022 16:00 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Variable Storage Display Value name type format label Variable label ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- nquest long %12.0g NQUEST nord byte %8.0g NORD anno int %8.0g ANNO yl2 double %10.0g YL2 yl1 double %10.0g YL1 ytp1 double %10.0g YTP1 ytp2 double %10.0g YTP2 ym3 double %10.0g YM3 yl double %10.0g YL ytp double %10.0g YTP yta double %10.0g YTA yt double %10.0g YT ym1 double %10.0g YM1 ym2 double %10.0g YM2 ym double %10.0g YM yca1 double %10.0g YCA1 yca2 double %10.0g YCA2 yca double %10.0g YCA ycf2 double %10.0g YCF2 ycf3 double %10.0g YCF3 ycf4 double %10.0g YCF4 y1 double %10.0g Y1 ycf1 double %10.0g YCF1 ycf double %10.0g YCF yc double %10.0g YC y double %10.0g Y ycf1l double %10.0g YCF1L ycf2l double %10.0g YCF2L ycf3l double %10.0g YCF3L ycfl double %10.0g YCFL ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sorted by: . sum Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- nquest | 279,092 349349 363942.7 1 975477 nord | 279,092 1.615145 .8777305 1 10 anno | 279,092 1997.751 12.15582 1977 2020 yl2 | 262,952 43.86919 584.8588 0 120000 yl1 | 262,952 5422.036 8432.58 0 500000 -------------+--------------------------------------------------------- ytp1 | 279,092 3449.802 6571.752 0 429000 ytp2 | 279,092 20.94316 557.3951 0 130000 ym3 | 268,216 217.7969 3007.96 0 413165.5 yl | 279,092 5225.628 8384.776 0 500000 ytp | 279,092 3470.745 6623.66 0 429000 -------------+--------------------------------------------------------- yta | 279,092 66.13861 1499.497 -100000 400000 yt | 279,092 3536.883 6692.396 -77468.53 426000 ym1 | 204,053 1885.229 8854.259 -30000 1000000 ym2 | 110,182 235.7133 2003.379 0 335697 ym | 279,092 2043.796 12477.79 -200000 2000000 -------------+--------------------------------------------------------- yca1 | 279,092 185.8693 2202.674 0 210000 yca2 | 279,092 2399.193 5175.826 0 347441.2 yca | 279,092 2585.062 6134.111 0 348706.5 ycf2 | 214,929 122.8378 1048.987 0 188304.5 ycf3 | 214,929 182.938 2573.009 0 686814 -------------+--------------------------------------------------------- ycf4 | 214,929 147.4948 889.3302 0 80920.22 y1 | 279,092 13391.37 17574.86 -182600 2092225 ycf1 | 214,929 140.0877 746.1364 0 74252.61 ycf | 214,929 298.3687 3207.304 -63260.31 686865 yc | 279,092 2814.836 7266.31 -57085.9 785865 -------------+--------------------------------------------------------- y | 214,929 16272.94 20016.51 -182517.2 2350705 ycf1l | 214,929 142.6338 775.7876 0 76335.31 ycf2l | 214,929 123.0479 1051.112 0 188304.5 ycf3l | 214,929 192.699 3118.722 0 928127.1 ycfl | 214,929 310.886 3693.229 -63260.31 928195.9 . browse . edit . . ***************** . * 3. SELECTION OF VARIABLES AND OBSERVATIONS OF INTEREST, RENAMING AND LABELLING OF VARIABLES . . * this is a large dataset, but we are interested only on some variables and some observations . . * I want to keep only a subset of variables in my data, since I do not need the others . keep nquest nord anno yl yt ytp . . * I can also drop a variable (or list of variables) . drop yt . . * now I can check again how the dataset has been changed . des Contains data from rper.dta Observations: 279,092 Variables: 5 10 Oct 2022 16:00 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Variable Storage Display Value name type format label Variable label ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- nquest long %12.0g NQUEST nord byte %8.0g NORD anno int %8.0g ANNO yl double %10.0g YL ytp double %10.0g YTP ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sorted by: Note: Dataset has changed since last saved. . . * we can make the data more understandable by using labels or changing variable names . rename nquest id_household . rename nord member_household . . label var id_household "household identifier" . label var member_household "identifier of houshold member" . label var yl "income from employment" . label var ytp "income from pension" . . * now I can check again how the dataset has been changed . des Contains data from rper.dta Observations: 279,092 Variables: 5 10 Oct 2022 16:00 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Variable Storage Display Value name type format label Variable label ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- id_household long %12.0g household identifier member_househ~d byte %8.0g identifier of houshold member anno int %8.0g ANNO yl double %10.0g income from employment ytp double %10.0g income from pension ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sorted by: Note: Dataset has changed since last saved. . . *there are many years in the database . ta anno ANNO | Freq. Percent Cum. ------------+----------------------------------- 1977 | 5,177 1.85 1.85 1978 | 5,683 2.04 3.89 1979 | 5,280 1.89 5.78 1980 | 5,488 1.97 7.75 1981 | 7,344 2.63 10.38 1982 | 7,203 2.58 12.96 1983 | 7,511 2.69 15.65 1984 | 7,199 2.58 18.23 1986 | 13,278 4.76 22.99 1987 | 12,390 4.44 27.43 1989 | 13,833 4.96 32.39 1991 | 13,868 4.97 37.35 1993 | 14,381 5.15 42.51 1995 | 14,641 5.25 47.75 1998 | 12,717 4.56 52.31 2000 | 14,321 5.13 57.44 2002 | 14,031 5.03 62.47 2004 | 13,937 4.99 67.46 2006 | 13,428 4.81 72.27 2008 | 13,702 4.91 77.18 2010 | 13,733 4.92 82.10 2012 | 13,644 4.89 86.99 2014 | 13,539 4.85 91.84 2016 | 11,888 4.26 96.10 2020 | 10,876 3.90 100.00 ------------+----------------------------------- Total | 279,092 100.00 . . * if I want to keep only observations belonging to the years 2014, 2016 and 2020 I can run the following . keep if inrange(anno, 2014,2020) (242,789 observations deleted) . . * I can also drop some observations. For example, to drop the year 2020 . drop if anno==2020 (10,876 observations deleted) . . * check how many years we have now . ta anno ANNO | Freq. Percent Cum. ------------+----------------------------------- 2014 | 13,539 53.25 53.25 2016 | 11,888 46.75 100.00 ------------+----------------------------------- Total | 25,427 100.00 . . . ***************** . * 4. HYPOTHESIS TESTING . . * is labor income higher in 2014 or in 2016? . tab anno, sum(yl) | Summary of income from employment ANNO | Mean Std. dev. Freq. ------------+------------------------------------ 2014 | 6373.8818 9961.9438 13,539 2016 | 6422.3281 9901.7432 11,888 ------------+------------------------------------ Total | 6396.5321 9933.6776 25,427 . . * do we miss something? these are nominal wages, to compare real wages I need to adjust for inflation . * on "ISTAT rivaluta" website I can check that 1 eur. in 2014 is worth 0.998 eur. of 2016 . * let's adjust wages of 2014 to their 2016 value . . * I can generate new variables in stata, for example the consumer price index . gen cpi=1 if anno==2016 (13,539 missing values generated) . replace cpi=0.998 if anno==2014 (13,539 real changes made) . . gen yl_real=yl/cpi . gen ytp_real=ytp/cpi . . * now let's check whether real labor income was higher in 2016 than 2014 . tabstat yl_real, by(anno) s(mean sd min max n) Summary for variables: yl_real Group variable: anno (ANNO) anno | Mean SD Min Max N ---------+-------------------------------------------------- 2014 | 6386.655 9981.907 0 135270.5 13539 2016 | 6422.328 9901.743 0 130000 11888 ---------+-------------------------------------------------- Total | 6403.333 9944.329 0 135270.5 25427 ------------------------------------------------------------ . . * notice that we can do similar things using tab or using tabstat. Often the same analysis can be performed in many ways on STATA. What matters is that the final result is correct... . . * notice that there are many workers with 0 labor earnings. We can assume that they are not working . * I want to compute earnings for indivduals that work . tabstat yl_real if yl_real>0, by(anno) s(mean sd min max n) Summary for variables: yl_real Group variable: anno (ANNO) anno | Mean SD Min Max N ---------+-------------------------------------------------- 2014 | 16868.69 9292.38 100.2004 135270.5 5126 2016 | 17087.88 8866.281 100 130000 4468 ---------+-------------------------------------------------- Total | 16970.77 9096.612 100 135270.5 9594 ------------------------------------------------------------ . . * TO FORMALLY TEST THE SIGNIFICANCE OF THE DIFFERENCE IN AVERAGE EARNINGS ACROSS YEARS . ttest yl_real if yl_real>0, by(anno) Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. err. Std. dev. [95% conf. interval] ---------+-------------------------------------------------------------------- 2014 | 5,126 16868.69 129.7889 9292.38 16614.25 17123.13 2016 | 4,468 17087.88 132.6432 8866.281 16827.83 17347.92 ---------+-------------------------------------------------------------------- Combined | 9,594 16970.77 92.87093 9096.612 16788.72 17152.82 ---------+-------------------------------------------------------------------- diff | -219.1843 186.1765 -584.1296 145.761 ------------------------------------------------------------------------------ diff = mean(2014) - mean(2016) t = -1.1773 H0: diff = 0 Degrees of freedom = 9592 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.1196 Pr(|T| > |t|) = 0.2391 Pr(T > t) = 0.8804 . . * now let's run another ttest: is average pension income higher than labor income? . ttest yl_real==ytp_real Paired t test ------------------------------------------------------------------------------ Variable | Obs Mean Std. err. Std. dev. [95% conf. interval] ---------+-------------------------------------------------------------------- yl_real | 25,427 6403.333 62.36313 9944.329 6281.098 6525.569 ytp_real | 25,427 6550.399 54.85021 8746.33 6442.889 6657.908 ---------+-------------------------------------------------------------------- diff | 25,427 -147.0653 100.3433 16000.59 -343.744 49.61331 ------------------------------------------------------------------------------ mean(diff) = mean(yl_real - ytp_real) t = -1.4656 H0: mean(diff) = 0 Degrees of freedom = 25426 Ha: mean(diff) < 0 Ha: mean(diff) != 0 Ha: mean(diff) > 0 Pr(T < t) = 0.0714 Pr(|T| > |t|) = 0.1428 Pr(T > t) = 0.9286 . . ***************** . * 5. HYPOTHESIS TESTING FOR CATEGORICAL VARIABLES . . * is the employment rate in 2016 higher than in 2014? . gen employed=yl_real>0 . ta employed employed | Freq. Percent Cum. ------------+----------------------------------- 0 | 15,833 62.27 62.27 1 | 9,594 37.73 100.00 ------------+----------------------------------- Total | 25,427 100.00 . * employed is a binary variable (dummy). We should perform a different type of test than the ttest . * Pearson chi2 test (under H0 anno and employed are independent) . ta anno employed, row chi2 +----------------+ | Key | |----------------| | frequency | | row percentage | +----------------+ | employed ANNO | 0 1 | Total -----------+----------------------+---------- 2014 | 8,413 5,126 | 13,539 | 62.14 37.86 | 100.00 -----------+----------------------+---------- 2016 | 7,420 4,468 | 11,888 | 62.42 37.58 | 100.00 -----------+----------------------+---------- Total | 15,833 9,594 | 25,427 | 62.27 37.73 | 100.00 Pearson chi2(1) = 0.2065 Pr = 0.649 . * prtest is designed for binary variables, and has a similar interpretation than ttest . prtest employed, by(anno) Two-sample test of proportions 2014: Number of obs = 13539 2016: Number of obs = 11888 ------------------------------------------------------------------------------ Group | Mean Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- 2014 | .3786099 .0041685 .3704397 .3867801 2016 | .3758412 .0044422 .3671347 .3845477 -------------+---------------------------------------------------------------- diff | .0027688 .0060918 -.0091709 .0147084 | under H0: .0060924 0.45 0.649 ------------------------------------------------------------------------------ diff = prop(2014) - prop(2016) z = 0.4545 H0: diff = 0 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(Z < z) = 0.6753 Pr(|Z| > |z|) = 0.6495 Pr(Z > z) = 0.3247 . . * is the proportion of retirees higher in 2014 or 2016? . gen retired=ytp_real>0 . prtest retired, by(anno) Two-sample test of proportions 2014: Number of obs = 13539 2016: Number of obs = 11888 ------------------------------------------------------------------------------ Group | Mean Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- 2014 | .4576409 .0042817 .449249 .4660328 2016 | .4756898 .0045804 .4667124 .4846672 -------------+---------------------------------------------------------------- diff | -.0180489 .00627 -.0303378 -.00576 | under H0: .00627 -2.88 0.004 ------------------------------------------------------------------------------ diff = prop(2014) - prop(2016) z = -2.8786 H0: diff = 0 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(Z < z) = 0.0020 Pr(|Z| > |z|) = 0.0040 Pr(Z > z) = 0.9980 . . ***************** . * 6. LONGITUDINAL ANALYSIS: INDIVIDUAL INCOME GROWTH . . * in this dataset we can potentially observe the same individual both in 2014 and 2016 . . * first let's create a unique individual identifier . * an individual is given by the combination of the houshold id and the position in the of houshold id . * this command assigns a unique value to the variable id_individual for each combination of the variables id_household and member_household . egen id_indivdiual=group(id_household member_household) . . * how many individuals are observed more than once (in 2014 and 2016) . duplicates report id_indivdiual Duplicates in terms of id_indivdiual -------------------------------------- Copies | Observations Surplus ----------+--------------------------- 1 | 13685 0 2 | 11742 5871 -------------------------------------- . * another way to check this . * _N is the total number of observations. If I compute this by individual, I have the number of times the same individual is observed in the data . bys id_indivdiual: ge N=_N . ta N N | Freq. Percent Cum. ------------+----------------------------------- 1 | 13,685 53.82 53.82 2 | 11,742 46.18 100.00 ------------+----------------------------------- Total | 25,427 100.00 . * are there 11,742/2=5871 individuals observed two times . . * compute the growth in labor income for the same individual across time . * only for individuals that work both in 2016 and 2014 . * since the command is long, I split it in two rows using the "///" symbol (to execute the command, I have to select both rows) . . bys id_indivdiual (anno): gen yl_growth=yl_real[_N]-yl_real[1] /// > if _n==2&employed[2]==1&employed[1]==1 (23,540 missing values generated) . . * NOTICE . * bys id_indivdiual (anno) means that the command is executed by individuals, but before executing the command the data is also sorted by year . /* > in practice, the data will be sorted as follows > id_ind anno > 1 2014 > 1 2016 > 2 2014 > 2 2016 > 3 2014 > 3 2016 > ... > and the command "gen" + "if" will be executed separately for each individual > > the result doesn't change if I did write this command instead (why?) > > bys id_indivdiual (anno): gen yl_growth=yl_real[_N]-yl_real[1] /// > if _n==_N&employed[_N]==1&employed[1]==1 > > */ . . . su yl_growth Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- yl_growth | 1,887 578.9448 5375.168 -37715.43 34889.78 . * was the individual growth statistically different from zero on average? . ttest yl_growth==0 One-sample t test ------------------------------------------------------------------------------ Variable | Obs Mean Std. err. Std. dev. [95% conf. interval] ---------+-------------------------------------------------------------------- yl_gro~h | 1,887 578.9448 123.7388 5375.168 336.2654 821.6242 ------------------------------------------------------------------------------ mean = mean(yl_growth) t = 4.6788 H0: mean = 0 Degrees of freedom = 1886 Ha: mean < 0 Ha: mean != 0 Ha: mean > 0 Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000 . . * we can perform the same computations using the xtset stata command . xtset id_indivdiual anno Panel variable: id_indivdiual (unbalanced) Time variable: anno, 2014 to 2016, but with gaps Delta: 1 unit . . * "L2." means the two period lag (stata understands that the time variable is year, and that there are two years of distance between 2016 and 2014) . gen yl_growth_bis=yl_real-L2.yl_real if employed==1&L2.employed==1 (23,540 missing values generated) . * check that the two variables are identical! . su yl_growth yl_growth_bis Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- yl_growth | 1,887 578.9448 5375.168 -37715.43 34889.78 yl_growth_~s | 1,887 578.9448 5375.168 -37715.43 34889.78 . . ***************** . * 7. LONGITUDINAL ANALYSIS: THE REPLACEMENT RATE . . * some workers in 2014 become retirees in 2016 . * what is the replacement rate? . count if L2.employed==1&retired==1 163 . . gen replacement_rate=ytp_real/L2.yl_real if L2.employed==1&retired==1 (25,264 missing values generated) . sum replacement_rate Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- replacemen~e | 163 2.169096 10.20209 .0386323 128.1831 . * if I run summarize with the option ",de" I get more info about the variable . sum replacement_rate, de replacement_rate ------------------------------------------------------------- Percentiles Smallest 1% .0589727 .0386323 5% .1365684 .0589727 10% .23952 .0840421 Obs 163 25% .6487 .0855429 Sum of wgt. 163 50% .9510353 Mean 2.169096 Largest Std. dev. 10.20209 75% 1.43233 7.819152 90% 2.33532 10.61509 Variance 104.0827 95% 3.7924 25.948 Skewness 11.74982 99% 25.948 128.1831 Kurtosis 144.7568 . . * notice that quite a lot of individuals have a replacement rate above 1 . * why? . * - pensions usually depend on the entire career of workers, not just on their very last job spell . * - measurement error/low representativeness of the sample... . . * let's check if labor income in the last job is positively correlated with pension income... . gen last_job=L2.yl_real if L2.employed==1&retired==1 (25,264 missing values generated) . . corr ytp_real last_job (obs=163) | ytp_real last_job -------------+------------------ ytp_real | 1.0000 last_job | 0.5481 1.0000 . . * we can check this with a regression analysis as well . reg ytp_real last_job Source | SS df MS Number of obs = 163 -------------+---------------------------------- F(1, 161) = 69.15 Model | 4.1728e+09 1 4.1728e+09 Prob > F = 0.0000 Residual | 9.7148e+09 161 60340083.7 R-squared = 0.3005 -------------+---------------------------------- Adj R-squared = 0.2961 Total | 1.3888e+10 162 85725343.8 Root MSE = 7767.9 ------------------------------------------------------------------------------ ytp_real | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- last_job | .5329378 .0640867 8.32 0.000 .4063788 .6594968 _cons | 6018.424 1183.701 5.08 0.000 3680.841 8356.007 ------------------------------------------------------------------------------ . . * and also with a scatterplot . twoway (scatter ytp_real last_job) (lfit ytp_real last_job) . . . . ***************** . * 8. FINALLY, LET'S CLOSE THE LOG FILE . log close name: log: /Users/bernardofanfani/Desktop/teaching/research_topics_labor/lab_1/lecture1.log log type: text closed on: 15 Oct 2024, 17:23:46 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------