----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      name:  <unnamed>
       log:  /Users/bernardofanfani/Desktop/teaching/research_topics_labor/lab_1/lecture1.log
  log type:  text
 opened on:  15 Oct 2024, 17:23:43

. 
. *****************
. * 2.    EXPLORATORY ANALYSIS OF THE DATABASE RPER.DTA
. * open the rper.dta database in the Stata memory
. use rper.dta, clear

. 
. * basic commands to describe and look at the data
. des

Contains data from rper.dta
 Observations:       279,092                  
    Variables:            30                  10 Oct 2022 16:00
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
nquest          long    %12.0g                NQUEST
nord            byte    %8.0g                 NORD
anno            int     %8.0g                 ANNO
yl2             double  %10.0g                YL2
yl1             double  %10.0g                YL1
ytp1            double  %10.0g                YTP1
ytp2            double  %10.0g                YTP2
ym3             double  %10.0g                YM3
yl              double  %10.0g                YL
ytp             double  %10.0g                YTP
yta             double  %10.0g                YTA
yt              double  %10.0g                YT
ym1             double  %10.0g                YM1
ym2             double  %10.0g                YM2
ym              double  %10.0g                YM
yca1            double  %10.0g                YCA1
yca2            double  %10.0g                YCA2
yca             double  %10.0g                YCA
ycf2            double  %10.0g                YCF2
ycf3            double  %10.0g                YCF3
ycf4            double  %10.0g                YCF4
y1              double  %10.0g                Y1
ycf1            double  %10.0g                YCF1
ycf             double  %10.0g                YCF
yc              double  %10.0g                YC
y               double  %10.0g                Y
ycf1l           double  %10.0g                YCF1L
ycf2l           double  %10.0g                YCF2L
ycf3l           double  %10.0g                YCF3L
ycfl            double  %10.0g                YCFL
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: 

. sum

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      nquest |    279,092      349349    363942.7          1     975477
        nord |    279,092    1.615145    .8777305          1         10
        anno |    279,092    1997.751    12.15582       1977       2020
         yl2 |    262,952    43.86919    584.8588          0     120000
         yl1 |    262,952    5422.036     8432.58          0     500000
-------------+---------------------------------------------------------
        ytp1 |    279,092    3449.802    6571.752          0     429000
        ytp2 |    279,092    20.94316    557.3951          0     130000
         ym3 |    268,216    217.7969     3007.96          0   413165.5
          yl |    279,092    5225.628    8384.776          0     500000
         ytp |    279,092    3470.745     6623.66          0     429000
-------------+---------------------------------------------------------
         yta |    279,092    66.13861    1499.497    -100000     400000
          yt |    279,092    3536.883    6692.396  -77468.53     426000
         ym1 |    204,053    1885.229    8854.259     -30000    1000000
         ym2 |    110,182    235.7133    2003.379          0     335697
          ym |    279,092    2043.796    12477.79    -200000    2000000
-------------+---------------------------------------------------------
        yca1 |    279,092    185.8693    2202.674          0     210000
        yca2 |    279,092    2399.193    5175.826          0   347441.2
         yca |    279,092    2585.062    6134.111          0   348706.5
        ycf2 |    214,929    122.8378    1048.987          0   188304.5
        ycf3 |    214,929     182.938    2573.009          0     686814
-------------+---------------------------------------------------------
        ycf4 |    214,929    147.4948    889.3302          0   80920.22
          y1 |    279,092    13391.37    17574.86    -182600    2092225
        ycf1 |    214,929    140.0877    746.1364          0   74252.61
         ycf |    214,929    298.3687    3207.304  -63260.31     686865
          yc |    279,092    2814.836     7266.31   -57085.9     785865
-------------+---------------------------------------------------------
           y |    214,929    16272.94    20016.51  -182517.2    2350705
       ycf1l |    214,929    142.6338    775.7876          0   76335.31
       ycf2l |    214,929    123.0479    1051.112          0   188304.5
       ycf3l |    214,929     192.699    3118.722          0   928127.1
        ycfl |    214,929     310.886    3693.229  -63260.31   928195.9

. browse

. edit

. 
. *****************
. * 3.    SELECTION OF VARIABLES AND OBSERVATIONS OF INTEREST, RENAMING AND LABELLING OF VARIABLES
. 
. * this is a large dataset, but we are interested only on some variables and some observations
. 
. * I want to keep only a subset of variables in my data, since I do not need the others
. keep nquest nord anno yl yt ytp

. 
. * I can also drop a variable (or list of variables)
. drop yt

. 
. * now I can check again how the dataset has been changed
. des

Contains data from rper.dta
 Observations:       279,092                  
    Variables:             5                  10 Oct 2022 16:00
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
nquest          long    %12.0g                NQUEST
nord            byte    %8.0g                 NORD
anno            int     %8.0g                 ANNO
yl              double  %10.0g                YL
ytp             double  %10.0g                YTP
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

. 
. * we can make the data more understandable by using labels or changing variable names
. rename nquest id_household

. rename nord member_household

. 
. label var id_household "household identifier"

. label var member_household "identifier of houshold member"

. label var yl "income from employment"

. label var ytp "income from pension"

. 
. * now I can check again how the dataset has been changed
. des

Contains data from rper.dta
 Observations:       279,092                  
    Variables:             5                  10 Oct 2022 16:00
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
id_household    long    %12.0g                household identifier
member_househ~d byte    %8.0g                 identifier of houshold member
anno            int     %8.0g                 ANNO
yl              double  %10.0g                income from employment
ytp             double  %10.0g                income from pension
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

. 
. *there are many years in the database
. ta anno

       ANNO |      Freq.     Percent        Cum.
------------+-----------------------------------
       1977 |      5,177        1.85        1.85
       1978 |      5,683        2.04        3.89
       1979 |      5,280        1.89        5.78
       1980 |      5,488        1.97        7.75
       1981 |      7,344        2.63       10.38
       1982 |      7,203        2.58       12.96
       1983 |      7,511        2.69       15.65
       1984 |      7,199        2.58       18.23
       1986 |     13,278        4.76       22.99
       1987 |     12,390        4.44       27.43
       1989 |     13,833        4.96       32.39
       1991 |     13,868        4.97       37.35
       1993 |     14,381        5.15       42.51
       1995 |     14,641        5.25       47.75
       1998 |     12,717        4.56       52.31
       2000 |     14,321        5.13       57.44
       2002 |     14,031        5.03       62.47
       2004 |     13,937        4.99       67.46
       2006 |     13,428        4.81       72.27
       2008 |     13,702        4.91       77.18
       2010 |     13,733        4.92       82.10
       2012 |     13,644        4.89       86.99
       2014 |     13,539        4.85       91.84
       2016 |     11,888        4.26       96.10
       2020 |     10,876        3.90      100.00
------------+-----------------------------------
      Total |    279,092      100.00

. 
. * if I want to keep only observations belonging to the years 2014, 2016 and 2020 I can run the following
. keep if inrange(anno, 2014,2020)
(242,789 observations deleted)

. 
. * I can also drop some observations. For example, to drop the year 2020
. drop if anno==2020
(10,876 observations deleted)

. 
. * check how many years we have now
. ta anno

       ANNO |      Freq.     Percent        Cum.
------------+-----------------------------------
       2014 |     13,539       53.25       53.25
       2016 |     11,888       46.75      100.00
------------+-----------------------------------
      Total |     25,427      100.00

. 
. 
. *****************
. * 4.    HYPOTHESIS TESTING
. 
. * is labor income higher in 2014 or in 2016?
. tab anno, sum(yl)

            |  Summary of income from employment
       ANNO |        Mean   Std. dev.       Freq.
------------+------------------------------------
       2014 |   6373.8818   9961.9438      13,539
       2016 |   6422.3281   9901.7432      11,888
------------+------------------------------------
      Total |   6396.5321   9933.6776      25,427

. 
. * do we miss something? these are nominal wages, to compare real wages I need to adjust for inflation
. * on "ISTAT rivaluta" website I can check that 1 eur. in 2014 is worth 0.998 eur. of 2016
. * let's adjust wages of 2014 to their 2016 value
. 
. * I can generate new variables in stata, for example the consumer price index
. gen cpi=1 if anno==2016
(13,539 missing values generated)

. replace cpi=0.998 if anno==2014
(13,539 real changes made)

. 
. gen yl_real=yl/cpi

. gen ytp_real=ytp/cpi

. 
. * now let's check whether real labor income was higher in 2016 than 2014
. tabstat yl_real, by(anno) s(mean sd min max n)

Summary for variables: yl_real
Group variable: anno (ANNO)

    anno |      Mean        SD       Min       Max         N
---------+--------------------------------------------------
    2014 |  6386.655  9981.907         0  135270.5     13539
    2016 |  6422.328  9901.743         0    130000     11888
---------+--------------------------------------------------
   Total |  6403.333  9944.329         0  135270.5     25427
------------------------------------------------------------

. 
. * notice that we can do similar things using tab or using tabstat. Often the same analysis can be performed in many ways on STATA. What matters is that the final result is correct...
. 
. * notice that there are many workers with 0 labor earnings. We can assume that they are not working
. * I want to compute earnings for indivduals that work
. tabstat yl_real if yl_real>0, by(anno) s(mean sd min max n)

Summary for variables: yl_real
Group variable: anno (ANNO)

    anno |      Mean        SD       Min       Max         N
---------+--------------------------------------------------
    2014 |  16868.69   9292.38  100.2004  135270.5      5126
    2016 |  17087.88  8866.281       100    130000      4468
---------+--------------------------------------------------
   Total |  16970.77  9096.612       100  135270.5      9594
------------------------------------------------------------

. 
. * TO FORMALLY TEST THE SIGNIFICANCE OF THE DIFFERENCE IN AVERAGE EARNINGS ACROSS YEARS
. ttest yl_real if yl_real>0, by(anno)

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
    2014 |   5,126    16868.69    129.7889     9292.38    16614.25    17123.13
    2016 |   4,468    17087.88    132.6432    8866.281    16827.83    17347.92
---------+--------------------------------------------------------------------
Combined |   9,594    16970.77    92.87093    9096.612    16788.72    17152.82
---------+--------------------------------------------------------------------
    diff |           -219.1843    186.1765               -584.1296     145.761
------------------------------------------------------------------------------
    diff = mean(2014) - mean(2016)                                t =  -1.1773
H0: diff = 0                                     Degrees of freedom =     9592

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.1196         Pr(|T| > |t|) = 0.2391          Pr(T > t) = 0.8804

. 
. * now let's run another ttest: is average pension income higher than labor income?
. ttest yl_real==ytp_real

Paired t test
------------------------------------------------------------------------------
Variable |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
 yl_real |  25,427    6403.333    62.36313    9944.329    6281.098    6525.569
ytp_real |  25,427    6550.399    54.85021     8746.33    6442.889    6657.908
---------+--------------------------------------------------------------------
    diff |  25,427   -147.0653    100.3433    16000.59    -343.744    49.61331
------------------------------------------------------------------------------
     mean(diff) = mean(yl_real - ytp_real)                        t =  -1.4656
 H0: mean(diff) = 0                              Degrees of freedom =    25426

 Ha: mean(diff) < 0           Ha: mean(diff) != 0           Ha: mean(diff) > 0
 Pr(T < t) = 0.0714         Pr(|T| > |t|) = 0.1428          Pr(T > t) = 0.9286

. 
. *****************
. * 5.    HYPOTHESIS TESTING FOR CATEGORICAL VARIABLES
. 
. * is the employment rate in 2016 higher than in 2014?
. gen employed=yl_real>0

. ta employed

   employed |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |     15,833       62.27       62.27
          1 |      9,594       37.73      100.00
------------+-----------------------------------
      Total |     25,427      100.00

. * employed is a binary variable (dummy). We should perform a different type of test than the ttest
. * Pearson chi2 test (under H0 anno and employed are independent)
. ta anno employed, row  chi2 

+----------------+
| Key            |
|----------------|
|   frequency    |
| row percentage |
+----------------+

           |       employed
      ANNO |         0          1 |     Total
-----------+----------------------+----------
      2014 |     8,413      5,126 |    13,539 
           |     62.14      37.86 |    100.00 
-----------+----------------------+----------
      2016 |     7,420      4,468 |    11,888 
           |     62.42      37.58 |    100.00 
-----------+----------------------+----------
     Total |    15,833      9,594 |    25,427 
           |     62.27      37.73 |    100.00 

          Pearson chi2(1) =   0.2065   Pr = 0.649

. * prtest is designed for binary variables, and has a similar interpretation than ttest
. prtest employed, by(anno)

Two-sample test of proportions                  2014: Number of obs =    13539
                                                2016: Number of obs =    11888
------------------------------------------------------------------------------
       Group |       Mean   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        2014 |   .3786099   .0041685                      .3704397    .3867801
        2016 |   .3758412   .0044422                      .3671347    .3845477
-------------+----------------------------------------------------------------
        diff |   .0027688   .0060918                     -.0091709    .0147084
             |  under H0:   .0060924     0.45   0.649
------------------------------------------------------------------------------
        diff = prop(2014) - prop(2016)                            z =   0.4545
    H0: diff = 0

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(Z < z) = 0.6753         Pr(|Z| > |z|) = 0.6495          Pr(Z > z) = 0.3247

. 
. * is the proportion of retirees higher in 2014 or 2016?
. gen retired=ytp_real>0

.  prtest retired, by(anno)

Two-sample test of proportions                  2014: Number of obs =    13539
                                                2016: Number of obs =    11888
------------------------------------------------------------------------------
       Group |       Mean   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        2014 |   .4576409   .0042817                       .449249    .4660328
        2016 |   .4756898   .0045804                      .4667124    .4846672
-------------+----------------------------------------------------------------
        diff |  -.0180489     .00627                     -.0303378     -.00576
             |  under H0:     .00627    -2.88   0.004
------------------------------------------------------------------------------
        diff = prop(2014) - prop(2016)                            z =  -2.8786
    H0: diff = 0

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(Z < z) = 0.0020         Pr(|Z| > |z|) = 0.0040          Pr(Z > z) = 0.9980

. 
. *****************
. * 6.    LONGITUDINAL ANALYSIS: INDIVIDUAL INCOME GROWTH
. 
. * in this dataset we can potentially observe the same individual both in 2014 and 2016
. 
. * first let's create a unique individual identifier
. * an individual is given by the combination of the houshold id and the position in the of houshold id
. * this command assigns a unique value to the variable id_individual for each combination of the variables id_household and member_household
. egen id_indivdiual=group(id_household member_household)

. 
. * how many individuals are observed more than once (in 2014 and 2016)
. duplicates report id_indivdiual

Duplicates in terms of id_indivdiual

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |        13685             0
        2 |        11742          5871
--------------------------------------

. * another way to check this
. * _N is the total number of observations. If I compute this by individual, I have the number of times the same individual is observed in the data
. bys id_indivdiual: ge N=_N

. ta N

          N |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |     13,685       53.82       53.82
          2 |     11,742       46.18      100.00
------------+-----------------------------------
      Total |     25,427      100.00

. * are there 11,742/2=5871 individuals observed two times
. 
. * compute the growth in labor income for the same individual across time
. * only for individuals that work both in 2016 and 2014
. * since the command is long, I split it in two rows using the "///" symbol (to execute the command, I have to select both rows)
. 
. bys id_indivdiual (anno): gen yl_growth=yl_real[_N]-yl_real[1] ///
> if _n==2&employed[2]==1&employed[1]==1
(23,540 missing values generated)

. 
. * NOTICE
. * bys id_indivdiual (anno) means that the command is executed by individuals, but before executing the command the data is also sorted by year
. /*
> in practice, the data will be sorted as follows
>         id_ind  anno
>         1               2014
>         1               2016
>         2               2014
>         2               2016
>         3               2014
>         3               2016
>         ...
> and the command "gen" + "if" will be executed separately for each individual
> 
> the result doesn't change if I did write this command instead (why?)
> 
> bys id_indivdiual (anno): gen yl_growth=yl_real[_N]-yl_real[1] ///
> if _n==_N&employed[_N]==1&employed[1]==1
> 
> */
. 
. 
. su yl_growth

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
   yl_growth |      1,887    578.9448    5375.168  -37715.43   34889.78

. * was the individual growth statistically different from zero on average?
. ttest yl_growth==0

One-sample t test
------------------------------------------------------------------------------
Variable |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
yl_gro~h |   1,887    578.9448    123.7388    5375.168    336.2654    821.6242
------------------------------------------------------------------------------
    mean = mean(yl_growth)                                        t =   4.6788
H0: mean = 0                                     Degrees of freedom =     1886

    Ha: mean < 0                 Ha: mean != 0                 Ha: mean > 0
 Pr(T < t) = 1.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 0.0000

. 
. * we can perform the same computations using the xtset stata command
. xtset id_indivdiual anno

Panel variable: id_indivdiual (unbalanced)
 Time variable: anno, 2014 to 2016, but with gaps
         Delta: 1 unit

. 
. * "L2." means the two period lag (stata understands that the time variable is year, and that there are two years of distance between 2016 and 2014)
. gen yl_growth_bis=yl_real-L2.yl_real if employed==1&L2.employed==1
(23,540 missing values generated)

. * check that the two variables are identical!
. su yl_growth yl_growth_bis

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
   yl_growth |      1,887    578.9448    5375.168  -37715.43   34889.78
yl_growth_~s |      1,887    578.9448    5375.168  -37715.43   34889.78

. 
. *****************
. * 7.    LONGITUDINAL ANALYSIS: THE REPLACEMENT RATE
. 
. * some workers in 2014 become retirees in 2016
. * what is the replacement rate?
. count if L2.employed==1&retired==1
  163

. 
. gen replacement_rate=ytp_real/L2.yl_real if L2.employed==1&retired==1
(25,264 missing values generated)

. sum replacement_rate

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
replacemen~e |        163    2.169096    10.20209   .0386323   128.1831

. * if I run summarize with the option ",de" I get more info about the variable
. sum replacement_rate, de

                      replacement_rate
-------------------------------------------------------------
      Percentiles      Smallest
 1%     .0589727       .0386323
 5%     .1365684       .0589727
10%       .23952       .0840421       Obs                 163
25%        .6487       .0855429       Sum of wgt.         163

50%     .9510353                      Mean           2.169096
                        Largest       Std. dev.      10.20209
75%      1.43233       7.819152
90%      2.33532       10.61509       Variance       104.0827
95%       3.7924         25.948       Skewness       11.74982
99%       25.948       128.1831       Kurtosis       144.7568

. 
. * notice that quite a lot of individuals have a replacement rate above 1
. * why?
. * - pensions usually depend on the entire career of workers, not just on their very last job spell
. * - measurement error/low representativeness of the sample...
. 
. * let's check if labor income in the last job is positively correlated with pension income...
. gen last_job=L2.yl_real if L2.employed==1&retired==1
(25,264 missing values generated)

. 
. corr ytp_real last_job
(obs=163)

             | ytp_real last_job
-------------+------------------
    ytp_real |   1.0000
    last_job |   0.5481   1.0000


. 
. * we can check this with a regression analysis as well
. reg ytp_real last_job

      Source |       SS           df       MS      Number of obs   =       163
-------------+----------------------------------   F(1, 161)       =     69.15
       Model |  4.1728e+09         1  4.1728e+09   Prob > F        =    0.0000
    Residual |  9.7148e+09       161  60340083.7   R-squared       =    0.3005
-------------+----------------------------------   Adj R-squared   =    0.2961
       Total |  1.3888e+10       162  85725343.8   Root MSE        =    7767.9

------------------------------------------------------------------------------
    ytp_real | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
    last_job |   .5329378   .0640867     8.32   0.000     .4063788    .6594968
       _cons |   6018.424   1183.701     5.08   0.000     3680.841    8356.007
------------------------------------------------------------------------------

. 
. * and also with a scatterplot
. twoway (scatter ytp_real last_job) (lfit ytp_real last_job)

. 
. 
. 
. *****************
. * 8. FINALLY, LET'S CLOSE THE LOG FILE
. log close
      name:  <unnamed>
       log:  /Users/bernardofanfani/Desktop/teaching/research_topics_labor/lab_1/lecture1.log
  log type:  text
 closed on:  15 Oct 2024, 17:23:46
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------