Introduction

Submission for the Jan 2021 Citi Tech Hackathon. Univest is a two-sided marketplace for Income Share Agreements (ISAs). Students sign up and indicate how much money they need for tuition, then they get offers from investors and choose the best one. Investors submit offers to fund the ISA. They can expect extremely low risk and medium to high returns.

In our MVP students can create a request for an ISA. This is posted to our backend and stored in a SQLite DB. Our machine learning algorithms then go to work calculating the expected yearly ROI over a 10 year timespan after graduation. Investors can search for this ISA requests and see the expected ROI.

Data Analysis

WallStreetBets

Aim:

To predict investor’s return on investment (ROI).

Strategy:

Use data that can be collected for college students (e.g. major, gender, number of siblings, political views) to train a model that will predict the student’s future income after graduation. Use that prediction to calculate the expected ROI for the investor.

Data:

The data used for the analysis and modelling was collected from General Social Survey (GSS). 10 datasets ranging from the year 2000 to 2018 were combined. Only individuals who fit the target market were kept (individuals within the age range of 21 to 45 who attended college or above).  The idea was to see what salaries these individuals were earning and then use back-datable features in the dataset (e.g. gender, parents’ highest level of education) to create a model that can predict salaries using these features. As the features are back-datable we could ask for that information from our student clients and predict their future salary. That prediction is then used to determine the expected ROI.  

Breakdown:

EDA, Visualization and Data Wrangling:

A notebook containing the initial analysis of the data:

Modelling, ML Regressors, Keras Feed Forward Neural Network:


EDA, Visualization and Data Wrangling

Data Overview

OCC10SIBSAGEEDUCPAEDUCMAEDUCDEGREEPADEGMADEGMAJOR1MAJOR2DIPGEDSECTORBARATESEXRACERES16REG16FAMILY16MAWRKGRWINCOM16BORNPARBORNGRANBORNPOLVIEWSINCOME
0Broadcast and sound engineering technicians an...1.026.016.016.016.0BACHELORBACHELORGRADUATENaNNaNNaNNaNNaNMALEWHITECITY GT 250000W. SOU. CENTRALMOTHER & FATHERYESNaNYESBOTH IN U.S1.0SLGHTLY CONSERVATIVE$8 000 TO 9 999
1Advertising and promotions managers6.044.014.012.012.0JUNIOR COLLEGEHIGH SCHOOLHIGH SCHOOLNaNNaNNaNNaNNaNFEMALEWHITEBIG-CITY SUBURBE. NOR. CENTRALMOTHER & FATHERYESNaNYESBOTH IN U.S1.0LIBERAL$7 000 TO 7 999
2First-line supervisors of office and administr...0.044.018.011.011.0GRADUATEHIGH SCHOOLHIGH SCHOOLNaNNaNNaNNaNNaNMALEWHITETOWN LT 50000W. SOU. CENTRALMOTHER & FATHERYESNaNYESBOTH IN U.SALL IN U.SSLIGHTLY LIBERAL$50000 TO 59999
3Dispatchers8.040.016.010.010.0HIGH SCHOOLLT HIGH SCHOOLLT HIGH SCHOOLNaNNaNNaNNaNNaNMALEBLACKTOWN LT 50000W. SOU. CENTRALMOTHER & FATHERYESNaNYESBOTH IN U.SALL IN U.SMODERATE$25000 TO 29999
4Software developers, applications and systems ...7.037.016.0NaN13.0BACHELORNaNHIGH SCHOOLNaNNaNNaNNaNNaNMALEWHITECOUNTRY,NONFARMW. SOU. CENTRALMOTHERYESNaNYESBOTH IN U.SNaNLIBERAL$75000 TO $89999

Income Breakdown

Income is in ranges, we recoded it into numeric variables by taking the lower bound of the range. We took the lower bound instead of the average of a range because some ranges have no upper bound in the data.

After recoding, the income distribution looks as follows:

Income by other features

Corelation Matrixs

Majors

Degrees

Diplomas

Father’s Education

Mother’s Education

Siblings

Guardians

Parents were born in the US

Gradparents were born in the US

Sex

Political Views

Feature Engineering

FeaturesNo. of Missing
MAJOR12492
DIPGED2438
PADEG762
POLVIEWS465
MADEG350
GRANBORN301
SIBS207
PARBORN203
FAMILY16202
DEGREE0
SEX0

Methods:

Polviews
df_dummies_polviews = df.drop(columns = "MAJOR1")
df_dummies_polviews = df_dummies_polviews.dropna()
tmp_y = df_dummies_polviews["POLVIEWS"]
df_dummies_polviews = df_dummies_polviews.drop(columns = "POLVIEWS")
df_dummies_polviews = pd.get_dummies(df_dummies_polviews)
SIBSDEGREE_BACHELORDEGREE_GRADUATEDEGREE_HIGH SCHOOLDEGREE_JUNIOR COLLEGEPADEG_BACHELORPADEG_GRADUATEPADEG_HIGH SCHOOLPADEG_JUNIOR COLLEGEPADEG_LT HIGH SCHOOLMADEG_BACHELORMADEG_GRADUATEMADEG_HIGH SCHOOLMADEG_JUNIOR COLLEGEMADEG_LT HIGH SCHOOLSEX_FEMALESEX_MALEDIPGED_GEDDIPGED_HS diploma after post HS classesDIPGED_High School diplomaDIPGED_OtherFAMILY16_FATHERFAMILY16_FATHER & STPMOTHERFAMILY16_FEMALE RELATIVEFAMILY16_M AND F RELATIVESFAMILY16_MALE RELATIVEFAMILY16_MOTHERFAMILY16_MOTHER & FATHERFAMILY16_MOTHER & STPFATHERFAMILY16_OTHERPARBORN_BOTH IN U.SPARBORN_DK FOR BOTHPARBORN_FATHER ONLYPARBORN_MOTHER ONLYPARBORN_MOTHER; FA. DKPARBORN_NEITHER IN U.SPARBORN_NOT FATHER;MO.DKPARBORN_NOT MOTHER;FA.DKGRANBORN_1.0GRANBORN_2.0GRANBORN_3.0GRANBORN_4.0GRANBORN_ALL IN U.S
01.0100010000010000100100000001001000000010000
16.0000100100001001000100000001001000000010000
20.0010000100001000100100000001001000000000001
38.0001000001000010100100000001001000000000001
47.0100000100001000100100000010001000000000001

SVM

model_polviews = svm.SVC(random_state=42)
model_polviews.fit(X_train, y_train)
model_polviews.score(X_test, y_test)

0.2909090909090909

Logistic Regression

model_polviews = LogisticRegression(random_state=42)
model_polviews.fit(X_train, y_train)
model_polviews.score(X_test, y_test)

0.28484848484848485

Decision tree

model_polviews = tree.DecisionTreeClassifier(random_state=42, max_depth=5)
model_polviews.fit(X_train, y_train)
model_polviews.score(X_test, y_test)

0.27575757575757576

Random forest

model_polviews = RandomForestClassifier(max_depth=6, random_state=42)
model_polviews.fit(X_train, y_train)
model_polviews.score(X_test, y_test)

0.2909090909090909

SVM and random forest gave us smiliar results. We use SVM model to predict missing values of political views.

We went through the same process for MAJOR1 and finally used Random Forest to predict missing values of Majors.

Modelling, ML Regressors, Keras Feed Forward Neural Network

Tain test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train shape: (3010, 128) | X_test shape: (753, 128)</br> y_train mean: 43348.02 | y_test mean: 44745.69</br> 128 features</br>

ML models

Linear Regression
LR = linear_model.LinearRegression().fit(X_train, y_train)
LR.score(X_test, y_test)

-2.7751720305407978e+17

Ridge Regression
RR = linear_model.Ridge(alpha=85, random_state=42).fit(X_train, y_train)
RR.score(X_test, y_test)

0.15757294603748428

Lasso Regression
LAS = linear_model.Lasso(alpha=85, random_state=42).fit(X_train, y_train)
LAS.score(X_test, y_test)

0.15091859140242336

Random Forest
RF = RandomForestRegressor(max_depth=6, random_state=42).fit(X_train, y_train)
RF.score(X_test, y_test)

0.14687449437771827

XGBoost
XG = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bynode = 0.5,colsample_bylevel=0.5, learning_rate = 0.05,
                max_depth = 5, alpha = 10, n_estimators = 100, gamma=0.5)

XG.fit(X_train, y_train)
XG.score(X_test, y_test)

0.15566510863040006

Hyperparameter tuning

After a GridSearch for hyperparameters we got the best_params as follows:

best_params={'colsample_bylevel': 0.5, 'colsample_bynode': 0.5, 'colsample_bytree': 0.5, 'gamma': 0, 'learning_rate': 0.05, 'max_depth': 5, 'min_child_weight': 4, 'n_estimators': 100, 'objective': 'reg:squarederror', 'subsample': 0.9}

The XGBoost model is updated:

XG = xgb.XGBRegressor(**best_params)
XG.fit(X_train, y_train) 
XG.score(X_test, y_test)

0.16381963132089417

XGBoost gave us the best result so far.

TF/Keras

modelff = Sequential()
modelff.add(Dense(units=512, activation='relu', input_dim=128))
modelff.add(Dropout(0.5))
modelff.add(Dense(units=256, activation='relu'))
modelff.add(Dropout(0.5))
modelff.add(Dense(units=128, activation='linear'))
modelff.add(Dropout(0.5))
modelff.add(Dense(1, activation='linear'))
modelff.compile(loss='mse', optimizer="adam", metrics=['accuracy'])
modelff.summary()

Model: “sequential”:

</tbody></table> ``` num_epochs = 5000 history = modelff.fit(X_train, y_train.values) ``` 95/95 [==============================] - 0s 1ms/step - loss: 2388068864.0000 - accuracy: 0.0123 ``` y_pred=modelff.predict(X_test) r2_score(y_test, y_pred ) ``` -0.1031647121676047 ANN's result is not very good. We will use the XGBoost model for our final prediction. #### EXPORTING THE MODEL ``` with open('model_pickle', 'wb') as f: pickle.dump(XG, f) ```
Layer (type)Output ShapeParam #
dense (Dense)(None, 512)66048
dropout (Dropout)(None, 512)0
dense_1 (Dense)(None, 256)131328
dropout_1 (Dropout)(None, 256)0
dense_2 (Dense)(None, 128)32896
dropout_2 (Dropout)(None, 128)0
dense_3 (Dense)(None, 1)129
==========
Total params: 230,401
Trainable params: 230,401
Non-trainable params: 0