Portfolio Webpage

Kucherov Ivan Portfolio Projects

Main Tools:

You can contact me via email at unequivocally.ivan@gmail.com

Project 1: Chess Dataset Analysis Dashboard

Data Analysis

I analyzed a Kaggle Online Chess Games dataset by creating an interactive dashboard in Tableu. You can find this project on my Tableu public.
After loading the dataset from Kaggle I wanted to know whether the pairings on Lichess are fair or not. In order to do so I needed to know how the difference between ratings of opponents is distributed. The empirical PDF (Probability Density Function) looked roughly like a bell-shaped curve, so I gave the normal curve a try and fit it to the data using the maximum likelihood estimates without any success. After some research I came across the Laplace distribution, which is way more appropriate in this case given the shape of the distribution. I created a 7 page pdf document where you can find an in-depth explanation of the Laplace distribution. There are 4 visualizations on this dashboard:

Empirical PDF of the Rating Difference (Rated Games): You can see that the distribution truly resembles that of Laplace. Note that I decided to only consider rated games in my analysis and exclude the casual matches. That is because casual matches are a lot more imbalanced in terms of the rating difference. For example, a 2000 rated person might be giving a coach session to his 500 rated friend. That would create a lot of outliers in the data and mess with CDF estimation
Empirical CDF of the Rating Difference and Fitted Laplace CDF (Rated Games): The maximum likelihood fit seems to accurately estimate the CDF of the empirical distribution. Also, since the median is located about zero in this dataset, then we can hypothesize that the rating distribution of the players follows an exponential distribution. This is further supported by the fact that both white and black players come from the same population (Lichess players) and hence must have the same underlying distribution. So, from this dataset at least, I can conclude that the matchmaking system on Lichess is fair
Top 20 Most Popular Chess Openings and their Winrates (Rated Games): You can filter the openings by the Elo bracket they are played in and the visual will dynamically update the top 20. Top 20 was chosen due to a strong decline in popularity among openings below top 20
Time Controls and Median Game Duration for Rated Games: Size of the bubbles is proportional to time control popularity (frequency) and colour ranges by median game duration. I chose a median because there are many outliers in the game duration: some games only last a couple of moves, while others go deep into the endgame. The median gives a more accurate representation of the overall game duration. The filter for the previous visual is also automatically applied here

Project 2: ML Stock Price Prediction

Data Science

I have created a flexible predictive neural network model with LSTM (Long Short Term Memory) layers using python to predict stock prices.
Firstly, I got a list of all NASDAQ exchange traded ticker symbols using yahoo_fin, because of the difficulties with retrieving this info with yfinance. Then I dropped the symbols that would not have contained enough data for comparative analysis. I sampled 10 random ticker symbols from the filtered list (I used a random seed so that all results can be reproduced) and using the Yahoo!Finance’s API gathered daily Adjusted Close prices for each of the tickers from 2018-01-02 to 2023-06-30. I split the sample into training (80% of the sample) and test (20% of the sample) subsamples. Training data is used to tune model parameters, while the testing subsample is left for performance valuation. The model is trained using labeled data: it predicts the Adjusted Close price one day into the future based on 60 previous days (both values can be changed in the model). The performance is evaluated using RMSE (Root Mean Squared Error) and MAPE (Mean Absolute Percentage Error). If we consider a 5% upper bound on MAPE as a critical point of model rejection, then all 10 stock forecasts are of desirable accuracy, since their MAPE’s are below 5% on test data, while many are even below 2%. The accuracy results can be seen below:

Project 3: Cryptocurrency Dashboard

Data Analysis

I have created an interactive, dynamic cryptocurrency dashboard with Power BI. I used Cryptowatch API and connected to it directly using Power BI. I do not have any affiliation with Cryptowatch. Also, they only provide a limited number of requests until you register or pay. You can find the web version of the dashboard here. Unfortunately, I couldn’t upload the pbix project on GitHub because of its size.
I loaded the data using Power BI and then cleaned and formatted it. After that the data set consisted of Open, High, Low, Close prices as well as Volume data for 257 cryptocurrencies since 2015 with 14 granularities and contained about 3 million rows. I created the backgrounds for the menu and other pages of the dashboard in PowerPoint and uploaded the slides to PowerBi as images, which you can find here. There are 3 pages in the dashboard:

Home Page: Used for navigating the dashboard and also has a button that redirects the user to my projects
Crypto Prices: Contains 2 crypto price visualizations (candlestick chart and line + bar chart), which can be toggled using a bookmark button with a candle icon on it. You can also apply filters by date range, ticker and granularity. The candlestick chart is not just a custom visual. I have created it myself by using the line + bar chart, measures and error bars
Analysis: This page gives some technical indicators to analyze the Close price. Namely, it computes a Simple Moving Average (SMA) as well as the Bollinger Bands. Personally, I think that technical analysis is nothing but a horoscope, but for investors, but making this page was, nevertheless, a great exercise in creating visuals that depend on user input. You can change the period for the SMA as well as the number of standard deviations the Bands are away from the SMA

Each page also contains a navigation to all other pages in the dashboard

Project 4: Financial Statements KPI Analysis

Data Analysis

I have created a dynamic dashboard of the S&P500 companies’ annual Balance sheets and Income statements for the years 2017-2022 using python and Power BI. Below you can find the HTML version of the dashboard (for a web version you can click here):

The dashboard contains 2 filters: by ticker symbol (company) and report year. The data dynamically updates, provides information on the industry and sector of the company and computes financial KPI’s:

Profitability ratios: Gross Margin (%), Net Profit Margin (%), Operating Margin, Return On Assets, Return On Equity
Liquidity ratios: Current Ratio, Quick Ratio, Cash Ratio
Solvency ratios: Debt-to-Assets Ratio, Interest Coverage Ratio, Debt-to-Equity Ratio

Firstly, Using yfinance and yahoo_fin I gathered ticker symbols that are in the S&P500 index as well as their full company names, industry and sector. The reasoning is simple: S&P500 index contains companies with the highest market caps and these big companies are more likely to have complete financial statements data for analysis. I loaded the income statements and balance sheets using SimFin API. In order to do so you need an API key, which can be obtained for free here. I do not have any affiliation with SimFin and just thought that their product and python support are great, though you do not get the most recent statements in their free API. You can view the full code for loading, cleaning and reshaping data here.
The default shape was not fit for how I wanted to visualize the statements in Power BI, so I created a function to reshape the data accordingly. Then I compiled the DataFrames to Excel files and loaded them into Power BI. I made necessary relationships and created all the measures with DAX, including all line items you see in the statements as well as the KPI’s. I designed the layout and theme of the dashboard myself without any helper tools. To download the project as a pbix click here.

Project 5: Credit Card Fraud Detection

Data Science

I have created a binary classification model in python to detect whether a given transaction is fraudulent or not. I used a Kaggle dataset, which contained censored information (principal components were given instead of the actual features) on credit card transactions made in 2013. To view this project on Kaggle click here.
I loaded the data, performed exploratory data analysis (EDA) and prepared the data for preprocessing. I reshaped one feature, randomized the order of the points in the sample and split the data into train, test and validation subsamples (90%-5%-5%). I proceeded to implement model metrics that would be later used to compare various models. I then created the following binary classification models:

Model 1: Logistic Regression. This is a go-to econometric model when it comes to binary classification
Model 2: Linear Support Vector Classification. It is the most common ML approach to binary classification problems
Model 3: K-Nearest Neighbours. Another popular method that is usually used to classify points into many groups, yet performed surprisingly well on test data
Model 4: AdaBoost Classifier. I wanted to see if an ensemble technique would work better than other models. It took a the longest time to train and performed badly for such a long waiting time
Model 5: Multi-Layer Perceptron Classifier. I took inspiration from the methods listed on the sklearn webpage and this one caught my eye. It also took a while to train, but it was definitely worth it

Model 5 turned out to be the best one in terms of the implemented metrics. I retrained this model on the train and test subsamples combined and evaluated its performance on the validation subsample. The final classification can be seen in the confusion matrix below:

Project 6: AppStore Dataset Analysis

Data Analysis

I have analyzed a 2017 AppStore dataset from Kaggle using SQL (SQLite Online). You can find the full SQL code here.
I loaded the data from csv files into the SQLite Cloud. Then, I performed exploratpry data analysis (EDA) to see if the data contains any missing values both between tables and in some of the key fields. I also answered some basic questions about the dataset, like “What are the top 5 app genres in terms of %?” and “What are the descriptive statistics of the user ratings?”. After that I went straight into data analytics. I had a couple of questions in mind that I wanted to answer. These questions are:

Question 1: Do paid apps have a higher rating than free ones on average?
Question 2: What are the bottom 5 categories in terms of average rating?
Question 3: Do apps with larger descriptions have a higher rating on average?
Question 4: What are the top 3 rated paid and free apps in each genre?

It turned out that paid apps do indeed outperform free ones on average. The bottom 5 categories turned out to be: Catalogs, Finance, Book, Navigation and Lifestyle in ascending order. There is also a positive correlation between the description length and average app rating (used joins to answer this question). The answer to the 4th question is a big table, so I cannot retell it here. For that you will have to execute the code, which you can find here (used window funstions and nested queries to answer this question).