Steam games dataset cleaning/analysis

In this project, we will be working with a Steam dataset that contains information about more than 55,000 games. The dataset can be downloaded for free from Kaggle, either in CSV or JSON format.

As always, the full Python code will not be shown here (only the most relevant snippets), but you can check the original ipynb file (with outputs) on my GitHub page.

Let’s get started!

Data Cleaning/Wrangling

First, we have to import the JSON file with Pandas using the correct parameters:

# Importing the json file with the correct formart
steam_games = pd.read_json('/content/steam_games.json', 
                           orient='index').reset_index(drop='True')

After checking the data types, the number of nulls/NAs, and duplicated rows, we identified that the price column needs to be corrected since it is missing the decimal separator. We can build a simple function to apply the correction to each row of both price columns:

# Prices are missing the decimal separator
def correct_prices(price):
  if price != 0:
    price = str(price)
    price = price[:-2] + '.' + price[-2:]

  return float(price)

# Applying the fix to both prices columns
steam_games['price'] = steam_games['price'].apply(correct_prices)
steam_games['initialprice'] = steam_games['initialprice'].apply(correct_prices)

We also noticed that the required_age column is a string, so we can convert it to an integer column by applying the necessary transformations. The release_date column needs to be transformed from string to datetime as well:

steam_games.required_age.unique()
# Changes needed:
# Change MA 15+ to 15
# 7+ to 7, 21+ to 21
# 180 to 18

steam_games.loc[steam_games.required_age == '180', 'required_age'] = '18'

# Extracting only the digits
steam_games['required_age'] = steam_games['required_age'].astype('str').str.extract('(\d+)').astype('int64')

# The release_date column should be of the datetime type
steam_games['release_date'] = pd.to_datetime(steam_games['release_date'], format='%Y/%m/%d')

If we take a look at the owners column, we can see that it is a string that contains an estimation of the minimum and the maximum number of owners for each game. For example, a value of ‘200.000 .. 500.000’ indicates that the number of owners for that specific game is between 200,000 and 500,000. Let’s extract that information into two separate new columns (of the integer type):

# Extracting the number of minimun and maximun estimated owners for each game
steam_games['min_est_owners'] = steam_games['owners'].apply(lambda x: int(x.split(' .. ')[0].replace(',', '')))
steam_games['max_est_owners'] = steam_games['owners'].apply(lambda x: int(x.split(' .. ')[1].replace(',', '')))

# Making sure that the columns have been created correctly
steam_games.loc[:5, ['owners', 'min_est_owners', 'max_est_owners']]

index	owners	min_est_owners	max_est_owners
0	10,000,000 .. 20,000,000	10000000	20000000
1	0 .. 20,000	0	20000
2	200,000 .. 500,000	200000	500000
3	100,000 .. 200,000	100000	200000
4	0 .. 20,000	0	20000
5	100,000 .. 200,000	100000	200000

Moving on, we identified that the platforms column is a dictionary that lets us know if each game is available on Windows, Mac, and Linux. It would be better to have a column for each OS, so let’s do that:

# Creating new columns for each OS so we can quickly check if a game is
# for a specific platform
steam_games = steam_games.join(steam_games['platforms'].apply(pd.Series), 
                                                        how='left')

As a final data cleaning/wrangling step, we can compute the total number of reviews and the percentage of the total that were positive with the positive and negative columns:

# Calculating the total number of reviews and the percentage of all reviews that
# are positive
steam_games['total_reviews'] = (steam_games['positive'] + steam_games['negative'])

steam_games['perc_positive_reviews'] = (steam_games['positive'] / steam_games['total_reviews']) * 100

Now we can proceed to perform some data analysis on this interesting dataset!

Exploratory Data Visualization/Analysis

These are the questions that we are going to explore in this article:

What percentage of games is available for each platform/OS?
What is the distribution of the price column?
What is the relationship between price and rating (% of positive reviews)?
How does the number of average owners vary for different price categories?
How have the average price and average rating of games evolved over time?

Percentage of games available for each platform

In order to obtain this visual, we need to perform some data wrangling steps first:

# Getting the data in the correct format to obtain the percentage of games
# that are available for each OS
os_df = steam_games[['name', 'windows', 'mac', 'linux']]

os_df = pd.melt(os_df, id_vars='name', value_vars=['windows', 'mac', 'linux'], var_name='os',
                value_name='available')

os_grouped = os_df.groupby('os').agg(perc_available=('available', 
                                     lambda x: sum(x)/len(x))).reset_index()

os_grouped

index	os	perc_available
0	linux	0.15187373184176978
1	mac	0.22930096424916055
2	windows	0.9997306566590652

Now we can plot that information:

# % of games available for each OS
ax, figure = plt.subplots(figsize=(10, 8), dpi=300)

sns.set_style('darkgrid')

sns.barplot(data=os_grouped, x='os', y='perc_available',
            order=['windows', 'mac', 'linux'])

plt.title('Percentage of games that are available on each OS')
plt.xlabel('OS')
plt.ylabel('% of games available')

ticks = [x/100 for x in range(0, 125, 25)]

plt.yticks(ticks=ticks,
           labels=[str(int(x * 100)) + ' %' for x in ticks])

plt.show()

As expected, almost every game is available on Windows, while mac and Linux (below 25%) are still not considered gaming platforms for the vast majority of videogame producers/publishers.

Price distribution

Let’s check the distribution of the initialprice column:

# Filtering on the initial price column to check the distribution of prices
# for games that cost 80 dollars or less
no_price_outlier = steam_games[steam_games.initialprice <= 80]

# A kernel density estimation makes it easier to see the underlying distribution,
# as a histogram would be too messy with so many data points
f, ax = plt.subplots(figsize=(10, 8), dpi=300)

sns.kdeplot(data=no_price_outlier, x='initialprice', fill=True, ax=ax,
            bw_adjust=1.5)

plt.xlim((0, 80))

plt.xlabel('Launch price ($)')
plt.title('Distribution of price')

plt.show()

We can see that the vast majority of games are either free or below $20, so this distribution is highly right-skewed. There are many indie producers/publishers on Steam that release inexpensive games in comparison with the amount of triple-A games (which have a large production budget and usually a launch price of $50 or above).

Relationship between price and rating

One would expect that the more expensive a game is, the more sophisticated it should be and therefore the higher the rating as well. We can check that assumption by analyzing the bivariate distribution of price and rating:

# Bivariate distribution of price and rating for games that have more than
# 100 reviews and cost 80 dollars or less
sns.set_style('darkgrid')

more_100_rat = steam_games.loc[(steam_games['total_reviews'] > 100) &
                               (steam_games['initialprice'] <= 80), :]

figure, ax = plt.subplots(figsize=(10, 8), dpi=100)

sns.histplot(data=more_100_rat, x='initialprice', y='perc_positive_reviews',
             cbar=True, binwidth=(5,5), binrange=[(0, 80), (0, 100)],
             stat='count', ax=ax, cbar_kws={'label': 'Number of games'})

sns.regplot(data=more_100_rat, x='initialprice', y='perc_positive_reviews',
            scatter=False, ax=ax, color='red')

plt.yticks(ticks=[x for x in range(5, 105, 5)])
plt.ylim((0, 100))
plt.xticks(ticks=[x for x in range(0, 85, 5)])
plt.xlim((0, 80))

plt.xlabel('Price at launch ($)')
plt.ylabel('% of positive ratings')

_ = plt.show()

The correlation is positive, although not a strong one (the Pearson correlation coefficient is only 0.06). Nowadays, an expensive game does not equal a good game. The competition is fierce in the videogame industry and many free or cheap productions are still highly valued within the gaming community.

Number of owners per price category

A reasonable assumption would be that the cheaper a game is, the higher the number of owners should be as well. We are going to explore that topic using price categories for the initialprice column:

# Creating custom categories for prices of games
steam_games['price_cat'] = pd.cut(steam_games.initialprice, 
                                  bins=[-1, 0, 20, 40, 60, steam_games.initialprice.max()],
                                  labels=['Free', 'Cheap', 'Normal', 'Expensive', 'Very Expensive'], 
                                  include_lowest=True)

# Relationship between price category and number of owners
price_owners = steam_games.groupby('price_cat').agg(avg_owners=('min_est_owners', np.mean)).reset_index()

# One might expect that, the lower the price, the higher the number of owners.
# Let's check that assumption
figure, ax = plt.subplots(figsize=(10, 8), dpi=300)

sns.barplot(data=price_owners, x='price_cat', y='avg_owners', color='indigo')

ticks = [x for x in range(0, 600000, 100000)]
plt.yticks(ticks=ticks,
           labels=[f'{x:,}' for x in ticks])

plt.ylabel('Average number of owners')
plt.xlabel('Price category')

plt.show()

Surprisingly, expensive games (between $40 and $60) have the highest number of average owners. It seems that triple-A games still dominate the market, as those are usually the most anticipated and purchased types of video games.

Evolution of price and rating over the years

Lastly, let’s explore how the average price and rating of games have evolved over the years to see if we can spot any interesting trends:

# We are now going to analyze the evolution of the average price and average
# rating of games per year. Games with more than 100 reviews
import datetime as dt
import numpy as np

steam_games['year'] = steam_games.release_date.dt.strftime('%Y')

steam_games_filt = steam_games.loc[(steam_games['release_date'].notnull()) &
                                   (steam_games['total_reviews'] > 100), :]

# Grouping by year and calculating the averages for price and rating
month_year = steam_games_filt.groupby('year').agg(avg_price=('initialprice', np.mean),
                                                  avg_rating=('perc_positive_reviews', np.mean),
                                                  n_games=('appid', len),).\
                                                  reset_index()

month_year = month_year.loc[month_year['n_games'] >= 50, :]

month_year.sort_values('year', inplace=True)

# Evolution of rating and concurrent users for games based on price category

figure, ax = plt.subplots(figsize=(10, 8), dpi=300)

sns.lineplot(data=month_year, x='year', y='avg_price',
             label='Average price')

plt.legend(bbox_to_anchor=(0.992, 1))

sns.pointplot(data=month_year, x='year', y='avg_price',
              color='blue')

plt.ylabel('Average price of games ($)')
plt.xlabel('Year')
plt.ylim((0, 30))

plt.xticks(fontsize=10)

ax2 = plt.twinx()

sns.lineplot(data=month_year, x='year', y='avg_rating', ax=ax2,
             color='red', label='Average rating')

sns.pointplot(data=month_year, x='year', y='avg_rating', ax=ax2,
              color='red')

plt.legend(bbox_to_anchor=(1, 0.95))

plt.title('Evolution of price and rating of games by year')
plt.ylabel('Average rating (% of reviews that are positive)')
plt.ylim((0, 100))

plt.show()

It seems that the rating of games hasn’t seen many variations in the last 16 years, hovering around 80%.

On the other hand, the average price of games has been increasing at a steady pace since 2016 and has reached its maximum value in 2022 at $18.46. It would be an interesting topic to explore more deeply and discover why that is the case, but that is going to be all for this article.