Avocado Prices

Introduction

This notebook is a short introduction to doing data analysis with Python and Nextjournal. Along the way, we will highlight some features of Nextjournal compared to other computational notebooks such as Jupyter or Google Colab. For this purpose, we use some example data from a Kaggle competition and will go through some basic analysis tasks.

For a deeper introduction on how to use Python in Nextjournal, visit our Tutorial.

One major feature of Nextjournal is the fact that you are able to Remix any notebook. Just click the Remix button at the top right on this page. This will create a full copy of this notebook in you profile, enabling you to execute the code while following this tutorial and playing around with everything. You will probably get the most out of this tutorial if you remix it and follow along.

We use historical data on avocado prices and sales volume in multiple US markets. Here, you can find the related Kaggle link.

The dataset is described as follows:

The table below represents weekly 2018 retail scan data for National retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. Starting in 2013, the table below reflects an expanded, multi-outlet retail data set. Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military. The Average Price (of avocados) in the table reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. The Product Lookup codes (PLU’s) in the table are only for Hass avocados. Other varieties of avocados (e.g. greenskins) are not included in this table.

Already at this point we highlight the simplicity of using Nextjournal. Once the file is downloaded, one can simply drag-and-drop it here. We could also choose to add content, then the file, and upload it. Once the file is in Nextjournal's file storage, you can access the file from any other notebook and move it by simply copy-paste.

avocado.csv
1.99 MB

Exploratory data analysis

Another bonus is that many Python libraries are already installed in the default Python runtime.

import pandas as pd
0.2s
Avocado (Python)
Python Libraries
data = pd.read_csv(
avocado.csv
)
0.6s
Avocado (Python)
Python Libraries

To avoid losing time, we can make sure that we do not accidentally re-run the cell above by locking it. We can lock the cell by clicking on the “•••” button next to the cell or using Command/Ctrl + Shift + O to open the Options menu, and choosing Lock Cell. The small green lock on the left bottom of the cell indicates that it is locked. Locking a cell prevents the notebook from executing this cell, if you choose to run all cells in a runtime, or of the entire notebook. This can come handy if one changes some function but does not need to download a large file again.

Let's have a look at the data.

data.head()
0.3s
Avocado (Python)
Python Libraries

These are the column descriptions provided by Kaggle:

  • Date - The date of the observation

  • AveragePrice - the average price of a single avocado

  • type - conventional or organic

  • year - the year

  • Region - the city or region of the observation

  • Total Volume - Total number of avocados sold

  • 4046 - Total number of avocados with PLU 4046 sold

  • 4225 - Total number of avocados with PLU 4225 sold

  • 4770 - Total number of avocados with PLU 4770 sold

Let's first plot the total sales of avocados per week (the Date column is per week):

sales_columns = ['Total Volume', '4046', '4225', '4770']
0.1s
Avocado (Python)
Python Libraries
weekly_data = data \
  .groupby(['Date']) \
  .sum() \
  .loc[:, sales_columns].reset_index()
0.3s
Avocado (Python)
Python Libraries
import plotly.graph_objs as go
plot_data = [{
  'x': weekly_data['Date'],
  'y': weekly_data[c],
  'name': c
} for c in sales_columns]
layout = {
  'xaxis': {'title': 'Date'},
  'yaxis': {'title': 'Avocados sold'}
}
go.Figure(data=plot_data, layout=layout)
0.8s
Avocado (Python)
Python Libraries

The first thing we notice are large spikes at the beginning of every year. Looking at the location of the dates, we find the following dates:

  • Feb. 1, 2015

  • Feb. 7, 2016

  • Feb. 5, 2017

  • Feb. 4, 2018

After some research, we found that at these dates the Super-Bowl was happening, significantly increasing the need for Avocados to make guacamole.

XLIX - Feb. 1, 2015 - New England 28 : Seattle 24

50 - Feb. 7, 2016 - Denver 24:Carolina 10

LI - Feb. 5, 2017NRG - New England 34 : Atlanta 28

LII - Feb. 4, 2018 - Philadelphia 41 : New England 33

The plot shows some inconsistency, since the sales numbers seem to not add-up, let's check that.

weekly_data['check_sum'] = (
  weekly_data['4046'] +
  weekly_data['4225'] + 
  weekly_data['4770']
)
weekly_data['total_diff'] = (
  weekly_data['Total Volume'] - weekly_data['check_sum']
)
weekly_data['ratio_diff'] = (
  weekly_data['total_diff'] / weekly_data['Total Volume'] * 100
)
0.3s
Avocado (Python)
Python Libraries
plot_data = [{
  'x': weekly_data['Date'],
  'y': weekly_data['ratio_diff']
}]
layout = {
  'title': 'Deviation of Sales',
  'xaxis': {'title': 'Date'},
  'yaxis': {'title': 'Deviation in %'}
}
go.Figure(data=plot_data, layout=layout)
0.6s
Avocado (Python)
Python Libraries

The difference between the total volume and the three PLU's is increasing over time and reaches almost 40% at the end of the time period of the provided data-set. Maybe a new PLU was introduced and is not provided in the data set.

Organic vs. conventional prices

Next, let's have a look at the differences between conventional and organic avocados.

weekly_price_data = data \
  .groupby(['Date', 'type']) \
  .mean() \
  .loc[:, ['AveragePrice']].reset_index()
0.2s
Avocado (Python)
Python Libraries
weekly_price_data_o = weekly_price_data[
  weekly_price_data['type'] == 'organic'
]
weekly_price_data_c = weekly_price_data[
  weekly_price_data['type'] == 'conventional'
]
weekly_price_data_trans = weekly_price_data_o.merge(weekly_price_data_c, on='Date')
weekly_price_data_trans['ratio'] = (
  weekly_price_data_trans['AveragePrice_x'] / 
  weekly_price_data_trans['AveragePrice_y']
)
0.4s
Avocado (Python)
Python Libraries
from plotly.subplots import make_subplots
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(
    go.Scatter(
      x=weekly_price_data_trans['Date'], 
      y=weekly_price_data_trans['AveragePrice_x'], 
      name='Organic'),
    secondary_y=False,
)
fig.add_trace(
    go.Scatter(
      x=weekly_price_data_trans['Date'], 
      y=weekly_price_data_trans['AveragePrice_y'], 
      name='Conventional'),
    secondary_y=False,
)
fig.add_trace(
    go.Scatter(
      x=weekly_price_data_trans['Date'], 
      y=weekly_price_data_trans['ratio'], 
      name='Price Ratio'),
    secondary_y=True,
)
fig.update_xaxes(title_text="Date")
layout = {
  'title': 'Weekly price data',
  'xaxis': {'title': 'Date'},
  'yaxis': {'title': 'Avg. Price'}
}
fig.update_yaxes(title_text="Price", secondary_y=False)
fig.update_yaxes(title_text="Price Ratio", secondary_y=True)
fig
0.7s
Avocado (Python)
Python Libraries

Organic and conventional avocados show similar price development. However, at the beginning of March 2016, the ratio drops and is very volatile after that, reaching from 1.1 to almost 1.7 between March 2017 and January 2018.

Regional differences

Let's turn towards the regional differences.

regional_data = data \
  .groupby(['region']) \
  .mean() \
  .loc[:, ['AveragePrice', 'Total Volume']].sort_values(by=['AveragePrice']).reset_index()
0.2s
Avocado (Python)
Python Libraries

Let's remove the US total numbers, since this is just the sum of everything.

regional_data = regional_data[regional_data['region'] != 'TotalUS']
0.1s
Avocado (Python)
Python Libraries
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(
  go.Bar(x=regional_data['region'], 
         y=regional_data['AveragePrice'],
        name='Price'),
  secondary_y=False
)
fig.add_trace(
  go.Bar(x=regional_data['region'], 
         y=regional_data['Total Volume'],
        name='Volume Sold'),
  secondary_y=True
)
fig.update_yaxes(title_text="Price", secondary_y=False)
fig.update_yaxes(title_text="Volume", secondary_y=True)
fig
0.8s
Avocado (Python)
Python Libraries

Hartford Springfield, San Francisco, and New York are the most expensive avocado markets between 2015 and 2018. Houston, Dallas, and South Central are the cheapest ones. Northeast sells a lot of 🥑, even for a comparable high price.

If you were wondering how the avocado got here, since the editor is rich text you can simply use your system to add emojis. E.g. on Mac OS hit Command + Ctrl + Space or on a Windows machine Windows + ~ to access the system's emojis 😀.

How do the regions develop over time, and are there differences?

weekly_regional_data = data \
  .groupby(['Date', 'region']) \
  .mean() \
  .loc[:, ['AveragePrice']].reset_index()
0.2s
Avocado (Python)
Python Libraries
max_region = weekly_regional_data[
  weekly_regional_data['region'] == 'HartfordSpringfield']
min_region = weekly_regional_data[
  weekly_regional_data['region'] == 'Houston']
plot_data = [{
  'x': max_region['Date'],
  'y': max_region['AveragePrice'],
  'name': 'HartfordSpringfield'
},
{
  'x': min_region['Date'],
  'y': min_region['AveragePrice'],
  'name': 'Houston'
}]
layout = {
  'title': 'Weekly price data',
  'xaxis': {'title': 'Date'},
  'yaxis': {'title': 'Avg. Price'}
}
go.Figure(data=plot_data, layout=layout)
0.5s
Avocado (Python)
Python Libraries

We see some similarities, e.g rather stable prices until June 2016, and then some fluctuations. Both time series have a clear peak, however at different dates. This might be something interesting to investigate.

Let's find the maximum price for each region.

max_avg_price_region = weekly_regional_data.groupby('region').idxmax()
0.2s
Avocado (Python)
Python Libraries
max_avg_price_region = pd.merge(
  max_avg_price_region, 
  weekly_regional_data, 
  left_on='AveragePrice',
  right_index=True,
  how='left'
)
0.1s
Avocado (Python)
Python Libraries
max_avg_price_region.sort_values(by='Date')
0.2s
Avocado (Python)
Python Libraries
max_avg_price_histo = max_avg_price_region.groupby('Date').count().reset_index()
0.1s
Avocado (Python)
Python Libraries
plot_data = [{
  'x': max_avg_price_histo['Date'],
  'y': max_avg_price_histo['region']
}]
layout = {
  'title': 'Weekly price data',
  'xaxis': {'title': 'Date'},
  'yaxis': {'title': 'Avg. Price'}
}
go.Figure(go.Bar(
  x=max_avg_price_histo['Date'],
  y=max_avg_price_histo['region']), layout=layout)
0.5s
Avocado (Python)
Python Libraries

The plot above shows the number of regions that have the maximum avocado price in that specific week. We see that all but one region have their maxima between October 2018 and October 2017. It seems that the prices were rising first in the north-east and California, and then were spreading over the country within a couple of months. The peak of the most expensive prices was reached beginning of October 2017, and then decreased. This answers the question "Was the Avocadopocalypse of 2017 real?" with a clear: Yes.

We should check out that one outlier, Pittsburgh, where the most expensive Avocados were sold in the first month of the provided data-set.

outlier_region = weekly_regional_data[
  weekly_regional_data['region'] == 'Pittsburgh']
plot_data = [{
  'x': outlier_region['Date'],
  'y': outlier_region['AveragePrice'],
  'name': 'HartfordSpringfield'
}]
layout = {
  'title': 'Weekly price data',
  'xaxis': {'title': 'Date'},
  'yaxis': {'title': 'Avg. Price'}
}
go.Figure(data=plot_data, layout=layout)
0.5s
Avocado (Python)
Python Libraries

The prices look very different from the rest, and there is no maximum from 2015 to 2018. The maximum price seems also to be part of a larger price decrease that happened prior to January 2015, i.e., the prices were even higher before the provided time-period.

Time-series prediction

The last step is to use the dataset, perform a time-series analysis, and make some predictions. We use the Facebook Prophet library for that.

At this point, we would like to highlight another feature of Nextjournal that sets Nextjournal apart from other notebooks. When we want to use a library that is not part of the Nextjournal standard Python runtime, we can create another runtime, install the library in this runtime, and then export it. This will create an environment within Nextjournal that can be used from any other runtime, meaning there is no need to install that library ever again! An additional advantage of this feature is that it ensures that we always use the exact same version of that library. Within the Table of Contents on the sidebar, you can see that there are two runtimes, one being called Python Libraries. It is this runtime where Prophet lives, and our Avocado runtime uses it as the base environment. You can read more about this feature in our Python Tutorial.

from fbprophet import Prophet
0.8s
Avocado (Python)
Python Libraries

We focus only on the Total Volume for now.

df = weekly_data[['Date', 'Total Volume']]
0.2s
Avocado (Python)
Python Libraries

Prophet needs specific column names.

df= df.rename(columns={'Date': 'ds', 'Total Volume': 'y'})
0.3s
Avocado (Python)
Python Libraries

Since we discovered the outliers produced by the super-bowl events, let's incorporate them in our model. We need to add also the super-bowl date in the future in order to include that event in our predictions. The effect seems to be only on that day, thus we set lower_window and upper_window to 0.

superbowls = pd.DataFrame({
  'holiday': 'superbowl',
  'ds': pd.to_datetime(
    ['2015-02-01', '2016-02-07', '2017-02-05', '2018-02-04', '2019-02-03']
  ),
  'lower_window': 0,
  'upper_window': 0,
})
0.2s
Avocado (Python)
Python Libraries
m = Prophet(holidays=superbowls)
2.0s
Avocado (Python)
Python Libraries
m.fit(df)
0.4s
Avocado (Python)
Python Libraries
future = m.make_future_dataframe(periods=365)
forecast = m.predict(future)
2.7s
Avocado (Python)
Python Libraries
m.plot(forecast)
0.7s
Avocado (Python)
Python Libraries

The model fits the data pretty well. The uncertainty is in a very reasonable range compared with the fluctuations of the original data. The super-bowl days are also modelled pretty accurately. The prediction of one year into the future looks also confident, the uncertainty interval is not increasing.

m.plot_components(forecast)
0.8s
Avocado (Python)
Python Libraries

Plotting the components reveals the long term increasing trend. Around February of 2016 the trend slows down and and picks up again in August of 2017. In between the sales volume is almost constant. The model is also pretty confident about the continuity of the trend into the future.

The yearly seasonality shows a peak around May, which is also clearly visible in the overall time series. After May the sales slowly decrease and reach the yearly minimum around November. The seasonality plays a significant role in the sales development throughout the year.

Summary

At the end of this analysis we want to highlight the real strengths of Nextjournal: sharing and collaboration. Once we are done with our analysis we can do several steps:

Sharing

  • Share this notebook in a read-only mode, and create a custom and human-readable, public-facing URL

  • Share a link that will always give read-only access to the latest version of the notebook

  • Publish the notebook so that everyone can read and remix it (see below)

  • Export the notebook as Markdown

  • Export the notebook as Jupyter notebook

Collaborating

  • Remix this notebook, meaning that someone can copy the current notebook with all its data and libraries and can start working with that state immediately. This is the real power of Nextjournal.

  • Add collaborators to this notebook and work simultaneously on this notebook

History

In case we produced a bug, or we are not sure anymore what was the definition of a function was this morning, we can simply go back in history anytime and to every state. Within one run, we can simply hit Command/Ctrl-Z or use Commands, and choose Undo to undo and go back step by step. After a shutdown, one can simply choose from the Commands Palette Notebook / Show history, and see every version of this notebook ... ever. We can check out every version and if we want to, restore it. No work will ever be lost again.

Next steps

We hope you enjoyed this introduction on how to to Data Analysis with Python and Nextjournal.

Nextjournal is free for private use, so you can sign-up and start using right away.

Was there something we missed in our Avocado Analysis? Please feel free to remix, add your work and share with us and the world!

Read more awesome notebooks created in Nextjournal related to Data Science, Machine Learning, Scientific Publishing here.

If you have questions about us, the pricing, or how we can help you to set up Nextjournal for your team, get in touch!

Happy collaborating ❤️

Appendix

pip install fbprophet
95.0s
Python Libraries (Bash)

Runtimes (2)