Statistics of Spice Usage

What's this about?

In this small project, I am interested in the analysis of cullinary data with the aim of characterizing statistical relationships between different spices. I am asking the question of which spices are most often used together for cooking, with the specific aim for finding meaningful combinations.
Ultimately, this boils down in characterizing cooccurences of spices in individual recipes. One cheap way of doing this consists of gathering for each spice a binary vector that codes its presences in a given recipe. And later create a join count matrix based on this binary vectors, resulting in a symmetric matrix of size that equals to the total number of spices I am interested in.

Methods

I will be using Pickle to load the database. Regular expressions to find the occurence of spices in the recipes. Numpy to make matrix multiplication necessary for joint count matrix. Pandas DataFrame will be the main work horse that will store and manipulate data. And finally Matplotlib to visualize the results. And finally, OS libraries to find where the home directory is located.
In [2]:
import pickle
import numpy as np
import re
import pandas as pd
import matplotlib.pyplot as plt
from os.path import expanduser
In [3]:
home  = expanduser("~");
data  =  pickle.load(file(''.join([home,'/Documents/Code/Python/JupyterNotebooks/SpiceStatistics/RecipeFinal'])));
data  = data.ingredient_name; #extract the ingredients
print(type(data))
print("There are {} recipes found in this database".format(len(data)))
<type 'list'>
There are 47724 recipes found in this database

Approach

In this project, I will be focusing only on a predetermined subset of spices that I prepared as a Python list.
To find the presence of different spices in the ingredents of a given dish recipe, I first create a regular expression that will match these spices.
I keep in mind that I can use the powerful tools that Pandas DataFrame objects offer to easily convert an existing DataFrame (of ingredients) onto another DataFrame (of spice occurences) while keeping track of matches. I achieve this by simply creating regular expressions with group names. This will come extremely handy as DataFrames will generates columns based on the group names of individual regular expressions patterns.
In the following, I create a regular expression string that will match the spices of interest. Each regular expression has name specified within the brackets which identifies the regular expression group.
In [4]:
pattern = ['salt', 'anis', 'ajowan', 'allspice', 'black pepper', 'cardamon', 'cayenne pepper', 'chili', 'cinnamon', 'cloves', 'coriander', 'cumin', 'fenugreek', 'galangal', 'ginger', 'juniper', 'mace', 'mustard seeds', 'nutmeg', 'paprika', 'saffron', 'sichuan pepper', 'turmeric', 'white pepper'];
def str2REgroup(x): return '(?P<' + x.replace(" ","") + '>' + x + ')|'
R = "".join(map(str2REgroup,pattern))
print(R)
(?P<salt>salt)|(?P<anis>anis)|(?P<ajowan>ajowan)|(?P<allspice>allspice)|(?P<blackpepper>black pepper)|(?P<cardamon>cardamon)|(?P<cayennepepper>cayenne pepper)|(?P<chili>chili)|(?P<cinnamon>cinnamon)|(?P<cloves>cloves)|(?P<coriander>coriander)|(?P<cumin>cumin)|(?P<fenugreek>fenugreek)|(?P<galangal>galangal)|(?P<ginger>ginger)|(?P<juniper>juniper)|(?P<mace>mace)|(?P<mustardseeds>mustard seeds)|(?P<nutmeg>nutmeg)|(?P<paprika>paprika)|(?P<saffron>saffron)|(?P<sichuanpepper>sichuan pepper)|(?P<turmeric>turmeric)|(?P<whitepepper>white pepper)|
Here is how the DataFrame looks like. The index (rows) codes for different recipes, whereas the columns are used to store the ingredients. At this point there is no specific meaning associated to the columns. They simply indicate for each recipe the rank (which is totally arbitrary) of the ingredient. Some recipes have more ingredients than others, which results in None values for other recipes with fewer ingredients.
In [5]:
df = pd.DataFrame(data)
df.head(5)
Out[5]:
0123456789...29303132333435363738
0sweetened condensed milkNoneNoneNoneNoneNoneNoneNoneNoneNone...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
1macaroni and cheesecondensed cream of chicken soupNoneNoneNoneNoneNoneNoneNoneNone...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
2LUCKY LEAF® Premium Peach Pie Fillingunsalted butter, softenedgranulated sugareggsall-purpose floursaltcinnamonbaking powderground nutmegground ginger...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
3fat-free cream cheesepart-skim ricotta cheesefat-free sour creamegg whitesall-purpose flourgranular sucrolose sweetener (such as Splenda®)vanilla extractNoneNoneNone...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
4packed dark brown sugarwhite sugarall-purpose flourbaking powderbuttershorteningeggsmilkvanilla extractchopped pecans...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
5 rows × 39 columns
Using the extract method with the grouped regular expression pattern I created above, I transform a given recipe row onto another binary DataFrame object which contains the occurence of ingredients.
In [6]:
df2 = df.iloc[2,:].str.extract(R, expand=False).any().astype(int)
print('Here is the occurence of spices in recipe index 2')
df2
Here is the occurence of spices in recipe index 2
Out[6]:
salt             1
anis             0
ajowan           0
allspice         0
blackpepper      0
cardamon         0
cayennepepper    0
chili            0
cinnamon         1
cloves           0
coriander        0
cumin            0
fenugreek        0
galangal         0
ginger           0
juniper          0
mace             0
mustardseeds     0
nutmeg           0
paprika          0
saffron          0
sichuanpepper    0
turmeric         0
whitepepper      0
dtype: int64
I simply need to repeat this for all recipes. Here we do not really need to use groupby.apply combo as the analysis doesn't require us to group data. Hence, we can simply run the same routine for all rows. This is what the code below does using list comprehension.
In [7]:
dummy  = [df.iloc[i,:].str.extract(R, expand=False).any() for i in range(df.shape[0])]
df2    = pd.DataFrame(dummy)
df2.sum(0)
Out[7]:
salt             21650
anis               115
ajowan               0
allspice            46
blackpepper        747
cardamon             0
cayennepepper     1590
chili             1503
cinnamon           526
cloves              29
coriander           60
cumin              299
fenugreek           18
galangal             8
ginger             252
juniper              9
mace                 0
mustardseeds        35
nutmeg              80
paprika           1371
saffron             92
sichuanpepper        0
turmeric            33
whitepepper        146
dtype: int64
So now we have now transformed our initial messy recipe DataFrame to something which is much easier to handle. Namely, every column contains Boolean indicators for the presence of individual spices for each recipe represented in the rows.
In [15]:

Out[15]:
salt             21650
anis               115
ajowan               0
allspice            46
blackpepper        747
cardamon             0
cayennepepper     1590
chili             1503
cinnamon           526
cloves              29
coriander           60
cumin              299
fenugreek           18
galangal             8
ginger             252
juniper              9
mace                 0
mustardseeds        35
nutmeg              80
paprika           1371
saffron             92
sichuanpepper        0
turmeric            33
whitepepper        146
dtype: int64

Results

The aggregate sum of the spices shows that the most common ingredient is salt (let's consider it as a spice for now). Not surprisingly. But more important for the computation of the joint count matrix we need to take into consideration that some spices are not even used one single time across the recipe database. Because these contain no information I remove these columns from the DataFrame.
To visualize these results, I plot individual spices occurences on a log scale using horizontal bars. This gives us an intuitive understanding of the frequency of spice usage.
In [38]:
empty_columns = df2.columns[df2.sum(0) == 0]
df2 = df2.drop(empty_columns,axis=1)
print(df2.sum(0))
y=np.log10(df2.sum(0).sort_values())
y.plot.barh()
plt.xlabel("log10(occurence)")
plt.show()
salt             21650
anis               115
allspice            46
blackpepper        747
cayennepepper     1590
chili             1503
cinnamon           526
cloves              29
coriander           60
cumin              299
fenugreek           18
galangal             8
ginger             252
juniper              9
mustardseeds        35
nutmeg              80
paprika           1371
saffron             92
turmeric            33
whitepepper        146
dtype: int64
My initial question was about understanding the joint usage statistics of different spice combinations.
Once the empty columns are dropped, we can now compute a joint count matrix with the following matrix multiplication
M = C'* C
which will return a positive definite matrix.
In [23]:
columns = df2.astype(int).values
count_matrix = np.matmul(columns.transpose(),columns)

plt.figure(figsize=(20,10))
plt.imshow(count_matrix)
plt.yticks(range(len(df2.columns)),df2.columns)
plt.xticks(range(len(df2.columns)),df2.columns,rotation=90)
plt.colorbar()
plt.show()
One problem in this result is that the count matrix is completely dominated by the frequent use of salt in the recipes. This precludes us to the appreciate the co-occurence of other mixtures, which are contained in the count matrix but not visible.
To remedy this, I will normalize the Boolean DataFrame with the occurence frequency of individual spices before computing the joint count matrix. This is similar to convert a covariance matrix to a correlation matrix. Mathematically this corresponds to weighing each column with the inverse of its sum with
M = (CD)' (C*D)
where D is diagonal matrix that contains the inverse of the number of times a spices is used. This basically transforms a covariance matrix to a correlation matrix.
In [25]:
D   = np.diag(columns.sum(0) ** -.5);
CC  = columns.dot(D);
CCC = CC.transpose().dot(CC);
#(C*D)'*(C*D)
plt.figure(figsize=(20,10))
plt.imshow(CCC ** 2)
plt.yticks(range(len(df2.columns)),df2.columns)
plt.xticks(range(len(df2.columns)),df2.columns,rotation=90)
plt.clim(0,.05)
plt.colorbar()
plt.show()