# What's this about?

In this small project, I am interested in the analysis of cullinary data with the aim of characterizing statistical relationships between different spices. I am asking the question of which spices are most often used together for cooking, with the specific aim for finding meaningful combinations.

Ultimately, this boils down in characterizing cooccurences of spices in individual recipes. One cheap way of doing this consists of gathering for each spice a binary vector that codes its presences in a given recipe. And later create a join count matrix based on this binary vectors, resulting in a symmetric matrix of size that equals to the total number of spices I am interested in.

# Methods

I will be using Pickle to load the database. Regular expressions to find the occurence of spices in the recipes. Numpy to make matrix multiplication necessary for joint count matrix. Pandas DataFrame will be the main work horse that will store and manipulate data. And finally Matplotlib to visualize the results. And finally, OS libraries to find where the home directory is located.

In [2]:

```
import pickle
import numpy as np
import re
import pandas as pd
import matplotlib.pyplot as plt
from os.path import expanduser
```

In [3]:

```
home = expanduser("~");
data = pickle.load(file(''.join([home,'/Documents/Code/Python/JupyterNotebooks/SpiceStatistics/RecipeFinal'])));
data = data.ingredient_name; #extract the ingredients
print(type(data))
print("There are {} recipes found in this database".format(len(data)))
```

# Approach

In this project, I will be focusing only on a predetermined subset of spices that I prepared as a Python list.

To find the presence of different spices in the ingredents of a given dish recipe, I first create a regular expression that will match these spices.

I keep in mind that I can use the powerful tools that Pandas DataFrame objects offer to easily convert an existing DataFrame (of ingredients) onto another DataFrame (of spice occurences) while keeping track of matches. I achieve this by simply creating regular expressions with group names. This will come extremely handy as DataFrames will generates columns based on the group names of individual regular expressions patterns.

In the following, I create a regular expression string that will match the spices of interest. Each regular expression has name specified within the brackets which identifies the regular expression group.

In [4]:

```
pattern = ['salt', 'anis', 'ajowan', 'allspice', 'black pepper', 'cardamon', 'cayenne pepper', 'chili', 'cinnamon', 'cloves', 'coriander', 'cumin', 'fenugreek', 'galangal', 'ginger', 'juniper', 'mace', 'mustard seeds', 'nutmeg', 'paprika', 'saffron', 'sichuan pepper', 'turmeric', 'white pepper'];
def str2REgroup(x): return '(?P<' + x.replace(" ","") + '>' + x + ')|'
R = "".join(map(str2REgroup,pattern))
print(R)
```

In [5]:

```
df = pd.DataFrame(data)
df.head(5)
```

Out[5]:

In [6]:

```
df2 = df.iloc[2,:].str.extract(R, expand=False).any().astype(int)
print('Here is the occurence of spices in recipe index 2')
df2
```

Out[6]:

In [7]:

```
dummy = [df.iloc[i,:].str.extract(R, expand=False).any() for i in range(df.shape[0])]
df2 = pd.DataFrame(dummy)
df2.sum(0)
```

Out[7]:

In [15]:

```
```

Out[15]:

# Results

The aggregate sum of the spices shows that the most common ingredient is salt (let's consider it as a spice for now). Not surprisingly. But more important for the computation of the joint count matrix we need to take into consideration that some spices are not even used one single time across the recipe database. Because these contain no information I remove these columns from the DataFrame.

To visualize these results, I plot individual spices occurences on a log scale using horizontal bars. This gives us an intuitive understanding of the frequency of spice usage.

In [38]:

```
empty_columns = df2.columns[df2.sum(0) == 0]
df2 = df2.drop(empty_columns,axis=1)
print(df2.sum(0))
y=np.log10(df2.sum(0).sort_values())
y.plot.barh()
plt.xlabel("log10(occurence)")
plt.show()
```

My initial question was about understanding the joint usage statistics of different spice combinations.

Once the empty columns are dropped, we can now compute a joint count matrix with the following matrix multiplication

M = C'* C

which will return a positive definite matrix.

In [23]:

```
columns = df2.astype(int).values
count_matrix = np.matmul(columns.transpose(),columns)
plt.figure(figsize=(20,10))
plt.imshow(count_matrix)
plt.yticks(range(len(df2.columns)),df2.columns)
plt.xticks(range(len(df2.columns)),df2.columns,rotation=90)
plt.colorbar()
plt.show()
```

One problem in this result is that the count matrix is completely dominated by the frequent use of salt in the recipes. This precludes us to the appreciate the co-occurence of other mixtures, which are contained in the count matrix but not visible.

To remedy this, I will normalize the Boolean DataFrame with the occurence frequency of individual spices before computing the joint count matrix. This is similar to convert a covariance matrix to a correlation matrix. Mathematically this corresponds to weighing each column with the inverse of its sum with

M = (C

*D)'*(C*D)
where D is diagonal matrix that contains the inverse of the number of times a spices is used.

In [25]:

```
D = np.diag(columns.sum(0) ** -.5);
CC = columns.dot(D);
CCC = CC.transpose().dot(CC);
#(C*D)'*(C*D)
plt.figure(figsize=(20,10))
plt.imshow(CCC ** 2)
plt.yticks(range(len(df2.columns)),df2.columns)
plt.xticks(range(len(df2.columns)),df2.columns,rotation=90)
plt.clim(0,.05)
plt.colorbar()
plt.show()
```

## To be continued:

- Compare DataFrame performance against looping through recipies with re.findall. Obviously, DataFrame will cost additional over head, however I would like to know how much?
- Make a little JQuery animation to select a spice and update the height of bars that represents the joint probability of occurence.