I’m not going to copy the Wikipedia article on AI here. I roughly know what it is and I assume you do too.

If you’re into philosophising about AI, that’s cool but this is a practical guide (mainly written for myself but you’re welcome to tag along).

I’m probably going to do some coding, and I’ll use some mathematical formulas along the way. I’ll explain my thought process every step of the way.

Cool. My aim is to be able to gather a LOT of knowledge about this topic in the timespan of one year.

I won’t just rehash theory here. I will try to find interesting datasets and solve real life problems.

Which language should I use? 

There are so many: C#, C++, Python, Matlab, Java, Lisp, Erlang,…

Most people on Quora seem to agree it heavily depends on the goal of the project, but generally they seem to prefer Python.

Another person checked which languages were used by the top contestants in the Google AI Challenge. The winners seem to be Java, C++ and Python, in that order.

Python has a lot of cool libraries like Numpy, Scypy and Pybrain. Matlab is proprietary and expensive. C++ is very fast but low-level so slow to write. Java is very general-purpose but lots of boilerplate and a bit clumsy in some aspects such as passing closures. Erlang seems to be good for parallel processes, not so much for computationally expensive tasks.

From everything I’ve read so far, it seems the logical choice would be either Python or C++.

And from this I gather it’s best not to write your own implementation for a neural network but to use existing libraries like Theano, Caffe or others.

According to this ranking Python is the second most popular language right now.

For now, I’ll stick with Python and check out some libraries. I might switch to C++ later on if needed. I’ve used Matlab a fair bit in my studies so might occasionally use that as well.

What should I read?

I need some basics first. Let’s see. This list of deep learning topics is quite intimidating.

Maybe I should set a goal for myself, otherwise I might get demotivated.

It seems deep learning is a new buzzword, on Google Trends it’s growing very quickly.

According to this Nvidia article, the field of AI is progressing ridiculously fast since 2015.

until recently neural networks were all but shunned by the AI research community. They had been around since the earliest days of AI, and had produced very little in the way of “intelligence.” The problem was even the most basic neural networks were very computationally intensive, it just wasn’t a practical approach. Still, a small heretical research group led by Geoffrey Hinton at the University of Toronto kept at it, finally parallelizing the algorithms for supercomputers to run and proving the concept, but it wasn’t until GPUs were deployed in the effort that the promise was realized.

It seems deep neural networks just use more layers and more processing power and slightly different algorithms:

as the network is getting tuned or “trained” it’s coming up with wrong answers —  a lot. What it needs is training. It needs to see hundreds of thousands, even millions of images, until the weightings of the neuron inputs are tuned so precisely that it gets the answer right practically every time — fog or no fog, sun or rain. It’s at that point that the neural network has taught itself what a stop sign looks like; or your mother’s face in the case of Facebook; or a cat, which is what Andrew Ng did in 2012 at Google.

Ng’s breakthrough was to take these neural networks, and essentially make them huge, increase the layers and the neurons, and then run massive amounts of data through the system to train it. In Ng’s case it was images from 10 million YouTube videos. Ng put the “deep” in deep learning, which describes all the layers in these neural networks.

OK. I’ll shift my focus to deep learning, more specifically deep neural networks. It seems deep learning is mostly about neural networks anyway, on the Wikipedia page there’s only one non-neural network algorithm mentioned: multilayer kernel machines.

Here‘s a chart of deep learning software. Almost all of these are in Python or C++ which confirms our earlier suspicions.

This is a cool page with only the best of the best deep learning papers. That’s assuming more citations means better.

Still quite intimidating. I’ll start by reading the Wiki page on Deep Learning.

One of the promises of deep learning is replacing handcrafted features with efficient algorithms for unsupervised or semi-supervised feature learning and hierarchical feature extraction.

To keep it simple for now: a feature or attribute simply means a specific input to the algorithm. For example if you’re processing an image – let’s say to classify it as a banana or a non-banana – the features would be all the pixels of the image. If you have a 100×100 pixel image, you’d have an input of 10.000 features (also called 10.000-dimensional) – actually 30.000 because you have 3 color values per pixel but let’s ignore that for now.

From the page on feature learning:

Feature learning is motivated by the fact that machine learning tasks such as classification often require input that is mathematically and computationally convenient to process. However, real-world data such as images, video, and sensor measurement is usually complex, redundant, and highly variable.

In our 10.000 pixel picture of a banana, there are going to be a lot of redundant pixels, which just means you don’t need all of them. In other words, the input space is not very optimal for the task of recognising a banana.

  1. You don’t need all the pixels, you can probably remove a lot of them
  2. Pixels lying next to each other are probably related (if a pixel is yellow, there’s a high chance its neighbours will be yellow as well, except on the edges of the banana)
  3. You probably don’t need the noise (high frequency information) to recognise the banana (for example the brown patches on the banana or small background features), but unfortunately all this info is included in the 10.000 pixel representation
  4. So you can probably transform the pixels to a completely different coordinate system that’s more suitable for the job. You could for example transform the image to the frequency spectrum using a Fourier transform and then remove the 50% highest frequency dimensions. This would leave you with only 5000 remaining attributes instead of 10.000.

Intermezzo: What’s the Frequency Domain?

This is important to grasp and I will probably use it a lot in the future so I need to explain.

Everything in our universe can be represented in 2 ways. The normal, intuitive way is spatial. This is how we perceive every day life. This can be 1D (sound waves), 2D (images), 3D (real life) or higher dimensional.

Image in spatial domain

Banana in spatial domain

The special, weird way is frequency. It basically contains a value for each possible frequency present in the image.

Banana in frequency domain

Note that the above image contains exactly the same info as the banana image. You can do the inverse Fourier transform and reconstruct the banana exactly.

Let’s take a simpler example: a one dimensional sound wave signal (in this case it’s not really spatial but temporal – the x-axis represents time)

Sound wave with one frequency

In this case the signal is a pure sine wave, meaning it only has 1 frequency. If you do a Fourier transform to the frequency domain you end up with just a single frequency value different from zero. All other frequencies contain a value of zero:

Left: temporal domain, right: frequency domain

Left: temporal domain, right: frequency domain

If I combine multiple frequencies in one signal (I can just add all the sine waves together):

Adding multiple sine waves

Adding multiple sine waves

This results in 3 different non-zero frequencies in the frequency domain: 50 hertz, 100 hertz and 150 hertz. (note: 50 hertz simply means 50 times per second).

Earlier we talked about removing 50% of the frequencies in an image. We can do something similar for this sound wave: we can simply remove the highest frequency (150 hertz), which removes 33% of all the data. This is a low-pass filter, it eliminates the noise (depending on how much data you remove – in this case you will also have removed actual data, not just noise).

After removing this one frequency, we can transform the sound wave back from the frequency domain to the temporal domain. We’ll see that the resulting signal doesn’t look the same anymore – the high frequency is gone and the signal looks “smoothed”.

What happens if we remove 50% of the banana picture?

After transforming back to the spatial domain by applying a reverse Fourier transform, we get a smoothed banana!

Removing the high frequencies in the banana picture

Removing the high frequencies in the banana picture

Now back to our regularly programmed schedule!

PROBLEM: We don’t know which features are good banana-recognising features.

some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used. — Pedro Domingos

In ye olden days, banana-recognition experts would manually pick some features because they knew the problem inside-out – “expert knowledge”.

Ideally we’d like to automate this. We want our algorithm to learn how to choose a good feature set, preferably even independent of the problem at hand – we want to also be able to use the same feature-choosing-algorithm for apple recognition or pomegranate recognition.


Expert knowledge is still important right now:

Coming up with features is difficult, time-consuming, requires expert knowledge. “Applied machine learning” is basically feature engineering. — Andrew Ng

But that’s exactly what deep learning is trying to solve:

One of the promises of deep learning is replacing handcrafted features with efficient algorithms for unsupervised or semi-supervised feature learning and hierarchical feature extraction.


This is big! It would mean a machine can not only learn a task (using the features), it would also learn to choose the features themselves – to learn how to learn!

We want the features to be non-redundant and informative. We want as much “information” as possible in the feature set.

We’re not going to formally define “information”, but let’s say it differs from task to task which information will be relevant. For some tasks the “roundness” might be important, for others the “color” or “edges” or “spikiness”.

I know these sound random and arbitrary but that’s because they are. Generally it’s very hard to define what these features actually represent in the image, especially when you use a general purpose feature extraction method (versus letting experts choose the features).

Some examples of these general methods: principal component analysis (PCA), kernel PCA, independent component analysis (ICA).Using thse we could for example reduce an image of 10.000 pixels to an input feature set of 70 values. Hopefully we capture most of the useful and relevant information in the original image with these 70 values.

New features can be obtained by combining the original attributes (the pixels from the image) in one way or another. A simple example would be the “darkness” of a greyscale image, which you could obtain by taking the average of all the pixels. This would be 1 attribute you could use as input for the algorithm.