Jekyll2018-04-17T02:06:40+00:00/SilverthreadA tech blog devoted to Data Science. Full of articles and handy cheatsheets any Data Scientist might make use of. Covers topics from very basic usage of Pandas, Python and Mathplotlib to Deep Networks and Machine Learning. A collection of helpful, straightforward and easy-to-follow articles.
Migration2018-04-15T00:00:00+00:002018-04-15T00:00:00+00:00/2018/04/15/travelling_without_moving<p>A blog has travelled without moving to a <a href="https://quicksilver0.github.io/">new location</a></p>
<p>Without moving… how’s that possible? Check out <a href="https://youtu.be/zZD99X_S7SA">the video</a>.
<!--more--></p>A blog has travelled without moving to a new location Without moving… how’s that possible? Check out the video.Prediction using NLP and Keras Neural Net2018-01-22T00:00:00+00:002018-01-22T00:00:00+00:00/tutorial/2018/01/22/NLP_with_Keras<p>This Notebook focuses on NLP techniques combined with Keras-built Neural Networks. The idea is to complete end-to-end project and to understand best approaches to text processing with Neural Networks by myself on practice. The tutorial provides vivid understanding of how to prepare the data for a Neural Network with Keras and how to actually implement and run it.</p>
<p><strong>Project description:</strong> predict if the review of the film is positive or negative. The dataset is a set of imdb reviews labeled as positive/negative.</p>
<p>It is inspired by a <a href="https://machinelearningmastery.com/crash-course-deep-learning-natural-language-processing/">DeepLearning with NLP CrashCourse</a> by Dr. Jason Brownlee.
<!--more--></p>
<h3 id="import-libraries">Import libraries</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">glob</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">nltk</span>
<span class="kn">from</span> <span class="nn">nltk.corpus</span> <span class="kn">import</span> <span class="n">stopwords</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="kn">from</span> <span class="nn">numpy</span> <span class="kn">import</span> <span class="n">asarray</span>
<span class="kn">from</span> <span class="nn">numpy</span> <span class="kn">import</span> <span class="n">zeros</span>
<span class="kn">from</span> <span class="nn">sklearn.feature_extraction.text</span> <span class="kn">import</span> <span class="n">CountVectorizer</span><span class="p">,</span> <span class="n">TfidfVectorizer</span>
<span class="kn">import</span> <span class="nn">gc</span>
<span class="kn">from</span> <span class="nn">keras.preprocessing.text</span> <span class="kn">import</span> <span class="n">Tokenizer</span>
<span class="kn">from</span> <span class="nn">keras.preprocessing.sequence</span> <span class="kn">import</span> <span class="n">pad_sequences</span>
<span class="kn">from</span> <span class="nn">keras.layers.embeddings</span> <span class="kn">import</span> <span class="n">Embedding</span>
<span class="kn">from</span> <span class="nn">keras.models</span> <span class="kn">import</span> <span class="n">Sequential</span>
<span class="kn">from</span> <span class="nn">keras.layers</span> <span class="kn">import</span> <span class="n">Convolution1D</span><span class="p">,</span> <span class="n">Conv1D</span><span class="p">,</span> <span class="n">Flatten</span><span class="p">,</span> <span class="n">Dropout</span><span class="p">,</span> <span class="n">Dense</span><span class="p">,</span> <span class="n">MaxPooling1D</span>
<span class="kn">from</span> <span class="nn">keras.callbacks</span> <span class="kn">import</span> <span class="n">TensorBoard</span>
</code></pre></div></div>
<h3 id="get-timestamps">Get Timestamps</h3>
<p>Define a function to display timespent</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">datetime</span>
<span class="k">def</span> <span class="nf">display_time_spent</span><span class="p">():</span>
<span class="n">end_time</span> <span class="o">=</span> <span class="n">datetime</span><span class="o">.</span><span class="n">datetime</span><span class="o">.</span><span class="n">now</span><span class="p">()</span>
<span class="n">time_spent</span> <span class="o">=</span> <span class="p">(</span><span class="n">end_time</span> <span class="o">-</span> <span class="n">start_time</span><span class="p">)</span>
<span class="n">hours</span><span class="p">,</span> <span class="n">remainder</span> <span class="o">=</span> <span class="nb">divmod</span><span class="p">(</span><span class="n">time_spent</span><span class="o">.</span><span class="n">seconds</span><span class="p">,</span> <span class="mi">3600</span><span class="p">)</span>
<span class="n">minutes</span><span class="p">,</span> <span class="n">seconds</span> <span class="o">=</span> <span class="nb">divmod</span><span class="p">(</span><span class="n">remainder</span><span class="p">,</span> <span class="mi">60</span><span class="p">)</span>
<span class="n">duration_formatted</span> <span class="o">=</span> <span class="s">'</span><span class="si">%</span><span class="s">d:</span><span class="si">%02</span><span class="s">d:</span><span class="si">%02</span><span class="s">d'</span> <span class="o">%</span> <span class="p">(</span><span class="n">hours</span><span class="p">,</span> <span class="n">minutes</span><span class="p">,</span> <span class="n">seconds</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Wall time: {}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">duration_formatted</span><span class="p">))</span>
</code></pre></div></div>
<p>Put this at the start of the block execution</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">start_time</span> <span class="o">=</span> <span class="n">datetime</span><span class="o">.</span><span class="n">datetime</span><span class="o">.</span><span class="n">now</span><span class="p">()</span>
</code></pre></div></div>
<p>Put this at the end of the block execution</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">display_time_spent</span><span class="p">()</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Wall time: 0:00:00
</code></pre></div></div>
<h3 id="read-in-data">Read in Data</h3>
<p>The dataset is collection of 1000 positive and 1000 negative imdb reviews.
Can be <a href="http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz">downloaded here.</a></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Define a function to read the file and return as a string</span>
<span class="k">def</span> <span class="nf">read_file</span><span class="p">(</span><span class="nb">file</span><span class="p">):</span>
<span class="n">f</span><span class="o">=</span><span class="nb">open</span><span class="p">(</span><span class="nb">file</span><span class="p">)</span>
<span class="k">return</span> <span class="n">f</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="n">f</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="c"># Read in positive reviews</span>
<span class="n">positive_files</span> <span class="o">=</span> <span class="n">glob</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s">'nlp_keras_embedding/data/txt_sentoken/pos/cv*.txt'</span><span class="p">)</span>
<span class="n">positive_reviews_list</span> <span class="o">=</span> <span class="p">[</span> <span class="n">read_file</span><span class="p">(</span><span class="nb">file</span><span class="p">)</span> <span class="k">for</span> <span class="nb">file</span> <span class="ow">in</span> <span class="n">positive_files</span> <span class="p">]</span>
<span class="n">labels</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">*</span><span class="nb">len</span><span class="p">(</span><span class="n">positive_reviews_list</span><span class="p">)</span>
<span class="n">reviews_positive_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="p">{</span><span class="s">'review'</span><span class="p">:</span> <span class="n">positive_reviews_list</span><span class="p">,</span> <span class="s">'label'</span><span class="p">:</span> <span class="n">labels</span><span class="p">})</span>
<span class="c"># Read in negative reviews</span>
<span class="n">negative_files</span> <span class="o">=</span> <span class="n">glob</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s">'nlp_keras_embedding/data/txt_sentoken/neg/cv*.txt'</span><span class="p">)</span>
<span class="n">negative_reviews_list</span> <span class="o">=</span> <span class="p">[</span> <span class="n">read_file</span><span class="p">(</span><span class="nb">file</span><span class="p">)</span> <span class="k">for</span> <span class="nb">file</span> <span class="ow">in</span> <span class="n">negative_files</span> <span class="p">]</span>
<span class="n">labels</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="nb">len</span><span class="p">(</span><span class="n">negative_reviews_list</span><span class="p">)</span>
<span class="n">reviews_negative_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="p">{</span><span class="s">'review'</span><span class="p">:</span> <span class="n">negative_reviews_list</span><span class="p">,</span> <span class="s">'label'</span><span class="p">:</span> <span class="n">labels</span><span class="p">})</span>
<span class="c"># Concatenate the dataframes into one</span>
<span class="n">reviews_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">reviews_positive_df</span><span class="p">,</span><span class="n">reviews_negative_df</span><span class="p">],</span> <span class="n">ignore_index</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">reviews_df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>
<h3 id="split-the-dataset">Split the dataset</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">labels</span> <span class="o">=</span> <span class="n">reviews_df</span><span class="p">[</span><span class="s">'label'</span><span class="p">]</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">reviews_df</span><span class="p">[</span><span class="s">'review'</span><span class="p">]</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Split dataset</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">dataset</span><span class="p">,</span> <span class="n">labels</span><span class="p">,</span> <span class="n">test_size</span><span class="o">=</span><span class="mf">0.1</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
</code></pre></div></div>
<h3 id="vectorize-and-preprocess-text">Vectorize and Preprocess text</h3>
<p>Make Bag Of Word matrix Representation and fit vectorizer on combined dataset.
I’ve experimented with different Vectorizers, turns out <em>binary</em> shows best results with NNs.</p>
<p>Notice words preprocessing options passed to CountVectorizer:</p>
<ul>
<li>token pattern would be minimum of 3 chars (any letter, a digit or “-“,”_” signs)</li>
<li>english stopwords are filtered out</li>
<li>ngrams would be generated from single to 3 tokens each</li>
<li>also minimum word occurence should be 3 or more, so min_df=3
Also CountVectorizer automatically lowercases the text.</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># fit vectorizer</span>
<span class="n">vectorizer</span> <span class="o">=</span> <span class="n">CountVectorizer</span><span class="p">(</span><span class="n">binary</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">min_df</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">ngram_range</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">3</span><span class="p">),</span> <span class="n">token_pattern</span><span class="o">=</span><span class="s">'(?u)</span><span class="se">\\</span><span class="s">b[a-z0-9</span><span class="err">\</span><span class="s">-</span><span class="err">\</span><span class="s">_]{3,}</span><span class="se">\\</span><span class="s">b'</span><span class="p">,</span> <span class="n">stop_words</span><span class="o">=</span><span class="s">'english'</span><span class="p">)</span>
<span class="c">#vectorizer = CountVectorizer(binary=True, min_df=3, ngram_range=(1,3), token_pattern='(?u)\\b[a-z0-9\-\_][a-z0-9\-\_]+\\b')</span>
<span class="c">#vectorizer = TfidfVectorizer(min_df=3, ngram_range=(1,3))</span>
<span class="c"># tokenize and build vocab</span>
<span class="n">vectorizer</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="c"># summarize</span>
<span class="c">#print(vectorizer.vocabulary_)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CountVectorizer(analyzer='word', binary=True, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=3,
ngram_range=(1, 3), preprocessor=None, stop_words='english',
strip_accents=None, token_pattern='(?u)\\b[a-z0-9\\-\\_]{3,}\\b',
tokenizer=None, vocabulary=None)
</code></pre></div></div>
<p>Transform train and test datasets to vector sparse matrices, and then to the arrays. (We need it in array forms in order to pass to the Keras NN layer)</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X_train_vec</span> <span class="o">=</span> <span class="n">vectorizer</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="n">X_train_arr</span> <span class="o">=</span> <span class="n">X_train_vec</span><span class="o">.</span><span class="n">toarray</span><span class="p">()</span>
<span class="c"># summarize encoded vector</span>
<span class="k">print</span><span class="p">(</span><span class="n">X_train_arr</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(1800, 34402) Test dataset:
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X_test_vec</span> <span class="o">=</span> <span class="n">vectorizer</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="n">X_test_arr</span> <span class="o">=</span> <span class="n">X_test_vec</span><span class="o">.</span><span class="n">toarray</span><span class="p">()</span>
<span class="c"># summarize encoded vector</span>
<span class="k">print</span><span class="p">(</span><span class="n">X_test_arr</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(200, 34402)
</code></pre></div></div>
<p>So we have quite large arrays to feed into our Neural Network. Train dataset is 1800 rows (samples) of 34k features! Let’s see further on how the Net learns from this saprse representation.</p>
<p>Get length of a wordspace. This length would be used as argument input to our NN model.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n_words</span> <span class="o">=</span> <span class="n">X_train_arr</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
</code></pre></div></div>
<h3 id="define-a-model">Define a Model</h3>
<p>Define Neural Network Architecture and compile.</p>
<p>A cursory exploration ended up with 2 layer architecture of 100 neurons each, with 0.1 dropout regularization showing best results.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># define NN model</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">Sequential</span><span class="p">()</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">100</span><span class="p">,</span> <span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="n">n_words</span><span class="p">,),</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dropout</span><span class="p">(</span><span class="mf">0.1</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">100</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dropout</span><span class="p">(</span><span class="mf">0.1</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'sigmoid'</span><span class="p">))</span>
<span class="c"># compile NN network</span>
<span class="n">tensorBoardCallback</span> <span class="o">=</span> <span class="n">TensorBoard</span><span class="p">(</span><span class="n">log_dir</span><span class="o">=</span><span class="s">'./logs'</span><span class="p">,</span> <span class="n">write_graph</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">model</span><span class="o">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">loss</span><span class="o">=</span><span class="s">'binary_crossentropy'</span><span class="p">,</span> <span class="n">optimizer</span><span class="o">=</span><span class="s">'adam'</span><span class="p">,</span> <span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s">'accuracy'</span><span class="p">])</span>
</code></pre></div></div>
<h3 id="train-and-evaluate">Train and evaluate</h3>
<p>Fit the model and evaluate.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">start_time</span> <span class="o">=</span> <span class="n">datetime</span><span class="o">.</span><span class="n">datetime</span><span class="o">.</span><span class="n">now</span><span class="p">()</span>
<span class="c"># fit network</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train_arr</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">callbacks</span><span class="o">=</span><span class="p">[</span><span class="n">tensorBoardCallback</span><span class="p">],</span> <span class="n">verbose</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="n">display_time_spent</span><span class="p">()</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Epoch 1/10
- 3s - loss: 0.4847 - acc: 0.7694
Epoch 2/10
- 3s - loss: 0.0281 - acc: 0.9956
Epoch 3/10
- 3s - loss: 0.0023 - acc: 1.0000
Epoch 4/10
- 3s - loss: 8.0274e-04 - acc: 1.0000
Epoch 5/10
- 3s - loss: 4.5116e-04 - acc: 1.0000
Epoch 6/10
- 3s - loss: 2.7223e-04 - acc: 1.0000
Epoch 7/10
- 3s - loss: 1.6565e-04 - acc: 1.0000
Epoch 8/10
- 3s - loss: 1.2852e-04 - acc: 1.0000
Epoch 9/10
- 3s - loss: 8.2725e-05 - acc: 1.0000
Epoch 10/10
- 3s - loss: 5.5967e-05 - acc: 1.0000
Wall time: 0:01:26
</code></pre></div></div>
<p>Evaluate:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">start_time</span> <span class="o">=</span> <span class="n">datetime</span><span class="o">.</span><span class="n">datetime</span><span class="o">.</span><span class="n">now</span><span class="p">()</span>
<span class="c"># evaluate</span>
<span class="n">loss</span><span class="p">,</span> <span class="n">acc</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">evaluate</span><span class="p">(</span><span class="n">X_test_arr</span><span class="p">,</span> <span class="n">y_test</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Test Accuracy: </span><span class="si">%</span><span class="s">f'</span> <span class="o">%</span> <span class="p">(</span><span class="n">acc</span><span class="o">*</span><span class="mi">100</span><span class="p">))</span>
<span class="n">display_time_spent</span><span class="p">()</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Test Accuracy: 91.000000
Wall time: 0:00:00
</code></pre></div></div>
<h3 id="wrap-up">Wrap up</h3>
<p>Ok, Nice! A simple Neural Network with just 2 layers, 100 neurons each gives very good results on predicting from relatively large bag of words!</p>
<p>So far, this tutorial explains in full:</p>
<ul>
<li>text preprocessing</li>
<li>test preprocessing and dataset preparation</li>
<li>NN model implementation with Keras</li>
<li>prediction</li>
</ul>
<p>Let me know of your ideas, additions and approaches to this problem. I’d be happy to hear from you and answer the questions.</p>This Notebook focuses on NLP techniques combined with Keras-built Neural Networks. The idea is to complete end-to-end project and to understand best approaches to text processing with Neural Networks by myself on practice. The tutorial provides vivid understanding of how to prepare the data for a Neural Network with Keras and how to actually implement and run it. Project description: predict if the review of the film is positive or negative. The dataset is a set of imdb reviews labeled as positive/negative. It is inspired by a DeepLearning with NLP CrashCourse by Dr. Jason Brownlee.Filter Spam with Machine Learning2017-12-06T00:00:00+00:002017-12-06T00:00:00+00:00/tutorial/2017/12/06/ML_to_Filter_Spam<p>This guide contains a very simple, yet powerful Machine Learning technique to filter the spam!</p>
<p>Multinomial Naive Bayes with minimal preprocessing yields incredible results! I was quite amazed by it’s efficiency to learn from text.</p>
<p>Please know, this technique is derived from <a href="https://www.coursera.org/learn/python-text-mining/home/welcome">Applied Text Mining in Python</a> course by University of Michingan, which I highly recommend to anyone in Computer Science and Machine Learning.</p>
<p>Ok, let’s dive in.
<!--more--></p>
<h3 id="read-data">Read data</h3>
<p>First we need data. You can get <a href="https://github.com/SilverSurfer0/SilverSurfer0.github.io/blob/master/data/spam.csv">spam.csv here</a>. The file contains labeled data to train our model on. Usually this is data labeled manually by humans or users.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">spam_data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'spam.csv'</span><span class="p">)</span>
<span class="n">spam_data</span><span class="p">[</span><span class="s">'target'</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">spam_data</span><span class="p">[</span><span class="s">'target'</span><span class="p">]</span><span class="o">==</span><span class="s">'spam'</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">)</span>
<span class="n">spam_data</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0 Go until jurong point, crazy.. Available only ... 0
1 Ok lar... Joking wif u oni... 0
2 Free entry in 2 a wkly comp to win FA Cup fina... 1
3 U dun say so early hor... U c already then say... 0
4 Nah I don't think he goes to usf, he lives aro... 0
5 FreeMsg Hey there darling it's been 3 week's n... 1
6 Even my brother is not like to speak with me. ... 0
7 As per your request 'Melle Melle (Oru Minnamin... 0
8 WINNER!! As a valued network customer you have... 1
9 Had your mobile 11 months or more? U R entitle... 1 ```
</code></pre></div></div>
<p>The next step would be splitting the datasets into Train and Test portions, using <code class="highlighter-rouge">train_test_split</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">spam_data</span><span class="p">[</span><span class="s">'text'</span><span class="p">],</span>
<span class="n">spam_data</span><span class="p">[</span><span class="s">'target'</span><span class="p">],</span>
<span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>
<h3 id="preprocess-data">Preprocess data</h3>
<p>Now we need to preprocess data into so-called term-matrix. It is a two-dimensional matrix where each row is a corresponding sample and each token (word) is presented as a feature - a column. The values are the number of token occurence in each sample.</p>
<p><code class="highlighter-rouge">sklearn</code> provides a tool <code class="highlighter-rouge">CountVecotrizer</code>, which essentially does all the heavy lifting of preprocessing: tokenizes, transforms data to a matrix and also comes with a bunch of other useful parameters, such as presenting features as <a href="https://en.wikipedia.org/wiki/N-gram">ngrams</a> of words or filtering out the stopwords.</p>
<p>Ok, great, all we need is to run it.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.feature_extraction.text</span> <span class="kn">import</span> <span class="n">CountVectorizer</span>
<span class="n">vectorizer</span> <span class="o">=</span> <span class="n">CountVectorizer</span><span class="p">()</span> <span class="c"># Instantiate vectorizer</span>
<span class="n">X_vect_matrix</span> <span class="o">=</span> <span class="n">vectorizer</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span> <span class="c"># Generate term document matrix of tokens</span>
</code></pre></div></div>
<h3 id="train-the-model">Train the model</h3>
<p>We have the matrix! We can now train the Multinomial Bayes Model.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.naive_bayes</span> <span class="kn">import</span> <span class="n">MultinomialNB</span>
<span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">roc_auc_score</span>
<span class="n">bayes_model</span> <span class="o">=</span> <span class="n">MultinomialNB</span><span class="p">(</span><span class="n">alpha</span><span class="o">=</span><span class="mf">0.1</span><span class="p">)</span> <span class="c"># Instantiate Multinomial Naive Bayes with slight smoothing parameter 0.1</span>
<span class="n">bayes_model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_vect_matrix</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span> <span class="c"># Train Bayes</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)
</code></pre></div></div>
<p>Few lines of code and our model is trained.</p>
<p>Now we can see result! (Remember that X_test should be similarly transformed into term-matrix)</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">predictions</span> <span class="o">=</span> <span class="n">bayes_model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">vectorizer</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_test</span><span class="p">))</span> <span class="c"># Obtain predictions using X_test transformed to a term document matrix.</span>
<span class="n">auc</span> <span class="o">=</span> <span class="n">roc_auc_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span><span class="n">predictions</span><span class="p">)</span> <span class="c"># Measure roc_auc</span>
<span class="k">print</span><span class="p">(</span><span class="n">auc</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0.972081218274
</code></pre></div></div>
<p><strong>97.2%!</strong> Spectacular, isn’t it? With just a few lines of code.</p>
<p>Yes, usually text preprocessing requires much more effort and even ingenuity to engineer new features in case of Machine Learning. Yet, Naive Bayes handles such tasks really good.</p>
<p>Good Luck:)</p>This guide contains a very simple, yet powerful Machine Learning technique to filter the spam! Multinomial Naive Bayes with minimal preprocessing yields incredible results! I was quite amazed by it’s efficiency to learn from text. Please know, this technique is derived from Applied Text Mining in Python course by University of Michingan, which I highly recommend to anyone in Computer Science and Machine Learning. Ok, let’s dive in.text mining - HackerNews Buzz-headlines2017-11-29T00:00:00+00:002017-11-29T00:00:00+00:00/project/2017/11/29/headlines_popularity<p>This project is an exploration of headlines frequency, it’s popularity.</p>
<p>The dataset is a collection of headlines from <a href="https://hn.premii.com/">HackerNews</a> portal gathered for period 2006-2015.</p>
<p><em>Note:</em> <a href="#hn-graph">to see final result scroll to the end</a></p>
<p>Let’s dive into exploration of the so-called metaparameters, and find out:
</p>
<ul>
<li>How popularity correlates with headline length?</li>
<li>What about popularity in respect quarter periods?</li>
<li>How does overall popularity changes over time on the website?</li>
<li>What are the <em>buzzphrases</em> appeared in the headlines and how do they change over time?</li>
</ul>
<p>Intuitively we can tell “Why these particular articles were popular” - that most probably depends on the content, relevance, the author. Yet, lets look for some unobvious correlations and then visualize most popular headlines!
<!--more--></p>
<h3 id="dataset">Dataset</h3>
<p>The dataset was compiled by Arnaud Drizard using the Hacker News API, and can be found <a href="https://github.com/arnauddri/hn">here</a>. The file contains 1553934 entries, is 171M big (uncompressed) and uses the following column titles:
<code class="highlighter-rouge">id</code>, <code class="highlighter-rouge">created_at</code>, <code class="highlighter-rouge">created_at_i</code>, <code class="highlighter-rouge">author</code>, <code class="highlighter-rouge">points</code>, <code class="highlighter-rouge">url_hostname</code>, <code class="highlighter-rouge">num_comments</code>, <code class="highlighter-rouge">title</code></p>
<p>For the sake of this mission only 4 columns are chosen and renamed appropriately:</p>
<ul>
<li><code class="highlighter-rouge">submission_time</code> – when the story was submitted.</li>
<li><code class="highlighter-rouge">upvotes</code> – number of upvotes the submission got.</li>
<li><code class="highlighter-rouge">url</code> – the base domain of the submission.</li>
<li><code class="highlighter-rouge">headline</code> – the headline of the submission. Users can edit this, and it doesn’t have to match the headline of the original article.</li>
</ul>
<h3 id="read-in-preprocess">Read-in, Preprocess</h3>
<p>Read in the data and preprocess for text mining</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'stories.csv'</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">'id'</span><span class="p">,</span> <span class="s">'submission_time'</span><span class="p">,</span> <span class="s">'posix_time'</span><span class="p">,</span> <span class="s">'author'</span><span class="p">,</span> <span class="s">'upvotes'</span><span class="p">,</span> <span class="s">'url'</span><span class="p">,</span> <span class="s">'comments_num'</span><span class="p">,</span> <span class="s">'headline'</span><span class="p">]</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[[</span><span class="s">'submission_time'</span><span class="p">,</span> <span class="s">'upvotes'</span><span class="p">,</span> <span class="s">'url'</span><span class="p">,</span> <span class="s">'headline'</span><span class="p">]]</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
</code></pre></div></div>
<p>Ok, we can see what is the data about.</p>
<p>Here is info on the dataframe.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="o">.</span><span class="n">info</span><span class="p">()</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code><class 'pandas.core.frame.DataFrame'>
RangeIndex: 1553933 entries, 0 to 1553932
Data columns (total 4 columns):
submission_time 1553933 non-null object
upvotes 1553933 non-null int64
url 1459198 non-null object
headline 1550599 non-null object
dtypes: int64(1), object(3)
memory usage: 47.4+ MB
</code></pre></div></div>
<p>It is rather large, 1.5M entries, which makes it interesting to explore.</p>
<p>Now, comb it with dropna(), so that we have a nice dataset. We can allow such luxury, since dataset is rich and filling nulls would not benefit much in this case.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="o">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1455871
</code></pre></div></div>
<p>Somewhat hundred thousands records have been removed, which is ok.</p>
<h4 id="core-preprocessing">Core preprocessing</h4>
<p>For further processing <code class="highlighter-rouge">nltk</code> library would be essential, it is especially appreciated for larger datasets analysis.</p>
<p>Steps would be:</p>
<ul>
<li>lowercasing</li>
<li>concatenate all words from the headlines column</li>
<li>tokenize</li>
<li>stripping of punctuation symbols</li>
<li>get rid of ‘stopwords’</li>
<li>lemmatize</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">nltk</span>
<span class="k">def</span> <span class="nf">preprocess_headline</span><span class="p">(</span><span class="n">headline</span><span class="p">):</span>
<span class="n">headline</span> <span class="o">=</span> <span class="n">headline</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span>
<span class="c"># tokenize (smart split)</span>
<span class="n">tokens</span> <span class="o">=</span> <span class="n">nltk</span><span class="o">.</span><span class="n">word_tokenize</span><span class="p">(</span><span class="n">headline</span><span class="p">)</span>
<span class="c"># Stripping of punctuation symbols:</span>
<span class="n">words_tokenized_nopunct</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">tokens</span> <span class="k">if</span> <span class="n">w</span><span class="o">.</span><span class="n">isalpha</span><span class="p">()]</span>
<span class="c"># Clean of stopwords. Note transformation to set, rather than using a list. Supposedly gives performance boost.</span>
<span class="n">stopwords</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span><span class="n">nltk</span><span class="o">.</span><span class="n">corpus</span><span class="o">.</span><span class="n">stopwords</span><span class="o">.</span><span class="n">words</span><span class="p">(</span><span class="s">'english'</span><span class="p">))</span>
<span class="n">words_except_stop</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">words_tokenized_nopunct</span> <span class="k">if</span> <span class="n">w</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">stopwords</span><span class="p">]</span>
<span class="c"># Lemmatize (smart-stemming of the words)</span>
<span class="n">WNlemma</span> <span class="o">=</span> <span class="n">nltk</span><span class="o">.</span><span class="n">WordNetLemmatizer</span><span class="p">()</span>
<span class="n">words_lemmatized</span> <span class="o">=</span> <span class="p">[</span><span class="n">WNlemma</span><span class="o">.</span><span class="n">lemmatize</span><span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">words_except_stop</span><span class="p">]</span>
<span class="c"># Gather back into a single phrase-string</span>
<span class="n">preprocessed_headline</span> <span class="o">=</span> <span class="s">' '</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">words_lemmatized</span><span class="p">)</span>
<span class="c">#preprocessed_headline = ' '.join(words_tokenized_nopunct)</span>
<span class="c">#print(preprocessed_headline)</span>
<span class="k">return</span> <span class="n">preprocessed_headline</span>
</code></pre></div></div>
<p>At this point, creating a copy of the original dataset would be a good decision. So that we continue transofmation with the copy!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dataset_df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span>
</code></pre></div></div>
<p>And now applying preprocessing of the headlines.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dataset_df</span><span class="p">[</span><span class="s">'processed_headline'</span><span class="p">]</span> <span class="o">=</span> <span class="n">dataset_df</span><span class="p">[</span><span class="s">'headline'</span><span class="p">]</span><span class="o">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">preprocess_headline</span><span class="p">)</span>
</code></pre></div></div>
<p>The processing might take a few minutes to transform 1.5M records. By the end we get nice and processed headlines a new column.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dataset_df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>
<p>Great, so far we have phrases consisting of normalized words: all of them lowercased, having same form and with no punctuation. That way we can work with the column further and count similar phrases!</p>
<p>Great, now we can get so called noun phrases from the headlines. Noun phrases are better much better for our exploration thatn just words, because for example ‘steve jobs’ is a noun phrase occur in headline, whereas if we aimed for just words, we would get only ‘steve’ or only ‘jobs’ to count.</p>
<p>So, let’s derive noun phrases and put them all in one huge list!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">textblob</span> <span class="kn">import</span> <span class="n">TextBlob</span>
<span class="n">noun_phrase_list</span> <span class="o">=</span> <span class="p">[</span> <span class="nb">list</span><span class="p">(</span><span class="n">TextBlob</span><span class="p">(</span><span class="n">processed_headline</span><span class="p">)</span><span class="o">.</span><span class="n">noun_phrases</span><span class="p">)</span> <span class="k">for</span> <span class="n">processed_headline</span> <span class="ow">in</span> <span class="n">dataset_df</span><span class="p">[</span><span class="s">'processed_headline'</span><span class="p">]</span> <span class="p">]</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">noun_phrase_list</span><span class="p">[:</span><span class="mi">3</span><span class="p">]</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[['business advice'], ['note superfish'], []]
</code></pre></div></div>
<p>We now have list of lists, let’s flatten it.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">noun_phrase_flat</span> <span class="o">=</span> <span class="p">[</span> <span class="n">item</span> <span class="k">for</span> <span class="n">sublist</span> <span class="ow">in</span> <span class="n">noun_phrase_list</span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">sublist</span> <span class="p">]</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">noun_phrase_flat</span><span class="p">[:</span><span class="mi">3</span><span class="p">]</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>['business advice',
'note superfish',
'php uk conference diversity scholarship programme']
</code></pre></div></div>
<p>Looks great!</p>
<p>A powerful technique is to put the list into collection! A collection is basically a dictionary with a bunch of methods. For example it allows easily acess such method as <code class="highlighter-rouge">Counter</code>!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">Counter</span>
<span class="n">counter_collection</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">(</span><span class="n">noun_phrase_flat</span><span class="p">)</span>
<span class="n">counter_collection</span><span class="o">.</span><span class="n">most_common</span><span class="p">(</span><span class="mi">20</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[('show hn', 8360),
('open source', 2021),
('social medium', 1691),
('social network', 1354),
('steve job', 1296),
('big data', 1204),
('silicon valley', 951),
('small business', 686),
('new york', 658),
('combinator bookmarklet', 656),
('mobile apps', 533),
('google glass', 533),
('mobile phone', 416),
('google chrome', 408),
('mobile app', 400),
('hacker news', 391),
('new way', 388),
('bill gate', 366),
('app store', 363),
('search engine', 343)]
</code></pre></div></div>
<p>Cool! However, we can see that topmost phrase is ‘show hn’ which stands for ‘show hacker news’. Thinking of this phrase, it most probably was some kind of a button (header) for showing hacker news, we don’t really need it. Cleaning it out.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">phrases_top_20</span> <span class="o">=</span> <span class="n">counter_collection</span><span class="o">.</span><span class="n">most_common</span><span class="p">(</span><span class="mi">21</span><span class="p">)[</span><span class="mi">1</span><span class="p">:]</span>
</code></pre></div></div>
<h4 id="get-distribution">Get distribution</h4>
<p>Let’s visualize distribution of top 20 frequent phrases.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="n">rslt_dist_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">phrases_top_20</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="p">(</span><span class="s">'Phrase'</span><span class="p">,</span><span class="s">'freq'</span><span class="p">))</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s">'Phrase'</span><span class="p">)</span>
<span class="n">matplotlib</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">(</span><span class="s">'ggplot'</span><span class="p">)</span>
<span class="n">bars</span> <span class="o">=</span> <span class="n">rslt_dist_df</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">rot</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span><span class="mi">10</span><span class="p">),</span> <span class="n">width</span><span class="o">=</span><span class="mf">0.8</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Top Frequent Headline buzzPhrases for 10 years in HackerNews'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">rotation</span><span class="o">=</span><span class="mi">40</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">();</span>
</code></pre></div></div>
<p><img src="/images/Headlines_popularity_files/Headlines_popularity_28_0.png" alt="png" /></p>
<p>Great! We have a clear picture of buzzphrases in HackerNews headlines over 10 years!</p>
<h3 id="most-frequent-domains">Most frequent domains</h3>
<p>Visualize most frequent domains. Some addresses are presented as <code class="highlighter-rouge">subdomain.domain.com</code>, which would be converted to <code class="highlighter-rouge">domain.com</code></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">re</span>
<span class="c"># Define convert-function subdomain.domain -> domain. We can apply it onto series values.</span>
<span class="k">def</span> <span class="nf">unify_domains</span><span class="p">(</span><span class="n">value</span><span class="p">):</span>
<span class="n">url</span><span class="o">=</span><span class="nb">str</span><span class="p">(</span><span class="n">value</span><span class="p">)</span>
<span class="n">subdom_dom_match</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s">'.*</span><span class="err">\</span><span class="s">.(</span><span class="err">\</span><span class="s">w+</span><span class="err">\</span><span class="s">.(?:</span><span class="err">\</span><span class="s">w{2,3}||(?!co.uk))$)'</span><span class="p">,</span> <span class="n">url</span><span class="p">)</span>
<span class="n">is_gb</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s">'.*</span><span class="err">\</span><span class="s">.co</span><span class="err">\</span><span class="s">.uk'</span><span class="p">,</span> <span class="n">url</span><span class="p">)</span>
<span class="k">if</span> <span class="n">subdom_dom_match</span> <span class="ow">and</span> <span class="ow">not</span> <span class="n">is_gb</span><span class="p">:</span>
<span class="n">url</span> <span class="o">=</span> <span class="n">subdom_dom_match</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">url</span>
<span class="n">domain_series</span> <span class="o">=</span> <span class="n">dataset_df</span><span class="p">[</span><span class="s">'url'</span><span class="p">]</span><span class="o">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">unify_domains</span><span class="p">)</span>
</code></pre></div></div>
<p>Get top frequent domains</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">top_domains</span> <span class="o">=</span> <span class="n">domain_series</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()[:</span><span class="mi">10</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="n">top_domains</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>blogspot.com 36807
github.com 30312
techcrunch.com 26609
nytimes.com 24125
youtube.com 22029
google.com 16303
wordpress.com 15531
medium.com 12991
arstechnica.com 12336
wired.com 11070
Name: url, dtype: int64
</code></pre></div></div>
<h3 id="frequency-by-hour-of-the-day">Frequency by hour of the day</h3>
<p>Which hours during a day are most prolific to get published.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">dateutil</span>
<span class="c"># Getting submission distribution per hour</span>
<span class="k">def</span> <span class="nf">parse_hours</span><span class="p">(</span><span class="n">value</span><span class="p">):</span>
<span class="n">datetime_val</span> <span class="o">=</span> <span class="n">dateutil</span><span class="o">.</span><span class="n">parser</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">value</span><span class="p">))</span>
<span class="c">#hour = value.hour</span>
<span class="n">hour</span> <span class="o">=</span> <span class="n">datetime_val</span><span class="o">.</span><span class="n">hour</span>
<span class="k">return</span> <span class="n">hour</span>
<span class="n">hour_series</span> <span class="o">=</span> <span class="n">dataset_df</span><span class="p">[</span><span class="s">'submission_time'</span><span class="p">]</span><span class="o">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">parse_hours</span><span class="p">)</span>
</code></pre></div></div>
<p>Display publish distribution over hours</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hour_dist</span> <span class="o">=</span> <span class="n">hour_series</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">hour_dist</span><span class="p">[:</span><span class="mi">5</span><span class="p">])</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>16 95123
17 94090
15 92098
18 89685
14 85719
Name: submission_time, dtype: int64
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dist_sorted</span> <span class="o">=</span> <span class="n">hour_dist</span><span class="o">.</span><span class="n">sort_index</span><span class="p">()</span>
<span class="n">matplotlib</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">(</span><span class="s">'ggplot'</span><span class="p">)</span>
<span class="n">bars</span> <span class="o">=</span> <span class="n">dist_sorted</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">rot</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span><span class="mi">10</span><span class="p">),</span> <span class="n">width</span><span class="o">=</span><span class="mf">0.8</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Publish frequency during day (UTC time) for 2006-2015'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">();</span>
</code></pre></div></div>
<p><img src="Headlines_popularity_files/Headlines_popularity_38_0.png" alt="png" /></p>
<p>Fair enough, most frequent publishing happened during evenings.</p>
<h3 id="popularity-to-headline-length">Popularity to Headline length</h3>
<p>Is there such correlation?</p>
<p>For that we would nee additional column to present lenght of the headline.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dataset_df</span><span class="p">[</span><span class="s">'headline_length'</span><span class="p">]</span> <span class="o">=</span> <span class="n">dataset_df</span><span class="p">[</span><span class="s">'headline'</span><span class="p">]</span><span class="o">.</span><span class="nb">apply</span><span class="p">(</span><span class="nb">len</span><span class="p">)</span>
<span class="n">dataset_df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>
<p>Checking Pearsons coefficient</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dataset_df</span><span class="o">.</span><span class="n">corr</span><span class="p">()</span>
</code></pre></div></div>
<p>Coeff less than 0.25 is insignificant, and suggests there is no correlation.</p>
<p>Visualizing a scatterplot, to see the shape and possible clustering.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dataset_df</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="s">'headline_length'</span><span class="p">,</span><span class="s">'upvotes'</span><span class="p">,</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span><span class="mi">10</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">();</span>
</code></pre></div></div>
<p><img src="/images/Headlines_popularity_files/Headlines_popularity_45_0.png" alt="png" /></p>
<p>It is very evident that articles with <em>headlines of 80 characters and more are unpopular</em> (except a few exceptions).</p>
<p>Actually the border is way to salient - a sign of some underlying reason.</p>
<h3 id="popularity-over-time">Popularity over time</h3>
<p>Visualize how overall popularity of articles change over time</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">datetime</span>
<span class="n">dataset_df</span><span class="p">[</span><span class="s">'submission_time'</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">dataset_df</span><span class="p">[</span><span class="s">'submission_time'</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">popularity_per_quaters</span> <span class="o">=</span> <span class="n">dataset_df</span><span class="o">.</span><span class="n">resample</span><span class="p">(</span><span class="s">'QS'</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="s">'submission_time'</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">()[</span><span class="s">'upvotes'</span><span class="p">]</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">popularity_per_quaters</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span><span class="mi">10</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Articles popularity over time'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">();</span>
</code></pre></div></div>
<p><img src="/images/Headlines_popularity_files/Headlines_popularity_50_0.png" alt="png" /></p>
<p>More and more articles got upvoted over time, a good representation of HackerNews growing popularity.</p>
<h3 id="buzzphrases">Buzzphrases</h3>
<p>That’d be interesting to see <code class="highlighter-rouge">Buzzphrases</code> trends over time. Which phrases appear most frequent in headlines over time? Get an insight of what topics were most popular back then and until recently in Hacker community.</p>
<p>First contemplate on the design on how to present it - an exciting step, with so many variations: let imagination flow. The pictures popup and acquire shapes as you think of the graph goals “What would it tell to a reader?”. A good thing is to sketch it with a pencil on a paper.</p>
<p><strong>Details:</strong></p>
<p>I got the design in my mind and a sketch on a paper. The idea is to draw top barplots for each period (quarters or halves-year) through the whole span (2006-2015). On top of it Buzzphrases trends are presented with lines. If you imagine - it is not an easy task to make it visually appealing: too many lines get interwined. Thus an animation would be applied. Initially patches and lines would be gray, not distinguished much. The animation would light-up group of patches and corresponding lines upon mouse hover! Sounds cool. That should work great.</p>
<p>From technical perspective, following steps are to be implemented:</p>
<ul>
<li>add dataframe column with corresponding time period</li>
<li>add dataframe column with month period (that’s because lines would be drawn per month to make them smooth)</li>
<li>get top Buzzphrases and aggregated frequency for each period. Agg frequency would be presented per month.</li>
<li>get top Buzzphrases frequencies per months</li>
<li>Plot bars - a bar for a period, a mean of top Buzzphrases.</li>
<li>Plot lines - for each top Buzzphrases presented.</li>
<li>Apply animation to highlight group of patches for a certain period with related buzzphrase-trend-lines</li>
<li>add annotations to display actual phrases</li>
<li>add annotations for interesting points</li>
<li>Focus on colors-palette and make it visually appealing:
<ul>
<li>initially bars and lines would be greay and thin</li>
<li>upon hover (or click?) certain period would be highlighted with the buzzphrase lines thickened and highlighted across all periods!</li>
</ul>
</li>
</ul>
<p>Each step has it’s technical caveats. We have a design and a plan, and can start making it real.</p>
<h4 id="add-periods">Add periods</h4>
<p>Create indecies: quarter_periods, month_periods. For further sampling per quarter/month.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Making a copy for manipulations</span>
<span class="n">df_buzz</span> <span class="o">=</span> <span class="n">dataset_df</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span>
<span class="c"># resetting an index, to save orig_index numbers, just in case it might be useful</span>
<span class="n">df_buzz</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">df_buzz</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s">'index'</span><span class="p">:</span><span class="s">'orig_index'</span><span class="p">},</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c"># Create quarter periods.</span>
<span class="c"># To achieve it: create a copy of time column, then set it as index,</span>
<span class="c"># and then convert this index to quarters via to_period attribute.</span>
<span class="n">df_buzz</span><span class="p">[</span><span class="s">'submission_time_ind'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df_buzz</span><span class="p">[</span><span class="s">'submission_time'</span><span class="p">]</span>
<span class="n">df_buzz</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s">'submission_time_ind'</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">df_buzz</span> <span class="o">=</span> <span class="n">df_buzz</span><span class="o">.</span><span class="n">to_period</span><span class="p">(</span><span class="s">'Q'</span><span class="p">,</span> <span class="n">copy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c"># then we reset index, and rename the column to what it presents: quarter_periods</span>
<span class="n">df_buzz</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">df_buzz</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s">'submission_time_ind'</span><span class="p">:</span><span class="s">'quarter_periods'</span><span class="p">},</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c"># Create month periods.</span>
<span class="c"># To achieve it: create a copy of time column, then set it as index,</span>
<span class="c"># and then convert this index to month via to_period attribute.</span>
<span class="n">df_buzz</span><span class="p">[</span><span class="s">'submission_time_ind'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df_buzz</span><span class="p">[</span><span class="s">'submission_time'</span><span class="p">]</span>
<span class="n">df_buzz</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s">'submission_time_ind'</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">df_buzz</span> <span class="o">=</span> <span class="n">df_buzz</span><span class="o">.</span><span class="n">to_period</span><span class="p">(</span><span class="s">'M'</span><span class="p">,</span> <span class="n">copy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c"># then we reset index, and rename the column to what it presents: month</span>
<span class="n">df_buzz</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">df_buzz</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s">'submission_time_ind'</span><span class="p">:</span><span class="s">'month_periods'</span><span class="p">},</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c"># Setting a multiindex of quarters and months</span>
<span class="n">df_buzz</span><span class="o">.</span><span class="n">set_index</span><span class="p">([</span><span class="s">'quarter_periods'</span><span class="p">,</span> <span class="s">'month_periods'</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">df_buzz</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>
<p>Looks good. Creating a multiindex involved a number of manipulations, and takes some computational power for 1.5M dataset.</p>
<h4 id="get-period-lists">get period lists</h4>
<p>The next step is to create datasets of top Buzzphrases per quarter and month periods, suitable for plotting.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># here a copy of dataset is created to perform manipulations</span>
<span class="n">df_buzz_reind</span> <span class="o">=</span> <span class="n">df_buzz</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span>
<span class="c">#df_buzz_reind.reset_index(inplace=True)</span>
<span class="n">df_buzz_reind</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s">'submission_time'</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df_buzz_reind</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>
<p>Pick period to explore between 2010 and 2015 years</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df_buzz_reind</span> <span class="o">=</span> <span class="n">df_buzz_reind</span><span class="p">[(</span><span class="n">df_buzz_reind</span><span class="p">[</span><span class="s">'submission_time'</span><span class="p">]</span><span class="o">></span><span class="s">'2010-01-01'</span><span class="p">)</span> <span class="o">&</span> <span class="p">(</span><span class="n">df_buzz_reind</span><span class="p">[</span><span class="s">'submission_time'</span><span class="p">]</span><span class="o"><</span><span class="s">'2015-01-01'</span><span class="p">)]</span>
</code></pre></div></div>
<h4 id="getting-nice-lists-of-quarter-periods-and-months-periods">Getting nice lists of quarter periods and months periods</h4>
<h5 id="quarter-periods">Quarter periods</h5>
<p>First getting <strong>quarter periods</strong> lists.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">quarters</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">df_buzz_reind</span><span class="o">.</span><span class="n">index</span><span class="o">.</span><span class="n">get_level_values</span><span class="p">(</span><span class="s">'quarter_periods'</span><span class="p">)))</span>
<span class="n">quarters_series</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">(</span><span class="n">quarters</span><span class="p">)</span>
<span class="n">quarters_series_sorted</span> <span class="o">=</span> <span class="n">quarters_series</span><span class="o">.</span><span class="n">sort_values</span><span class="p">()</span>
</code></pre></div></div>
<p>Get quarter period list and quarter period names list - all sorted in ascending order.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">quarter_periods_list</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">quarters_series_sorted</span><span class="p">)</span>
<span class="n">quarter_names_list</span> <span class="o">=</span> <span class="n">quarters_series_sorted</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">quarter_periods_list</span><span class="p">[:</span><span class="mi">5</span><span class="p">]</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[Period('2010Q1', 'Q-DEC'),
Period('2010Q2', 'Q-DEC'),
Period('2010Q3', 'Q-DEC'),
Period('2010Q4', 'Q-DEC'),
Period('2011Q1', 'Q-DEC')]
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">quarter_names_list</span><span class="p">[:</span><span class="mi">5</span><span class="p">]</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>['2010Q1',
'2010Q2',
'2010Q3',
'2010Q4',
'2011Q1',
'2011Q2',
'2011Q3',
'2011Q4',
'2012Q1',
'2012Q2',
'2012Q3',
'2012Q4',
'2013Q1',
'2013Q2',
'2013Q3',
'2013Q4',
'2014Q1',
'2014Q2',
'2014Q3',
'2014Q4']
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">quarters_list</span><span class="o">=</span><span class="nb">list</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">quarter_periods_list</span><span class="p">,</span><span class="n">quarter_names_list</span><span class="p">))</span>
<span class="n">quarters_list</span><span class="p">[:</span><span class="mi">5</span><span class="p">]</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[(Period('2010Q1', 'Q-DEC'), '2010Q1'),
(Period('2010Q2', 'Q-DEC'), '2010Q2'),
(Period('2010Q3', 'Q-DEC'), '2010Q3'),
(Period('2010Q4', 'Q-DEC'), '2010Q4'),
(Period('2011Q1', 'Q-DEC'), '2011Q1')]
</code></pre></div></div>
<p>Good, we now have our lists!</p>
<ul>
<li><code class="highlighter-rouge">quarter_periods_list</code></li>
<li><code class="highlighter-rouge">quarter_names_list</code></li>
</ul>
<p>And versions zipped to list of tuples (period, period_name): <code class="highlighter-rouge">quarters_list</code></p>
<h4 id="get-most-frequent-buzzphrases-for-each-period">get most frequent buzzphrases for each period</h4>
<p>We are building a dictionary of { period_name : top 3 frequent phrases with frequency }. It involves sampling the dataframe for each period and extract words with their frequencies.</p>
<p>Erlier in this project we’ve already extracted top most buzzphrases across all 10 year span! So let’s gather those steps in a single nice function, that would apply for each period and build the dict.</p>
<p>Define a function that exctracts and returns top frequent phrases.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_top_freq_phrases</span><span class="p">(</span><span class="n">processed_headlines_series</span><span class="p">):</span>
<span class="c"># Getting noun phrase list</span>
<span class="n">noun_phrase_list</span> <span class="o">=</span> <span class="p">[</span> <span class="nb">list</span><span class="p">(</span><span class="n">TextBlob</span><span class="p">(</span><span class="n">processed_headline</span><span class="p">)</span><span class="o">.</span><span class="n">noun_phrases</span><span class="p">)</span> <span class="k">for</span> <span class="n">processed_headline</span> <span class="ow">in</span> <span class="n">processed_headlines_series</span> <span class="p">]</span>
<span class="c"># Flattening the list</span>
<span class="n">noun_phrase_flat</span> <span class="o">=</span> <span class="p">[</span> <span class="n">item</span> <span class="k">for</span> <span class="n">sublist</span> <span class="ow">in</span> <span class="n">noun_phrase_list</span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">sublist</span> <span class="p">]</span>
<span class="c"># Converting a list to a collection</span>
<span class="n">counter_collection</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">(</span><span class="n">noun_phrase_flat</span><span class="p">)</span>
<span class="c"># Remove 'show hn' phrase as unnecessary for the stats</span>
<span class="k">del</span> <span class="n">counter_collection</span><span class="p">[</span><span class="s">'show hn'</span><span class="p">]</span>
<span class="c"># Finally obtain tuples of top 3 common phrases</span>
<span class="n">collection_for_period</span> <span class="o">=</span> <span class="n">counter_collection</span>
<span class="n">top_phrases</span> <span class="o">=</span> <span class="n">counter_collection</span><span class="o">.</span><span class="n">most_common</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
<span class="c"># print(top_phrases)</span>
<span class="k">return</span> <span class="n">top_phrases</span><span class="p">,</span> <span class="n">collection_for_period</span> <span class="c"># return top frequent phrases and a collection</span>
</code></pre></div></div>
<p>We can now go through different periods in dataframe and build a dictionary of top frequent phrases per period.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">quarters_top_phrases_dict</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">period</span><span class="p">,</span> <span class="n">period_name</span> <span class="ow">in</span> <span class="n">quarters_list</span><span class="p">:</span>
<span class="n">df_period</span> <span class="o">=</span> <span class="n">df_buzz_reind</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">period</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="n">period_name</span><span class="p">)</span>
<span class="n">top_phrases</span><span class="p">,</span> <span class="n">collection_for_period</span> <span class="o">=</span> <span class="n">get_top_freq_phrases</span><span class="p">(</span><span class="n">df_period</span><span class="p">[</span><span class="s">'processed_headline'</span><span class="p">])</span>
<span class="c">#print(top_phrases)</span>
<span class="c">#print(type(collection_for_period))</span>
<span class="n">quarters_top_phrases_dict</span><span class="p">[</span><span class="n">period_name</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="n">top_phrases</span><span class="p">,</span> <span class="n">collection_for_period</span><span class="p">]</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2010Q1
2010Q2
2010Q3
2010Q4
2011Q1
2011Q2
2011Q3
2011Q4
2012Q1
2012Q2
2012Q3
2012Q4
2013Q1
2013Q2
2013Q3
2013Q4
2014Q1
2014Q2
2014Q3
2014Q4
</code></pre></div></div>
<p>Perfect, a dictionary with top phrases and frequencies for each quarter is gathered.</p>
<p>Now we would get a one nice list of all top phrases encountered. As that would allow to build line graphs for each.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">top_phrases_uniq</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">set</span><span class="p">([</span><span class="n">phrase_freq</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">sublist</span> <span class="ow">in</span> <span class="n">quarters_top_phrases_dict</span><span class="o">.</span><span class="n">values</span><span class="p">()</span> <span class="k">for</span> <span class="n">phrase_freq</span> <span class="ow">in</span> <span class="n">sublist</span><span class="p">[</span><span class="mi">0</span><span class="p">]]))</span>
<span class="n">top_phrases_uniq</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>['steve job',
'world news',
'triunfo del amor capitulo',
'google glass',
'angry bird',
'combinator bookmarklet',
'open source',
'ipad mini',
'google instant',
'google buzz',
'mobile app',
'new ipad',
'flappy bird',
'social network',
'artificial intelligence',
'social medium',
'google nexus',
'website need',
'real estate',
'big data',
'silicon valley',
'window phone',
'aaron swartz',
'net neutrality',
'reina del sur capitulo',
'elon musk']
</code></pre></div></div>
<p>Perfect!</p>
<p>Now, let’s obtain list frequencies for each phrase! Each list would be used as series data to plot a line.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">phrase_series_dict</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">phrase</span> <span class="ow">in</span> <span class="n">top_phrases_uniq</span><span class="p">:</span>
<span class="n">phrase_series_dict</span><span class="p">[</span><span class="n">phrase</span><span class="p">]</span><span class="o">=</span><span class="p">[</span> <span class="n">quarters_top_phrases_dict</span><span class="p">[</span><span class="n">period_name</span><span class="p">][</span><span class="mi">1</span><span class="p">][</span><span class="n">phrase</span><span class="p">]</span> <span class="k">if</span> <span class="n">phrase</span> <span class="ow">in</span> <span class="n">quarters_top_phrases_dict</span><span class="p">[</span><span class="n">period_name</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span> <span class="k">else</span> <span class="s">'null'</span> <span class="k">for</span> <span class="n">period_name</span> <span class="ow">in</span> <span class="n">quarter_names_list</span><span class="p">]</span>
</code></pre></div></div>
<p>Now, for the convinience, let’s print our value series for each phrase!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">phrase</span> <span class="ow">in</span> <span class="n">top_phrases_uniq</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="n">phrase</span><span class="p">,</span> <span class="n">phrase_series_dict</span><span class="p">[</span><span class="n">phrase</span><span class="p">])</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>steve job [18, 78, 32, 39, 70, 37, 93, 396, 56, 49, 39, 49, 32, 29, 39, 30, 38, 20, 8, 17]
world news ['null', 'null', 'null', 'null', 1, 1, 5, 81, 47, 7, 'null', 'null', 'null', 1, 'null', 'null', 1, 'null', 'null', 'null']
triunfo del amor capitulo ['null', 'null', 'null', 1, 9, 73, 'null', 'null', 'null', 1, 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null']
google glass ['null', 'null', 'null', 'null', 'null', 'null', 'null', 1, 5, 17, 14, 7, 95, 125, 57, 55, 43, 64, 17, 16]
angry bird ['null', 1, 4, 14, 45, 25, 15, 18, 20, 6, 11, 1, 3, 5, 2, 4, 5, 3, 'null', 2]
combinator bookmarklet [4, 5, 2, 9, 23, 40, 61, 37, 33, 42, 72, 66, 86, 62, 65, 26, 4, 5, 3, 1]
open source [68, 51, 58, 64, 55, 58, 61, 71, 94, 89, 109, 81, 90, 103, 94, 112, 106, 104, 82, 110]
ipad mini ['null', 'null', 'null', 'null', 'null', 'null', 'null', 2, 4, 4, 13, 53, 7, 4, 1, 2, 'null', 'null', 'null', 3]
google instant ['null', 'null', 32, 13, 4, 'null', 'null', 1, 2, 1, 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null']
google buzz [51, 2, 'null', 'null', 'null', 'null', 'null', 3, 'null', 'null', 'null', 'null', 'null', 1, 1, 'null', 'null', 'null', 'null', 'null']
mobile app [3, 3, 4, 8, 8, 6, 15, 12, 14, 22, 23, 26, 35, 26, 24, 36, 33, 12, 39, 24]
new ipad ['null', 1, 1, 'null', 5, 2, 1, 2, 115, 20, 3, 2, 1, 1, 1, 1, 'null', 'null', 1, 'null']
flappy bird ['null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 42, 11, 5, 2]
social network [35, 39, 49, 77, 58, 54, 76, 58, 63, 79, 68, 45, 63, 38, 53, 42, 41, 30, 31, 38]
artificial intelligence [5, 8, 4, 10, 10, 6, 7, 8, 7, 11, 11, 17, 13, 8, 9, 22, 16, 14, 23, 47]
social medium [45, 39, 50, 56, 66, 87, 129, 97, 129, 115, 72, 90, 97, 101, 84, 59, 39, 41, 60, 40]
google nexus [32, 5, 'null', 8, 'null', 5, 3, 2, 'null', 1, 7, 13, 8, 2, 1, 6, 2, 'null', 1, 1]
website need ['null', 28, 1, 'null', 2, 'null', 1, 2, 2, 1, 'null', 2, 'null', 1, 2, 'null', 'null', 'null', 'null', 'null']
real estate [10, 8, 12, 6, 7, 12, 38, 65, 17, 15, 8, 7, 3, 1, 7, 5, 4, 7, 8, 6]
big data [5, 2, 9, 12, 19, 17, 26, 30, 66, 73, 70, 89, 116, 124, 115, 85, 77, 64, 69, 64]
silicon valley [10, 17, 11, 27, 34, 40, 41, 41, 42, 59, 45, 35, 42, 53, 45, 53, 77, 48, 59, 31]
window phone [6, 1, 18, 35, 30, 14, 16, 22, 25, 36, 15, 18, 20, 10, 11, 9, 14, 7, 4, 3]
aaron swartz ['null', 'null', 'null', 1, 'null', 1, 3, 1, 1, 1, 2, 'null', 122, 7, 7, 1, 8, 2, 2, 'null']
net neutrality [6, 13, 14, 12, 4, 7, 1, 4, 2, 5, 2, 'null', 1, 1, 7, 5, 33, 49, 21, 32]
reina del sur capitulo ['null', 'null', 'null', 'null', 3, 66, 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null']
elon musk ['null', 3, 8, 'null', 2, 2, 2, 6, 2, 6, 10, 16, 24, 22, 36, 22, 8, 32, 16, 42]
</code></pre></div></div>
<p>Great! We now have that shows top phrases and their frequencies per quarter periods!</p>
<h3 id="visualize-the-graph">Visualize the graph</h3>
<p><a id="hn-graph"></a></p>
<p>Since we gathered the freq data, time to plot the graph.</p>
<p>That is quite a challenge: how to represent those serial data in a nice way. Keeping in mind following aims:</p>
<ul>
<li>Display topmost headlines</li>
<li>Display tendency for each headline throughout 5 years?</li>
</ul>
<p>Well, how about that would be nice curved lines, each highlighted when hovered over. Adding some icons and annotations would be great as well.</p>
<p>After spending hours of research and trial, I found great solution: <a href="https://www.highcharts.com/demo/spline-symbols"><strong>Highcharts.com!</strong></a></p>
<p>Highcharts are absolutely great, smooth and stylish, with so many capabilities! I found it so much more better than plotly or bokeh.</p>
<p>The only thing is that it is javascript-based. I’ve copy-pasted the data-series by hand and then crafted the chart in jsfiddle.net. You can <a href="http://jsfiddle.net/s45ng42a/2/">check working code here</a>.</p>
<hr />
<div id="container" style="min-width: 310px; height: 600px; margin: 0 auto"></div>
<hr />
<p>Steve Jobs’ heritage is by far most popular. Aaron Swartz is a very interesting figure I’ve mined from this project and got acquinted with. A very sad story - a genius young man commited suicide on 2013q1 under heavy burden that happened to him. Google Glass and Big Data had a popularity peak somewhere in the middle of 2013, but then declined. Remember Those Flappy birds? That was fun. Also we can see the popularity of Elon Musk and Artificial intelligence coincidentally and unstoppably rising…</p>
<p>Ok, that was interesting to peek into those patterns. And moreover to explore instruments for such endeavors.</p>
<p>All the best :)</p>This project is an exploration of headlines frequency, it’s popularity. The dataset is a collection of headlines from HackerNews portal gathered for period 2006-2015. Note: to see final result scroll to the end Let’s dive into exploration of the so-called metaparameters, and find out: How popularity correlates with headline length? What about popularity in respect quarter periods? How does overall popularity changes over time on the website? What are the buzzphrases appeared in the headlines and how do they change over time? Intuitively we can tell “Why these particular articles were popular” - that most probably depends on the content, relevance, the author. Yet, lets look for some unobvious correlations and then visualize most popular headlines!ML framework with Kaggle Titanic competition2017-11-17T00:00:00+00:002017-11-17T00:00:00+00:00/tutorial/2017/11/17/ML_kaggle_titanic<p>Such a renown <a href="https://www.kaggle.com/c/titanic">Kaggle competition</a>. Everyone into Machine Learning had tried to predict, who is more likely to survive: be it a family man, a gentlemen with expensive ticket, or a child? Or maybe someone holding a <em>Royalty</em> title? Yes, there is quite a number of features to learn for a machine!</p>
<p>Let’s dive into this tutorial which is more of presentation of my top notch framework for Machine Learning so far! Yes! The framework I’ve build with the courtesy of <a href="dataquest.io">dataquest.io</a>, spending quite some time getting my best competition results.</p>
<p><em>Btw, <a href="dataquest.io">dataquest.io</a> is a great platform and a community to learn some great things!</em></p>
<p><img src="/images/titanic_sinking.jpg" alt="title" />
<!--more--></p>
<p>A typical ML workflow is defined in those steps:</p>
<ul>
<li><strong>Data exploration</strong>, to find patterns in the data</li>
<li><strong>Feature engineering and preprocessing</strong>, to create new features from those patterns or through pure experimentation</li>
<li><strong>Feature selection</strong>, to select the best subset of our current set of features</li>
<li><strong>Model selection/tuning</strong>, training a number of models with different hyperparameters to find the best performer.</li>
</ul>
<p>At the end of any cycle we can always submit the predictions for holdout dataset to see the results.</p>
<h2 id="data-exploration">Data Exploration</h2>
<p>Data is available on Kaggle <a href="https://www.kaggle.com/c/titanic">Titanic competition page</a>.</p>
<p>A rule of thumb is get acquinted with the domain. Well, reading a wikipage about Titanic is not only fascinating, but can also be beneficial for the competition directly, such as give insight that, for example infants were more likely to survive.</p>
<h3 id="imports-and-dataset-explorations">Imports and dataset explorations</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="n">train</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'data/train.csv'</span><span class="p">)</span>
<span class="n">holdout</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'data/test.csv'</span><span class="p">)</span>
<span class="n">train</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
<span class="n">The</span> <span class="n">dataset</span> <span class="n">looks</span> <span class="n">simple</span> <span class="n">enough</span> <span class="n">upon</span> <span class="n">inspecting</span><span class="p">,</span> <span class="ow">and</span> <span class="n">does</span> <span class="ow">not</span> <span class="n">have</span> <span class="n">much</span> <span class="n">columns</span><span class="o">.</span> <span class="n">Kaggle</span> <span class="n">competition</span> <span class="n">page</span> <span class="n">explains</span> <span class="nb">all</span> <span class="n">relevant</span> <span class="n">columns</span><span class="o">.</span>
<span class="n">Let</span><span class="s">'s see survival rate by Sex</span><span class="err">
</span><span class="s">```python</span><span class="err">
</span><span class="s">sex_pivot = train.pivot_table(index="Sex",values="Survived")</span><span class="err">
</span><span class="s">sex_pivot.plot.bar()</span><span class="err">
</span><span class="s">plt.show();</span><span class="err">
</span></code></pre></div></div>
<p><img src="/images/Kaggle_titanic_publish_files/Kaggle_titanic_publish_8_0.png" alt="png" /></p>
<p>A very vivid barplot.</p>
<p>Survival rate by the ticket classes: 1, 2, 3, where 1-st class is the most expensive.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pclass_pivot</span> <span class="o">=</span> <span class="n">train</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="s">"Pclass"</span><span class="p">,</span><span class="n">values</span><span class="o">=</span><span class="s">"Survived"</span><span class="p">)</span>
<span class="n">pclass_pivot</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">bar</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">();</span>
</code></pre></div></div>
<p><img src="/images/Kaggle_titanic_publish_files/Kaggle_titanic_publish_11_0.png" alt="png" /></p>
<p>A class disparity is very well seen.</p>
<p>What about age groups?</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">train</span><span class="p">[</span><span class="s">"Age"</span><span class="p">]</span><span class="o">.</span><span class="n">describe</span><span class="p">()</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>count 714.000000
mean 29.699118
std 14.526497
min 0.420000
25% 20.125000
50% 28.000000
75% 38.000000
max 80.000000
Name: Age, dtype: float64
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">survived</span> <span class="o">=</span> <span class="n">train</span><span class="p">[</span><span class="n">train</span><span class="p">[</span><span class="s">"Survived"</span><span class="p">]</span> <span class="o">==</span> <span class="mi">1</span><span class="p">]</span>
<span class="n">survived</span><span class="p">[</span><span class="s">"Age"</span><span class="p">]</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span><span class="n">color</span><span class="o">=</span><span class="s">'red'</span><span class="p">,</span><span class="n">bins</span><span class="o">=</span><span class="mi">50</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">();</span>
</code></pre></div></div>
<p><img src="/images/Kaggle_titanic_publish_files/Kaggle_titanic_publish_14_0.png" alt="png" /></p>
<p>We can see that infants were the first to be saved. Which is obvious if to think about, small babies take least space and people tend to take all of them onboard. Other groups are not that obvious, even though 20-40 years are displayed more frequent than 10 y/olds on the graph, maybe it is just that there were almost no 10 y/olds on Titanic at all?</p>
<p>We should address these questions. <strong>A good feature engineering</strong> is the next step we are to perform.</p>
<h2 id="feature-engineering">Feature engineering</h2>
<p>This step is conceived one of the most crucial one. this step is what can reward a scientist with cutting edge prediction percentage. This is where some good thinking is applied.</p>
<h3 id="clean-datasets">Clean datasets</h3>
<p>A very first, easy step but important step to decide on how to fill those null cells, and ensure overall that all data is consistent.</p>
<p>Turns out there are some missing values in Fare and Embarked columns. Define a function.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">process_missing</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="s">"""Handle various missing values from the data set
Usage
------
holdout = process_missing(holdout)
"""</span>
<span class="n">df</span><span class="p">[</span><span class="s">"Fare"</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">"Fare"</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="n">train</span><span class="p">[</span><span class="s">"Fare"</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
<span class="n">df</span><span class="p">[</span><span class="s">"Embarked"</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">"Embarked"</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s">"S"</span><span class="p">)</span>
<span class="k">return</span> <span class="n">df</span>
</code></pre></div></div>
<h3 id="create-new-features">Create new features</h3>
<p>A powerful technique in feature engineering is data binning.</p>
<p>A great example is binning age groups - i.e. assigning each person to a certain group based on their age. <code class="highlighter-rouge">pd.cut()</code> is a tool to use for that task.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">process_age</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="s">"""Process the Age column into pre-defined 'bins'
Usage
------
train = process_age(train)
"""</span>
<span class="n">df</span><span class="p">[</span><span class="s">"Age"</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">"Age"</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="o">-</span><span class="mf">0.5</span><span class="p">)</span>
<span class="n">cut_points</span> <span class="o">=</span> <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">12</span><span class="p">,</span><span class="mi">18</span><span class="p">,</span><span class="mi">35</span><span class="p">,</span><span class="mi">60</span><span class="p">,</span><span class="mi">100</span><span class="p">]</span>
<span class="n">label_names</span> <span class="o">=</span> <span class="p">[</span><span class="s">"Missing"</span><span class="p">,</span><span class="s">"Infant"</span><span class="p">,</span><span class="s">"Child"</span><span class="p">,</span><span class="s">"Teenager"</span><span class="p">,</span><span class="s">"Young Adult"</span><span class="p">,</span><span class="s">"Adult"</span><span class="p">,</span><span class="s">"Senior"</span><span class="p">]</span>
<span class="n">df</span><span class="p">[</span><span class="s">"Age_categories"</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">cut</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">"Age"</span><span class="p">],</span><span class="n">cut_points</span><span class="p">,</span><span class="n">labels</span><span class="o">=</span><span class="n">label_names</span><span class="p">)</span>
<span class="k">return</span> <span class="n">df</span>
</code></pre></div></div>
<p>On that step we can draw some deeper insights on survival percentage for age grouped above. We can use again a powerful function <code class="highlighter-rouge">df.pivot_table(index='colname1', values='colname2')</code> which would calculate survival means for each age group, giving us surival percentage vividly. Even more we can visualize it.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Binning by age group</span>
<span class="n">train_copy_df</span> <span class="o">=</span> <span class="n">process_age</span><span class="p">(</span><span class="n">train</span><span class="p">)</span>
<span class="c"># Visualize</span>
<span class="n">age_pivot</span> <span class="o">=</span> <span class="n">train_copy_df</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="s">'Age_categories'</span><span class="p">,</span> <span class="n">values</span><span class="o">=</span><span class="s">'Survived'</span><span class="p">)</span>
<span class="n">age_pivot</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">bar</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">();</span>
</code></pre></div></div>
<p><img src="/images/Kaggle_titanic_publish_files/Kaggle_titanic_publish_23_0.png" alt="png" /></p>
<p>We can also bin Fare costs to groups.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">process_fare</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="s">"""Process the Fare column into pre-defined 'bins'
Usage
------
train = process_fare(train)
"""</span>
<span class="n">cut_points</span> <span class="o">=</span> <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">12</span><span class="p">,</span><span class="mi">50</span><span class="p">,</span><span class="mi">100</span><span class="p">,</span><span class="mi">1000</span><span class="p">]</span>
<span class="n">label_names</span> <span class="o">=</span> <span class="p">[</span><span class="s">"0-12"</span><span class="p">,</span><span class="s">"12-50"</span><span class="p">,</span><span class="s">"50-100"</span><span class="p">,</span><span class="s">"100+"</span><span class="p">]</span>
<span class="n">df</span><span class="p">[</span><span class="s">"Fare_categories"</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">cut</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">"Fare"</span><span class="p">],</span><span class="n">cut_points</span><span class="p">,</span><span class="n">labels</span><span class="o">=</span><span class="n">label_names</span><span class="p">)</span>
<span class="k">return</span> <span class="n">df</span>
</code></pre></div></div>
<p>Cabins. Looking at cabins we can see each starts with specific letter - might be important. Let’s create features with certain cabin types (start letter). NaN we would replace with ‘Unknown’. Creating a function.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">process_cabin</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="s">"""Process the Cabin column into pre-defined 'bins'
Usage
------
train process_cabin(train)
"""</span>
<span class="n">df</span><span class="p">[</span><span class="s">"Cabin_type"</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">"Cabin"</span><span class="p">]</span><span class="o">.</span><span class="nb">str</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">df</span><span class="p">[</span><span class="s">"Cabin_type"</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">"Cabin_type"</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s">"Unknown"</span><span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'Cabin'</span><span class="p">,</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">df</span>
</code></pre></div></div>
<p>Names contain titles, which can be very important. Let’s extract the titles and assign to certain groups.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">process_titles</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="s">"""Extract and categorize the title from the name column
Usage
------
train = process_titles(train)
"""</span>
<span class="n">titles</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">"Mr"</span> <span class="p">:</span> <span class="s">"Mr"</span><span class="p">,</span>
<span class="s">"Mme"</span><span class="p">:</span> <span class="s">"Mrs"</span><span class="p">,</span>
<span class="s">"Ms"</span><span class="p">:</span> <span class="s">"Mrs"</span><span class="p">,</span>
<span class="s">"Mrs"</span> <span class="p">:</span> <span class="s">"Mrs"</span><span class="p">,</span>
<span class="s">"Master"</span> <span class="p">:</span> <span class="s">"Master"</span><span class="p">,</span>
<span class="s">"Mlle"</span><span class="p">:</span> <span class="s">"Miss"</span><span class="p">,</span>
<span class="s">"Miss"</span> <span class="p">:</span> <span class="s">"Miss"</span><span class="p">,</span>
<span class="s">"Capt"</span><span class="p">:</span> <span class="s">"Officer"</span><span class="p">,</span>
<span class="s">"Col"</span><span class="p">:</span> <span class="s">"Officer"</span><span class="p">,</span>
<span class="s">"Major"</span><span class="p">:</span> <span class="s">"Officer"</span><span class="p">,</span>
<span class="s">"Dr"</span><span class="p">:</span> <span class="s">"Officer"</span><span class="p">,</span>
<span class="s">"Rev"</span><span class="p">:</span> <span class="s">"Officer"</span><span class="p">,</span>
<span class="s">"Jonkheer"</span><span class="p">:</span> <span class="s">"Royalty"</span><span class="p">,</span>
<span class="s">"Don"</span><span class="p">:</span> <span class="s">"Royalty"</span><span class="p">,</span>
<span class="s">"Sir"</span> <span class="p">:</span> <span class="s">"Royalty"</span><span class="p">,</span>
<span class="s">"Countess"</span><span class="p">:</span> <span class="s">"Royalty"</span><span class="p">,</span>
<span class="s">"Dona"</span><span class="p">:</span> <span class="s">"Royalty"</span><span class="p">,</span>
<span class="s">"Lady"</span> <span class="p">:</span> <span class="s">"Royalty"</span>
<span class="p">}</span>
<span class="n">extracted_titles</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">"Name"</span><span class="p">]</span><span class="o">.</span><span class="nb">str</span><span class="o">.</span><span class="n">extract</span><span class="p">(</span><span class="s">' ([A-Za-z]+)</span><span class="err">\</span><span class="s">.'</span><span class="p">,</span><span class="n">expand</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s">"Title"</span><span class="p">]</span> <span class="o">=</span> <span class="n">extracted_titles</span><span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="n">titles</span><span class="p">)</span>
<span class="k">return</span> <span class="n">df</span>
</code></pre></div></div>
<p>There is a technique, called <code class="highlighter-rouge">create dummies</code>. Above functions serve us to create categories from values. The technique creates a separate feature for each category and got assigned with values 1 or 0 whether a person has this feature or not! This is a very effective technique for Machine Learning.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">create_dummies</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="n">column_name</span><span class="p">):</span>
<span class="s">"""Create Dummy Columns (One Hot Encoding) from a single Column
Usage
------
train = create_dummies(train,"Age")
"""</span>
<span class="n">dummies</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">get_dummies</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="n">column_name</span><span class="p">],</span><span class="n">prefix</span><span class="o">=</span><span class="n">column_name</span><span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">df</span><span class="p">,</span><span class="n">dummies</span><span class="p">],</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">df</span>
</code></pre></div></div>
<p>Now let’s define a function that applies a set of feature-extractions functions to a dataset.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">feature_preprocessing</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">process_missing</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">process_age</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">process_fare</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">process_cabin</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">process_titles</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
<span class="n">column_name</span> <span class="o">=</span> <span class="p">[</span><span class="s">'Age_categories'</span><span class="p">,</span> <span class="s">'Fare_categories'</span><span class="p">,</span> <span class="s">'Title'</span><span class="p">,</span> <span class="s">'Cabin_type'</span><span class="p">,</span> <span class="s">'Sex'</span><span class="p">]</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">create_dummies</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="n">column_name</span><span class="p">)</span>
<span class="k">return</span> <span class="n">df</span>
</code></pre></div></div>
<p>Preprocess <code class="highlighter-rouge">train</code> and <code class="highlighter-rouge">holdout</code> datasets!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">train_preprocessed</span> <span class="o">=</span> <span class="n">feature_preprocessing</span><span class="p">(</span><span class="n">train</span><span class="p">)</span>
<span class="n">holdout_preprocessed</span> <span class="o">=</span> <span class="n">feature_preprocessing</span><span class="p">(</span><span class="n">holdout</span><span class="p">)</span>
</code></pre></div></div>
<h3 id="more-feature-creation">More Feature Creation</h3>
<p>Another cycle of exploration and feature creation.</p>
<p>This time we pay attention to Parch and SibSp columns, which are all about family size of a passenger onboard.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">explore_cols</span> <span class="o">=</span> <span class="p">[</span><span class="s">"SibSp"</span><span class="p">,</span><span class="s">"Parch"</span><span class="p">,</span><span class="s">"Survived"</span><span class="p">]</span>
<span class="n">explore</span> <span class="o">=</span> <span class="n">train</span><span class="p">[</span><span class="n">explore_cols</span><span class="p">]</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span>
<span class="n">explore</span><span class="p">[</span><span class="s">'familysize'</span><span class="p">]</span> <span class="o">=</span> <span class="n">explore</span><span class="p">[[</span><span class="s">"SibSp"</span><span class="p">,</span><span class="s">"Parch"</span><span class="p">]]</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">pivot</span> <span class="o">=</span> <span class="n">explore</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="s">'familysize'</span><span class="p">,</span><span class="n">values</span><span class="o">=</span><span class="s">"Survived"</span><span class="p">)</span>
<span class="n">pivot</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">ylim</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span><span class="n">yticks</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="o">.</span><span class="mi">1</span><span class="p">));</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">();</span>
</code></pre></div></div>
<p><img src="/images/Kaggle_titanic_publish_files/Kaggle_titanic_publish_37_0.png" alt="png" /></p>
<p>We can see that passengers with higher children count were very likely to survive. While those who had no family aboard had survival chance of only 30%.</p>
<p>Here we create a feature callsed <code class="highlighter-rouge">isalone</code> as following:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">isalone</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="n">df</span><span class="p">[</span><span class="s">'isalone'</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">where</span><span class="p">((</span><span class="n">df</span><span class="p">[</span><span class="s">'SibSp'</span><span class="p">]</span><span class="o">==</span><span class="mi">0</span><span class="p">)</span> <span class="o">&</span> <span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'Parch'</span><span class="p">]</span><span class="o">==</span><span class="mi">0</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="k">return</span> <span class="n">df</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">train_preprocessed</span> <span class="o">=</span> <span class="n">isalone</span><span class="p">(</span><span class="n">train_preprocessed</span><span class="p">)</span>
<span class="n">holdout_preprocessed</span> <span class="o">=</span> <span class="n">isalone</span><span class="p">(</span><span class="n">holdout_preprocessed</span><span class="p">)</span>
</code></pre></div></div>
<p>Great.</p>
<p>We can also create separate features (<code class="highlighter-rouge">dummy</code> columns with 1 and 0) for passenger’s Pclass.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">train_preprocessed</span> <span class="o">=</span> <span class="n">create_dummies</span><span class="p">(</span><span class="n">train_preprocessed</span><span class="p">,</span><span class="s">'Pclass'</span><span class="p">)</span>
<span class="n">holdout_preprocessed</span> <span class="o">=</span> <span class="n">create_dummies</span><span class="p">(</span><span class="n">holdout_preprocessed</span><span class="p">,</span><span class="s">'Pclass'</span><span class="p">)</span>
</code></pre></div></div>
<h4 id="drop-unnecessary-columns">Drop unnecessary columns</h4>
<p>We should get rid of columns we used to derive new features from. Such as Age, Pclass, Fare…</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cols_to_drop</span> <span class="o">=</span> <span class="p">[</span><span class="s">'Age'</span><span class="p">,</span> <span class="s">'Pclass'</span><span class="p">,</span> <span class="s">'Fare'</span><span class="p">,</span> <span class="s">'SibSp'</span><span class="p">,</span> <span class="s">'Parch'</span><span class="p">]</span>
<span class="n">train_preprocessed</span> <span class="o">=</span> <span class="n">train_preprocessed</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="n">cols_to_drop</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">holdout_preprocessed</span> <span class="o">=</span> <span class="n">holdout_preprocessed</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="n">cols_to_drop</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<h4 id="postprocessing">Postprocessing</h4>
<ul>
<li>Ensure no NA columns are in the dataset.</li>
<li>Also ensuring to select columns only with numerical data - an important processing step for ML.</li>
<li>Preprocessing - minmax normalization.</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">minmax_scale</span>
<span class="k">def</span> <span class="nf">only_numeric</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">select_dtypes</span><span class="p">(</span><span class="n">include</span><span class="o">=</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">number</span><span class="p">])</span>
<span class="k">return</span> <span class="n">df</span>
<span class="c">#minmaxnormalization here if required. If there are columns left with varying values.</span>
<span class="c">#train_rescaled = pd.DataFrame(minmax_scale(train[cols_to_rescale]), columns=cols_rescaled)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">train_preprocessed</span> <span class="o">=</span> <span class="n">only_numeric</span><span class="p">(</span><span class="n">train_preprocessed</span><span class="p">)</span>
<span class="n">holdout_preprocessed</span> <span class="o">=</span> <span class="n">only_numeric</span><span class="p">(</span><span class="n">holdout_preprocessed</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="feature-selection">Feature selection</h2>
<p>RFECV - recursive feature selection! It automatically selects best columns for a model of choice.</p>
<p>Example:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.feature_selection</span> <span class="kn">import</span> <span class="n">RFECV</span>
<span class="kn">from</span> <span class="nn">sklearn.ensemble</span> <span class="kn">import</span> <span class="n">RandomForestClassifier</span>
<span class="k">def</span> <span class="nf">select_features</span><span class="p">(</span><span class="n">instantiated_model</span><span class="p">,</span> <span class="n">all_X</span><span class="p">,</span> <span class="n">all_y</span><span class="p">):</span>
<span class="n">clf</span> <span class="o">=</span> <span class="n">RandomForestClassifier</span><span class="p">(</span><span class="n">random_state</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">selector</span> <span class="o">=</span> <span class="n">RFECV</span><span class="p">(</span><span class="n">instantiated_model</span><span class="p">,</span> <span class="n">cv</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">scoring</span><span class="o">=</span><span class="s">'roc_auc'</span><span class="p">)</span>
<span class="n">selector</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">all_X</span><span class="p">,</span> <span class="n">all_y</span><span class="p">)</span>
<span class="n">optimized_columns</span> <span class="o">=</span> <span class="n">all_X</span><span class="o">.</span><span class="n">columns</span><span class="p">[</span><span class="n">selector</span><span class="o">.</span><span class="n">support_</span><span class="p">]</span><span class="o">.</span><span class="n">values</span>
<span class="k">print</span><span class="p">(</span><span class="n">optimized_columns</span><span class="p">)</span>
<span class="k">return</span> <span class="n">optimized_columns</span>
</code></pre></div></div>
<p><strong>Note</strong>: in practice RFECV feature selector is not very useful, unfotunately worsens performance so far in my experience.</p>
<h4 id="diminish-collinearity">Diminish collinearity</h4>
<p>Collinearity happens when a values in one column can be derived from another. For example male/female columns which obviously complement each other. Person can be either male or female. Collinearity effects can be seen in multiple columns also.</p>
<p>To diminish collinearity we can simply drop one of the columns in a pack.</p>
<p>Which columns have correlinearity? We would first visualize correlation for columns and then select which to drop. Here is a nice function which highlights collinear columns with deeper color:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Drop certain columns that complement each other</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>
<span class="k">def</span> <span class="nf">plot_correlation_heatmap</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="n">corr</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">corr</span><span class="p">()</span>
<span class="n">sns</span><span class="o">.</span><span class="nb">set</span><span class="p">(</span><span class="n">style</span><span class="o">=</span><span class="s">"white"</span><span class="p">)</span>
<span class="n">mask</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros_like</span><span class="p">(</span><span class="n">corr</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="nb">bool</span><span class="p">)</span>
<span class="n">mask</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">triu_indices_from</span><span class="p">(</span><span class="n">mask</span><span class="p">)]</span> <span class="o">=</span> <span class="bp">True</span>
<span class="n">f</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span> <span class="mi">14</span><span class="p">))</span>
<span class="n">cmap</span> <span class="o">=</span> <span class="n">sns</span><span class="o">.</span><span class="n">diverging_palette</span><span class="p">(</span><span class="mi">220</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="n">as_cmap</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">sns</span><span class="o">.</span><span class="n">heatmap</span><span class="p">(</span><span class="n">corr</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="n">mask</span><span class="p">,</span> <span class="n">cmap</span><span class="o">=</span><span class="n">cmap</span><span class="p">,</span> <span class="n">vmax</span><span class="o">=.</span><span class="mi">3</span><span class="p">,</span> <span class="n">center</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
<span class="n">square</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">linewidths</span><span class="o">=.</span><span class="mi">5</span><span class="p">,</span> <span class="n">cbar_kws</span><span class="o">=</span><span class="p">{</span><span class="s">"shrink"</span><span class="p">:</span> <span class="o">.</span><span class="mi">5</span><span class="p">})</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plot_correlation_heatmap</span><span class="p">(</span><span class="n">train_preprocessed</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/images/Kaggle_titanic_publish_files/Kaggle_titanic_publish_54_0.png" alt="png" /></p>
<p>Ok, obviously <code class="highlighter-rouge">Sex_female/Sex_male</code> and <code class="highlighter-rouge">Title_Miss/Title_Mr/Title_Mrs</code> are very salient on the heatmap. We’ll remove <code class="highlighter-rouge">Sex_female/male</code> completely and leave out <code class="highlighter-rouge">Title_Mr/Mrs/Miss</code> as more nuanced.</p>
<p>Apart from that, we should remove each of the following from the corresponding groups:</p>
<ul>
<li>Cabin_type_F</li>
<li>Age_categories_Teenager</li>
<li>Fare_categories_12-50</li>
<li>Title_Master</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cols_to_drop</span> <span class="o">=</span> <span class="p">[</span><span class="s">'Sex_male'</span><span class="p">,</span><span class="s">'Sex_female'</span><span class="p">,</span><span class="s">'Cabin_type_F'</span><span class="p">,</span><span class="s">'Age_categories_Teenager'</span><span class="p">,</span><span class="s">'Fare_categories_12-50'</span><span class="p">,</span><span class="s">'Title_Master'</span><span class="p">]</span>
<span class="n">train_preprocessed</span> <span class="o">=</span> <span class="n">train_preprocessed</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="n">cols_to_drop</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">holdout_preprocessed</span> <span class="o">=</span> <span class="n">holdout_preprocessed</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="n">cols_to_drop</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="model-selection-and-tuning">Model Selection and Tuning!</h2>
<p>Now this is fun part! This is where we build (train) our models. That is usually quite exciting to choose those great algorithms out there and see it performing great (at least sometimes!). This is where we see Machine Learning in work. Magic!</p>
<h3 id="split-datasets-for-ml">Split datasets for ML</h3>
<p>At this point split dataframes into all_X and all_y (data and labels) properly. That’s what we’ll be feeding into models.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">all_X</span><span class="p">,</span> <span class="n">all_y</span> <span class="o">=</span> <span class="n">train_preprocessed</span><span class="o">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'Survived'</span><span class="p">,</span><span class="s">'PassengerId'</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span> <span class="n">train_preprocessed</span><span class="p">[</span><span class="s">'Survived'</span><span class="p">]</span>
<span class="n">test_X</span> <span class="o">=</span> <span class="n">holdout_preprocessed</span>
</code></pre></div></div>
<p>Ensure at this point that featurelist (column names) coincide. As it may happen so, that some categories are absent in holdout dataset. So, we have to use those columns, which both datasets share for sure.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Grab all available colnames from all_X to use as a column_list</span>
<span class="n">column_list</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">all_X</span><span class="o">.</span><span class="n">columns</span><span class="o">.</span><span class="n">values</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Ensure all colnames in the list is present in holdout collist name. Choose only those that coincide.</span>
<span class="n">col_list_to_choose_from</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">test_X</span><span class="o">.</span><span class="n">columns</span><span class="o">.</span><span class="n">values</span><span class="p">)</span>
<span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="n">colname</span> <span class="k">for</span> <span class="n">colname</span> <span class="ow">in</span> <span class="n">column_list</span> <span class="k">if</span> <span class="n">colname</span> <span class="ow">in</span> <span class="n">col_list_to_choose_from</span><span class="p">]</span> <span class="c"># Such mumbo jumbo is required because not all columns are the same in the holdout dataset</span>
</code></pre></div></div>
<p>Now let’s use these columns further on. It helps to avoid possible issues.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">all_X</span> <span class="o">=</span> <span class="n">all_X</span><span class="p">[</span><span class="n">columns</span><span class="p">]</span>
</code></pre></div></div>
<h3 id="gridsearch-best-hyperparameters-and-model">GridSearch best hyperparameters and model!</h3>
<p>Now we’ll create a function to do the heavy lifting of model selection and tuning. The function will use three different algorithms and use grid search to train all combinations of hyperparameters and find best performing model!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">GridSearchCV</span>
<span class="kn">from</span> <span class="nn">sklearn.neighbors</span> <span class="kn">import</span> <span class="n">KNeighborsClassifier</span>
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LogisticRegression</span>
<span class="kn">from</span> <span class="nn">sklearn.ensemble</span> <span class="kn">import</span> <span class="n">GradientBoostingClassifier</span>
</code></pre></div></div>
<h5 id="a-list-of-parameters">A list of parameters!</h5>
<p>It can be achieved by creating a list of dictionaries— that is, a list where each element of the list is a dictionary. Each dictionary should contain:</p>
<ul>
<li>The name of the particular model</li>
<li>An estimator object for the model</li>
<li>A dictionary of hyperparameters that we’ll use for grid search.</li>
</ul>
<p>Example of list of dictionaries (hyperparameters that gidsearch will use) can look like:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Define a list of dictionaries, that describe models and their parameters we want gridsearch for the best param!</span>
<span class="n">list_of_dict_models</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">{</span>
<span class="s">'name'</span><span class="p">:</span><span class="s">'LogisticRegression'</span><span class="p">,</span>
<span class="s">'estimator'</span><span class="p">:</span><span class="n">LogisticRegression</span><span class="p">(),</span>
<span class="s">'hyperparameters'</span><span class="p">:</span>
<span class="p">{</span>
<span class="s">"solver"</span><span class="p">:</span> <span class="p">[</span><span class="s">"newton-cg"</span><span class="p">,</span> <span class="s">"lbfgs"</span><span class="p">,</span> <span class="s">"liblinear"</span><span class="p">],</span>
<span class="s">'C'</span><span class="p">:[</span><span class="mf">0.6</span><span class="p">,</span> <span class="mf">0.7</span><span class="p">,</span> <span class="mf">0.8</span><span class="p">,</span> <span class="mf">0.9</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mf">1.2</span><span class="p">,</span> <span class="mf">1.5</span><span class="p">],</span>
<span class="s">'tol'</span><span class="p">:[</span><span class="mf">0.001</span><span class="p">,</span> <span class="mf">0.01</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">]</span>
<span class="p">}</span>
<span class="p">},</span>
<span class="p">{</span>
<span class="s">'name'</span><span class="p">:</span><span class="s">'KNeighborsClassifier'</span><span class="p">,</span>
<span class="s">'estimator'</span><span class="p">:</span><span class="n">KNeighborsClassifier</span><span class="p">(),</span>
<span class="s">'hyperparameters'</span><span class="p">:</span>
<span class="p">{</span>
<span class="s">"n_neighbors"</span><span class="p">:</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">20</span><span class="p">,</span><span class="mi">2</span><span class="p">),</span>
<span class="s">"weights"</span><span class="p">:</span> <span class="p">[</span><span class="s">"distance"</span><span class="p">,</span> <span class="s">"uniform"</span><span class="p">],</span>
<span class="s">"algorithm"</span><span class="p">:</span> <span class="p">[</span><span class="s">"ball_tree"</span><span class="p">,</span> <span class="s">"kd_tree"</span><span class="p">,</span> <span class="s">"brute"</span><span class="p">],</span>
<span class="s">"p"</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">]</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">]</span>
</code></pre></div></div>
<p>Now that we have a dictionary of different models. Let’s create a function that would gridsearch through all of them and find best params!</p>
<p>It should return an updated dictionary. And also printing in a nice way it’s findings.</p>
<p><strong>Note:</strong> it contains RFECV feature selector, commented though, as it proved not very useful last time.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">select_model</span><span class="p">(</span><span class="n">all_X</span><span class="p">,</span> <span class="n">all_y</span><span class="p">,</span> <span class="n">list_of_dict_models</span><span class="p">):</span>
<span class="n">this_scoring</span> <span class="o">=</span> <span class="s">'accuracy'</span>
<span class="k">for</span> <span class="n">model</span> <span class="ow">in</span> <span class="n">list_of_dict_models</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Searching best params for {}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">model</span><span class="p">[</span><span class="s">'name'</span><span class="p">]))</span>
<span class="n">estimator</span> <span class="o">=</span> <span class="n">model</span><span class="p">[</span><span class="s">'estimator'</span><span class="p">]</span>
<span class="c"># # First Recursive column selection for each model</span>
<span class="c"># if model['name'] not in ['KNeighborsClassifier']:</span>
<span class="c"># selector = RFECV(estimator, cv=10, scoring=this_scoring)</span>
<span class="c"># selector.fit(all_X, all_y)</span>
<span class="c"># optimized_columns = all_X.columns[selector.support_].values</span>
<span class="c"># model['optimized_columns'] = optimized_columns</span>
<span class="c"># print('col_list lenght: {}'.format(len(optimized_columns)))</span>
<span class="n">grid</span> <span class="o">=</span> <span class="n">GridSearchCV</span><span class="p">(</span><span class="n">estimator</span><span class="p">,</span> <span class="n">param_grid</span><span class="o">=</span><span class="n">model</span><span class="p">[</span><span class="s">'hyperparameters'</span><span class="p">],</span> <span class="n">cv</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">scoring</span><span class="o">=</span><span class="n">this_scoring</span><span class="p">,</span> <span class="n">n_jobs</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">grid</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">all_X</span><span class="p">,</span> <span class="n">all_y</span><span class="p">)</span>
<span class="n">model</span><span class="p">[</span><span class="s">'best_hyperparameters'</span><span class="p">]</span> <span class="o">=</span> <span class="n">grid</span><span class="o">.</span><span class="n">best_params_</span>
<span class="n">best_score</span> <span class="o">=</span> <span class="n">grid</span><span class="o">.</span><span class="n">best_score_</span>
<span class="n">model</span><span class="p">[</span><span class="s">'best_score'</span><span class="p">]</span> <span class="o">=</span> <span class="n">best_score</span>
<span class="n">model</span><span class="p">[</span><span class="s">'best_estimator'</span><span class="p">]</span> <span class="o">=</span> <span class="n">grid</span><span class="o">.</span><span class="n">best_estimator_</span>
<span class="n">list_of_dict_models_sorted_best</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">list_of_dict_models</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s">'best_score'</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">best_model</span> <span class="o">=</span> <span class="n">list_of_dict_models_sorted_best</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="s">'name'</span><span class="p">]</span>
<span class="n">best_score_achieved</span> <span class="o">=</span> <span class="n">list_of_dict_models_sorted_best</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="s">'best_score'</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Best Model: {}, score: {}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">best_model</span><span class="p">,</span><span class="n">best_score_achieved</span><span class="p">))</span>
<span class="k">return</span> <span class="n">list_of_dict_models_sorted_best</span>
</code></pre></div></div>
<p>Run the powerful search function! Get the dictionary.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">list_of_dict_models_sorted_best</span> <span class="o">=</span> <span class="n">select_model</span><span class="p">(</span><span class="n">all_X</span><span class="p">,</span> <span class="n">all_y</span><span class="p">,</span> <span class="n">list_of_dict_models</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Searching best params for LogisticRegression
Searching best params for KNeighborsClassifier
Best Model: LogisticRegression, score: 0.813692480359147
</code></pre></div></div>
<p>Additionally print the whole dictionary (for exploration purposes)</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">list_of_dict_models_sorted_best</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[{'name': 'LogisticRegression', 'estimator': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False), 'hyperparameters': {'solver': ['newton-cg', 'lbfgs', 'liblinear'], 'C': [0.6, 0.7, 0.8, 0.9, 1, 1.2, 1.5], 'tol': [0.001, 0.01, 0.1, 0.5, 1, 5]}, 'best_hyperparameters': {'C': 0.6, 'solver': 'newton-cg', 'tol': 0.001}, 'best_score': 0.81369248035914699, 'best_estimator': LogisticRegression(C=0.6, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='newton-cg', tol=0.001,
verbose=0, warm_start=False)}, {'name': 'KNeighborsClassifier', 'estimator': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform'), 'hyperparameters': {'n_neighbors': range(1, 20, 2), 'weights': ['distance', 'uniform'], 'algorithm': ['ball_tree', 'kd_tree', 'brute'], 'p': [1, 2]}, 'best_hyperparameters': {'algorithm': 'brute', 'n_neighbors': 3, 'p': 1, 'weights': 'uniform'}, 'best_score': 0.8125701459034792, 'best_estimator': KNeighborsClassifier(algorithm='brute', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=3, p=1,
weights='uniform')}]
</code></pre></div></div>
<h3 id="generate-output">Generate output</h3>
<p>A function that generates final output for Kaggle submission.
As an input it takes a trained model, colnames, and optional filename to save to.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">save_submission_file</span><span class="p">(</span><span class="n">best_trained_model</span><span class="p">,</span> <span class="n">colnames</span><span class="p">,</span> <span class="n">filename</span><span class="o">=</span><span class="s">'latest_submission.csv'</span><span class="p">):</span>
<span class="n">test_y</span> <span class="o">=</span> <span class="n">best_trained_model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">test_X</span><span class="p">[</span><span class="n">colnames</span><span class="p">])</span>
<span class="n">submission</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">test_X</span><span class="p">[</span><span class="s">'PassengerId'</span><span class="p">],</span><span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">(</span><span class="n">test_y</span><span class="p">,</span><span class="n">name</span><span class="o">=</span><span class="s">'Survived'</span><span class="p">)],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="c">#submission.rename(columns={0:'Survived'}, inplace=True)</span>
<span class="c">#print(submission.head(3))</span>
<span class="n">submission</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
</code></pre></div></div>
<p>Retrive best performing model from the dictionaries (or a model of choice).</p>
<p>And thus we are ready to run a function, to get a latest file for a submission!</p>
<p>Dataset looks fine and abundant with columns. Let’s assign proper data to test_X.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Alternatively just grab all available colnames from all_X to use as a column_list</span>
<span class="n">column_list</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">all_X</span><span class="o">.</span><span class="n">columns</span><span class="o">.</span><span class="n">values</span><span class="p">)</span>
<span class="c">#column_list.append('PassengerId')</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">best_trained_model</span> <span class="o">=</span> <span class="n">list_of_dict_models_sorted_best</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="s">'best_estimator'</span><span class="p">]</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">save_submission_file</span><span class="p">(</span><span class="n">best_trained_model</span><span class="p">,</span> <span class="n">column_list</span><span class="p">,</span> <span class="n">filename</span><span class="o">=</span><span class="s">'latest_submission.csv'</span><span class="p">)</span>
</code></pre></div></div>
<p>The output ready for Kaggle submission would be saved locally as <code class="highlighter-rouge">filename</code>!</p>
<h1 id="good-luck">Good Luck!</h1>
<p>Hope some techniques as well as the framework is most helpful for your endeavors. Have great Machine Learning experience!</p>Such a renown Kaggle competition. Everyone into Machine Learning had tried to predict, who is more likely to survive: be it a family man, a gentlemen with expensive ticket, or a child? Or maybe someone holding a Royalty title? Yes, there is quite a number of features to learn for a machine! Let’s dive into this tutorial which is more of presentation of my top notch framework for Machine Learning so far! Yes! The framework I’ve build with the courtesy of dataquest.io, spending quite some time getting my best competition results. Btw, dataquest.io is a great platform and a community to learn some great things!Word Frequency From a Text2017-10-04T00:00:00+00:002017-10-04T00:00:00+00:00/project/2017/10/04/word_frequency<p>Suppose you want to get top frequent words from a text. This task quickly reveals the caveats: there are swarms of words each with dozens of forms, all those n’t and ‘s and and also commas and periods… All these should be accounted for.</p>
<p>Good for us, there are very powerful tools exist for word processing and text mining - libraries that handles these tasks in a best way possible.</p>
<p>Get familiar with <strong>nltk</strong> - a powerful library for NLP (natural language processing)!</p>
<!--more-->
<p>Ok, so the aim is to get word frequencies. The workflow would be:</p>
<ul>
<li>imports - get libraries</li>
<li>Normalize - prepare words</li>
<li>Tokenize - smart word split</li>
<li>Lemmatize - smart processing to meaningful ‘universal’ words.</li>
<li>Get word frequencies!</li>
</ul>
<p>Follow these simple, yet powerful text mining steps.</p>
<h4 id="imports">imports</h4>
<p>First import <code class="highlighter-rouge">nltk</code>, also you might probably need to do <code class="highlighter-rouge">nltk.download()</code> to get all resources. That might take some time!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">nltk</span>
<span class="n">nltk</span><span class="o">.</span><span class="n">download</span><span class="p">()</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
True
</code></pre></div></div>
<p>Loading raw version of Alice in Wonderland as an example for our analysis.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">alice</span> <span class="o">=</span> <span class="n">nltk</span><span class="o">.</span><span class="n">corpus</span><span class="o">.</span><span class="n">gutenberg</span><span class="o">.</span><span class="n">raw</span><span class="p">(</span><span class="s">'carroll-alice.txt'</span><span class="p">)</span>
<span class="nb">len</span><span class="p">(</span><span class="n">alice</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>144395
</code></pre></div></div>
<h4 id="normalization">Normalization</h4>
<p>Is simply lowercasing all words nicely.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">alice_lower</span> <span class="o">=</span> <span class="n">alice</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">alice_lower</span><span class="p">[:</span><span class="mi">393</span><span class="p">])</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[alice's adventures in wonderland by lewis carroll 1865]
chapter i. down the rabbit-hole
alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought alice 'without pictures or
conversation?'
</code></pre></div></div>
<h4 id="tokenize">Tokenize</h4>
<p>Tokenization is a smart splitting of the text to it’s constituent parts: words, symbols, endings. <code class="highlighter-rouge">nltk</code> has a powerful tool <code class="highlighter-rouge">word_tokenize</code> and takes into consideration many different cases.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">alice_tokenized</span> <span class="o">=</span> <span class="n">nltk</span><span class="o">.</span><span class="n">word_tokenize</span><span class="p">(</span><span class="n">alice_lower</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">alice_tokenized</span><span class="p">[:</span><span class="mi">80</span><span class="p">])</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>['[', 'alice', "'s", 'adventures', 'in', 'wonderland', 'by', 'lewis', 'carroll', '1865', ']', 'chapter', 'i.', 'down', 'the', 'rabbit-hole', 'alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', ',', 'and', 'of', 'having', 'nothing', 'to', 'do', ':', 'once', 'or', 'twice', 'she', 'had', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading', ',', 'but', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in', 'it', ',', "'and", 'what', 'is', 'the', 'use', 'of', 'a', 'book', ',', "'", 'thought', 'alice', "'without", 'pictures', 'or', 'conversation', '?']
</code></pre></div></div>
<h4 id="filter-stop-words">Filter Stop words</h4>
<p>There are a lot of words, such as ‘a’, ‘or’, ‘the’. By the way ‘the’ is the most frequent word in english. Since we are interested in meaningful words we shall filter out any stop words.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">nltk.corpus</span> <span class="kn">import</span> <span class="n">stopwords</span>
<span class="n">stopWords</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span><span class="n">stopwords</span><span class="o">.</span><span class="n">words</span><span class="p">(</span><span class="s">'english'</span><span class="p">))</span>
<span class="n">alice_tokenized</span> <span class="o">=</span> <span class="p">[</span> <span class="n">token</span> <span class="k">for</span> <span class="n">token</span> <span class="ow">in</span> <span class="n">alice_tokenized</span> <span class="k">if</span> <span class="n">token</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">stopWords</span> <span class="p">]</span>
</code></pre></div></div>
<h4 id="filter-punctuation">Filter punctuation</h4>
<p>Punctuation symbols are top frequent as well for the texts. For our purposes we should filter them out as well.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">alice_tokenized</span> <span class="o">=</span> <span class="p">[</span><span class="n">token</span> <span class="k">for</span> <span class="n">token</span> <span class="ow">in</span> <span class="n">alice_tokenized</span> <span class="k">if</span> <span class="n">token</span><span class="o">.</span><span class="n">isalpha</span><span class="p">()]</span>
</code></pre></div></div>
<h4 id="lemmatize">Lemmatize</h4>
<p>Lemmatization, is a smart-stemming. Stemming finds and transforms a word to it’s stem. For example the step of <code class="highlighter-rouge">universal</code> is <code class="highlighter-rouge">univers</code> - which is not very appealing for a human. While lemmatization create meaningful words, thus <code class="highlighter-rouge">universal</code> or <code class="highlighter-rouge">universally</code> would be transformed to actually <code class="highlighter-rouge">universal</code>!</p>
<p>See examples of stemming and lemmatization:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Stemming</span>
<span class="n">porter</span> <span class="o">=</span> <span class="n">nltk</span><span class="o">.</span><span class="n">PorterStemmer</span><span class="p">()</span>
<span class="n">alice_stemmed</span> <span class="o">=</span> <span class="p">[</span><span class="n">porter</span><span class="o">.</span><span class="n">stem</span><span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">alice_tokenized</span><span class="p">]</span> <span class="c"># Still Lemmatization</span>
<span class="k">print</span><span class="p">(</span><span class="n">alice_stemmed</span><span class="p">[:</span><span class="mi">80</span><span class="p">])</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>['alic', 'adventur', 'wonderland', 'lewi', 'carrol', 'chapter', 'alic', 'begin', 'get', 'tire', 'sit', 'sister', 'bank', 'noth', 'twice', 'peep', 'book', 'sister', 'read', 'pictur', 'convers', 'use', 'book', 'thought', 'alic', 'pictur', 'convers', 'consid', 'mind', 'well', 'could', 'hot', 'day', 'made', 'feel', 'sleepi', 'stupid', 'whether', 'pleasur', 'make', 'would', 'worth', 'troubl', 'get', 'pick', 'daisi', 'suddenli', 'white', 'rabbit', 'pink', 'eye', 'ran', 'close', 'noth', 'remark', 'alic', 'think', 'much', 'way', 'hear', 'rabbit', 'say', 'dear', 'oh', 'dear', 'shall', 'late', 'thought', 'afterward', 'occur', 'ought', 'wonder', 'time', 'seem', 'quit', 'natur', 'rabbit', 'actual', 'took', 'watch']
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Lemmatization. Notice the diffetence.</span>
<span class="n">WNlemma</span> <span class="o">=</span> <span class="n">nltk</span><span class="o">.</span><span class="n">WordNetLemmatizer</span><span class="p">()</span>
<span class="n">alice_lemmatized</span> <span class="o">=</span> <span class="p">[</span><span class="n">WNlemma</span><span class="o">.</span><span class="n">lemmatize</span><span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">alice_tokenized</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="n">alice_lemmatized</span><span class="p">[:</span><span class="mi">80</span><span class="p">])</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>['alice', 'adventure', 'wonderland', 'lewis', 'carroll', 'chapter', 'alice', 'beginning', 'get', 'tired', 'sitting', 'sister', 'bank', 'nothing', 'twice', 'peeped', 'book', 'sister', 'reading', 'picture', 'conversation', 'use', 'book', 'thought', 'alice', 'picture', 'conversation', 'considering', 'mind', 'well', 'could', 'hot', 'day', 'made', 'feel', 'sleepy', 'stupid', 'whether', 'pleasure', 'making', 'would', 'worth', 'trouble', 'getting', 'picking', 'daisy', 'suddenly', 'white', 'rabbit', 'pink', 'eye', 'ran', 'close', 'nothing', 'remarkable', 'alice', 'think', 'much', 'way', 'hear', 'rabbit', 'say', 'dear', 'oh', 'dear', 'shall', 'late', 'thought', 'afterwards', 'occurred', 'ought', 'wondered', 'time', 'seemed', 'quite', 'natural', 'rabbit', 'actually', 'took', 'watch']
</code></pre></div></div>
<h4 id="get-distribution">Get Distribution</h4>
<p><code class="highlighter-rouge">FreqDist</code> creates a dictionary of word and it’s frequency</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">word_tokens_freqdist</span> <span class="o">=</span> <span class="n">nltk</span><span class="o">.</span><span class="n">FreqDist</span><span class="p">(</span><span class="n">alice_lemmatized</span><span class="p">)</span>
<span class="nb">list</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">word_tokens_freqdist</span><span class="o">.</span><span class="n">keys</span><span class="p">())[:</span><span class="mi">10</span><span class="p">])</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>['alice',
'adventure',
'wonderland',
'lewis',
'carroll',
'chapter',
'beginning',
'get',
'tired',
'sitting']
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Get words with frequency >60 and length >5.</span>
<span class="n">freqwords</span> <span class="o">=</span> <span class="p">[</span><span class="s">'{}:{}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">w</span><span class="p">,</span><span class="n">dist</span><span class="p">[</span><span class="n">w</span><span class="p">])</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">word_tokens_freqdist</span><span class="o">.</span><span class="n">keys</span><span class="p">()</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">w</span><span class="p">)</span> <span class="o">></span> <span class="mi">5</span> <span class="ow">and</span> <span class="n">word_tokens_freqdist</span><span class="p">[</span><span class="n">w</span><span class="p">]</span> <span class="o">></span> <span class="mi">60</span><span class="p">]</span>
<span class="n">freqwords</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>['thought:76', 'little:128']
</code></pre></div></div>
<p>FreqDist class provides us with handy methods to get top frequent words from the dictionary of frequencies in a nice way.</p>
<p>Getting top 10 frequent words from a text!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Get top 10 frequent words from a text</span>
<span class="n">top_freq_words</span> <span class="o">=</span> <span class="n">word_tokens_freqdist</span><span class="o">.</span><span class="n">most_common</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">top_freq_words</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[('said', 462), ('alice', 396), ('little', 128), ('one', 100), ('would', 90), ('know', 90), ('could', 86), ('like', 86), ('went', 83), ('thing', 79)]
</code></pre></div></div>
<p>We can see top frequent words with their occurence in the text presented in a list of tuples.</p>
<p>Cool!</p>
<p>These are some great basiscs of NLP. We could get get most frequent meaningfult words from a beloved tale Alice in Wonderland. <em>Said alice little one…</em></p>
<p>Hope it was helpful. Your comments and insights are very welcomed.</p>Suppose you want to get top frequent words from a text. This task quickly reveals the caveats: there are swarms of words each with dozens of forms, all those n’t and ‘s and and also commas and periods… All these should be accounted for. Good for us, there are very powerful tools exist for word processing and text mining - libraries that handles these tasks in a best way possible. Get familiar with nltk - a powerful library for NLP (natural language processing)!ML Modelling Workflow Tutorial2017-09-26T00:00:00+00:002017-09-26T00:00:00+00:00/tutorial/2017/09/26/ML_Modelling<p>This is a solid and quick Machine Learning tutorial that leads through the steps of building a best prediction model. It allows to get understanding of the process and provides with code examples.</p>
<p>I’ll be using Logistic Regressor and Trees (either Gradient Boosted or Ensembles are good), for saving space, although I encourage you to run different classifiers.</p>
<p>Originally built on a project I’ve completed for Michigan University course <a href="https://www.coursera.org/learn/python-machine-learning/home/welcome">Applied Machine Learning</a>. Which I highly recommend to anyone passionate with Data Science.
<!--more--></p>
<p><img src="/images/machine_learning/forest.jpg" alt="English Forest" title="In Machine Learning you may grow Forests!" /></p>
<p><em>Picture of English Forest is not accidental. In machine Learning you can grow trees. Thousands of them!</em></p>
<h4 id="the-dataset">The dataset</h4>
<p>The dataset is relatively large - 60k rows, with a dozen of features in a feature space. For discretion reasons, I can not include the dataset, as it is used for student scoring.</p>
<h5 id="a-typical-machine-learning-workflow-is-as-following">A typical Machine Learning workflow is as following:</h5>
<ul>
<li>get acquinted with the data</li>
<li>concise the datasets - deal with nans, trim for only relevant data. (Classifiers cant deal with NaNs and throw errors!)</li>
<li>merge datasets</li>
<li>decide on the features (ensure not to present data leakages), and preprocess the features, such as: convert to nums, minmax transformation, polynomial transformation (for regression tasks)</li>
<li>===> Above steps - the data preparation takes considerable input and an essential step! <===</li>
<li>split Train dataset</li>
<li>understand the distribution of target labels, and it could be a good idea to train a dummy for comparison.</li>
<li>choose relevant classifier, tune for best parameters according to the scoring of your choice (auc, precision, recall, accuracy, f1) via Grid Search</li>
<li>repeat previous test for various classifiers of choice</li>
<li>===> Now we have the best model! <===</li>
<li>feed in (fit) <em>whole</em> train dataset to our model without splits!</li>
</ul>
<p>This workflow ensures to get best possible fitted model. With the best scoring output.</p>
<h5 id="a-few-discoveries-while-testing-different-classifiers">A few discoveries while testing different classifiers:</h5>
<ul>
<li><strong>Logistic Regresser</strong> is VERY powerful in his simplicity and rapidness. It outputs very <em>decent scores</em> (not the best) within seconds.</li>
<li><strong>Tree Ensembels and Gradient Boosted Trees</strong> are very good, but require resource capacity (powerful cpus) and time to display <em>best scores</em>.</li>
<li><strong>Kernelized SVC</strong> turned out quite unusable <em>for dataset with relatively low number of features</em>. Require tremendous resource capacity, and training can take hours. For example, tuning kernel=rbf clogged my i5 for 6 hours with <em>medium output</em>.</li>
</ul>
<h2 id="prepare-dataset-and-preprocess-features">Prepare dataset and Preprocess features</h2>
<p>Assuming dataset is ready, trimmed, some nans are been dealt with, required columns are merged and presented in numberic format.</p>
<p>Here are code snippets for minmax preprocessor and splits:</p>
<h4 id="fillna">fillna</h4>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Example of converting all nans with negative number</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="n">value</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<h4 id="splitter">Splitter</h4>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Split dataset. This example assumes the last column to be the target label (represented as binary)</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="k">def</span> <span class="nf">split_dataset</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="o">.</span><span class="n">values</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">]]</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="o">.</span><span class="n">values</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]]</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="k">return</span> <span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">split_dataset</span><span class="p">(</span><span class="n">train_df</span><span class="p">)</span>
</code></pre></div></div>
<h4 id="min-max-scaler">min-max scaler</h4>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Preprocess features via min-max scaler</span>
<span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">MinMaxScaler</span>
<span class="n">scaler</span> <span class="o">=</span> <span class="n">MinMaxScaler</span><span class="p">()</span>
<span class="n">X_train_scaled</span> <span class="o">=</span> <span class="n">scaler</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="n">X_test_scaled</span> <span class="o">=</span> <span class="n">scaler</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="finding-best-model">Finding Best Model</h2>
<p>This is an most exciting part! Now you’ll see classifiers and your models compete for the best scores. Enjoy and choose the best one :)</p>
<h3 id="logistic-regression-classifier">Logistic Regression classifier</h3>
<p>Let’s complete full workflow using Logistic Regression classifier.</p>
<p>First, tune Logistic Regression classifier for best params via Grid Search. Note that scoring is set to ‘roc-auc’.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LogisticRegression</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">GridSearchCV</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">LogisticRegression</span><span class="p">()</span>
<span class="n">grid_params</span> <span class="o">=</span> <span class="p">{</span><span class="s">'C'</span><span class="p">:[</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">40</span><span class="p">,</span> <span class="mi">60</span><span class="p">,</span> <span class="mi">80</span><span class="p">],</span> <span class="s">'penalty'</span><span class="p">:</span> <span class="p">[</span><span class="s">'l1'</span><span class="p">,</span> <span class="s">'l2'</span><span class="p">]}</span>
<span class="c">#Note n_jobs=-1 parameter sets GridSearch to use all cores.</span>
<span class="n">grid_model_best</span> <span class="o">=</span> <span class="n">GridSearchCV</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">param_grid</span><span class="o">=</span><span class="n">grid_params</span><span class="p">,</span> <span class="n">n_jobs</span><span class="o">=-</span><span class="mi">1</span><span class="p">,</span> <span class="n">cv</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">scoring</span><span class="o">=</span><span class="s">'roc_auc'</span><span class="p">)</span> <span class="c"># cv=3 and scoring='accuracy' are default</span>
<span class="n">grid_model_best</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train_scaled</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
</code></pre></div></div>
<h4 id="model-performance-report">Model Performance Report</h4>
<p>Visualize possible performance metrics for a given model</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">roc_auc_score</span>
<span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">classification_report</span>
<span class="c"># This a good function that prints out metrics for the model.</span>
<span class="k">def</span> <span class="nf">print_report_grid_search</span><span class="p">(</span><span class="n">model</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Accuracy of the classifier on training set: {:.2f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_train_scaled</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Accuracy of the classifier on test set: {:.2f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test_scaled</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)))</span>
<span class="n">y_predicted</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test_scaled</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">Report</span><span class="se">\n</span><span class="s">'</span><span class="p">,</span><span class="n">classification_report</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">y_predicted</span><span class="p">,</span> <span class="n">target_names</span><span class="o">=</span><span class="p">[</span><span class="s">'non-compliant'</span><span class="p">,</span> <span class="s">'compliant'</span><span class="p">]))</span>
<span class="c"># Print out best parameters</span>
<span class="k">print</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">Grid best parameter (max. AUC): '</span><span class="p">,</span> <span class="n">model</span><span class="o">.</span><span class="n">best_params_</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Grid best Train AUC score: '</span><span class="p">,</span> <span class="n">model</span><span class="o">.</span><span class="n">best_score_</span><span class="p">)</span>
<span class="c"># Calculate and print out auc</span>
<span class="n">y_score_model</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">decision_function</span><span class="p">(</span><span class="n">X_test_scaled</span><span class="p">)</span>
<span class="n">roc_auc_model</span> <span class="o">=</span> <span class="n">roc_auc_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">y_score_model</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">Model best Test AUC score: {}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">roc_auc_model</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">print_report_grid_search</span><span class="p">(</span><span class="n">grid_model_best</span><span class="p">)</span>
</code></pre></div></div>
<h4 id="best-params">Best params</h4>
<p>Great! We now have a solid working model, potentially the best one.</p>
<p>Write down best params and any other notes found for a given model.</p>
<ul>
<li>Features are min-max scaled</li>
<li>{‘C’: 35, ‘penalty’: ‘l1’}</li>
<li>Model best Test AUC score: 0.7861967572842491</li>
</ul>
<h3 id="gradient-boosted-trees">Gradient Boosted Trees</h3>
<p>Trees, especially large forests (>1000) require significant computational power, and takes some time.</p>
<p>Also note, that the model is fitted with non-normalized dataset. Building laarge forests more than 1000 might clog the cpu for a few hours. Obviously the larger forests would give slightly better output.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Import and instantiate classifier </span>
<span class="kn">from</span> <span class="nn">sklearn.ensemble</span> <span class="kn">import</span> <span class="n">GradientBoostingClassifier</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">GradientBoostingClassifier</span><span class="p">()</span>
<span class="c"># Define params</span>
<span class="n">param_grid</span> <span class="o">=</span> <span class="p">[{</span>
<span class="s">'n_estimators'</span><span class="p">:[</span><span class="mi">100</span><span class="p">,</span> <span class="mi">200</span><span class="p">,</span> <span class="mi">300</span><span class="p">],</span>
<span class="s">'max_depth'</span><span class="p">:[</span><span class="mi">3</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">15</span><span class="p">],</span>
<span class="s">'random_state'</span><span class="p">:[</span><span class="mi">0</span><span class="p">],</span>
<span class="s">'learning_rate'</span><span class="p">:[</span><span class="mf">0.01</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span>
<span class="s">'max_features'</span><span class="p">:</span> <span class="p">[</span><span class="s">'auto'</span><span class="p">]</span>
<span class="p">}]</span>
<span class="c"># Find model with best params for the scoring goal.</span>
<span class="c"># Note n_jobs=-1 parameter sets GridSearch to use all cores.</span>
<span class="n">grid_model_best</span> <span class="o">=</span> <span class="n">GridSearchCV</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">param_grid</span> <span class="o">=</span> <span class="n">param_grid</span><span class="p">,</span> <span class="n">n_jobs</span><span class="o">=-</span><span class="mi">1</span><span class="p">,</span> <span class="n">cv</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">scoring</span><span class="o">=</span><span class="s">'roc_auc'</span><span class="p">)</span>
<span class="c"># Fit the model</span>
<span class="n">grid_model_best</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
</code></pre></div></div>
<h4 id="model-performance-report-1">Model Performance Report</h4>
<p>Visualize possible performance metrics for a given model</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">roc_auc_score</span>
<span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">classification_report</span>
<span class="c"># The function to print report differs from that of used with Logistic Regression. Particularly predict_proba function is used.</span>
<span class="k">def</span> <span class="nf">print_report_grid_search_trees</span><span class="p">(</span><span class="n">model</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Accuracy of the classifier on a training set: {:.2f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Accuracy of the classifier on a test set: {:.2f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)))</span>
<span class="n">y_predicted</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">Report</span><span class="se">\n</span><span class="s">'</span><span class="p">,</span><span class="n">classification_report</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">y_predicted</span><span class="p">,</span> <span class="n">target_names</span><span class="o">=</span><span class="p">[</span><span class="s">'non-compliant'</span><span class="p">,</span> <span class="s">'compliant'</span><span class="p">]))</span>
<span class="c"># Print out best parameters</span>
<span class="k">print</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">Grid best parameter (max. AUC): '</span><span class="p">,</span> <span class="n">model</span><span class="o">.</span><span class="n">best_params_</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Grid best Train AUC score: '</span><span class="p">,</span> <span class="n">model</span><span class="o">.</span><span class="n">best_score_</span><span class="p">)</span>
<span class="c"># Calculate test AUC score</span>
<span class="n">y_probability</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict_proba</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="n">fpr_model</span><span class="p">,</span> <span class="n">tpr_model</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">roc_curve</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">y_probability</span><span class="p">[:,</span><span class="mi">1</span><span class="p">])</span>
<span class="n">auc_score</span> <span class="o">=</span> <span class="n">auc</span><span class="p">(</span><span class="n">fpr_model</span><span class="p">,</span> <span class="n">tpr_model</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">Model best Test AUC score: {}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">auc_score</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">print_report_grid_search_trees</span><span class="p">(</span><span class="n">grid_model_best</span><span class="p">)</span>
</code></pre></div></div>
<h4 id="best-params-1">Best params</h4>
<p>Even better. Trees output outperforms that of Logistics model.</p>
<p>Latest best params for model with 3 folds (with latlon, with dates):</p>
<ul>
<li>no min-max feature scaling</li>
<li>{‘learning_rate’: 0.1, ‘max_depth’: 7, ‘max_features’: ‘auto’, ‘n_estimators’: 200, ‘random_state’: 0}</li>
<li>Model best Test AUC score: 0.8537540093641313</li>
</ul>
<h2 id="final-model">Final Model</h2>
<p>So, we have our a model of choice! Gradient Boosted Forest!</p>
<p>The last steps would be:</p>
<ul>
<li>fitting the model with full train dataset</li>
<li>use crossvalidation technique for possibly even better output.</li>
</ul>
<h4 id="prepare-datasets-no-split">Prepare datasets, no split</h4>
<p>train dataset is not splitted and is used as a whole</p>
<p>test dataset is used as a whole and obviously has noy target labels.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">dont_split_dataset</span><span class="p">(</span><span class="n">train_df</span><span class="p">,</span> <span class="n">test_df</span><span class="p">):</span>
<span class="n">X_train</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="o">.</span><span class="n">values</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">]]</span>
<span class="n">y_train</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="o">.</span><span class="n">values</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]]</span>
<span class="n">X_test</span> <span class="o">=</span> <span class="n">test_df</span>
<span class="k">return</span> <span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">X_test</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">X_test</span> <span class="o">=</span> <span class="n">dont_split_dataset</span><span class="p">(</span><span class="n">train_df</span><span class="p">,</span> <span class="n">test_df</span><span class="p">)</span>
</code></pre></div></div>
<h4 id="fit-the-final-model">Fit the Final Model</h4>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#The Model Of Choice! Feed in the results.</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">GradientBoostingClassifier</span><span class="p">(</span><span class="n">random_state</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">n_estimators</span><span class="o">=</span><span class="mi">200</span><span class="p">,</span> <span class="n">learning_rate</span><span class="o">=</span><span class="mf">0.1</span><span class="p">,</span> <span class="n">max_depth</span><span class="o">=</span><span class="mi">7</span><span class="p">)</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
</code></pre></div></div>
<p><strong>Perfect! The model is ready.</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Display</span> <span class="n">performance</span> <span class="n">during</span> <span class="n">training</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">roc_auc_score</span>
<span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">classification_report</span>
<span class="c"># See performance on train dataset</span>
<span class="k">def</span> <span class="nf">print_train_report_tree_model</span><span class="p">(</span><span class="n">model</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Accuracy of Decision Tree classifier on training set: {:.2f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)))</span>
<span class="n">y_probability</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict_proba</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="n">fpr_model</span><span class="p">,</span> <span class="n">tpr_model</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">roc_curve</span><span class="p">(</span><span class="n">y_train</span><span class="p">,</span> <span class="n">y_probability</span><span class="p">[:,</span><span class="mi">1</span><span class="p">])</span>
<span class="n">roc_auc_model</span> <span class="o">=</span> <span class="n">auc</span><span class="p">(</span><span class="n">fpr_model</span><span class="p">,</span> <span class="n">tpr_model</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">Model Train AUC: {}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">roc_auc_model</span><span class="p">))</span>
<span class="n">print_train_report_tree_model</span><span class="p">(</span><span class="n">model</span><span class="p">)</span>
</code></pre></div></div>
<h4 id="predict">Predict!</h4>
<p>That’s what a model is built for.</p>
<p>Give her the Test dataset. She’ll make predictions in no time.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">target_labels</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="thanks">Thanks</h2>
<p>So far it was a nice introduction (or refresher) of Machine Learning workflow to built a best model. Leaving off initial dataset preparation (as this is something that requires separate approach).</p>
<p>Hope it was helpful! Any inputs and comments are welcomed.</p>
<p>Have great modelling!</p>This is a solid and quick Machine Learning tutorial that leads through the steps of building a best prediction model. It allows to get understanding of the process and provides with code examples. I’ll be using Logistic Regressor and Trees (either Gradient Boosted or Ensembles are good), for saving space, although I encourage you to run different classifiers. Originally built on a project I’ve completed for Michigan University course Applied Machine Learning. Which I highly recommend to anyone passionate with Data Science.ML Model Evaluation cheatsheet2017-09-23T00:00:00+00:002017-09-23T00:00:00+00:00/cheatsheet/2017/09/23/ML_Choosing_best_algorithm<p>A handy cheatsheet on tools for model evaluation. Briefly explains key concepts, and ends up with Powerful GridSearch tool, providing code snippets.</p>
<!--more-->
<h4 id="the-cheatsheet-includes">The cheatsheet includes:</h4>
<p><a href="#dummy">Dummy Classifiers</a></p>
<p><a href="#confusion">Confusion Matrices</a> - example of binary confusion matrices</p>
<p>Evaluating Binary-Classification Models</p>
<ul>
<li><a href="#metrics"><strong>Metrics</strong></a> - metrics explained</li>
<li><a href="#AUC">AUC</a> - useful extra metric</li>
</ul>
<p>Evaluating MultiClassification Models</p>
<ul>
<li><a href="#multi-confusion">MultiClass Confusion Matrices</a></li>
<li><a href="#multi-report">Multiclass Classification Report</a></li>
</ul>
<p><a href="#grid-search"><strong>Meet the Grid Searh!</strong></a> - the essense of this cheatsheet</p>
<h2 id="dummy-classifiers-">Dummy Classifiers <a id="dummy"></a></h2>
<p>Dummy classifiers completely ignore data. Used for performance comparison with real models. The idea is to get better scoring than that of a dummy.</p>
<p>Types of dummy classifiers:</p>
<ul>
<li>most_frequent - always predicts most frequent label</li>
<li>stratified - random predictions based on data distribution of initial dataset</li>
<li>uniform - generates prediction uniformly</li>
<li>constant - predicts user-provided label (can be applicable for F1 score evaluation)</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.dummy</span> <span class="kn">import</span> <span class="n">DummyClassifier</span>
<span class="c"># Negative class (0) is most frequent</span>
<span class="n">dummy_majority</span> <span class="o">=</span> <span class="n">DummyClassifier</span><span class="p">(</span><span class="n">strategy</span> <span class="o">=</span> <span class="s">'most_frequent'</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="c"># Therefore the dummy 'most_frequent' classifier always predicts class 0</span>
<span class="n">y_dummy_predictions</span> <span class="o">=</span> <span class="n">dummy_majority</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="n">y_dummy_predictions</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dummy_majority</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">svm</span> <span class="o">=</span> <span class="n">SVC</span><span class="p">(</span><span class="n">kernel</span><span class="o">=</span><span class="s">'linear'</span><span class="p">,</span> <span class="n">C</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">svm</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="confusion-matrices">Confusion Matrices<a id="confusion"></a></h2>
<p>Provide insight via a visual on True Negative, True Positive, False Negative, False Positive distribution.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Confusion matrix with dummy classifier</span>
<span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">confusion_matrix</span>
<span class="c"># Negative class (0) is most frequent</span>
<span class="n">dummy_majority</span> <span class="o">=</span> <span class="n">DummyClassifier</span><span class="p">(</span><span class="n">strategy</span> <span class="o">=</span> <span class="s">'most_frequent'</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">y_majority_predicted</span> <span class="o">=</span> <span class="n">dummy_majority</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="n">confusion</span> <span class="o">=</span> <span class="n">confusion_matrix</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">y_majority_predicted</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Most frequent class (dummy classifier)</span><span class="se">\n</span><span class="s">'</span><span class="p">,</span> <span class="n">confusion</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Confusion matrix with SVC</span>
<span class="n">svm</span> <span class="o">=</span> <span class="n">SVC</span><span class="p">(</span><span class="n">kernel</span><span class="o">=</span><span class="s">'linear'</span><span class="p">,</span> <span class="n">C</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">svm_predicted</span> <span class="o">=</span> <span class="n">svm</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="n">confusion</span> <span class="o">=</span> <span class="n">confusion_matrix</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">svm_predicted</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Support vector machine classifier (linear kernel, C=1)</span><span class="se">\n</span><span class="s">'</span><span class="p">,</span> <span class="n">confusion</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="evaluating-binary-classification-models">Evaluating Binary-Classification Models</h2>
<h3 id="metrics-">Metrics <a id="metrics"></a></h3>
<p>Sometimes performance of the model can be evaluated not only accuracy-wise. But various metrics can be taken into consideration or chosen as primary, according to business goals.</p>
<p><strong>Accuracy</strong> = TP + TN / (TP + TN + FP + FN) - how accurately True values are predicted</p>
<p><strong>Pecision</strong> = TP / (TP + FP) - how precise the model is. Good as primary for customer-inclined tasks, where False Positive results should be minimized.</p>
<p><strong>Recall</strong> = TP / (TP + FN) Also known as sensitivity, or True Positive Rate - good for medical applications, when missed (False Negative) predictions should be minimized.</p>
<p><strong>F1</strong> = 2 * Precision * Recall / (Precision + Recall) - evaluates precision and recall equally (harmonic mean)</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">accuracy_score</span><span class="p">,</span> <span class="n">precision_score</span><span class="p">,</span> <span class="n">recall_score</span><span class="p">,</span> <span class="n">f1_score</span>
<span class="c"># Accuracy = TP + TN / (TP + TN + FP + FN)</span>
<span class="c"># Precision = TP / (TP + FP)</span>
<span class="c"># Recall = TP / (TP + FN) Also known as sensitivity, or True Positive Rate</span>
<span class="c"># F1 = 2 * Precision * Recall / (Precision + Recall) </span>
<span class="k">print</span><span class="p">(</span><span class="s">'Accuracy: {:.2f}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">accuracy_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">tree_predicted</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Precision: {:.2f}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">precision_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">tree_predicted</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Recall: {:.2f}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">recall_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">tree_predicted</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'F1: {:.2f}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">f1_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">tree_predicted</span><span class="p">)))</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Combined report with all above metrics</span>
<span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">classification_report</span>
<span class="k">print</span><span class="p">(</span><span class="n">classification_report</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">tree_predicted</span><span class="p">,</span> <span class="n">target_names</span><span class="o">=</span><span class="p">[</span><span class="s">'not 1'</span><span class="p">,</span> <span class="s">'1'</span><span class="p">]))</span>
</code></pre></div></div>
<h3 id="area-under-the-curve-auc">Area under the Curve (AUC)<a id="AUC"></a></h3>
<p>Shows relation between True Positive and False Positive rates.
The higher the AUC, obviously the better the model (Recall rate is higher).</p>
<p>It is a curve built upon sweeping through the thresholds provided by a decision function.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">roc_curve</span><span class="p">,</span> <span class="n">auc</span>
<span class="kn">from</span> <span class="nn">matplotlib</span> <span class="kn">import</span> <span class="n">cm</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y_binary_imbalanced</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlim</span><span class="p">([</span><span class="o">-</span><span class="mf">0.01</span><span class="p">,</span> <span class="mf">1.00</span><span class="p">])</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylim</span><span class="p">([</span><span class="o">-</span><span class="mf">0.01</span><span class="p">,</span> <span class="mf">1.01</span><span class="p">])</span>
<span class="k">for</span> <span class="n">g</span> <span class="ow">in</span> <span class="p">[</span><span class="mf">0.01</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.20</span><span class="p">,</span> <span class="mi">1</span><span class="p">]:</span>
<span class="n">svm</span> <span class="o">=</span> <span class="n">SVC</span><span class="p">(</span><span class="n">gamma</span><span class="o">=</span><span class="n">g</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">y_score_svm</span> <span class="o">=</span> <span class="n">svm</span><span class="o">.</span><span class="n">decision_function</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="n">fpr_svm</span><span class="p">,</span> <span class="n">tpr_svm</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">roc_curve</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">y_score_svm</span><span class="p">)</span>
<span class="n">roc_auc_svm</span> <span class="o">=</span> <span class="n">auc</span><span class="p">(</span><span class="n">fpr_svm</span><span class="p">,</span> <span class="n">tpr_svm</span><span class="p">)</span>
<span class="n">accuracy_svm</span> <span class="o">=</span> <span class="n">svm</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"gamma = {:.2f} accuracy = {:.2f} AUC = {:.2f}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="n">accuracy_svm</span><span class="p">,</span>
<span class="n">roc_auc_svm</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">fpr_svm</span><span class="p">,</span> <span class="n">tpr_svm</span><span class="p">,</span> <span class="n">lw</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.7</span><span class="p">,</span>
<span class="n">label</span><span class="o">=</span><span class="s">'SVM (gamma = {:0.2f}, area = {:0.2f})'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="n">roc_auc_svm</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'False Positive Rate'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">16</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'True Positive Rate (Recall)'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">16</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">color</span><span class="o">=</span><span class="s">'k'</span><span class="p">,</span> <span class="n">lw</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">'--'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="s">"lower right"</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">11</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">'ROC curve: (1-of-10 digits classifier)'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">16</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">axes</span><span class="p">()</span><span class="o">.</span><span class="n">set_aspect</span><span class="p">(</span><span class="s">'equal'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<h2 id="evaluating-multiclassification-models">Evaluating Multiclassification Models</h2>
<h4 id="multiclass-confusion-matrix">Multiclass Confusion Matrix<a id="multi-confusion"></a></h4>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dataset</span> <span class="o">=</span> <span class="n">load_digits</span><span class="p">()</span>
<span class="n">X</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">data</span><span class="p">,</span> <span class="n">dataset</span><span class="o">.</span><span class="n">target</span>
<span class="n">X_train_mc</span><span class="p">,</span> <span class="n">X_test_mc</span><span class="p">,</span> <span class="n">y_train_mc</span><span class="p">,</span> <span class="n">y_test_mc</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">svm</span> <span class="o">=</span> <span class="n">SVC</span><span class="p">(</span><span class="n">kernel</span> <span class="o">=</span> <span class="s">'linear'</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train_mc</span><span class="p">,</span> <span class="n">y_train_mc</span><span class="p">)</span>
<span class="n">svm_predicted_mc</span> <span class="o">=</span> <span class="n">svm</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test_mc</span><span class="p">)</span>
<span class="n">confusion_mc</span> <span class="o">=</span> <span class="n">confusion_matrix</span><span class="p">(</span><span class="n">y_test_mc</span><span class="p">,</span> <span class="n">svm_predicted_mc</span><span class="p">)</span>
<span class="n">df_cm</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">confusion_mc</span><span class="p">,</span>
<span class="n">index</span> <span class="o">=</span> <span class="p">[</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">10</span><span class="p">)],</span> <span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">10</span><span class="p">)])</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mf">5.5</span><span class="p">,</span><span class="mi">4</span><span class="p">))</span>
<span class="n">sns</span><span class="o">.</span><span class="n">heatmap</span><span class="p">(</span><span class="n">df_cm</span><span class="p">,</span> <span class="n">annot</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">'SVM Linear Kernel </span><span class="se">\n</span><span class="s">Accuracy:{0:.3f}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">accuracy_score</span><span class="p">(</span><span class="n">y_test_mc</span><span class="p">,</span>
<span class="n">svm_predicted_mc</span><span class="p">)))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'True label'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Predicted label'</span><span class="p">)</span>
<span class="n">svm</span> <span class="o">=</span> <span class="n">SVC</span><span class="p">(</span><span class="n">kernel</span> <span class="o">=</span> <span class="s">'rbf'</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train_mc</span><span class="p">,</span> <span class="n">y_train_mc</span><span class="p">)</span>
<span class="n">svm_predicted_mc</span> <span class="o">=</span> <span class="n">svm</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test_mc</span><span class="p">)</span>
<span class="n">confusion_mc</span> <span class="o">=</span> <span class="n">confusion_matrix</span><span class="p">(</span><span class="n">y_test_mc</span><span class="p">,</span> <span class="n">svm_predicted_mc</span><span class="p">)</span>
<span class="n">df_cm</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">confusion_mc</span><span class="p">,</span> <span class="n">index</span> <span class="o">=</span> <span class="p">[</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">10</span><span class="p">)],</span>
<span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">10</span><span class="p">)])</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span> <span class="o">=</span> <span class="p">(</span><span class="mf">5.5</span><span class="p">,</span><span class="mi">4</span><span class="p">))</span>
<span class="n">sns</span><span class="o">.</span><span class="n">heatmap</span><span class="p">(</span><span class="n">df_cm</span><span class="p">,</span> <span class="n">annot</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">'SVM RBF Kernel </span><span class="se">\n</span><span class="s">Accuracy:{0:.3f}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">accuracy_score</span><span class="p">(</span><span class="n">y_test_mc</span><span class="p">,</span>
<span class="n">svm_predicted_mc</span><span class="p">)))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'True label'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Predicted label'</span><span class="p">);</span>
</code></pre></div></div>
<h4 id="multi-class-classification-report">Multi-class classification report<a id="multi-report"></a></h4>
<p>Displays combined metrics report for a model</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">classification_report</span><span class="p">(</span><span class="n">y_test_mc</span><span class="p">,</span> <span class="n">svm_predicted_mc</span><span class="p">))</span>
</code></pre></div></div>
<h2 id="meet-the-grid-search">Meet the Grid Search!<a id="grid-search"></a></h2>
<p>A powerful toolset to evaluate best parameters for maximization of a given score metrics.</p>
<p>So, it simply sweeps through all possible parameter combinations. And thus allows to choose best model!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.svm</span> <span class="kn">import</span> <span class="n">SVC</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">GridSearchCV</span>
<span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">roc_auc_score</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">load_digits</span><span class="p">()</span>
<span class="n">X</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">data</span><span class="p">,</span> <span class="n">dataset</span><span class="o">.</span><span class="n">target</span> <span class="o">==</span> <span class="mi">1</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">clf</span> <span class="o">=</span> <span class="n">SVC</span><span class="p">(</span><span class="n">kernel</span><span class="o">=</span><span class="s">'rbf'</span><span class="p">)</span>
<span class="n">grid_values</span> <span class="o">=</span> <span class="p">{</span><span class="s">'gamma'</span><span class="p">:</span> <span class="p">[</span><span class="mf">0.001</span><span class="p">,</span> <span class="mf">0.01</span><span class="p">,</span> <span class="mf">0.05</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">100</span><span class="p">]}</span>
<span class="c"># default metric to optimize over grid parameters: accuracy</span>
<span class="n">grid_clf_acc</span> <span class="o">=</span> <span class="n">GridSearchCV</span><span class="p">(</span><span class="n">clf</span><span class="p">,</span> <span class="n">param_grid</span> <span class="o">=</span> <span class="n">grid_values</span><span class="p">)</span>
<span class="n">grid_clf_acc</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Grid best parameter (max. accuracy): '</span><span class="p">,</span> <span class="n">grid_clf_acc</span><span class="o">.</span><span class="n">best_params_</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Grid best score (accuracy): '</span><span class="p">,</span> <span class="n">grid_clf_acc</span><span class="o">.</span><span class="n">best_score_</span><span class="p">)</span>
<span class="c"># alternative metric to optimize over grid parameters: AUC</span>
<span class="n">grid_clf_auc</span> <span class="o">=</span> <span class="n">GridSearchCV</span><span class="p">(</span><span class="n">clf</span><span class="p">,</span> <span class="n">param_grid</span> <span class="o">=</span> <span class="n">grid_values</span><span class="p">,</span> <span class="n">scoring</span> <span class="o">=</span> <span class="s">'roc_auc'</span><span class="p">)</span>
<span class="n">grid_clf_auc</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">y_decision_fn_scores_auc</span> <span class="o">=</span> <span class="n">grid_clf_auc</span><span class="o">.</span><span class="n">decision_function</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Test set AUC: '</span><span class="p">,</span> <span class="n">roc_auc_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">y_decision_fn_scores_auc</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Grid best parameter (max. AUC): '</span><span class="p">,</span> <span class="n">grid_clf_auc</span><span class="o">.</span><span class="n">best_params_</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Grid best score (AUC): '</span><span class="p">,</span> <span class="n">grid_clf_auc</span><span class="o">.</span><span class="n">best_score_</span><span class="p">)</span>
</code></pre></div></div>
<h4 id="evaluation-metrics-supported-for-model-selection">Evaluation metrics supported for model selection</h4>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.metrics.scorer</span> <span class="kn">import</span> <span class="n">SCORERS</span>
<span class="k">print</span><span class="p">(</span><span class="nb">sorted</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">SCORERS</span><span class="o">.</span><span class="n">keys</span><span class="p">())))</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>['accuracy', 'adjusted_rand_score', 'average_precision', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc']
</code></pre></div></div>
<h2 id="thanks">Thanks!</h2>
<p>Hope the cheatsheet is helpful. Especially the section on the Grid Search.</p>
<p><em>The Article is based on handouts from Specialization on coursera <a href="https://www.coursera.org/specializations/data-science-python">Applied Data Science with Python</a> University of Michigan.</em></p>
<p>Have a great time!</p>A handy cheatsheet on tools for model evaluation. Briefly explains key concepts, and ends up with Powerful GridSearch tool, providing code snippets.Machine Learning Algorithms Cheatsheet2017-09-23T00:00:00+00:002017-09-23T00:00:00+00:00/cheatsheet/2017/09/23/ML_Algorithms_cheatsheet<p>A useful cheatsheet of Machine Learning Algorithms, with brief description on best application along with code examples.</p>
<p>The cheatsheet lists various models as well as few techniques (at the end) to compliment model performance.</p>
<!--more-->
<p><strong>k-NN Classifier</strong></p>
<ul>
<li><a href="#k-NN-models">k-NN Classifier and Regressor</a> -aka- “k-Nearest Points Classifier”.</li>
</ul>
<p><strong>Linear models for Regression</strong></p>
<ul>
<li><a href="#linear-regressor">Linear Model Classifier and Regressor</a> -aka- “Weighted features Classifier”</li>
<li><a href="#ridge-regressor">Ridge Regression</a> - Linear regressor with regularization.</li>
<li><a href="#lasso-regressor">Lasso Regression</a> - regressor with sparse solution.</li>
</ul>
<p><strong>Linear models for Classification</strong></p>
<ul>
<li><a href="#logistic-regressor">Logistic Regression</a></li>
<li><a href="#linear-SVC">Linear Support Vector Machines</a> or SVC</li>
</ul>
<p><strong>Kernelized Vector Machines</strong></p>
<ul>
<li><a href="#kernelized-SVC">Kernelized SVC</a> - rbf and poly kernels</li>
</ul>
<p><strong>Decision Trees</strong></p>
<ul>
<li><a href="#decision-trees">Tree Classifier</a> - building, visualization and plotting important features</li>
</ul>
<p><strong>Techniques</strong></p>
<ul>
<li><a href="#min-max-scaler">Min-Max Scaler</a> - normalizer</li>
<li><a href="#poly-feature-expansion">Polynomial Feature Expansion technique</a> - feature magnifier</li>
<li><a href="#cross-validation">Cross Validation</a> - train model via several data splits (folds)</li>
</ul>
<h2 id="import-and-initializations">Import and initializations</h2>
<p>Import libraries, read in data, split data</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%</span><span class="n">matplotlib</span> <span class="n">notebook</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="c">#Read in some dataframe with feature columns and a column of labels.</span>
<span class="n">fruits</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">()</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">fruits</span><span class="p">[[</span><span class="s">'height'</span><span class="p">,</span> <span class="s">'width'</span><span class="p">,</span> <span class="s">'mass'</span><span class="p">,</span> <span class="s">'color_score'</span><span class="p">]]</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">fruits</span><span class="p">[</span><span class="s">'fruit_label'</span><span class="p">]</span>
<span class="c">#Split the data. By default the split ration is 75%:25%</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="k-nn-models-">k-NN Models <a id="k-NN-models"></a></h2>
<p>k-NN classifier is a simple and popular algorithm, can be used both for classification and regression solutions. Algorithm builds decision boundaries for classes. The prediction accuracy based on the major vote from the k-nearest points. Number of k-nearest points is decided with the parameter n_neighbors.</p>
<p>The higher the <code class="highlighter-rouge">n_neighbors=k</code>, the simplier the model.</p>
<p><strong>Best to apply</strong>: predict objects with low number of features.</p>
<h3 id="k-nn-classifier">k-NN Classifier</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Create classifier object</span>
<span class="kn">from</span> <span class="nn">sklearn.neighbors</span> <span class="kn">import</span> <span class="n">KNeighborsClassifier</span>
<span class="c"># Note the n_neighbors parameter, which is key on how accurate the classifier would be.</span>
<span class="n">knn</span> <span class="o">=</span> <span class="n">KNeighborsClassifier</span><span class="p">(</span><span class="n">n_neighbors</span> <span class="o">=</span> <span class="mi">5</span><span class="p">)</span>
<span class="c"># Train the classifier (fit the estimator) using the training data</span>
<span class="n">knn</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="c"># Check the score</span>
<span class="n">knn</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="c"># Estimate the accuracy of the classifier on future data, using the test data</span>
<span class="n">knn</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)</span>
<span class="c"># Use the trained k-NN classifier model to classify new, previously unseen objects</span>
<span class="n">fruit_prediction</span> <span class="o">=</span> <span class="n">knn</span><span class="o">.</span><span class="n">predict</span><span class="p">([[</span><span class="mi">20</span><span class="p">,</span> <span class="mf">4.3</span><span class="p">,</span> <span class="mf">5.5</span><span class="p">]])</span>
<span class="n">lookup_fruit_name</span><span class="p">[</span><span class="n">fruit_prediction</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span>
</code></pre></div></div>
<h3 id="k-nn-regressor">k-NN Regressor</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.neighbors</span> <span class="kn">import</span> <span class="n">KNeighborsRegressor</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X_R1</span><span class="p">,</span> <span class="n">y_R1</span><span class="p">,</span> <span class="n">random_state</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">knnreg</span> <span class="o">=</span> <span class="n">KNeighborsRegressor</span><span class="p">(</span><span class="n">n_neighbors</span> <span class="o">=</span> <span class="mi">5</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">knnreg</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'R-squared test score: {:.3f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">knnreg</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)))</span>
</code></pre></div></div>
<h2 id="linear-models-for-regression">Linear Models for Regression</h2>
<p>Linear models use basic and popular algorithms and are good in solving various regression problems. Generalize better than k-NN.</p>
<p>Linear algorithms base their prediction feature weights computed using different techniques. Algorithms can be controlled using regularization: l1 or l2 (linear and squared) to increase generalization level.</p>
<p>Regularization is a penalty applied to large weights.</p>
<h3 id="linear-regression-">Linear Regression <a id="linear-regressor"></a></h3>
<p>No regularization</p>
<p><strong>Best chosen</strong>: for datasets with medium amount of features.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Import parameters and datasets</span>
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LinearRegression</span>
<span class="kn">from</span> <span class="nn">adspy_shared_utilities</span> <span class="kn">import</span> <span class="n">load_crime_dataset</span>
<span class="c"># Communities and Crime dataset</span>
<span class="p">(</span><span class="n">X_crime</span><span class="p">,</span> <span class="n">y_crime</span><span class="p">)</span> <span class="o">=</span> <span class="n">load_crime_dataset</span><span class="p">()</span>
<span class="c"># Split</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X_crime</span><span class="p">,</span> <span class="n">y_crime</span><span class="p">,</span>
<span class="n">random_state</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
<span class="c"># Train the model</span>
<span class="n">linreg</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">()</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="c"># Print result prediction of crime rate per capita in the areas based on all features.</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Crime dataset'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'linear model intercept: {}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">linreg</span><span class="o">.</span><span class="n">intercept_</span><span class="p">))</span>
<span class="c"># Prints weights (coefficient) assigned to each feature</span>
<span class="k">print</span><span class="p">(</span><span class="s">'linear model coeff:</span><span class="se">\n</span><span class="s">{}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">linreg</span><span class="o">.</span><span class="n">coef_</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'R-squared score (training): {:.3f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">linreg</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'R-squared score (test): {:.3f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">linreg</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)))</span>
</code></pre></div></div>
<h3 id="ridge-regression-">Ridge Regression <a id="ridge-regressor"></a></h3>
<p>Linear Regression with regularization.</p>
<p>Parameters:</p>
<ul>
<li>alpha=1 - defines regularization level. Higher alpha = higher regularization.</li>
</ul>
<p>Requires feature normalization (min-max transofrmation 0..1) - (!)only on train data, to avoid data leakage.</p>
<p>Can be applied with polynomial feature expansion.</p>
<p><strong>Best chosen</strong>: works well with medium and smaller sized datasets with large number of features</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">MinMaxScaler</span>
<span class="n">scaler</span> <span class="o">=</span> <span class="n">MinMaxScaler</span><span class="p">()</span>
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">Ridge</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X_crime</span><span class="p">,</span> <span class="n">y_crime</span><span class="p">,</span>
<span class="n">random_state</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">X_train_scaled</span> <span class="o">=</span> <span class="n">scaler</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="n">X_test_scaled</span> <span class="o">=</span> <span class="n">scaler</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="n">linridge</span> <span class="o">=</span> <span class="n">Ridge</span><span class="p">(</span><span class="n">alpha</span><span class="o">=</span><span class="mf">20.0</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train_scaled</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Crime dataset'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'ridge regression linear model intercept: {}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">linridge</span><span class="o">.</span><span class="n">intercept_</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'ridge regression linear model coeff:</span><span class="se">\n</span><span class="s">{}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">linridge</span><span class="o">.</span><span class="n">coef_</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'R-squared score (training): {:.3f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">linridge</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_train_scaled</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'R-squared score (test): {:.3f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">linridge</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test_scaled</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Number of non-zero features: {}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">linridge</span><span class="o">.</span><span class="n">coef_</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)))</span>
</code></pre></div></div>
<h3 id="lasso-regression-">Lasso Regression <a id="lasso-regressor"></a></h3>
<p>Similar to Ridge Regression, but L1(linear) regularization applied. So that weighted sum can get equal to 0, unlike in L2 regularization (with weights squared).</p>
<p>So, essentially Lasso Regression applies “Sparse Solution”, i.e. chooses features only of highest importance.</p>
<p>Controls:</p>
<ul>
<li>alpha=1 defines regularization level, Higher alpha = higher regularization.</li>
</ul>
<p><strong>Best chosen</strong>: when dataset contains a few features with medium/large effect.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">Lasso</span>
<span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">MinMaxScaler</span>
<span class="n">scaler</span> <span class="o">=</span> <span class="n">MinMaxScaler</span><span class="p">()</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X_crime</span><span class="p">,</span> <span class="n">y_crime</span><span class="p">,</span>
<span class="n">random_state</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">X_train_scaled</span> <span class="o">=</span> <span class="n">scaler</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="n">X_test_scaled</span> <span class="o">=</span> <span class="n">scaler</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="n">linlasso</span> <span class="o">=</span> <span class="n">Lasso</span><span class="p">(</span><span class="n">alpha</span><span class="o">=</span><span class="mf">2.0</span><span class="p">,</span> <span class="n">max_iter</span> <span class="o">=</span> <span class="mi">10000</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train_scaled</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Crime dataset'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'lasso regression linear model intercept: {}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">linlasso</span><span class="o">.</span><span class="n">intercept_</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'lasso regression linear model coeff:</span><span class="se">\n</span><span class="s">{}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">linlasso</span><span class="o">.</span><span class="n">coef_</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Non-zero features: {}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">linlasso</span><span class="o">.</span><span class="n">coef_</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'R-squared score (training): {:.3f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">linlasso</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_train_scaled</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'R-squared score (test): {:.3f}</span><span class="se">\n</span><span class="s">'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">linlasso</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test_scaled</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Features with non-zero weight (sorted by absolute magnitude):'</span><span class="p">)</span>
<span class="c"># Sorts out and presents features by their magnitude</span>
<span class="k">for</span> <span class="n">e</span> <span class="ow">in</span> <span class="nb">sorted</span> <span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">X_crime</span><span class="p">),</span> <span class="n">linlasso</span><span class="o">.</span><span class="n">coef_</span><span class="p">)),</span>
<span class="n">key</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">e</span><span class="p">:</span> <span class="o">-</span><span class="nb">abs</span><span class="p">(</span><span class="n">e</span><span class="p">[</span><span class="mi">1</span><span class="p">])):</span>
<span class="k">if</span> <span class="n">e</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="s">'</span><span class="se">\t</span><span class="s">{}, {:.3f}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">e</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">e</span><span class="p">[</span><span class="mi">1</span><span class="p">]))</span>
</code></pre></div></div>
<h2 id="linear-models-for-classification">Linear models for Classification</h2>
<p>Linear models require less resources for generalization in comparison to Kernel SVC. And thus can be very powerful for larger datasets</p>
<h3 id="logistic-regression-">Logistic Regression <a id="logistic-regressor"></a></h3>
<p>It actually uses binary classification, i.e. comparing this class against all others. Virtually linear regression is used for binary classification under the hood.</p>
<p>Controls:</p>
<ul>
<li>C parameter, stands for L2 regularization level. Higher C = less regularization.</li>
</ul>
<p><strong>Best Chosen</strong>: popular choice for classification even with large datasets</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LogisticRegression</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X_cancer</span><span class="p">,</span> <span class="n">y_cancer</span><span class="p">,</span> <span class="n">random_state</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">clf</span> <span class="o">=</span> <span class="n">LogisticRegression</span><span class="p">()</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Breast cancer dataset'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Accuracy of Logistic regression classifier on training set: {:.2f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">clf</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Accuracy of Logistic regression classifier on test set: {:.2f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">clf</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)))</span>
</code></pre></div></div>
<h3 id="linear-support-vector-machines-">Linear Support Vector Machines <a id="linear-SVC"></a></h3>
<p>Is ok to solve binary and multiclassification problems. Binary classification happens under the hood.
Controls:</p>
<ul>
<li>C parameter, stands for L2 regularization level. Higher C = less regularization.</li>
</ul>
<p><strong>Best chosen</strong>: relatively good with large datasets, fast prediction, sparse data</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.svm</span> <span class="kn">import</span> <span class="n">LinearSVC</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X_cancer</span><span class="p">,</span> <span class="n">y_cancer</span><span class="p">,</span> <span class="n">random_state</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">clf</span> <span class="o">=</span> <span class="n">LinearSVC</span><span class="p">()</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Breast cancer dataset'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Accuracy of Linear SVC classifier on training set: {:.2f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">clf</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Accuracy of Linear SVC classifier on test set: {:.2f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">clf</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)))</span>
</code></pre></div></div>
<h2 id="kernelized-support-vector-machines-">Kernelized Support Vector Machines <a id="kernelized-SVC"></a></h2>
<p>Implements different functions under the hood, called “kernels”. The default is RBF kernel - Radio Basis Function.</p>
<p>Kernel examples: rbf, poly</p>
<p>Controls:</p>
<ul>
<li>gamma=1, the higher the gamma the less generatlization.</li>
<li>C, stands for L2 regularization level. Higher C = less regularization.</li>
</ul>
<p><strong>Best choosen</strong> Powerful classifiers, especially when supplemented with correct parameter tuning.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.svm</span> <span class="kn">import</span> <span class="n">SVC</span>
<span class="kn">from</span> <span class="nn">adspy_shared_utilities</span> <span class="kn">import</span> <span class="n">plot_class_regions_for_classifier</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X_D2</span><span class="p">,</span> <span class="n">y_D2</span><span class="p">,</span> <span class="n">random_state</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
<span class="c"># The default SVC kernel is radial basis function (RBF)</span>
<span class="n">plot_class_regions_for_classifier</span><span class="p">(</span><span class="n">SVC</span><span class="p">()</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">),</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="bp">None</span><span class="p">,</span> <span class="bp">None</span><span class="p">,</span>
<span class="s">'Support Vector Classifier: RBF kernel'</span><span class="p">)</span>
<span class="c"># Compare decision boundries with polynomial kernel, degree = 3</span>
<span class="n">plot_class_regions_for_classifier</span><span class="p">(</span><span class="n">SVC</span><span class="p">(</span><span class="n">kernel</span> <span class="o">=</span> <span class="s">'poly'</span><span class="p">,</span> <span class="n">degree</span> <span class="o">=</span> <span class="mi">3</span><span class="p">)</span>
<span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">),</span> <span class="n">X_train</span><span class="p">,</span>
<span class="n">y_train</span><span class="p">,</span> <span class="bp">None</span><span class="p">,</span> <span class="bp">None</span><span class="p">,</span>
<span class="s">'Support Vector Classifier: Polynomial kernel, degree = 3'</span><span class="p">)</span>
</code></pre></div></div>
<p>Example of SVC with min-max features preprocessed.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">MinMaxScaler</span>
<span class="n">scaler</span> <span class="o">=</span> <span class="n">MinMaxScaler</span><span class="p">()</span>
<span class="n">X_train_scaled</span> <span class="o">=</span> <span class="n">scaler</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="n">X_test_scaled</span> <span class="o">=</span> <span class="n">scaler</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="n">clf</span> <span class="o">=</span> <span class="n">SVC</span><span class="p">(</span><span class="n">C</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">gamma</span><span class="o">=</span><span class="mf">0.1</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train_scaled</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Breast cancer dataset (normalized with MinMax scaling)'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'RBF-kernel SVC (with MinMax scaling) training set accuracy: {:.2f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">clf</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_train_scaled</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'RBF-kernel SVC (with MinMax scaling) test set accuracy: {:.2f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">clf</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test_scaled</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)))</span>
</code></pre></div></div>
<h3 id="decision-trees-">Decision Trees <a id="decision-trees"></a></h3>
<p>Decision Tree Classifier Builds a structure of features with highest-to-lowest weight features using split-game. Individual decision trees tend to overfit.</p>
<p>Parameters:</p>
<ul>
<li>max_depth - decision tree depth, for generalization purposes and avoid overfitting</li>
</ul>
<p><strong>Best chosen</strong>: great for classification, especially when used in ensembles. Good with medium number of features.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.datasets</span> <span class="kn">import</span> <span class="n">load_iris</span>
<span class="kn">from</span> <span class="nn">sklearn.tree</span> <span class="kn">import</span> <span class="n">DecisionTreeClassifier</span>
<span class="kn">from</span> <span class="nn">adspy_shared_utilities</span> <span class="kn">import</span> <span class="n">plot_decision_tree</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="n">iris</span> <span class="o">=</span> <span class="n">load_iris</span><span class="p">()</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">iris</span><span class="o">.</span><span class="n">data</span><span class="p">,</span> <span class="n">iris</span><span class="o">.</span><span class="n">target</span><span class="p">,</span> <span class="n">random_state</span> <span class="o">=</span> <span class="mi">3</span><span class="p">)</span>
<span class="n">clf</span> <span class="o">=</span> <span class="n">DecisionTreeClassifier</span><span class="p">(</span><span class="n">max_depth</span> <span class="o">=</span> <span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Accuracy of Decision Tree classifier on training set: {:.2f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">clf</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Accuracy of Decision Tree classifier on test set: {:.2f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">clf</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)))</span>
</code></pre></div></div>
<h4 id="visualize-decision-trees">Visualize Decision Trees</h4>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plot_decision_tree</span><span class="p">(</span><span class="n">clf</span><span class="p">,</span> <span class="n">iris</span><span class="o">.</span><span class="n">feature_names</span><span class="p">,</span> <span class="n">iris</span><span class="o">.</span><span class="n">target_names</span><span class="p">)</span>
</code></pre></div></div>
<h4 id="visualize-feature-importances">Visualize Feature Importances</h4>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">adspy_shared_utilities</span> <span class="kn">import</span> <span class="n">plot_feature_importances</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">4</span><span class="p">),</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">80</span><span class="p">)</span>
<span class="n">plot_feature_importances</span><span class="p">(</span><span class="n">clf</span><span class="p">,</span> <span class="n">iris</span><span class="o">.</span><span class="n">feature_names</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Feature importances: {}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">clf</span><span class="o">.</span><span class="n">feature_importances_</span><span class="p">))</span>
</code></pre></div></div>
<h2 id="techniques">Techniques</h2>
<p>Some techniques that complement different models:</p>
<ul>
<li>MinMaxScaler - normalizer</li>
<li>Polynomial Feature Expansion - magnifies features</li>
<li>Cross Validation - performs several training fold</li>
</ul>
<h3 id="minmax-scaler-">MinMax Scaler <a id="min-max-scaler"></a></h3>
<p>Normalizes features.</p>
<p>Best applied along with Regularized Linear Regression models (Ridge) and with Kernelized SVC.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">MinMaxScaler</span>
<span class="n">scaler</span> <span class="o">=</span> <span class="n">MinMaxScaler</span><span class="p">()</span>
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">Ridge</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X_crime</span><span class="p">,</span> <span class="n">y_crime</span><span class="p">,</span>
<span class="n">random_state</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">X_train_scaled</span> <span class="o">=</span> <span class="n">scaler</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="n">X_test_scaled</span> <span class="o">=</span> <span class="n">scaler</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
</code></pre></div></div>
<h3 id="polynomial-feature-expansion-technique-">Polynomial feature expansion Technique <a id="poly-feature-expansion"></a></h3>
<p>Allows to magnify features.</p>
<p>Use polynomial features in combination with regression that has a regularization penalty, like <strong>ridge regression</strong>. Applied on initial dataset.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LinearRegression</span>
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">Ridge</span>
<span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">PolynomialFeatures</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X_F1</span><span class="p">,</span> <span class="n">y_F1</span><span class="p">,</span>
<span class="n">random_state</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">linreg</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">()</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'linear model coeff (w): {}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">linreg</span><span class="o">.</span><span class="n">coef_</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'linear model intercept (b): {:.3f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">linreg</span><span class="o">.</span><span class="n">intercept_</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'R-squared score (training): {:.3f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">linreg</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'R-squared score (test): {:.3f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">linreg</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">Now we transform the original input data to add</span><span class="se">\n\
</span><span class="s">polynomial features up to degree 2 (quadratic)</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
<span class="n">poly</span> <span class="o">=</span> <span class="n">PolynomialFeatures</span><span class="p">(</span><span class="n">degree</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="n">X_F1_poly</span> <span class="o">=</span> <span class="n">poly</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X_F1</span><span class="p">)</span>
<span class="c">#print('printing original X_F1 of len {}: {}'.format(len(X_F1),X_F1))</span>
<span class="c">#print('printing poly-transformed X_F1 of len {}: {}'.format(len(X_F1_poly)X_F1_poly))</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X_F1_poly</span><span class="p">,</span> <span class="n">y_F1</span><span class="p">,</span>
<span class="n">random_state</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">linreg</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">()</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'(poly deg 2) linear model coeff (w):</span><span class="se">\n</span><span class="s">{}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">linreg</span><span class="o">.</span><span class="n">coef_</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'(poly deg 2) linear model intercept (b): {:.3f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">linreg</span><span class="o">.</span><span class="n">intercept_</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'(poly deg 2) R-squared score (training): {:.3f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">linreg</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'(poly deg 2) R-squared score (test): {:.3f}</span><span class="se">\n</span><span class="s">'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">linreg</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">Addition of many polynomial features often leads to</span><span class="se">\n\
</span><span class="s">overfitting, so we often use polynomial features in combination</span><span class="se">\n\
</span><span class="s">with regression that has a regularization penalty, like ridge</span><span class="se">\n\
</span><span class="s">regression.</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X_F1_poly</span><span class="p">,</span> <span class="n">y_F1</span><span class="p">,</span>
<span class="n">random_state</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">linreg</span> <span class="o">=</span> <span class="n">Ridge</span><span class="p">()</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'(poly deg 2 + ridge) linear model coeff (w):</span><span class="se">\n</span><span class="s">{}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">linreg</span><span class="o">.</span><span class="n">coef_</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'(poly deg 2 + ridge) linear model intercept (b): {:.3f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">linreg</span><span class="o">.</span><span class="n">intercept_</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'(poly deg 2 + ridge) R-squared score (training): {:.3f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">linreg</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'(poly deg 2 + ridge) R-squared score (test): {:.3f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">linreg</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)))</span>
</code></pre></div></div>
<h3 id="cross-validation-technique-">Cross Validation Technique <a id="cross-validation"></a></h3>
<p>Allows to reach better scores, by additional splits of the dataset (folds). The scores can be calculated as a mean of scores from each fold.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">cross_val_score</span>
<span class="n">clf</span> <span class="o">=</span> <span class="n">KNeighborsClassifier</span><span class="p">(</span><span class="n">n_neighbors</span> <span class="o">=</span> <span class="mi">5</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">X_fruits_2d</span><span class="o">.</span><span class="n">as_matrix</span><span class="p">()</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">y_fruits_2d</span><span class="o">.</span><span class="n">as_matrix</span><span class="p">()</span>
<span class="n">cv_scores</span> <span class="o">=</span> <span class="n">cross_val_score</span><span class="p">(</span><span class="n">clf</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Cross-validation scores (3-fold):'</span><span class="p">,</span> <span class="n">cv_scores</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Mean cross-validation score (3-fold): {:.3f}'</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">cv_scores</span><span class="p">)))</span>
</code></pre></div></div>
<h4 id="a-note-on-performing-cross-validation-for-more-advanced-scenarios">A note on performing cross-validation for more advanced scenarios.</h4>
<p>In some cases (e.g. when feature values have very different ranges), we’ve seen the need to scale or normalize the training and test sets before use with a classifier. The proper way to do cross-validation when you need to scale the data is <em>not</em> to scale the entire dataset with a single transform, since this will indirectly leak information into the training data about the whole dataset, including the test data (see the lecture on data leakage later in the course). Instead, scaling/normalizing must be computed and applied for each cross-validation fold separately. To do this, the easiest way in scikit-learn is to use <em>pipelines</em>. While these are beyond the scope of this course, further information is available in the scikit-learn documentation here:</p>
<p>http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html</p>
<p>or the Pipeline section in the recommended textbook: Introduction to Machine Learning with Python by Andreas C. Müller and Sarah Guido (O’Reilly Media).</p>
<h2 id="thanks">Thanks</h2>
<p>Great! Hope this cheatsheet was helpful.</p>
<p>Based on handouts from Specialization on coursera <a href="https://www.coursera.org/specializations/data-science-python">Applied Data Science with Python</a> University of Michigan.</p>
<p>Have a nice day ;)</p>A useful cheatsheet of Machine Learning Algorithms, with brief description on best application along with code examples. The cheatsheet lists various models as well as few techniques (at the end) to compliment model performance.NYC Best School Districts2017-09-21T00:00:00+00:002017-09-21T00:00:00+00:00/project/2017/09/21/NYC_schools_exploration<p>My NYC Schools Exploration project completed with interesting results and great visual. There is quite a number of code lines I’ve produced, all for the sake of finding out on how good can matplotlib be in drawing visuals. Turns out it is quite capable!</p>
<p>Scroll to the very bottom to marvel on data-mined results wrapped up in a nice visual!</p>
<p>This research addresses following questions:</p>
<ul>
<li>Determing wheter there’s a correlation between class size and SAT scores</li>
<li>Figuring out which neighborhoods have the best schools
<ul>
<li>
<p>In combination with a dataset containing property values for NY districts, we could find the least expensive neighborhoods that have good schools.</p>
<!--more-->
</li>
</ul>
</li>
<li>Investigating the differences between parent, teacher, and student responses to surveys.</li>
<li>Assigning scores to schools based on sat_score and other attributes.</li>
</ul>
<p>View the <a href="http://nbviewer.jupyter.org/github/SilverSurfer0/dataquest/blob/master/solutions/NYC_schools_exploration.ipynb#">full code and all artifacts on nbviewer</a>.</p>
<p><em>For readability means this blog post contains only Findings.</em></p>
<h3 id="findings">Findings</h3>
<p><img src="/images/NYC_schools_exploration_files/NYC_schools_exploration_84_1.png" alt="png" /></p>
<p>Manhattan is by far the most expensive Borough in NYC in terms of apartment price, while schools SAT scores may be decent, but not so impressive.</p>
<p><strong>Now, the main question</strong>: which School Districts are not so expensive to live in NYC and have decent schools according to the SAT score?</p>
<ol>
<li><strong>Brooklyn</strong> is by far the best according to the map. Apartments are ranging on a lower price spectrum, and SAT score is the highest.</li>
<li><strong>Staten Island</strong> seems to be a ‘School haven’! With least estate prices, the Schools in Staten Island hit the upper SAT score level!</li>
<li>Some Districts in <strong>Queens</strong> Borough are a fairly good choice. With apartments price OK, and schools with nice performance.</li>
</ol>
<p>Additionally: There is no evident enough correlation between how expensive the District is and School Performance.</p>
<h3 id="wrap-up">Wrap up</h3>
<p><em>Thanks for checking out this article!</em></p>
<p>Additionally I wan’t to highlight, that averages for SAT scores and Property Values were calculated using trim_mean. Which is a good way to ensure no skewing happens due to extra low/high outliers (trim mean basically trims 10% lowest and highest values).</p>
<p>It was more of an exploration of matplotlib, basemap capabilities and those shapefiles. I think the result is quite nice, though there is so much room for improvement.</p>
<p>It is absolutely true, that such graphs are better to be built automatically with other libraries or even online tools! Which would be much faster and even provide interactivity.</p>
<p>Sources:</p>
<ul>
<li>http://brandonrose.org/pythonmap (thanks!)</li>
<li>http://sensitivecities.com/so-youd-like-to-make-a-map-using-python-EN.html#.Wb-9Z8gjFaR</li>
<li>http://shallowsky.com/blog/programming/plotting-election-data-basemap.html</li>
<li>https://data.cityofnewyork.us/Education/School-Districts/r8nu-ymqj/data</li>
</ul>My NYC Schools Exploration project completed with interesting results and great visual. There is quite a number of code lines I’ve produced, all for the sake of finding out on how good can matplotlib be in drawing visuals. Turns out it is quite capable! Scroll to the very bottom to marvel on data-mined results wrapped up in a nice visual! This research addresses following questions: Determing wheter there’s a correlation between class size and SAT scores Figuring out which neighborhoods have the best schools In combination with a dataset containing property values for NY districts, we could find the least expensive neighborhoods that have good schools.