HBO MAX is not working on Samsung tv

You try to launch HBO MAX on your Samsung TV, but it doesn’t work. You’ve already tried unplugging and replugging the HDMI cable and restarting your TV and cable box, but nothing seems to be fixing…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Python vs R in 2019

Which is best for Machine Learning in the Cloud?

Working with two new languages has been intriguing. Being able to compare both while performing the same tasks has given me great insight into how intuitive each language is. But also how good they are at Machine Learning overall.

Which got me thinking, in a commercial setting, which one is best?

So I set out to find an answer.

Good old, dependable Python. It’s been around for decades, now at version 3.7 at the time of this article.

Both have snazzy websites.

Admittedly, I’d never really heard much of R before I started my Machine Learning journey. I even considered R to be the new-kid-on-the-block… How wrong was I?!

In fact R was conceived in the early 90’s. It’s first stable release was in 2000.

R was developed primarily for statistical computing and graphics. Built largely on C and Fortran it’s performance is set to be (at least) comparable to that of Python.

There is a growing community around R, and some excellent libraries for Machine Learning.

So what are they like to work with?

There are some key differences to both languages which have a notable impact on how well each language is suited to Machine Learning. The most notable difference is arguably that of the available Data Types both have to offer.

Both languages have familiar scalar types catering for numbers, strings and logical operators or booleans. However since we’re mostly dealing with row data — Data Frames in Machine Learning land — we’re more interested in what each has to offer here.

Vectors comprise of any number of scalars that must be of the same type. R has vectors out-of-the-box but Python relies on lists, which allow mixed types.

Matrices are used often in Machine Learning, particularly sparse matrices. In Python we can create a matrix with the help of Numpy. R can handle matrices natively.

Data Frames are structured tables of data, where data in each column is of the same type. The Pandas library brings Data Frames to Python, whereas R has this data type baked in.

Factors are very specific to the data science world. When dealing with categorical data like ["ketchup", "mayo", "mayo", "ketchup] we often need to encode data in a numeric way; [1, 2, 2, 1] . But we must be sure in our computations that we don’t take a factors’ value with any numerical significance. That is to say "mayo" is not more than "ketchup" . R gives us an elegant way to do this with factors. In Python, libraries are relied upon to deal with these nominal values.

Python is clearly very capable, but it gets a lot of functionality from libraries. R is pre-compiled with these data types already inside. That means all libraries in R know about the same structures. There may also be performance implications, albeit small, when running computations on very large datasets.

In R we’ll write a simple script to do the import, note that we could use native read.csv() to get the dataset, but the excellent readr library is more efficient;

And the equivalent one in Python;

Now in the interest of being fair (I noticed python was considerably faster on second runs) we’ll compare them with time command over 10 iterations and take an average;

Well that’s pretty close, both complete the task in around 1.38s. But as I mentioned before, Python seems to be slower on the first run by about 50% again.

Now the second column of this dataset is for country, which for some models will need to be encoded so we can work with it. Let’s perform this task in both languages;

In R we can do it like this;

And in Python we’ll use the excellent sklearn;

Remember when we looked at datasets — we saw that R has factors built in. Now how easy was that to encode our country column?

Python can do this with the sklearn library. Lets see how they perform;

Hang on, R script is quicker than the first run? What?

I ran this a few times, and it’s not actually quicker, it’s about the same. The difference in time is due to the amount of time between tests and general system usage etc. my bad.

But the tests are relative to each other, so we can still compare. R essentially takes no extra time to encode the nominal column country. Python on the other hand does take a little extra time, around 0.506s in fact.

The bread-and-butter of Machine Learning, splitting data into a test and training set is next up in our comparison. It’s fairly typical to take 20% for a test set, so we’ll do that.

In R;

And Python;

We’re using libraries in both languages here. An important point to note is that the seed is set globally in R whereas it’s more often used in methods in Python.

Setting the seed is crucial when splitting data, by using a seed we guarantee the same ‘randomness’ across multiple runs.

And the results;

There’s nothing in it here, R only takes an extra 0.088s on average. Python wants on average around 0.028s for the same task.

Both languages have many libraries that offer specific bundles of functionality. They are pretty hard to compare, so I won’t.

What we can see — despite including Scala which we’re not looking at here — is that the top Python packages contain more contributors and have more commits than the top R packages.

On the face of things that doesn’t tell us too much, but we can make several assumptions;

It would be really interesting to see the infographic above presented as year-on-year change. I suspect that we would see an exponentially increasing curve for R in terms of contributors and commits.

Both languages have interfaces into Tensorflow. It’s important to distinguish that Tensorflow is not written in Python, but it’s most commonly used with Python.

We know that Visualisation is an incredibly important at part of Data Science.

So how do Python and R measure up when it comes to quickly visualising data?

There are a plethora of charts, graphs and diagrams that can be generated from all manor of data. You can check out some awesome visualisations from contributor to both languages here;

My only experience so far is with matplotlib in Python and ggplot2 in R. But I found ggplot2 marginally easier and more intuitive to use.

Before we move on to other sections, there’s something I discovered about for loops in R;

I was most worried about this section when I set out mainly since I had a hunch R wouldn’t run anywhere at scale. However, it was surprising and interesting to see a growing support for R in the cloud.

The Cloud ML product from Google seems to be built for use with Python. Most examples of training models in the cloud (that I can find) are provided in Python.

Another point to note is that the usage of R on Google cloud seems to only support Tensorflow operations, I could not find any information about other R libraries. I also couldn’t find out about GPU support but I suspect Python ones will be better tested and utilised.

So it seems as though you can do what you like within reason. Although you would expect ultimate performance to come from AWS optimised containers.

Getting started and example pages still focus on Python though.

From a business perspective, when choosing something as fundamental as a language you would consider how easy it will be to hire and how much it’ll cost you. I set out to understand this from a recruitment perspective by trying to answer these questions;

Very quickly it became apparent that there is no distinction to be made here.

Why? Well simply because the commodity of a Data Scientist is not their skill in any given language, but more their aptitude with data and how to get the best out of it.

Bare in mind this survey is from 2016…

So clearly if you’re looking to employ a Data Scientist, (programming) language needn’t be a barrier.

Now, you didn’t need an algorithm to predict that there was going to be a section titled Conclusion, did you?

I set out here to answer a question;

We should probably expand on what a ‘commercial setting’ is.

As the Machine Learning landscape continues to explode at an alarming rate, a lot of companies are looking at it thinking we should be doing this.

But also there are data-rich startups without the baggage of maintaining dreaded legacy systems.

And finally there’s the companies already doing ML, perhaps they have been doing it for years but they’re looking to move to the cloud.

So it makes sense to consider the best choice for these three categories of company;

So here we go;

Those looking to incorporate ML into their workflow have perhaps the most difficult decision to make.

If you’re looking to train existing employees on how to pre-process data and feed models then it is worth considering how easy each language is to learn. R is strong here because it’s designed with statistics in mind.

Python on the other hand probably has better support in the cloud and more expertise available on places like Stack Overflow.

The choice also depends on how much you will be required to graph data, the reports from ggplot2 in R are report quality.

For most people though, I expect Python will make the most sensible choice, it’s clearly the most popular choice for cloud. Established businesses will prefer Python as a less risky approach.

With a totally clean slate, startups can do what they want.

I see startups a bit like gambling on the Stock Exchange — with big risk comes big reward… sometimes.

You could say that choosing R is a high-risk manoeuvre. It may be phased out of the cloud for example, if adoption is low enough. On the other hand it could pay off if support becomes as embedded as Python and performance is better.

Python though is probably a safer choice, and that may be better for a startup. After all, isn’t starting a company risky enough?

For companies already doing ML it makes most sense to stick to what they’re using now.

Both languages will run on cloud, maybe providers support one better than the other but you can choose between them. Importantly, there is probably not enough performance difference to worry about.

The most important thing for your business will be continuity, using the same language with a different workflow.

All that is above is my take based on my research and my own experience. I’d be interested to hear other perspectives, what would you use and why?

Also if you have first hand experience in either or both languages then please share your experience.

Thanks for reading, and happy learning!

Add a comment

Related posts:

what career is right for me

what career is right for me. “what career is right for me” is published by myrobotoy.

Dissociation

One line poetry prompt July 22 “Squander”. “Dissociation” is published by Anna Rozwadowska in Chalkboard.

Solana Smart Contracts Overview

Smart Contracts are a technology that will take Blockchain to the next level of adoption. Find out how Smart Contracts work on the Solana Network, in this article.