Data Science for Startups: Data Pipelines

Source: TheDigitalArtist at pixabay.comPart three of my ongoing series about building a data science discipline at a startup. You can find links to all of the posts in the introduction.Building data pipelines is a core component of data science at a st…

Launch with AI in 1 week or less

https://www.reddit.com/r/spaceporn/comments/81q31g/twin_engine_atlas_v_with_four_solid_rocket_motors/Whether you’re a new startup or an existing business, here’s one way you can get an AI-enabled product or service into production in 1 week or less. An…

Friendlier data labelling using generated Google Forms

Manually labelling data is nobodies favourite machine learning chore. You needn’t worry though about asking others to help out provided you can give them a pleasant tool for the task. Let me present to you: generated Google Forms using Google App Script!

Google App Scripts allow you to build automation between Google Apps

The regular way people might label data is just by typing in the labels into a spreadsheet. I would normally do this as well, however in a recent task I needed to label paragraphs of text. Have you ever tried to read paragraphs of text in a spreadsheet?.. it’s hell! Luckily whilst trying to figure out a way to make the labelling process less gruelling I came across a way of auto generating a form based on data in a spreadsheet document using Google App Script.

Nasty! Nobody wants to strain their eyes trying to read documents in spreadsheet cells!

Creating the script that will generate our Form

To get started we just jump into the App Script editor from within the Google Spread Sheet containing the data we want to gather labels for:

Opening the App Script editor from a Google Spreadsheet

Using App Script (pssst! it’s just JavaScript) we can read the spreadsheet data and send commands to other Google Apps (in this case the Google Forms).

What’s great about using Forms for labelling is that you can guarantee consistency in the user input by specifying the data input type. For example:

Number range:

form.addScaleItem()
.setTitle(dataToLabel)
.setBounds(1, 10)
.setRequired(true);

Binary label:

form.addCheckboxItem()
.setTitle(dataToLabel)
.setChoices([
item.createChoice('Is a cat')
])

Multi class label

form.addMultipleChoiceItem()
.setTitle(dataToLabel)
.setChoices([
item.createChoice('Cats'),
item.createChoice('Dogs'),
item.createChoice('Fish')
])

See the details for more input types in the App Script API docs (or just look at the different input types when manually creating a Google Form).

You can grab the script I have used to generate a Form for labelling text documents with numbers 0 to 10 from my Github:

ZackAkil/friendlier-data-labelling

After you have your script written (or copy and pasted); you then select your scripts’ entry point and run it! Warning You’re probably going to have to jump through a few authorisation hoops the first time you do it.

Make sure to select the entry point function of the script before running.

Using the generated Form

After the script has run, you can head over to your Google Forms and there you should find a brand new Form! You can send the Form to whoever you want to do the labelling:

Finally you can send your labellers a convenient link to a familiar Google Form that they can use to carry out the labelling task.

Accessing the data labels

After the labelling is done, you can then just view the labels as a spreadsheet and export as a CSV:

It’s pretty straight forward to get the labels out as a CSV.

Hopefully this saves you a bit of headache in your future machine learning efforts!

The full script and dataset used in this article can be found on my Github:

ZackAkil/friendlier-data-labelling


Friendlier data labelling using generated Google Forms was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

The AGI/Deep Learning Connection

Artificial General Intelligence

As an amazing course on AGI at MIT by Lex Fridman, one of my favourite lecturers, is about to begin (or might already have kicked off by the time this article is posted), I felt like writing about the very same topic that I have been reading for quite a few months now.

“Almost all young people working on Artificial Intelligence look around and say – What’s popular? Statistical learning. So I’ll do that. That’s exactly the way to kill yourself scientifically. “

– Marvin Minsky during his course called Society of Mind at MIT in 2011.

Marvin Minsky, the famous American cognitive scientist and co-founder of MIT’s AI Laboratory, never agreed for a way too simple approach towards AGI or replicating functionality of brain for that matter. But we still can’t deny the progress deep learning has brought about in the field. There could be high chances that the brain does not do gradient descent and stuff but it’s unfair to undermine deep learning on the basis of that.

There are many things about the brain that we don’t possess while creating a replica of it. Also, deep learning would definitely prove to be an essential component to create truly intelligent machines but probably not enough alone.

It can be stated that the core idea behind the work of Ben Goertzel is something the vast majority of curious minds would’ve thought of long back; programmatically designing human parts as components of AGI to create intelligence close to that of humans if not at par with them, by using the concept of cognitive synergy. But no one really was able to bring this thought to reality… He did. There are many great researchers worth mentioning like Marcus Hutter (AIXI) and Pei Wang (NARS) to name a few.

With its fair share of flaws and probably also not enough funding that is available with tech giants like Google, Facebook and Tesla, OpenCog’s contribution towards creating AGI can simply not be ignored at any cost.

Guess what? Even Ben’s work uses deep learning! There are numerous reasons why deep learning cannot be done away with. There is a need to rethink backpropagation as said by Geoff Hinton himself, but no one can deny that it has been a huge success. Maybe someone will come up with an even more sensible alternative for deep learning and backpropagation in particular, due to its use in a plethora of applications.

People who haven’t watched the movie ‘Ex Machina’ should — 
i) Skip to the next paragraph
ii) Watch the movie very soon!

I vividly remember the scene where Ava is running towards her own creator Nathan to kill him. It does send chills through the viewer’s body.

Such things have been shown time and time again in various sci-fi movies and also been supported by very well known people in the field who are warning us about AI. What if our creations become the cause of our death? This brings up the need to include coexistence and ‘benevolence’ in the robots so that they don’t fear us. But then again, there have been strong claims that AI today hasn’t even reached the level of intelligence of a mouse.

I believe a lot in Ben Goertzel’s idea of the need for cognitive synergy. What’s that? It is the combination of different components designed to be intelligent, to form a cognitive system wherein they will help each other out to perform their respective tasks for the system to be capable of being called truly intelligent irrespective of whether it has faced a particular situation in the real world before or not.

One might take the example of transfer learning wherein just the last few layers are changed and trained whereas most part of the model remains the same.

One could also imagine using this for several tasks that would need prediction. Based on the requirement of the system, a model could be trained to learn which situation needs what type of last layers to be used for successful completion of the task with efficiency more than that of humans (because if not, then umm… what’s the point?)

Yann LeCun’s post about Sophia

I obviously cannot vouch for how real Sophia actually is, how close to intelligence is this Twitter using robot, and how many of its functions is it actually performing and not someone from the developers team but I do genuinely appreciate the idea and the effort put in to implement it irrespective of the end product and its viability.

Now that I’ve mentioned Yann LeCun’s post, I would also like to refer to the famous debate between him and Gary Marcus (@GaryMarcus). One of his statements as a response to Gary’s views, that I personally found the most important one —

“Does AI need more innate machinery? The answer probably is yes but the answer is also, not as much as Gary thinks. ”

- Yann LeCun, FAIR and NYU

While Gary brought up certain very interesting points for everyone to think about, there are moments when they start sounding extreme. The Facebook AI Director on the other hand was very calm and tackled all arguments very sensibly.

Ali Rahimi’s view on Gary Marcus’s Paper on drawbacks of Deep Learning

It is true that the things both of the highly respected individuals agree on are very fundamental problems with the current state of AI worldwide, but still Yann LeCun’s stand with respect to the debate topic was more acceptable as well as logical while Gary Marcus was criticized for certain remarks that he made even in his recent papers.

One of the most well received views on Gary’s paper by Thomas Dietterich

I’m definitely not in some reputed position to be commenting or throwing irrational opinions around or take sides on such great thought processes that had been born decades before I was born, but consider these as thoughts of someone who has been closely following the works of pioneers of the field.

I am just a believer of the idea that if AGI can be created one day in the future then deep learning is definitely going to have a vital role to play in its proper functioning. Companies like OpenAI inspire me and many others to believe that putting in hours into research and thinking about applications of AI will definitely yield amazing results that will revolutionize the way we live our lives.

It would be something combining different fields of study like neuroscience, philosophy, mathematics, physics, computer science and many more that would together contribute in a masterpiece that would prove to be the best creation of mankind.

Indeed, the field of AI is split in various groups believing in different ways to approaching the solution to the problem of AGI and none of them seem incorrect.

These groups (with either symbolic or behavioral or any other approaches towards AGI) with their different ideologies and fair shares of drawbacks in their approaches need to be combined in a certain way that each one could nullify the other’s drawback, which again brings us back to the idea of cognitive synergy. You see, none of the ideas can be ignored completely. Every approach, every attempt, every single line of code written towards a successful implementation of AGI is important.

At the same time, even developing principles of governance for usage of AGI by nations should be laid down as well as integrated into the machines so that they can’t be used for wrong purposes, which is exactly what Ethical AI is all about.

There are several questions like this looming over the concept of AGI that need to be solved in order to actually achieve even more substantial and ground breaking outcomes in that area. Let’s hope that if AGI is possible then we’re all on the right track towards AGI and if not, atleast eventually find the right way towards the idea of humans and robots coexisting to make the world a better place to live for us as well as the generations to come!

I am sharing a couple of links and videos that people interested in learning about AGI should definitely check out –

Other articles that you might like — https://medium.com/@raksham_p

Stay tuned for more posts on Artificial Intelligence coming up very soon! 🙂


The AGI/Deep Learning Connection was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Overfitting vs. Underfitting: A Complete Example

Exploring and solving a fundamental data science problemWhen you study data science you come to realize there are no truly complex ideas, just many simple building blocks combined together. A neural network may seem extremely advanced, but it’s really …

What’s the difference between data science, machine learning, and artificial intelligence?

When I introduce myself as a data scientist, I often get questions like “What’s the difference between that and machine learning?” or “Does that mean you work on artificial intelligence?” I’ve responded enough times that my answer easily qualifies for my “rule of three”:

When you’ve written the same code 3 times, write a function

When you’ve given the same in-person advice 3 times, write a blog post

— David Robinson (@drob) November 9, 2017

The fields do have a great deal of overlap, and there’s enough hype around each of them that the choice can feel like a matter of marketing. But they’re not interchangeable: most professionals in these fields have an intuitive understanding of how particular work could be classified as data science, machine learning, or artificial intelligence, even if it’s difficult to put into words.

So in this post, I’m proposing an oversimplified definition of the difference between the three fields:

  • Data science produces insights
  • Machine learning produces predictions
  • Artificial intelligence produces actions

To be clear, this isn’t a sufficient qualification: not everything that fits each definition is a part of that field. (A fortune teller makes predictions, but we’d never say that they’re doing machine learning!) These also aren’t a good way of determining someone’s role or job title (“Am I a data scientist?”), which is a matter of focus and experience. (This is true of any job description: I write as part of my job but I’m not a professional writer).

But I think this definition is a useful way to distinguish the three types of work, and to avoid sounding silly when you’re talking about it. It’s worth noting that I’m taking a descriptivist rather than a prescriptivist approach: I’m not interested in what these terms “should mean”, but rather how people in the field typically use them.

Data science produces insights

Data science is distinguished from the other two fields because its goal is an especially human one: to gain insight and understanding. Jeff Leek has an excellent definition of the types of insights that data science can achieve, including descriptive (“the average client has a 70% chance of renewing”) exploratory (“different salespeople have different rates of renewal”) and causal (“a randomized experiment shows that customers assigned to Alice are more likely to renew than those assigned to Bob”).

Again, not everything that produces insights qualifies as data science (the classic definition of data science is that it involves a combination of statistics, software engineering, and domain expertise). But we can use this definition to distinguish it from ML and AI. The main distinction is that in data science there’s always a human in the loop: someone is understanding the insight, seeing the figure, or benefitting from the conclusion. It would make no sense to say “Our chess-playing algorithm uses data science to choose its next move,” or “Google Maps uses data science to recommend driving directions”.

This definition of data science thus emphasizes:

  • Statistical inference
  • Data visualization
  • Experiment design
  • Domain knowledge
  • Communication

Data scientists might use simple tools: they could report percentages and make line graphs based on SQL queries. They could also use very complex methods: they might work with distributed data stores to analyze trillions of records, develop cutting-edge statistical techniques, and build interactive visualizations. Whatever they use, the goal is to gain a better understanding of their data.

Machine learning produces predictions

I think of machine learning as the field of prediction: of “Given instance X with particular features, predict Y about it”. These predictions could be about the future (“predict whether this patient will go into sepsis”), but they also could be about qualities that aren’t immediately obvious to a computer (“predict whether this image has a bird in it”). Almost all Kaggle competitions qualify as machine learning problems: they offer some training data, and then see if competitors can make accurate predictions about new examples.

There’s plenty of overlap between data science and machine learning. For example, logistic regression can be used to draw insights about relationships (“the richer a user is the more likely they’ll buy our product, so we should change our marketing strategy”) and to make predictions (“this user has a 53% chance of buying our product, so we should suggest it to them”).

Models like random forests have slightly less interpretability and are more likely to fit the “machine learning” description, and methods such as deep learning are notoriously challenging to explain. This could get in the way if your goal is to extract insights rather than make predictions. We could thus imagine a “spectrum” of data science and machine learning, with more interpretable models leaning towards the data science side and more “black box” models on the machine learning side.

[source](https://xkcd.com/1838/)

Most practitioners will switch back and forth between the two tasks very comfortably. I use both machine learning and data science in my work: I might fit a model on Stack Overflow traffic data to determine which users are likely to be looking for a job (machine learning), but then construct summaries and visualizations that examine why the model works (data science). This is an important way to discover flaws in your model, and to combat algorithmic bias. This is one reason that data scientists are often responsible for developing machine learning components of a product.

Artificial intelligence produces actions

Artificial intelligence is by far the oldest and the most widely recognized of these three designations, and as a result it’s the most challenging to define. The term is surrounded by a great deal of hype, thanks to researchers, journalists, and startups who are looking for money or attention.

When you’re fundraising, it’s AI
When you’re hiring, it’s ML
When you’re implementing, it’s linear regression
When you’re debugging, it’s printf()

— Baron Schwartz (@xaprb) November 15, 2017

This has led to a backlash that strikes me as unfortunate, since it means some work that probably should be called AI isn’t described as such. Some researchers have even complained about the AI effect: “AI is whatever we can’t do yet”.1 So what work can we fairly describe as AI?

One common thread in definitions of “artificial intelligence” is that an autonomous agent executes or recommends actions (e.g. Poole, Mackworth and Goebel 1998, Russell and Norvig 2003). Some systems I think should described as AI include:

  • Game-playing algorithms (Deep Blue, AlphaGo)
  • Robotics and control theory (motion planning, walking a bipedal robot)
  • Optimization (Google Maps choosing a route)
  • Natural language processing (bots2)
  • Reinforcement learning

Again, we can see a lot of overlap with the other fields. Deep learning is particuarly interesting for straddling the fields of ML and AI. The typical use case is training on data and then producing predictions, but it has shown enormous success in game-playing algorithms like AlphaGo. (This is in contrast to earlier game-playing systems, like Deep Blue, which focused more on exploring and optimizing the future solution space).

But there are also distinctions. If I analyze some sales data and discover that clients from particular industries renew more than others (extracting an insight), the output is some numbers and graphs, not a particular action. (Executives might use those conclusions to change our sales strategy, but that action isn’t autonomous) This means I’d describe my work as data science: it would be cringeworthy to say that I’m “using AI to improve our sales.”

please

please

please do not write that someone who trained an algorithm has “harnessed the power of AI”

— Dave Gershgorn (@davegershgorn) September 18, 2017

The difference between artificial intelligence and machine learning is a bit more subtle, and historically ML has often been considered a subfield of AI (computer vision in particular was a classic AI problem). But I think the ML field has largely “broken off” from AI, partly because of the backlash described above: most people who to work on problems of prediction don’t like to describe themselves as AI researchers. (It helped that many important ML breakthroughs came from statistics, which had less of a presence in the rest of the AI field). This means that if you can describe a problem as “predict X from Y,” I’d recommend avoiding the term AI completely.

by today’s definition, y=mx+b is an artificial intelligence bot that can tell you where a line is going

— Amy Hoy ✨ (@amyhoy) March 29, 2017

Case study: how would the three be used together?

Suppose we were building a self-driving car, and were working on the specific problem of stopping at stop signs. We would need skills drawn from all three of these fields.

  • Machine learning: The car has to recognize a stop sign using its cameras. We construct a dataset of millions of photos of streetside objects, and train an algorithm to predict which have stop signs in them.

  • Artificial intelligence: Once our car can recognize stop signs, it needs to decide when to take the action of applying the brakes. It’s dangerous to apply them too early or too late, and we need it to handle varying road conditions (for example, to recognize on a slippery road that it’s not slowing down quickly enough), which is a problem of control theory.

  • Data science: In street tests we find that the car’s performance isn’t good enough, with some false negatives in which it drives right by a stop sign. After analyzing the street test data, we gain the insight that the rate of false negatives depends on the time of day: it’s more likely to miss a stop sign before sunrise or after sunset. We realize that most of our training data included only objects in full daylight, so we construct a better dataset including nighttime images and go back to the machine learning step.

  1. It doesn’t help that AI is often conflated with general AI, capable of performing tasks across many different domains, or even superintelligent AI, which surpasses human intelligence. This sets unrealistic expectations for any system described as “AI”. 

  2. By “bots” here I’m referring to systems meant to interpret natural language and then respond in kind. This can be distinguished from text mining, where the goal is to extract insights (data science) or text classification, where the goal is to categorize documents (machine learning) 

Why soft skills are requisite in data science?

Communication came into existence with human civilization. At that time the communication was limited within just talking or communicating their feeling face to face. The need was less so the ways of communication was not that much, people were just communicating with some actions their needs and feeling during the tribes existed. As the human […]