AI – Collecting Data

Up to 80% of an Artificial Intelligence project is about Collecting Data:

  • What data is Required?
  • What data is Available?
  • How to Select the data?
  • How to Collect the data?
  • How to Clean the data?
  • How to Prepare the data?
  • How to Use the data?

What is Data?

Data can be many things. With Artificial Intelligence it must be a collection of facts:

NumbersPrices. Dates.
MeasurementsSize. Height. Weight.
WordsNames and Places.
ObservationsCounting Cars.
DescriptionsIt is cold.

Intelligence Needs Data

Human intelligence needs data:

A real estate broker needs data about sold houses to estimate prices.

Artificial intelligence needs data:

A computer program also needs data to estimate prices.

Storing Data

The most common data to collect are Numbers and Measurements.

Often data are stored in arrays representing the relationship between values.

This table contains house prices versus size:


Quantitative vs. Qualitative

Quantitative data are numerical:

  • 55 cars
  • 15 meters
  • 35 children

Qualitative data are descriptive:

  • It is cold
  • It is long
  • It was fun

Census or Sampling

Census is when we collect data for every member of a group.

Sample is when we collect data for some members of a group.

If we wanted to know how many Americans smoke cigarettes, we could ask every person in the US (a census), or we could ask 10 000 people (a sample).

A census is Accurate, but hard to do. A sample is Inaccurate, but is easier to do.

Sampling Terms

Population is group of individuals (objects) we want to collect information from.

Census is information about every individual in a population.

Sample is information about a part of the population (In order to represent all).

Random Samples

In order for a sample to represent a population, it must be collected randomly.

Random Sample, is a sample where every member of the population has an equal chance to appear in the sample.

Sampling Bias

Sampling Bias (Error) occurs when samples are collected in such a way that some individuals are less (or more) likely to be included in the sample.