Wydawca: Valentina Porcu Kategoria: Nauka i nowe technologie Język: angielski Rok wydania: 2017

Uzyskaj dostęp do tej
i ponad 25000 książek
od 6,99 zł miesięcznie.

Wypróbuj przez
7 dni za darmo

Ebooka przeczytasz w aplikacjach Legimi na:

e-czytniku kup za 1 zł
tablecie  
smartfonie  
komputerze  
Czytaj w chmurze®
w aplikacjach Legimi.
Dlaczego warto?
Czytaj i słuchaj w chmurze®
w aplikacjach Legimi.
Dlaczego warto?
Liczba stron: 148

Odsłuch ebooka (TTS) dostępny w abonamencie „ebooki+audiobooki bez limitu” w aplikacji Legimi na:

Androida
iOS
Czytaj i słuchaj w chmurze®
w aplikacjach Legimi.
Dlaczego warto?

Ebooka przeczytasz na:

e-czytniku EPUB kup za 1 zł
tablecie EPUB
smartfonie EPUB
komputerze EPUB
Czytaj w chmurze®
w aplikacjach Legimi.
Dlaczego warto?
Czytaj i słuchaj w chmurze®
w aplikacjach Legimi.
Dlaczego warto?

Pobierz fragment dostosowany na:

Zabezpieczenie: watermark

Opis ebooka Learning Python for data mining - Valentina Porcu

My goal is to accompany a reader who is starting to study this programming language, showing her through basic concepts and then move to data mining. We will begin by explaining how to use Python and its structures, how to install Python, which tools are best suited for a data analyst work, and then switch to an introduction to data mining packages. The book is in any case an introduction. Its aim is not, for instance, to fully explain topics such as machine learning or statistics with this programming language, which would take at least twice or three times as much as this entire book. The aim is to provide a guidance from the first programming steps with Python to manipulation and import of datasets, to some examples of data analysis.To be more precise, in the Getting Started section, we will run through some basic installation concepts, tools available for programming on Python, differences between Python2 and Python3, and setting up a work folder.In Chapter 1, we will begin to see some basic concepts about creating objects, entering comments, reserved words for the system, and on the various types of operators that are part of the grammar of this programming language.In Chapter 2, we will carry on with the basic Python structures, such as tuples, lists, dictionaries, sets, strings, and files, and learn how to create and convert them.In Chapter 3 we will see the basics for creating small basic functions, and how to save them.Chapter 4 deals with conditional instructions that allow us to extend the power of a function as well as some important functions.In Chapter 5 we will keep talking about some basic concepts related to object-oriented programming, concept of module, method, and error handling.Chapter 6 is dedicated to importing files with some of the basic features. We will see how to open and edit text files, in .csv format, and in various other formats.Chapters 7 to 10 will deal with Python's most important data mining packages: Numpy and Scipy for mathematical functions and random data generation, pandas for dataframe management and data import, Matplotlib for drawing charts and scikit-learn for machine learning. With regard to scikit-learn, we will limit ourselves to provide a basic idea of the code of the various algorithms, without going, given the complexity of the subject, into details for the various techniques.Finally, in Conclusions, we will summarize the topics and concepts of the book and see the management of dates and some of the data sources for our tests with Python.This book is intended for those who want to get closer to the Python programming language from a data analysis perspective. We will therefore focus on the most used packages for data analysis, after the introduction to Python's basic concepts.

Opinie o ebooku Learning Python for data mining - Valentina Porcu

Fragment ebooka Learning Python for data mining - Valentina Porcu

Contents

Foreword

Installing Python

Editor and Integrated development environments

Differences between Python2 and Python3

Working directory

Using Terminal

Chapter 1

1.1 Objects in Python

1.2 Reserved terms for the system and names

1.3 Enter comments in the code

1.4 Types of data

1.5 File format

1.6 Operators

1.7 Indentation

1.8 Quotation marks

Chapter 2

2.1 Numbers

2.2 Container objects

Tuples

Lists

Dictionaries

Sets

Strigs

Files

2.3 Immutability

2.4 Converting formats

Chapter 3

3.1 Functions

3.1.1 Some predefined built_in functions

Obtain informations regarding a function

3.2 Create your own functions

3.3 Salvare i propri moduli e file

Chapter 4

4.1 Conditional instructions

4.1.1 if

4.1.2 if-else

4.1.3 elif

4.2 Loops

4.2.1 for

4.2.2 while

4.2.3 continue and break

4.2.4 range()

4.3 Extend our functions with conditional instructions

4.4 map() and filter() functions

4.5 The lambda function

4.6 Scoping

Chapter 5

5.1 Object Oriented Programming

5.2 Modules

5.3 Methods

5.4 List comprehension

5.5 Regular Expressions

5.6 User input

5.7 Errors and Exceptions

Chapter 6

6.1 Importing files

6.2 .csv format

6.3 From the web

6.4 In JSON

6.5 Other formats

Chapter 7

7.1 Libraries for data mining

7.2 pandas

7.2.1 pandas: Series

7.2.2 pandas: dataframes

7.2.3 pandas: importing and exporting data

7.2.4 pandas: data manipulation

7.2.5 pandas: missing values

7.2.6 pandas: merging two datasets

7.2.7 pandas: basic statistics

Chapter 8

8.1 SciPy

8.2 Numpy

8.2.1 Numpy - generating random numbers and seeds

Chapter 9

9.1 Matplotlib

Chapter 10

10.1 scikit-learn

Managing dates

Data sources

Conclusions

Foreword

Python is an interpreted, interactive, and object-oriented language. It features a library of functions, is extendable, as it can easily create new modules, and is available for all operating systems. For these and other reasons it is also one of the most used programming languages when it comes to data mining and machine learning.

My goal is to accompany a reader who is starting to study this programming language, showing her through basic concepts and then move to data mining. We will begin by explaining how to use Python and its structures, how to install Python, which tools are best suited for a data analyst work, and then switch to an introduction to data mining packages. The book is in any case an introduction. Its aim is not, for instance, to fully explain topics such as machine learning or statistics with this programming language, which would take at least twice or three times as much as this entire book. The aim is to provide a guidance from the first programming steps with Python to manipulation and import of datasets, to some examples of data analysis.

To be more precise, in the Getting Started section, we will run through some basic installation concepts, tools available for programming on Python, differences between Python2 and Python3, and setting up a work folder.

In Chapter 1, we will begin to see some basic concepts about creating objects, entering comments, reserved words for the system, and on the various types of operators that are part of the grammar of this programming language.

In Chapter 2, we will carry on with the basic Python structures, such as tuples, lists, dictionaries, sets, strings, and files, and learn how to create and convert them.

In Chapter 3 we will see the basics for creating small basic functions, and how to save them.

Chapter 4 deals with conditional instructions that allow us to extend the power of a function as well as some important functions.

In Chapter 5 we will keep talking about some basic concepts related to object-oriented programming, concept of module, method, and error handling.

Chapter 6 is dedicated to importing files with some of the basic features. We will see how to open and edit text files, in .csv format, and in various other formats.

Chapters 7 to 10 will deal with Python's most important data mining packages: Numpy and Scipy for mathematical functions and random data generation, pandas for dataframe management and data import, Matplotlib for drawing charts and scikit-learn for machine learning. With regard to scikit-learn, we will limit ourselves to provide a basic idea of the code of the various algorithms, without going, given the complexity of the subject, into details for the various techniques.

Finally, in Conclusions, we will summarize the topics and concepts of the book and see the management of dates and some of the data sources for our tests with Python.

This book is intended for those who want to get closer to the Python programming language from a data analysis perspective. We will therefore focus on the most used packages for data analysis, after the introduction to Python's basic concepts. To download the code, go more into depth for some topics and for more information about the practical part you can visit my website, Datawiring.me. From the site homepage you can also subscribe to my newsletter to keep track of news in the code and last posts.

Given the introductory nature of the course, in any case, the advice is to write the code manually to get familiar with I and being able to handle it, especially for readers who have just begun programming.

Installing Python

Python can be easily installed from https://www.python.org/downloads/ in both version 2 or 3. It is already preinstalled on Unix systems, so if we have a Mac or Linux, we can simply access terminal and type "python".

From the python.org website, simply download the most suitable version for your operating system and proceed with installation following the on-screen instructions.

Editor and Integrated development environments

There are many ways to use a programming language, such as Python. We can simply write the first lines from the terminal: then, once the programming language is installed, if necessary (depending on the operating system you are using there will be some versions of Python already integrated) we will open a terminal window and digit its name.

Writing code this way, when it comes to doing more than a few examples, may, however, prove to be somewhat uncomfortable. For this reason, you can use text editor or IDE, Integrated Development Environment, or Integrated Development environments. This way, we simply write code scripts, which we will then save with the .py extension, which we could later run to verify the correct functioning.

There are many free and paid editors that differ in their completeness, scalability, ease of use. Among the most used editors are Sublime Text, Text Wrangler, Notepad++ (for Windows), or TextMate (for Mac).But we can also use a simple text editor.

As for integrated development environments, or IDEs, Python-specific ones are for instance Wingware, Komodo, Pycharm, Emacs, but there are really lots of them. This kind of tools provide tools to simplify work, such as self-completion, auto-editing and auto-indentation, integrated documentation, syntax highlighting, code-folding (the ability to hide some pieces of code while you Works on other parts), and support for debugging.

Spyder(which is included inAnaconda) and Jupyter are the most used in Data Science, along with Canopy. A useful tool for Jupyter is nbviewer, which allows the exchange of Jupyter's .ipynb files, which can be downloaded athttp://nbviewer.jupyter.org and can also be linked to Github.

As for Anaconda, a very useful tool as it also features Jupyter, it can be downloaded for our operating system fromthis link. The list of resources that are installed with Anaconda (over 100 packets for data mining, maths, data analysis and algebra) can be viewed opening a terminal window and then typing:

conda list

Part of the resources installed with Anaconda

We can program Python through one or more of these tools, depending on our habits and what we want to do. Spyder and Jupyter are very common for data mining, which are both available once Anaconda is installed. These are tools that can be used and installed individually (eg Jupyter can be tested from this link), but installing Anaconda makes it easy to work, as it provides us with a whole host of tools and packages.

Spyder Home Screen

Example of open script on Jupiter

The Python code can then be run directly from the terminal, or saved as .py file and then run from these other editors. What tells us we are running the Python code is the ">>>" symbol at the beginning of the prompt.

To best follow the examples in this book I recommend installing Anaconda from theContinuum.io website and using Jupyter. Anaconda automatically installs a set of packages and modules that we will then use and we will not have to reinstall each time from the terminal.

Anaconda's main screen

Differences between Python2 and Python3

Python is released in two different versions, Python2 and Python3. Python2 was born in 2000 (currently the latest release is 2.7), and its support is expected until 2020. It is the historical and most complete version.

Python3 was released in 2008 (current version is 3.6). There are many libraries for Python3, but not all of them have been yet converted for this release from Python2.

The two versions are very similar but feature some differences, for example with regard to mathematical operations:

Python 2.7

5/2

2

# Python2 performs division by breaking the decimal

Python 3.5.2

5/2

2.5

To get the correct result in Python2 we have to specify the decimal as follows:

5.0/2

2.5

# or like this

5/2.0

2.5

# or specifying we are talking about a decimal (float)

float(5)/2

2.5

To keep the two versions of Python together, you can also import Python into a form called future, allowing to import Python3 functions into Python2.

from __future__ import division

5/2

2.5

For a closer look at the differences between the two versions of Python, you can access this online resource.

What is the difference between the two versions and why choose one or the other? Python2 represents the best-defined and stable version, while Python3 represents the future of the language, although for some things the two versions do not coincide. In the first part of this text we will always try to highlight the differences between the two versions. From chapter 7 onwards, the section on data mining packages, we will use Python3.

Working directory

Before we start working, we set the work directory on our computer. Setting up a work directory means setting up a home for our scripts and our files, where Python will automatically look at when we ask it to import a file or run a script. To find out what our work directory is, simply type this on the Python shell:

import os

os.getcwd()

‘~/valentinaporcu'

# to edit the working directory, we use the following notation, inserting the new directory in parentheses

os.chdir(“/~/Python_script”)

# then let’s check if it has been correctly changed

os.getcwd()

‘~/Python_script’

Setting up a working directory means that when we're going to import a file that is in our workbook, we can simply type the name followed by extension and quotation marks in this format:

“file_name.extension”

For instance:

"dataframe_data_collection1.csv"

Python will directly check if there is a file with that name inside that folder and it will import it. Same thing when we save a Python file by typing it on our computer: Python will automatically put it in that folder. Even when we run a Python script, as we will see, we will have to access the folder where the script (the working directory or another one) is located directly from the terminal.

If we want to import a file that is not in the working directory but elsewhere on our computer or on the web, we can still do this, this time by entering the full file address:

“complete_address.file_name.extension”

For instance:

"/Users/vp/Downloads/dataframe_data1.csv"

Using Terminal

Let us see how to run Python scripts. First, let us open a terminal window.

As you can see, we see the dollar symbol ($) not the Python shell symbol (>>>). We can view the list of our folders and files with the ls command.

At this point, we can move to the Python_test folder, for example:

cd Python_test

In the folder where I moved, Python_test, I find my Python scripts, that is, the .py files that I can run by writing like this:

python test.py

test.py is the name of the script I am going to run.

Chapter 1

Introductory notions

Objects in Python

In Python, any item is considered an object, that is, a container to place our data. In Python there are many types of objects: tuples, lists, sets and dictionaries, and are called in container English. All the Python processing process is based on objects.

Each object in Python is distinguished by three properties:

a name

a type

an ID

Object names consist of only alphanumeric characters and underscores, so all characters between A-Z, a-z, 0-9, and _. Type is the type of object, such as string, numeric, or boolean. The ID is a number that uniquely identifies our object.

The objects remain inside the computer memory and can be retrieved. When no longer needed, a garbage collector mechanism frees up busy memory.

1.2 Reserved terms for the system and names

Python has a set of words that are reserved for the system and cannot be used by users as names for objects or functions. Such words are:

andas assert break class continue def del elif else except exec False finally for from global if import in is lambda None not or import in pass print raise return True try while with yeld

These words cannot be used as names for our objects. Object names in R must be subject to some rules:

must begin with a letter or underscore _

they must contain only letters, numbers, and underscores

they are case sensitive, so a test object is not the same as a TEST object or a Test object

1.3 Enter comments in the code

In Python, any comment preceded by the # symbol is not read by the program as a code, but is ignored. This is very useful to comment on the code and resume it later. Comments can be written both on the code to comment and on the side.

# comment no. 1

print(“Hello World”) # comment no. 2

To write a comment on multiple lines, we can also use three times the quotation marks, like this:

“””

comment line 1

comment line 2

comment line 3

“””

1.4 Types of data

Python data can be of various types. We can summarize them in the table below:

To know what type an object is, we can always use the type() function:

# we create an x object

type(x)

<class 'int'>

# a y object

type(y)

<class 'float'>

# and a z object

type(z)

<class 'str'>

1.5 File format

Once you have created a script in Python, you need to save it with a .py extension. Typically, when it comes to complex scripts, we will create a script on an editor that we will then run. A .py script can be written by one of the different editors we've seen, even a normal text editor, and then renamed with .py extension.

1.6 Operators

On Python we find a series of operators, divided into several groups:

arithmetic

of assignment

of comparison

logical

bitwise

of membership

of identity

Beside these operators, there is also a hierarchy that marks the order in which they are used.

Mathematical operators

When we open Python, the simplest thing we can do is use it to perform math operations, for which we use mathematical operators:

We then open Python and perform some examples of mathematical operations:

10+7

17

15-2

13

2*3

6

10/2

5

3**3

27

10/3

3

25//7

3

Comparison and membership operators

In Python we also have some comparison operators or comparators.

Operator

Description

>

greater than

<

lower than

equal to

>=

greater or equal

<=

lower or equal

!=

different

is

identity

is not

non identity

in

exists in

not in

does not exists in

These operators are used to test relationships between objects. Let us see some examples:

# we create two objects

# let us verify if x is greater than y

x > y

False

# the output is a logical vector that tells us that x is not greater than y

# let us see if x is less than y

x < y

True

# time the answer is affirmative

# we create another z object that with the same value as x

# let us verify with equality operator if z is equal to x

True

# even in this case the output is positive

# let us verify if z is different from y

z != y

True

# we create a tuple

# and verify if the number 2 is in the tuple

2 in v1

True

# let us verify if element 8 is NOT in tuple v1

8 not in v1

True

# let us verify if element 7 is NOT in tuple v1

7 not in v1

False

If we compare text strings, Python counts the characters so in this case the </> symbol is meant as "how many characters in string1 is greater than the number of characters in string2?" For instance:

"valentina" > "laura"

True

We cannot compare strings and numbers, because we would get a mistake.

Bitwise operators