pandas

Pandas 筆記

今天我想分享我學的 pandas 筆記,我覺得對於分大數據很有幫助,他有點像 mysql。請先下載 module 或 package:pip install pandas`

Part1 基本語法

連線 CSV 檔案

可以去 github 找相關 csv 檔案,然後去 raw 檔案做連線

1
2
3
4
5
import pandas as pd
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv"
tips = pd.read_csv(url)

print(tips)

pandas 會回傳 DataFrame 格式。<class 'pandas.core.frame.DataFrame'>
下面是印出所有資料表內容。

1
2
3
4
5
6
7
8
9
10
11
12
     total_bill   tip     sex smoker   day    time  size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

Head and Tail

  • head(N): 是查前面 N 筆資料

print(tips.head(10))

output:

1
2
3
4
5
6
7
8
9
10
11
   total_bill   tip     sex smoker  day    time  size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
5 25.29 4.71 Male No Sun Dinner 4
6 8.77 2.00 Male No Sun Dinner 2
7 26.88 3.12 Male No Sun Dinner 4
8 15.04 1.96 Male No Sun Dinner 2
9 14.78 3.23 Male No Sun Dinner 2
  • tail(N): 是查最後 N 筆資料

print(tips.tail(10))

output

1
2
3
4
5
6
7
8
9
10
11
  total_bill   tip     sex smoker   day    time  size
234 15.53 3.00 Male Yes Sat Dinner 2
235 10.07 1.25 Male No Sat Dinner 2
236 12.60 1.00 Male Yes Sat Dinner 2
237 32.83 1.17 Male Yes Sat Dinner 2
238 35.83 4.67 Female No Sat Dinner 3
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

Shape method

shape 語法是用來查有多少欄與列

tips.shape

output:
(244, 7)

column

columns 顯示欄表的名稱

print(tips.columns)

output:
Index([‘total_bill’, ‘tip’, ‘sex’, ‘smoker’, ‘day’, ‘time’, ‘size’], dtype=’object’)

describe (statistics value)

describe 會顯示出所有有統計的直,如果表內容有字就會顯示出來。

print(tips.describe())

output:

1
2
3
4
5
6
7
8
9
       total_bill         tip        size
count 244.000000 244.000000 244.000000
mean 19.785943 2.998279 2.569672
std 8.902412 1.383638 0.951100
min 3.070000 1.000000 1.000000
25% 13.347500 2.000000 2.000000
50% 17.795000 2.900000 2.000000
75% 24.127500 3.562500 3.000000
max 50.810000 10.000000 6.000000

顯示特定欄位

請在 tips 裡面輸入你要顯示欄資訊

tips['sex']

output:

1
2
3
4
5
6
7
8
9
10
11
12
0      Female
1 Male
2 Male
3 Male
4 Female
...
239 Male
240 Female
241 Male
242 Male
243 Female
Name: sex, Length: 244, dtype: object

Unique 直,獨一無二

print(tips['sex'].unique())

[‘Female’ ‘Male’]

sort 排序

單欄位

print(tips['total_bill'].sort_values())

output:

1
2
3
4
5
6
7
8
9
10
11
12
67      3.07
92 5.75
111 7.25
172 7.25
149 7.51
...
182 45.35
156 48.17
59 48.27
212 48.33
170 50.81
Name: total_bill, Length: 244, dtype: float64

多欄位排序數字大小排序

顯示欄名稱 和排序大小

tips[["total_bill","tip"]].sort_values("total_bill")

output:

1
2
3
4
5
6
7
8
9
10
11
12
     total_bill    tip
67 3.07 1.00
92 5.75 1.00
111 7.25 1.00
172 7.25 5.15
149 7.51 2.00
.. ... ...
182 45.35 3.50
156 48.17 5.00
59 48.27 6.73
212 48.33 9.00
170 50.81 10.00

條件 condition

single

print(tips[tips['smoker'] == 'No'])

output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
   total_bill   tip     sex smoker   day    time  size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
235 10.07 1.25 Male No Sat Dinner 2
238 35.83 4.67 Female No Sat Dinner 3
239 29.03 5.92 Male No Sat Dinner 3
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

[151 rows x 7 columns]

多重條件,可以用 And &or |

tips[(tips['smoker'] == 'Yes') & (tips['sex']=='Male')]

output:

1
2
3
4
5
6
7
8
9
10
11
12
13
  total_bill    tip   sex smoker   day    time  size
56 38.01 3.00 Male Yes Sat Dinner 4
58 11.24 1.76 Male Yes Sat Dinner 2
60 20.29 3.21 Male Yes Sat Dinner 2
61 13.81 2.00 Male Yes Sat Dinner 2
62 11.02 1.98 Male Yes Sat Dinner 2
63 18.29 3.76 Male Yes Sat Dinner 4
69 15.01 2.09 Male Yes Sat Dinner 2
76 17.92 3.08 Male Yes Sat Dinner 2
80 19.44 3.00 Male Yes Thur Lunch 2
83 32.68 5.00 Male Yes Thur Lunch 2
90 28.97 3.00 Male Yes Fri Dinner 2
95 40.17 4.73 Male Yes Fri Dinner 4

tips[(tips['smoker'] == 'Yes') | (tips['sex']=='Male')]

Example:

列出 day 欄位獨一無二的項目

print(tips['day'].unique())

output:

1
['Sun' 'Sat' 'Thur' 'Fri']

列出多個特定的 column

tips[['sex','tip']]

output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
        sex   tip
0 Female 1.01
1 Male 1.66
2 Male 3.50
3 Male 3.31
4 Female 3.61
.. ... ...
239 Male 5.92
240 Female 2.00
241 Male 2.00
242 Male 1.75
243 Female 3.00

[244 rows x 2 columns]

選出在周日用餐且小費給 5 元以上的資料

print(tips[(tips["day"]=="Sun") & (tips["tip"]>=5)])

output:

1
2
3
4
5
6
7
8
9
10
11
12
13
    total_bill   tip     sex smoker  day    time  size
11 35.26 5.00 Female No Sun Dinner 4
44 30.40 5.60 Male No Sun Dinner 4
46 22.23 5.00 Male No Sun Dinner 2
47 32.40 6.00 Male No Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
116 29.93 5.07 Male No Sun Dinner 4
155 29.85 5.14 Female No Sun Dinner 5
156 48.17 5.00 Male No Sun Dinner 6
172 7.25 5.15 Male Yes Sun Dinner 2
181 23.33 5.65 Male Yes Sun Dinner 2
183 23.17 6.50 Male Yes Sun Dinner 4
185 20.69 5.00 Male No Sun Dinner 5

groupby()等語法

groupby()把資料做分組再進去做統計

tips.groupby('time').size()

output:

1
2
3
4
time
Dinner 176
Lunch 68
dtype: int64

mean()平均值

在 groupby 之後並選定 column 後,使用 mean()可以算出各類別該 column 的平均值

tips.groupby('time')['total_bill'].median()

output:

1
2
3
4
time
Dinner 18.390
Lunch 15.965
Name: total_bill, dtype: float64

Part2 real world Example titanic 例

這是新的專案以 titanic 為例

1
2
3
4
5
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
df = pd.read_csv(url)
print(df)
print("="*50)
print(df.columns)

output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
   survived  pclass     sex   age  sibsp  parch     fare embarked   class  \
0 0 3 male 22.0 1 0 7.2500 S Third
1 1 1 female 38.0 1 0 71.2833 C First
2 1 3 female 26.0 0 0 7.9250 S Third
3 1 1 female 35.0 1 0 53.1000 S First
4 0 3 male 35.0 0 0 8.0500 S Third
.. ... ... ... ... ... ... ... ... ...
886 0 2 male 27.0 0 0 13.0000 S Second
887 1 1 female 19.0 0 0 30.0000 S First
888 0 3 female NaN 1 2 23.4500 S Third
889 1 1 male 26.0 0 0 30.0000 C First
890 0 3 male 32.0 0 0 7.7500 Q Third

who adult_male deck embark_town alive alone
0 man True NaN Southampton no False
1 woman False C Cherbourg yes False
2 woman False NaN Southampton yes True
3 woman False C Southampton yes False
4 man True NaN Southampton no True
.. ... ... ... ... ... ...
886 man True NaN Southampton no True
887 woman False B Southampton yes True
888 woman False NaN Southampton no False
889 man True C Cherbourg yes True
890 man True NaN Queenstown no True

[891 rows x 15 columns]

==================================================

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
'alive', 'alone'],
dtype='object')

最貴的票價是多少?

df['fare'].max()

output:
512.3292

最年輕的乘客與最老的乘客差幾歲?

df["age"].min()

output:
0.42

乘客平均年齡為幾歲?

df["age"].mean()

output:
29.69911764705882

每個乘船處各有多少乘客上船?

df.groupby("embark_town").size()

output:

男性、女性、孩童倖存者各有多少人?

df[df["survived"]==1].groupby("who").size()

1
2
3
4
5
who
child 49
man 88
woman 205
dtype: int64

男人、女人、兒童存活率為多少?

total = df.groupby('who').size() > print(total)

output:

1
2
3
4
5
who
child 83
man 537
woman 271
dtype: int64

alive = df[df["survived"] == 1].groupby('who').size() > print(alive)

output:

1
2
3
4
5
who
child 49
man 88
woman 205
dtype: int64

ratio = alive / total > print(ratio*100)

output:

1
2
3
4
5
who
child 59.036145
man 16.387337
woman 75.645756
dtype: float64

reference