Python Pandas¶
Pandas is one of the most important libraries of Python. Pandas has data structures for data analysis. The most used of these are Series and DataFrame data structures. Series is one dimensional, that is, it consists of a column. Data frame is two-dimensional, i.e. it consists of rows and columns.
To install Pandas, you can use "pip install pandas"
Series Data Structure¶
Series is a one-dimensional array-like object that can hold data of any type. It is similar to a column in a table.
In [2]:
import pandas as pd
import numpy as np
pd.__version__
Out[2]:
'2.2.3'
In [3]:
obj=pd.Series([1,"John",3.5,"Hey"])
obj
Out[3]:
0 1 1 John 2 3.5 3 Hey dtype: object
In [4]:
obj.values
Out[4]:
array([1, 'John', 3.5, 'Hey'], dtype=object)
In [5]:
obj2=pd.Series([1,"John",3.5,"Hey"],index=["a","b","c","d"])
obj2
Out[5]:
a 1 b John c 3.5 d Hey dtype: object
In [6]:
obj2["b"]
Out[6]:
'John'
In [7]:
obj2.index
Out[7]:
Index(['a', 'b', 'c', 'd'], dtype='object')
In [8]:
score={"Jane":90, "Bill":80,"Elon":85,"Tom":75,"Tim":95}
names=pd.Series(score) # Convert to Series
names
Out[8]:
Jane 90 Bill 80 Elon 85 Tom 75 Tim 95 dtype: int64
In [9]:
names["Tim"]
Out[9]:
95
In [10]:
names[names>=85]
Out[10]:
Jane 90 Elon 85 Tim 95 dtype: int64
In [11]:
names["Tom"]=60
names
Out[11]:
Jane 90 Bill 80 Elon 85 Tom 60 Tim 95 dtype: int64
In [12]:
names[names<=80]=83
names
Out[12]:
Jane 90 Bill 83 Elon 85 Tom 83 Tim 95 dtype: int64
In [13]:
"Tom" in names
Out[13]:
True
In [14]:
names/10
Out[14]:
Jane 9.0 Bill 8.3 Elon 8.5 Tom 8.3 Tim 9.5 dtype: float64
In [15]:
names**2
Out[15]:
Jane 8100 Bill 6889 Elon 7225 Tom 6889 Tim 9025 dtype: int64
In [16]:
names.isnull()
Out[16]:
Jane False Bill False Elon False Tom False Tim False dtype: bool
Working with Series Data Structure¶
In [18]:
games=pd.read_csv("https://raw.githubusercontent.com/TirendazAcademy/PANDAS-TUTORIAL/refs/heads/main/DataSets/vgsalesGlobale.csv")
In [19]:
games.head()
Out[19]:
Rank | Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Wii Sports | Wii | 2006.0 | Sports | Nintendo | 41.49 | 29.02 | 3.77 | 8.46 | 82.74 |
1 | 2 | Super Mario Bros. | NES | 1985.0 | Platform | Nintendo | 29.08 | 3.58 | 6.81 | 0.77 | 40.24 |
2 | 3 | Mario Kart Wii | Wii | 2008.0 | Racing | Nintendo | 15.85 | 12.88 | 3.79 | 3.31 | 35.82 |
3 | 4 | Wii Sports Resort | Wii | 2009.0 | Sports | Nintendo | 15.75 | 11.01 | 3.28 | 2.96 | 33.00 |
4 | 5 | Pokemon Red/Pokemon Blue | GB | 1996.0 | Role-Playing | Nintendo | 11.27 | 8.89 | 10.22 | 1.00 | 31.37 |
In [20]:
games.dtypes
Out[20]:
Rank int64 Name object Platform object Year float64 Genre object Publisher object NA_Sales float64 EU_Sales float64 JP_Sales float64 Other_Sales float64 Global_Sales float64 dtype: object
In [21]:
games.Genre.describe()
Out[21]:
count 16598 unique 12 top Action freq 3316 Name: Genre, dtype: object
In [22]:
games.Genre.value_counts()
Out[22]:
Genre Action 3316 Sports 2346 Misc 1739 Role-Playing 1488 Shooter 1310 Adventure 1286 Racing 1249 Platform 886 Simulation 867 Fighting 848 Strategy 681 Puzzle 582 Name: count, dtype: int64
In [23]:
games.Genre.value_counts(normalize=True)
Out[23]:
Genre Action 0.199783 Sports 0.141342 Misc 0.104772 Role-Playing 0.089649 Shooter 0.078925 Adventure 0.077479 Racing 0.075250 Platform 0.053380 Simulation 0.052235 Fighting 0.051090 Strategy 0.041029 Puzzle 0.035064 Name: proportion, dtype: float64
In [24]:
type(games.Genre.value_counts())
Out[24]:
pandas.core.series.Series
In [26]:
games.Genre.unique()
Out[26]:
array(['Sports', 'Platform', 'Racing', 'Role-Playing', 'Puzzle', 'Misc', 'Shooter', 'Simulation', 'Action', 'Fighting', 'Adventure', 'Strategy'], dtype=object)
In [27]:
games.Genre.nunique()
Out[27]:
12
In [28]:
pd.crosstab(games.Genre, games.Year)
Out[28]:
Year | 1980.0 | 1981.0 | 1982.0 | 1983.0 | 1984.0 | 1985.0 | 1986.0 | 1987.0 | 1988.0 | 1989.0 | ... | 2009.0 | 2010.0 | 2011.0 | 2012.0 | 2013.0 | 2014.0 | 2015.0 | 2016.0 | 2017.0 | 2020.0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Genre | |||||||||||||||||||||
Action | 1 | 25 | 18 | 7 | 1 | 2 | 6 | 2 | 2 | 2 | ... | 272 | 226 | 239 | 266 | 148 | 186 | 255 | 119 | 1 | 0 |
Adventure | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 141 | 154 | 108 | 58 | 60 | 75 | 54 | 34 | 0 | 0 |
Fighting | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | 0 | ... | 53 | 40 | 50 | 29 | 20 | 23 | 21 | 14 | 0 | 0 |
Misc | 4 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | ... | 207 | 201 | 184 | 38 | 42 | 41 | 39 | 18 | 0 | 0 |
Platform | 0 | 3 | 5 | 5 | 1 | 4 | 6 | 2 | 4 | 3 | ... | 29 | 31 | 37 | 12 | 37 | 10 | 14 | 10 | 0 | 0 |
Puzzle | 0 | 2 | 3 | 1 | 3 | 4 | 0 | 0 | 1 | 5 | ... | 79 | 45 | 43 | 11 | 3 | 8 | 6 | 0 | 0 | 0 |
Racing | 0 | 1 | 2 | 0 | 3 | 0 | 1 | 0 | 1 | 0 | ... | 84 | 57 | 65 | 30 | 16 | 27 | 19 | 20 | 0 | 0 |
Role-Playing | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 3 | 2 | ... | 103 | 103 | 95 | 78 | 71 | 91 | 78 | 40 | 2 | 0 |
Shooter | 2 | 10 | 5 | 1 | 3 | 1 | 4 | 2 | 1 | 1 | ... | 91 | 81 | 94 | 48 | 59 | 47 | 34 | 32 | 0 | 0 |
Simulation | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | ... | 123 | 82 | 56 | 18 | 18 | 11 | 15 | 9 | 0 | 1 |
Sports | 1 | 4 | 2 | 1 | 2 | 1 | 3 | 4 | 2 | 3 | ... | 184 | 186 | 122 | 54 | 53 | 55 | 62 | 38 | 0 | 0 |
Strategy | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 65 | 53 | 46 | 15 | 19 | 8 | 17 | 10 | 0 | 0 |
12 rows × 39 columns
In [29]:
games.Global_Sales.describe()
Out[29]:
count 16598.000000 mean 0.537441 std 1.555028 min 0.010000 25% 0.060000 50% 0.170000 75% 0.470000 max 82.740000 Name: Global_Sales, dtype: float64
In [31]:
print(games.Global_Sales.mean())
print(games.Global_Sales.median())
print(games.Global_Sales.std())
print(games.Global_Sales.max())
0.5374406555006628 0.17 1.5550279355699124 82.74
In [32]:
games.Global_Sales.value_counts()
Out[32]:
Global_Sales 0.02 1071 0.03 811 0.04 645 0.05 632 0.01 618 ... 5.01 1 5.05 1 5.07 1 5.11 1 3.16 1 Name: count, Length: 623, dtype: int64
In [35]:
games.Year.plot(kind="hist")
Out[35]:
<Axes: ylabel='Frequency'>
In [36]:
games.Year.plot(kind="box")
Out[36]:
<Axes: >
In [37]:
games.Year.plot(kind="kde")
Out[37]:
<Axes: ylabel='Density'>
In [39]:
games.Genre.value_counts().plot(kind="bar")
Out[39]:
<Axes: xlabel='Genre'>
DataFrame Data Structure¶
DataFrame is a two-dimensional data structure that can hold data of different types. It is similar to a table with rows and columns.
In [40]:
data={"name":["Bill","Tom","Tim","John","Alex","Vanessa","Kate"],
"score":[90,80,85,75,95,60,65],
"sport":["Wrestling","Football","Skiing","Swimming","Tennis",
"Karete","Surfing"],
"sex":["M","M","M","M","F","F","F"]}
df=pd.DataFrame(data)
df
Out[40]:
name | score | sport | sex | |
---|---|---|---|---|
0 | Bill | 90 | Wrestling | M |
1 | Tom | 80 | Football | M |
2 | Tim | 85 | Skiing | M |
3 | John | 75 | Swimming | M |
4 | Alex | 95 | Tennis | F |
5 | Vanessa | 60 | Karete | F |
6 | Kate | 65 | Surfing | F |
In [41]:
df=pd.DataFrame(data,columns=["name","sport","sex","score"])
df
Out[41]:
name | sport | sex | score | |
---|---|---|---|---|
0 | Bill | Wrestling | M | 90 |
1 | Tom | Football | M | 80 |
2 | Tim | Skiing | M | 85 |
3 | John | Swimming | M | 75 |
4 | Alex | Tennis | F | 95 |
5 | Vanessa | Karete | F | 60 |
6 | Kate | Surfing | F | 65 |
In [42]:
df=pd.DataFrame(data,columns=["name", "sport", "gender", "score", "age"],
index=["one","two","three","four","five","six","seven"])
df
Out[42]:
name | sport | gender | score | age | |
---|---|---|---|---|---|
one | Bill | Wrestling | NaN | 90 | NaN |
two | Tom | Football | NaN | 80 | NaN |
three | Tim | Skiing | NaN | 85 | NaN |
four | John | Swimming | NaN | 75 | NaN |
five | Alex | Tennis | NaN | 95 | NaN |
six | Vanessa | Karete | NaN | 60 | NaN |
seven | Kate | Surfing | NaN | 65 | NaN |
In [43]:
df["sport"]
Out[43]:
one Wrestling two Football three Skiing four Swimming five Tennis six Karete seven Surfing Name: sport, dtype: object
In [44]:
my_columns=["name","sport"]
df[my_columns]
Out[44]:
name | sport | |
---|---|---|
one | Bill | Wrestling |
two | Tom | Football |
three | Tim | Skiing |
four | John | Swimming |
five | Alex | Tennis |
six | Vanessa | Karete |
seven | Kate | Surfing |
In [45]:
df.sport
Out[45]:
one Wrestling two Football three Skiing four Swimming five Tennis six Karete seven Surfing Name: sport, dtype: object
In [46]:
df.loc[["one"]]
Out[46]:
name | sport | gender | score | age | |
---|---|---|---|---|---|
one | Bill | Wrestling | NaN | 90 | NaN |
In [47]:
df.loc[["one","two"]]
Out[47]:
name | sport | gender | score | age | |
---|---|---|---|---|---|
one | Bill | Wrestling | NaN | 90 | NaN |
two | Tom | Football | NaN | 80 | NaN |
In [48]:
df["age"]=18
In [49]:
df=pd.DataFrame(data,columns=["name", "sport", "gender", "score", "age"],
index=["one","two","three","four","five","six","seven"])
values=[18,19,20,18,17,17,18]
df["age"]=values
df
Out[49]:
name | sport | gender | score | age | |
---|---|---|---|---|---|
one | Bill | Wrestling | NaN | 90 | 18 |
two | Tom | Football | NaN | 80 | 19 |
three | Tim | Skiing | NaN | 85 | 20 |
four | John | Swimming | NaN | 75 | 18 |
five | Alex | Tennis | NaN | 95 | 17 |
six | Vanessa | Karete | NaN | 60 | 17 |
seven | Kate | Surfing | NaN | 65 | 18 |
In [50]:
df["pass"]=df.score>=70
df
Out[50]:
name | sport | gender | score | age | pass | |
---|---|---|---|---|---|---|
one | Bill | Wrestling | NaN | 90 | 18 | True |
two | Tom | Football | NaN | 80 | 19 | True |
three | Tim | Skiing | NaN | 85 | 20 | True |
four | John | Swimming | NaN | 75 | 18 | True |
five | Alex | Tennis | NaN | 95 | 17 | True |
six | Vanessa | Karete | NaN | 60 | 17 | False |
seven | Kate | Surfing | NaN | 65 | 18 | False |
In [51]:
del df["pass"]
df
Out[51]:
name | sport | gender | score | age | |
---|---|---|---|---|---|
one | Bill | Wrestling | NaN | 90 | 18 |
two | Tom | Football | NaN | 80 | 19 |
three | Tim | Skiing | NaN | 85 | 20 |
four | John | Swimming | NaN | 75 | 18 |
five | Alex | Tennis | NaN | 95 | 17 |
six | Vanessa | Karete | NaN | 60 | 17 |
seven | Kate | Surfing | NaN | 65 | 18 |
In [52]:
scores={"Math":{"A":85,"B":90,"C":95}, "Physics":{"A":90,"B":80,"C":75}}
In [53]:
scores_df=pd.DataFrame(scores)
scores_df
Out[53]:
Math | Physics | |
---|---|---|
A | 85 | 90 |
B | 90 | 80 |
C | 95 | 75 |
In [54]:
scores_df.T
Out[54]:
A | B | C | |
---|---|---|---|
Math | 85 | 90 | 95 |
Physics | 90 | 80 | 75 |
In [55]:
scores_df.index.name="name"
scores_df.columns.name="lesson"
scores_df
Out[55]:
lesson | Math | Physics |
---|---|---|
name | ||
A | 85 | 90 |
B | 90 | 80 |
C | 95 | 75 |
In [56]:
scores_df.values
Out[56]:
array([[85, 90], [90, 80], [95, 75]])
In [57]:
scores_index=scores_df.index
In [58]:
scores_index[1]="Jack"
scores_index
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[58], line 1 ----> 1 scores_index[1]="Jack" 2 scores_index File ~/work/AI/blog/.venv/lib/python3.12/site-packages/pandas/core/indexes/base.py:5371, in Index.__setitem__(self, key, value) 5369 @final 5370 def __setitem__(self, key, value) -> None: -> 5371 raise TypeError("Index does not support mutable operations") TypeError: Index does not support mutable operations
Indexing & Selection & Filtering¶
In [59]:
import numpy as np
In [61]:
obj=pd.Series(np.arange(5),
index=["a","b","c","d","e"])
obj
Out[61]:
a 0 b 1 c 2 d 3 e 4 dtype: int64
In [62]:
obj["c"]
Out[62]:
2
In [63]:
obj[2]
/var/folders/59/c32_bthx48jd9m2ym5m3tnpw0000j7/T/ipykernel_18768/1662947756.py:1: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]` obj[2]
Out[63]:
2
In [64]:
obj[0:3]
Out[64]:
a 0 b 1 c 2 dtype: int64
In [65]:
obj[["a","c"]]
Out[65]:
a 0 c 2 dtype: int64
In [66]:
obj[[0,2]]
/var/folders/59/c32_bthx48jd9m2ym5m3tnpw0000j7/T/ipykernel_18768/1746387968.py:1: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]` obj[[0,2]]
Out[66]:
a 0 c 2 dtype: int64
In [67]:
obj[obj<2]
Out[67]:
a 0 b 1 dtype: int64
In [68]:
obj["a":"c"]
Out[68]:
a 0 b 1 c 2 dtype: int64
In [69]:
obj["b":"c"]=5
obj
Out[69]:
a 0 b 5 c 5 d 3 e 4 dtype: int64
In [70]:
data=pd.DataFrame(
np.arange(16).reshape(4,4),
index=["London","Paris",
"Berlin","Istanbul"],
columns=["one","two","three","four"])
data
Out[70]:
one | two | three | four | |
---|---|---|---|---|
London | 0 | 1 | 2 | 3 |
Paris | 4 | 5 | 6 | 7 |
Berlin | 8 | 9 | 10 | 11 |
Istanbul | 12 | 13 | 14 | 15 |
In [71]:
data["two"]
Out[71]:
London 1 Paris 5 Berlin 9 Istanbul 13 Name: two, dtype: int64
In [72]:
data[["one","two"]]
Out[72]:
one | two | |
---|---|---|
London | 0 | 1 |
Paris | 4 | 5 |
Berlin | 8 | 9 |
Istanbul | 12 | 13 |
In [73]:
data[:3]
Out[73]:
one | two | three | four | |
---|---|---|---|---|
London | 0 | 1 | 2 | 3 |
Paris | 4 | 5 | 6 | 7 |
Berlin | 8 | 9 | 10 | 11 |
In [74]:
data[data["four"]>5]
Out[74]:
one | two | three | four | |
---|---|---|---|---|
Paris | 4 | 5 | 6 | 7 |
Berlin | 8 | 9 | 10 | 11 |
Istanbul | 12 | 13 | 14 | 15 |
In [75]:
data[data<5]=0
data
Out[75]:
one | two | three | four | |
---|---|---|---|---|
London | 0 | 0 | 0 | 0 |
Paris | 0 | 5 | 6 | 7 |
Berlin | 8 | 9 | 10 | 11 |
Istanbul | 12 | 13 | 14 | 15 |
In [76]:
data.iloc[1]
Out[76]:
one 0 two 5 three 6 four 7 Name: Paris, dtype: int64
In [77]:
data.iloc[1,[1,2,3]]
Out[77]:
two 5 three 6 four 7 Name: Paris, dtype: int64
In [78]:
data.loc["Paris",["one","two"]]
Out[78]:
one 0 two 5 Name: Paris, dtype: int64
In [79]:
data.loc[:"Paris","four"]
Out[79]:
London 0 Paris 7 Name: four, dtype: int64
In [80]:
toy_data=pd.Series(np.arange(5),
index=["a","b","c",
"d","e"])
toy_data
Out[80]:
a 0 b 1 c 2 d 3 e 4 dtype: int64
In [81]:
toy_data[-1]
/var/folders/59/c32_bthx48jd9m2ym5m3tnpw0000j7/T/ipykernel_18768/3728369251.py:1: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]` toy_data[-1]
Out[81]:
4
Useful Methods¶
In [82]:
s=pd.Series([1,2,3,4],
index=["a","b","c","d"])
s
Out[82]:
a 1 b 2 c 3 d 4 dtype: int64
In [83]:
s2=s.reindex(["b","d","a","c","e"])
s2
Out[83]:
b 2.0 d 4.0 a 1.0 c 3.0 e NaN dtype: float64
In [84]:
s3=pd.Series(["blue","yellow","purple"],
index=[0,2,4])
s3
Out[84]:
0 blue 2 yellow 4 purple dtype: object
In [85]:
s3.reindex(range(6),method="ffill")
Out[85]:
0 blue 1 blue 2 yellow 3 yellow 4 purple 5 purple dtype: object
In [86]:
df=pd.DataFrame(np.arange(9).reshape(3,3),
index=["a","c","d"],
columns=["Tim","Tom","Kate"])
df
Out[86]:
Tim | Tom | Kate | |
---|---|---|---|
a | 0 | 1 | 2 |
c | 3 | 4 | 5 |
d | 6 | 7 | 8 |
In [87]:
df2=df.reindex(["d","c","b","a"])
df2
Out[87]:
Tim | Tom | Kate | |
---|---|---|---|
d | 6.0 | 7.0 | 8.0 |
c | 3.0 | 4.0 | 5.0 |
b | NaN | NaN | NaN |
a | 0.0 | 1.0 | 2.0 |
In [88]:
names=["Kate","Tim","Tom"]
df.reindex(columns=names)
Out[88]:
Kate | Tim | Tom | |
---|---|---|---|
a | 2 | 0 | 1 |
c | 5 | 3 | 4 |
d | 8 | 6 | 7 |
In [89]:
df.loc[["c","d","a"]]
Out[89]:
Tim | Tom | Kate | |
---|---|---|---|
c | 3 | 4 | 5 |
d | 6 | 7 | 8 |
a | 0 | 1 | 2 |
In [90]:
s=pd.Series(np.arange(5.),
index=["a","b","c","d","e"])
s
Out[90]:
a 0.0 b 1.0 c 2.0 d 3.0 e 4.0 dtype: float64
In [91]:
new_s=s.drop("b")
new_s
Out[91]:
a 0.0 c 2.0 d 3.0 e 4.0 dtype: float64
In [92]:
s.drop(["c","d"])
Out[92]:
a 0.0 b 1.0 e 4.0 dtype: float64
In [93]:
data=pd.DataFrame(np.arange(16).reshape(4,4),
index=["Kate","Tim",
"Tom","Alex"],
columns=list("ABCD"))
data
Out[93]:
A | B | C | D | |
---|---|---|---|---|
Kate | 0 | 1 | 2 | 3 |
Tim | 4 | 5 | 6 | 7 |
Tom | 8 | 9 | 10 | 11 |
Alex | 12 | 13 | 14 | 15 |
In [94]:
data.drop(["Kate","Tim"])
Out[94]:
A | B | C | D | |
---|---|---|---|---|
Tom | 8 | 9 | 10 | 11 |
Alex | 12 | 13 | 14 | 15 |
In [95]:
data.drop("A",axis=1)
Out[95]:
B | C | D | |
---|---|---|---|
Kate | 1 | 2 | 3 |
Tim | 5 | 6 | 7 |
Tom | 9 | 10 | 11 |
Alex | 13 | 14 | 15 |
In [96]:
data.drop("Kate",axis=0)
Out[96]:
A | B | C | D | |
---|---|---|---|---|
Tim | 4 | 5 | 6 | 7 |
Tom | 8 | 9 | 10 | 11 |
Alex | 12 | 13 | 14 | 15 |
In [97]:
data
Out[97]:
A | B | C | D | |
---|---|---|---|---|
Kate | 0 | 1 | 2 | 3 |
Tim | 4 | 5 | 6 | 7 |
Tom | 8 | 9 | 10 | 11 |
Alex | 12 | 13 | 14 | 15 |
In [98]:
data.mean(axis="index")
Out[98]:
A 6.0 B 7.0 C 8.0 D 9.0 dtype: float64
In [100]:
data.mean(axis="columns")
Out[100]:
Kate 1.5 Tim 5.5 Tom 9.5 Alex 13.5 dtype: float64
In [101]:
data.mean(axis=None)
Out[101]:
7.5
Arithmetic Operations¶
In [102]:
s1=pd.Series(np.arange(4),
index=["a","c","d","e"])
s2=pd.Series(np.arange(5),
index=["a","c","e","f","g"])
In [103]:
print(s1)
print(s2)
a 0 c 1 d 2 e 3 dtype: int64 a 0 c 1 e 2 f 3 g 4 dtype: int64
In [104]:
s1+s2
Out[104]:
a 0.0 c 2.0 d NaN e 5.0 f NaN g NaN dtype: float64
In [105]:
df1=pd.DataFrame(
np.arange(6).reshape(2,3),
columns=list("ABC"),
index=["Tim","Tom"])
df2=pd.DataFrame(
np.arange(9).reshape(3,3),
columns=list("ACD"),
index=["Tim","Kate","Tom"])
In [106]:
print(df1)
print(df2)
A B C Tim 0 1 2 Tom 3 4 5 A C D Tim 0 1 2 Kate 3 4 5 Tom 6 7 8
In [107]:
df1+df2
Out[107]:
A | B | C | D | |
---|---|---|---|---|
Kate | NaN | NaN | NaN | NaN |
Tim | 0.0 | NaN | 3.0 | NaN |
Tom | 9.0 | NaN | 12.0 | NaN |
In [108]:
df1.add(df2,fill_value=0)
Out[108]:
A | B | C | D | |
---|---|---|---|---|
Kate | 3.0 | NaN | 4.0 | 5.0 |
Tim | 0.0 | 1.0 | 3.0 | 2.0 |
Tom | 9.0 | 4.0 | 12.0 | 8.0 |
In [109]:
df1
Out[109]:
A | B | C | |
---|---|---|---|
Tim | 0 | 1 | 2 |
Tom | 3 | 4 | 5 |
In [111]:
1/df1
Out[111]:
A | B | C | |
---|---|---|---|
Tim | inf | 1.00 | 0.5 |
Tom | 0.333333 | 0.25 | 0.2 |
In [112]:
df1/2
Out[112]:
A | B | C | |
---|---|---|---|
Tim | 0.0 | 0.5 | 1.0 |
Tom | 1.5 | 2.0 | 2.5 |
In [113]:
s=df2.iloc[1]
s
Out[113]:
A 3 C 4 D 5 Name: Kate, dtype: int64
In [114]:
df2
Out[114]:
A | C | D | |
---|---|---|---|
Tim | 0 | 1 | 2 |
Kate | 3 | 4 | 5 |
Tom | 6 | 7 | 8 |
In [115]:
df2-s
Out[115]:
A | C | D | |
---|---|---|---|
Tim | -3 | -3 | -3 |
Kate | 0 | 0 | 0 |
Tom | 3 | 3 | 3 |
In [116]:
s2=df2["A"]
s2
Out[116]:
Tim 0 Kate 3 Tom 6 Name: A, dtype: int64
In [117]:
df2.sub(s2,axis="index")
Out[117]:
A | C | D | |
---|---|---|---|
Tim | 0 | 1 | 2 |
Kate | 0 | 1 | 2 |
Tom | 0 | 1 | 2 |
In [118]:
df2
Out[118]:
A | C | D | |
---|---|---|---|
Tim | 0 | 1 | 2 |
Kate | 3 | 4 | 5 |
Tom | 6 | 7 | 8 |
Applying a Function¶
In [119]:
df=pd.DataFrame(
np.random.randn(4,3),
columns=list("ABC"),
index=["Kim","Susan","Tim","Tom"])
df
Out[119]:
A | B | C | |
---|---|---|---|
Kim | 2.554629 | -1.113764 | 0.968447 |
Susan | 0.596522 | -0.653082 | 0.068941 |
Tim | -1.996567 | -1.629866 | 1.012815 |
Tom | -0.250421 | -1.260170 | 0.384344 |
In [120]:
np.abs(df)
Out[120]:
A | B | C | |
---|---|---|---|
Kim | 2.554629 | 1.113764 | 0.968447 |
Susan | 0.596522 | 0.653082 | 0.068941 |
Tim | 1.996567 | 1.629866 | 1.012815 |
Tom | 0.250421 | 1.260170 | 0.384344 |
In [121]:
f=lambda x:x.max()-x.min()
In [122]:
df.apply(f)
Out[122]:
A 4.551196 B 0.976784 C 0.943874 dtype: float64
In [123]:
df.apply(f,axis=1)
Out[123]:
Kim 3.668393 Susan 1.249605 Tim 3.009383 Tom 1.644514 dtype: float64
In [124]:
def f(x):
return x**2
In [125]:
df.apply(f)
Out[125]:
A | B | C | |
---|---|---|---|
Kim | 6.526127 | 1.240471 | 0.937890 |
Susan | 0.355839 | 0.426516 | 0.004753 |
Tim | 3.986281 | 2.656463 | 1.025795 |
Tom | 0.062710 | 1.588027 | 0.147720 |
Sorting & Ranking¶
In [3]:
s=pd.Series(range(5),
index=["e","d","a","b","c"])
s
Out[3]:
e 0 d 1 a 2 b 3 c 4 dtype: int64
In [4]:
s.sort_index()
Out[4]:
a 2 b 3 c 4 d 1 e 0 dtype: int64
In [7]:
s.sort_index()
Out[7]:
a 2 b 3 c 4 d 1 e 0 dtype: int64
In [8]:
df=pd.DataFrame(
np.arange(12).reshape(3,4),
index=["two","one","three"],
columns=["d","a","b","c"])
df
Out[8]:
d | a | b | c | |
---|---|---|---|---|
two | 0 | 1 | 2 | 3 |
one | 4 | 5 | 6 | 7 |
three | 8 | 9 | 10 | 11 |
In [11]:
df.sort_index()
Out[11]:
d | a | b | c | |
---|---|---|---|---|
one | 4 | 5 | 6 | 7 |
three | 8 | 9 | 10 | 11 |
two | 0 | 1 | 2 | 3 |
Reference¶
In [ ]: