Operating Systems Chapter 10: File systems

2017. Spring Instructor: Joonho Kwon [email protected] Data Science Laboratory, PNU

These slides are based on the slides prepared by Anthony Joseph.

How do we Hide I/O Latency? 

Blocking Interface: “Wait” 





Non-blocking Interface: “Don’t Wait” 





When request data (e.g., read() system call), put process to sleep until data is ready When write data (e.g., write() system call), put process to sleep until device is ready for data Returns quickly from read or write request with count of bytes successfully transferred to kernel Read may return nothing, write may write nothing

Asynchronous Interface: “Tell Me Later” 



When requesting data, take pointer to user’s buffer, return immediately; later kernel fills buffer and notifies user When sending data, take pointer to user’s buffer, return immediately; l ater kernel takes data and notifies user 2

I/O & Storage Layers Operations, Entities and Interface Application / Service High Level I/O

streams

Low Level I/O

handles

Syscall

registers file_open, file_read, … on struct file * & void *

File System

I/O Driver

descriptors

we are here …

Commands and Data Transfers Disks, Flash, Controllers, DMA

3

Recall: C Low level I/O 

Operations on File Descriptors – as OS object representing the state of a file 

User has a “handle” on the descriptor

#include #include #include int open (const char *filename, int flags [, mode_t mode]) int create (const char *filename, mode_t mode) int close (int filedes)

Bit vector of: • Access modes (Rd, Wr, …) • Open Flags (Create, …) • Operating modes (Appends, …)

Bit vector of Permission Bits: • User|Group|Other X R|W|X

http://www.gnu.org/software/libc/manual/html_node/Opening-and-Closing-Files.html

4

Recall: C Low Level Operations ssize_t read (int filedes, void *buffer, size_t maxsize) - returns bytes read, 0 => EOF, -1 => error ssize_t write (int filedes, const void *buffer, size_t size) - returns bytes written off_t lseek (int filedes, off_t offset, int whence) - set the file offset * if whence == SEEK_SET: set file offset to “offset” * if whence == SEEK_CRT: set file offset to crt location + “offset” * if whence == SEEK_END: set file offset to file size + “offset” int fsync (int fildes) – wait for i/o of filedes to finish and commit to disk void sync (void) – wait for ALL to finish and commit to disk



When write returns, data is on its way to disk and can be read, but it may not actually be permanent!

5

Building a File System 

File System: Layer of OS that transforms block interface of disks (or other block devices) into Files, Directories, etc.



File System Components  

 

Naming: Interface to find files by name, not by blocks Disk Management: collecting disk blocks into files Protection: Layers to keep data secure Reliability/Durability: Keeping of files durable despite crashes, media failures, attacks, etc.

6

Recall: User vs. System View of a File 

User’s view:  Durable Data Structures



System’s view (system call interface):  Collection of Bytes (UNIX)  Doesn’t matter to system what kind of data structures you want to store on disk!



System’s view (inside OS):  Collection of blocks (a block is a logical transfer unit, while a sector is the physical transfer unit)  Block size  sector size; in UNIX, block size is 4KB 7

Recall: Translating from User to System View File System



What happens if user says: give me bytes 2—12?  



What about: write bytes 2—12?   



Fetch block Modify portion Write out Block

Everything inside File System is in whole size blocks 



Fetch block corresponding to those bytes Return just the correct portion of the block

For example, getc(), putc()  buffers something like 4096 bytes, even if interface is one byte at a time

From now on, file is a collection of blocks

8

Disk Management Policies (1/2) 

Basic entities on a disk:  



File: user-visible group of blocks arranged sequentially in logical space Directory: user-visible index mapping names to files

Access disk as linear array of sectors. Two Options: 

 

Identify sectors as vectors [cylinder, surface, sector], sort in cylinder-major order, not used anymore Logical Block Addressing (LBA): Every sector has integer address from zero up to max number of sectors Controller translates from address  physical position  First case: OS/BIOS must deal with bad sectors  Second case: hardware shields OS from structure of disk 9

Recall: Disk Management Policies (2/2) 

Need way to track free disk blocks  



Link free blocks together  too slow today Use bitmap to represent free space on disk

Need way to structure files: File Header  

Track which blocks belong at which offsets within the logical file structure Optimize placement of files’ disk blocks to match access and usage patterns

10

Designing a File System …   

What factors are critical to the design choices? Durable data store => it’s all on disk (Hard) Disks Performance !!! 



Open before Read/Write 





Can write (or read zeros) to expand the file Start small and grow, need to make room

Organized into directories 



Can perform protection checks and look up where the actual file resource are, in advance

Size is determined as they are used !!! 



Maximize sequential access, minimize seeks

What data structure (on disk) for that?

Need to allocate / free blocks 

Such that access remains efficient

11

File system  

Data Structure Functions

12

이 슬라이드는 OLC Center의 서울대학교 고건 교수님의 자료를 바탕으로 제작되었음.

Terminology

Memory OS

Disk

Sector

block

buf

Page 

sector 



block  



place in main memory where block is stored

page 



multiple of disk sector (512B on x86) kernel performs all I/O in terms of blocks

buffer 



smallest addressable unit in hardware

multiple of block (4KB on x86)

segment  

a chunk of a buffer, contiguous in memory its size is smaller than block 13

Kernel Data Structure for File Process 1

Process 2

PCB

PCB

CPU

mem

FCB

CPU

mem

File

: Table (Data Structure)

: Object (hardware or software)

14

Meta-data for a File 

Information kernel needs for a file: • • • • • • •

owner protection device content device driver routines accessing where now ….

(eg Clinton) (eg rwx r-- r--) (eg disk) (eg. sector address) (eg read(), open() ) (eg offset)

15

contiguous allocation hole hole

hole

Poor Insert/Delete ---------Free Space Management -----Compaction Wasted Space

scattered allocation

16

Contents of File FA may be stored in disk non-contiguously in units of disk sectors

File content

Why not contiguous allocation? (O) fast – if R/W whole content sequential use for swap, device copy, …

(X) space management many small holes (useless) external fragmentation

File content

File content File content

17

Kernel maintains metadata for each file

File

metadata

File content File content File contentFile content

18

File metadata includes pointers to data sectors

File

metadata

File content File content File contentFile content

19

Open() retrieves metadata from disk to main memory

File metadata

File

metadata

File content File content File contentFile content

But not contents – they are too big !!

20

This metadata has pointers to data sectors

File metadata

File metadata File content

File content File contentFile content

21

Now, you can reach any data sector through in-core metadata “offset” = byte position to R/W next

File metadata

File metadata File content

File content File contentFile content

22

Processes Sharing a File Three processes share same file three copies of file metadata in memory

PA

FX metadata FX

PB

FX metadata

PC

FX

metadata File content

File content File contentFile content

metadata 23

If PB updates metadata broadcast to all copies - risky (inconsistency) - inefficient

PA

FX metadata

PB

FX metadata

PC

FX metadata

FX

metadata File content

File content File contentFile content 24

Copying file metadata is expensive avoid copies to minimum share single copy whenever possible What about file offset? Cannot be shared, private copy/process

PA

PB

FX metadata FX metadata

FX metadata

PC

FX metadata

File content

File content File contentFile content 25

Split Metadata for file – – – – – –

owner protection information device pointer to file content device driver routines offset

Systemwide information All processes share single copy in memory (these fields part are not updated frequently)

“inode” table Private information Let each process have private copy since processes access different part

file table 26

two data structures for each file Private information

(system) file table PA PB

Systemwide information

inode table

offset

offset

Process private info Per-process data Next byte position to r/w

all the rest info.

shared info single copy globally

less frequently changed information 27

struct inode

inode inode

struct inode { char i_flag; char i_count; int i_dev; int i_number; int i_mode; char i_nlink; char i_uid; char i_gid; char i_size0; char *i_size1; int i_addr[8]; int i_lastr; } inode[NINODE];

/* reference count */ /* device where inode resides */ /* i number, 1-to-1 with device address */ /* directory entries */ /* owner */ /* group of owner */ /* most significant of size */ /* least sig */ /* sector addresses constituting file */ /* last logical block read (for read-ahead) */

28

struct file /* * One file structure is allocated for each open/creat/pipe call. * Main use is to hold the read/write offset */ struct file { char f_flag; char f_count; int f_inode; char *f_offset[2]; } file[NFILE];

/* flags */ #define FREAD 01 #define FWRITE 02 #define FPIPE 04

/* reference count */ /* pointer to inode structure */ /* read/write character pointer */

file

inode

offset

all the rest info. 29

Disk Space for ... Space for inode

Space for data blocks

inode inode inode inode

data block data block data blockdata block

File data size --- varies inode size --- fixed 30

Space for inode in Disk (Each inode - fixed size)

inode 0

inode 1

inode 2

i-number: ordinal number of inode in disk If I know (disk, i-number), I can access file content. disk name i-number

inode

content 31

Sharing files 

Example 

(Case-1) who, grep -- pipe file % who | grep  share inode (pipe file),  not share offset

pipe

who grep

sequence of bytes tty(in)



(Case-2) parent/child -- tty file % vi  share inode (tty file),  share offset

sh vi

32

Sharing files (system) file table

Inode table tty device

pipe $ grep|who

who

offset

inode

grep offset

process group $ vi

sh

inode offset

vi Pipe file offset

game

inode

game file 33

Device switch table 

2-dim array which maps 

(device name, operation name) => device driver routine

devswtab[]: open close read write ioctl 

Read_lp

Starting address of read_lp() routine

device independence (above: file, below: device)

34

struct { int int int int int

Actually, one dimensional array of struct not two dimensional array

cdevsw (*d_open)(); (*d_close)(); (*d_read)(); (*d_write)(); (*d_sgtty)();

} cdevsw[];

device 1

device 2

open close read write ioctl

(*d_open)(); (*d_close)(); (*d_read)(); (*d_write)(); (*d_sgtty)(); (*d_open)(); (*d_close)(); (*d_read)(); (*d_write)(); (*d_sgtty)();

Read_lp

device 3

(*d_open)(); (*d_close)(); (*d_read)(); (*d_write)(); (*d_sgtty)();

35

Kernel tables after open(/a/b) (1/3) PA

user

(system) file table inode table / a

/

a

b

data block data block data block

b Device name

36

Kernel tables after open(/a/b) (2/3) PA

user

(system) file table inode table / a

/

a

b

data block data block data block

offset

b Device name

37

Kernel tables after open(/a/b) (3/3) PA

user

(system) file table inode table /

u_ofile fd = 4

0 1 2 3 4

a

/

a

b

data block data block data block

offset

b Device name

38

File descriptor table (or open file table)   

An array in struct user ( u_ofile[] array ) per process open file information whenever program calls open(), create() fd = open(“/a/b”, …) (3) file descriptor (2) kernel is returned opens file 

fd is integer (“file descriptor”), starts from 0, 1, 2 .. 



(1) pathname of file the file to open

0, 1, 2 reserved for standard (input/output/error) file

fd is used as an index into  

u_ofile[] array (file descriptor table, open file table_) starting point to access file (points to system file table) 39

Kernel data structure for file

PA

user per process (system) file table inode table u_ofile[] fd

0 1 2 3 4

offset

devswtab

inode

offset

routine device

inode

open file table file descriptor table ( “file handle” extends this notion to network. Window’s name)

40

Kernel Data Structure Process 1

devswtab

user offset

inode

read( )

CPU CPU

FX

41

(System) file table

Process 1

devswtab

user



One entry for each open/create/pipe

offset inode

read( )

CPU





may be shared (if offset is shared)

CPU FX

content 

  

offset counter (number of processes sharing this entry) pointer to inode table r/w/p flag 42

Inode table

Process 1

devswtab

user

  

Changed less frequently (than offset) includes most of the information content (while in disk)

offset inode

read( )

CPU



  



protection mode owner size time array of pointers to disk data blocks

CPU FX

43

In core Inode

Process 1

devswtab

user



    



offset

content (while in disk) protection mode owner size time array of pointers to disk blocks

inode

read( )

CPU

CPU FX

plus (at load time) 

  

counter (number of processes sharing) device name (major/minor device number) i-number(location of inode in disk) status (locked, mount point, …) 44

pointer array within inode

Now, you can reach any data block through in-core inode These pointers are stored in an array within inode

inode

inode

File content

File content File contentFile content

45

pointer array within inode

inode

File content

File content File contentFile content

46

pointer array within inode

Data

Block direct 0 direct 1 direct 2 direct 3 direct 4 direct 5 direct 6 direct 7 direct 8

Sector Address

direct 9 single indirect double indirect triple direct

47

pointer array within inode

Data

Block direct 0 direct 1 direct 2 direct 3 direct 4 direct 5 direct 6 direct 7

Sector

direct 8

Address

direct 9 single indirect double indirect triple direct

Fast for small files slower for big files

48

--

timesharing aplication

Offset vs Disk Block Data Block ~ 1KB

direct 0

~ 2 KB

direct 1 direct 2 direct 3 direct 4 direct 5

57821

direct 6

direct 7 direct 8 ~ 9 KB

direct 9

~109KB

single indirect

~ 10109 KB

double indirect

Sector Address

triple direct 49

Linux      

1-12th pointer 13th pointer 14th pointer 15th pointer

– direct pointer – indirect pointer - doubly indirect pointer – triply indirect pointer

--------------Max 4096 GB file data if  

block address - 32 bits block size - 4096 byte

50

Directory file 

Directory file 

it is also a file. 

File name i-number

“a” 7

content:

“b” 1

“bin” 3

“dev” 772

i-number = 3 Q: file name – limit char? Q: # of files – limit

3rd inode In disk

Data blocks

51

Kernel tables before open(/a/b) (1/3) PA

user

(system) file table inode table /

inode data

/

a

b

data block data block data block

52

Kernel tables before open(/a/b) (2/3) PA user

file table

inode table /

data block

data block

data block

inode a bin x 7 11 8

data

/

a

b

data block data block data block

53

Kernel tables before open(/a/b) (3/3) PA user

file table

inode table /

data block

data block

data block

inode a bin x 7 11 8

a

data block

data block

data

/

a

b

data block data block data block

data block

b usr y 3 21 6 54

open(“/a/b”) (1/2)

/:

data i

data data

/a:

Directory

data i

data data

/a/b:

a x y bin dev 7 6 8 11 40

w u b 7 6 8

ch temp 11 40

Directory

data

i

data data

Content of this file

Regular File 55

open(“/a/b”) (2/2) /:

Inode 0

a bin dev



7 11



40

Inode of “a”

a:

b u

ch



8 6

11



Inode of “b”

File b’s Data blocks 56

open(“/a/b”, …) (1/2) 

Kernel system call open( ) scans pathname 

1st -- root directory file:    



get inode 0 in disk inode space read data blocks of root directory file search for file name “a” get corresponding i-number for file “a”

2nd -- “a” file: 

  

get inode of “a” from disk (also directory file) get data blocks of directory file “a” search for file name “b” get corresponding i-number for file “b”

/a/b a bin

7 11

dev 40

/a/b

57

open(“/a/b”, …) (2/2) 

Kernel system call open( ) scans pathname (cont’d) /a/b 

file “b”:  



read inode of “b” from disk (regular file) ---- pathname ends here -------

set up kernel data structures for file “b” 

  



insert inode into in-core inode table new entry in system file table (offset <= zero) new entry in u_ofile[] in user return file descriptor open( ) is done 58

Kernel tables after open(/a/b) (1/4) PA

user

(system) file table inode table / a

/

a

b

data block data block data block

b Device name

59

Kernel tables after open(/a/b) (2/4) PA

user

(system) file table inode table / a

/

a

b

data block data block data block

offset

b Device name

60

Kernel tables after open(/a/b) (3/4) PA

user

(system) file table inode table /

u_ofile fd = 4

0 1 2 3 4

a

/

a

b

data block data block data block

offset

b Device name

61

Kernel tables after open(/a/b) (4/4) PA

user

(system) file table inode table /

u_ofile fd = 4 returned

0 1 2 3 4

a

/

a

b

data block data block data block

offset

b Device name

Once you have fd, 62 you can access b’s inode after only 3 memory accesses

open(“/a/b”, …) again 

open() is very -- disk accesses 



once (open or create) is enough translate (pathname=> fd) once, save it do not use pathname in subsequent calls 



read( ), write( ), close( ), …

use file descriptor instead 

read(fd, ... ), write(fd, ... ),

63

C functions for file 

Wait a minute … 



I used printf(), scanf(), getchar() …. But never used read(), write() before …? I used *FILE …. But never used fd (file descriptor) before …?

Right, most people use library function

And library then invokes invokes system calls Remember? Library cannot perform I/O directly …. library functions are in my address space (user)

64

System calls for files create() open(), close() read(), write() lseek() stat()



move offset get inode content

All others are library functions 

eg scanf(), gets(), getchar(), ….. 65

System call v.s. Library call in kernel system call

read()

in a.out (user) library call scanf() getchar()

format char

gets()

string

fsacnf() fgetc() fgets() fread()

fd

tty files

all files

any number

*FILE (struct in lib)

66

FILE vs fd Kernel a.out

User a.out

my code

library

user

FILE (

local buffer

main( )

add( ) sub( )

count ---- buf pointer -- buf

file descriptor }

fopen( ) printf( )

fd

(system) file table

inode table

u-ofile

/

0 1 2 3 4

a offset

b

/

a

data block data block

system call trap( )

When the local buffer (in FILE) becomes empty, Read() system call fills this buffer again

write()

67

Example: open 1. my a.out 2. fopen() 3. library

calls creates invokes

library fopen(“/a/b/c” ) struct FILE for /a/b/c system call open(“/a/b/c” )

kenel sets up kernel returns

tables (inode, user, .., u_ofile[]) file descriptor fd

fopen() saves fopen() returns

fd in *FILE (for future use) *FILE

4. my a.out saves 5. all future use

*FILE (for future use) getchar(*FILE) 68

Example: getchar() #include “syscalls.h” int getchar(void) /* library function -- copied into my a.out */ { static char buf[BUFSIZ]; /* library local buffer */ data structure static char *bufp = buf; /* pointer */ in library static int n =0; /* counter */ /* Is library local buffer empty? */ if (n == 0) {/* Yes, invoke read() system call & fill up local buffer*/ n = read (0, buf, sizeof(buf)); /* system call */ bufp = buf; } return(--n>0)? (unsigned char) *bufp++: EOF; /* return a character */ }

69

Functions for file handling 

So, you usually use library… printf() for formating (such as %s, %d) getchar() for performance …. But all library I/O functions end up asking system call (Library functions are “ user” code & cannot do I/O directly) They are front-end and provide you with convenience, performance …

Many library functions may exist But there’s only one system call for read()

70

Summary 

File System: 

Transforms blocks into Files and Directories



Optimize for access and usage patterns



Maximize sequential access, allow efficient random access



File (and directory) defined by header, called “inode”



File Allocation Table (FAT) Scheme





Linked-list approach



Very widely used: Cameras, USB drives, SD cards



Simple to implement, but poor performance and no security

Look at actual file access patterns – many small files, but large files take up all the space!

71

Q&A

72

FILE

When request data (e.g., read() system call), put process to sleep until data is ready. ○ When write data ... Naming: Interface to find files by name, not by blocks.

2MB Sizes 2 Downloads 241 Views

Recommend Documents

Download File
ii Create a .GIF image with textual animation for the following theme: Flower and its parts. 30. 3 i Draw a smiley face and reposition the face along a circular path.

pdf-file
Nov 7, 2012 - consists in asking subjects individually and then as a group to tell ..... We have lost the data of two groups due to a problem with a computer during the ..... Philosophical Transactions of the Royal Society B: Biological Sciences,.

Download File
Rc.No.240/MDM/2012 ... Review of MDM Scheme by Hon'ble Minister for Secondary Education, .... 34 Whether the School Management Committee / Public.

File Sharing Algorithms File Sharing Algorithms over MANET ... - IJRIT
Establishing peer-to-peer (P2P) file sharing for mobile ad hoc networks ... methods, flooding-based methods, advertisement-based methods and social-based ... P2P over MANETs is the most popular used pattern of file sharing within MANETs. .... [10]. S

File Sharing Algorithms File Sharing Algorithms over MANET ... - IJRIT
Establishing peer-to-peer (P2P) file sharing for mobile ad hoc networks ... methods, flooding-based methods, advertisement-based methods and social-based ... P2P over MANETs is the most popular used pattern of file sharing within MANETs. .... [10]. S

Download File - IAS Parliament
kms since independence while China has added 50,000 route kms in the same period. .... pen or on screen by a stylus. I. • It can recognize the shapes of the.

Output file
Identified potential in sustainable development business. We recently visited Harbin City, China to gain insight into Green Build. Technology's (GBUT) ...

Download File - IAS Parliament
6.4 Inter-State River Water Disputes. (Amendment) Bill 2017. 6.5 Mahanadi River Water Dispute ..... case, CIC in its order has urged CoA (Committee of administrators) running BCCI to ...... Palestine and the Dead Sea to the west; and the Red.

Output file
nose bumper should be approx- imately 1” above the player's eyebrows. Minor adjustments can be made by adjusting the inflation of the air liner. You can also make adjustments by trying different front sizers, crown pads or side pads. It is the inte

Download File - HOME
Create a web page with the following using HTML i) To embed an image map in a web page ii) To fix the hot spots iii) Show all the related information when the ...

Output file
Group), a leading architectural firm in Singapore, and a 9.23-hectare waterfront land site ..... planning, urban design, civil & structural and mechanical & electrical engineering, interior design and project management services. Recent projects in .

PDF file
discretized into probabilities pij of falling into various (Ai,ωj) cells, to estimate damage as in Eq. 4. Early applications of the NTF method (e.g., [6,19]) have used ...

Design file
k-c-ß-fn-se-¶nepw Nn{XsØ IqSp-X¬ BI¿jWob- am-°q-hm≥ ]›m-Øew Aev]w tamSn-]n-Sn-∏n-t°-≠Xpw. Bhiyambn ht∂-°mw. AØ-c-Øn¬ Hcp Nn{X-amWv. NphsS \¬In-bn-cn-°p-∂-Xv. ]n∂-Wn-bnse _m\¿ Imgv®-°m-cs‚ {i≤ A]-l- cn-°p-∂p-≠v.

Output file
26 Oct 2015 - Sources: Bloomberg, OIR estimates. Key financial highlights. Year Ended 31 Dec (S$m). FY13. FY14. FY15F. FY16F. Gross revenue. 160.1. 203.3. 218.0. 225.6. Property operating expenses. -57.0. -70.9. -77.3. -78.7. Net property income. 103

PDF file
lar, the second moment is commonly reported for the shifted quantity X μ: ..... app ear fairly satisfactory. F or the low er threshold of σ. ผ =48.8, how ever, results.

GENERAL RULES File
May 16, 2014 - The DI possesses native or near-native fluency in American Sign Language. The DI facilitates communication between a person using sign language and a ...... Copies may be obtained at no charge from National RID, 333 Commerce Street,. A

PDF file - GitHub
nimal Docker... 48 [OK] tutum/hello‐world Image to test docker deploymen ts. Has Apac... 19 [OK] marcells/aspnet‐hello‐world ASP.NET vNext ‐ Hello World.

Output file
na. 11.0. 11.8. DPU yield (%). 5.0. 5.3. 5.4. 5.5. P/NAV (x). 1.2. 1.1. 1.1. 1.2. ROE (%). 9.8. 10.1. 6.5. 6.8. Debt/Assets (%). 30.4. 32.1. 36.7. 35.9. Please refer to important disclosures at the back of this document. MCI (P) 005/06/2015 .... Chan

The Google File System
Permission to make digital or hard copies of all or part of this work for personal or .... The master maintains all file system metadata. This in- ...... California, January 2002. [11] Steven R. ... on Mass Storage Systems and Technologies, College.

Download this PDF file
Michigan State University/Brown. University. Abstract Whether a transitive stative .... 9 When these sentences are presented in a list, their acceptability improves (Schmitt 1996). (i) a. ..... effects: University of California, Santa Cruz dissertati

Serverless Network File Systems
A dissertation submitted in partial satisfaction of the requirements for the degree of. Doctor of Philosophy in. Computer Science in the. GRADUATE DIVISION.

Technical file - Woodberry Wine
With daily stirring, “pigeage” and constant sampling, in order to extract the maximum fruity flavors, dyes(anthocyane) and the tannins contained in the skin of the grapes. The process will end up with the Malolactic fermentation in stainless stee