Device Power State

Device Power State

I will be describing about only Native Power management here, where we 

will set the PM configuration hardware registers through PCI configuration space.  Well there are other kinds of PM like runtime PM, Native PM using firmware like ACPI..etc

What is Device Power State ?

In general, power management is a feature allowing one to save energy by putting devices into states in which they draw less power (low-power states) at the price of reduced functionality or performance.

PCI devices may be put into low-power states in two ways, by using the device capabilities introduced by the PCI Bus Power Management Interface Specification, or with the help of platform firmware, such as an ACPI BIOS. In the first approach, that is referred to as the native PCI power management (native PCI PM) in what follows, the device power state is changed as a result of writing a specific value into one of its standard configuration registers. The second approach requires the platform firmware to provide special methods that may be used by the kernel to change the device’s power state.

Devices supporting the native PCI PM usually can generate wakeup signals called Power Management Events (PMEs) to let the kernel know about external events requiring the device to be active. After receiving a PME the kernel is supposed to put the device that sent it into the full-power state. However, the PCI Bus Power Management Interface Specification doesn’t define any standard method of delivering the PME from the device to the CPU and the operating system kernel. It is assumed that the platform firmware will perform this task and therefore, even though a PCI device is set up to generate PMEs, it also may be necessary to prepare the platform firmware for notifying the CPU of the PMEs coming from the device (e.g. by generating interrupts).

What is Native PCI PM

The implementation of the PCI PM Spec is optional for conventional PCI devices, but it is mandatory for PCI Express devices. If a device supports the PCI PM Spec, it has an 8 byte power management capability field in its PCI configuration space. This field is used to describe and control the standard features related to the native PCI power management.

The PCI PM Spec defines 4 operating states for devices (D0-D3) and for buses (B0-B3). The higher the number, the less power is drawn by the device or bus in that state. However, the higher the number, the longer the latency for the device or bus to return to the full-power state (D0 or B0, respectively).

There are two variants of the D3 state defined by the specification. The first one is D3hot, referred to as the software accessible D3, because devices can be programmed to go into it. The second one, D3cold, is the state that PCI devices are in when the supply voltage (Vcc) is removed from them. It is not possible to program a PCI device to go into D3cold, although there may be a programmable interface for putting the bus the device is on into a state in which Vcc is removed from all devices on the bus.

PCI bus power management, however, is not supported by the Linux kernel at the time of this writing and therefore it is not covered by this document.

Note that every PCI device can be in the full-power state (D0) or in D3cold, regardless of whether or not it implements the PCI PM Spec. In addition to that, if the PCI PM Spec is implemented by the device, it must support D3hot as well as D0. The support for the D1 and D2 power states is optional.

PCI devices supporting the PCI PM Spec can be programmed to go to any of the supported low-power states (except for D3cold). While in D1-D3hot the standard configuration registers of the device must be accessible to software (i.e. the device is required to respond to PCI configuration accesses), although its I/O and memory spaces are then disabled. This allows the device to be programmatically put into D0. Thus the kernel can switch the device back and forth between D0 and the supported low-power states (except for D3cold) and the possible power state transitions the device can undergo are the following:

Current State | New State

D0                  | D1, D2, D3

D1                  | D2, D3

D2                  | D3

D1, D2, D3    | D0

The transition from D3cold to D0 occurs when the supply voltage is provided to the device (i.e. power is restored). In that case the device returns to D0 with a full power-on reset sequence and the power-on defaults are restored to the device by hardware just as at initial power up.

PCI devices supporting the PCI PM Spec can be programmed to generate PMEs while in any power state (D0-D3), but they are not required to be capable of generating PMEs from all supported power states. In particular, the capability of generating PMEs from D3cold is optional and depends on the presence of additional voltage (3.3Vaux) allowing the device to remain sufficiently active to generate a wakeup signal.

What will be the implementation ?

This is a typical implementation. Drivers can slightly change the order of the operations in the implementation, ignore some operations or add more driver specific operations in it, but drivers should do something like this overall.

 

A reference implementation -1

Legacy Power Management implementation goes parallel with Device States


The below kernel call back are called when CONFIG_PM_SLEEP is enabled in kernel and after executing the following command 

# echo memory > /sys/power/state

the complete host system goes into sleep state and wakeup only after an event generated e.g. when you press power button of CPU manually.


#ifdef CONFIG_PM_SLEEP 

driver_suspend()

{

          rtnl_lock();

/* Device driver specific operations like suspend, resume */

/* Disable IRQ */

free_irq();

/* If using MSI, Disable MSI */

pci_disable_msi();

pci_save_state();

pci_enable_wake();

/* Disable IO/bus master/irq router */

pci_disable_device();

pci_set_power_state(pci_choose_state(D3hot));

rtnl_unlock();

}

 

driver_resume()

{

         rtnl_lock();

pci_set_power_state(PCI_D0);

pci_restore_state();

/* Device's irq possibly is changed, driver should take care */

pci_enable_device();

pci_set_master();

/* If using MSI, device's vector possibly is changed */

pci_enable_msi();

request_irq();

/* Device driver specific operations */

rtnl_unlock();

}

 

static struct dev_pm_ops driver_pm_ops = {

        .suspend = driver_suspend,

        .resume = driver_resume,

};

 

#endif


 /* PCIe - interface structure */

 static struct pci_driver driver_driver = {

     .name = driver_driver_name,

     .id_table = driver_pci_tbl,

     .probe = driver_probe,

     .remove = driver_remove,


#ifdef CONFIG_PM_SLEEP

     .driver         = {

                .name   = "driver",

                .pm     = &driver_pm_ops,

        },

};

#endif

 

.suspend() and .resume() callbacks and bind with "struct dev_pm_ops" variable.

 

A reference implementation -2

 

In the above implementation the kernel function pci_set_power_state() will take care of writing into device specific configuration space.   

 

int bnx2x_set_power_state(struct bnx2x *bp, pci_power_t state)
{
	u16 pmcsr;

	/* If there is no power capability, silently succeed */
	if (!bp->pdev->pm_cap) {
		BNX2X_DEV_INFO("No power capability. Breaking.\n");
		return 0;
	}

	pci_read_config_word(bp->pdev, bp->pdev->pm_cap + PCI_PM_CTRL, &pmcsr);

	switch (state) {
	case PCI_D0:
		pci_write_config_word(bp->pdev, bp->pdev->pm_cap + PCI_PM_CTRL,
				      ((pmcsr & ~PCI_PM_CTRL_STATE_MASK) |
				       PCI_PM_CTRL_PME_STATUS));

		if (pmcsr & PCI_PM_CTRL_STATE_MASK)
			/* delay required during transition out of D3hot */
			msleep(20);
		break;

	case PCI_D3hot:
		/* If there are other clients above don't
		   shut down the power */
		if (atomic_read(&bp->pdev->enable_cnt) != 1)
			return 0;
		/* Don't shut down the power for emulation and FPGA */
		if (CHIP_REV_IS_SLOW(bp))
			return 0;

		pmcsr &= ~PCI_PM_CTRL_STATE_MASK;
		pmcsr |= 3;

		if (bp->wol)
			pmcsr |= PCI_PM_CTRL_PME_ENABLE;

		pci_write_config_word(bp->pdev, bp->pdev->pm_cap + PCI_PM_CTRL,
				      pmcsr);

		/* No more memory access after this point until
		* device is brought back to D0.
		*/
		break;

	default:
		dev_err(&bp->pdev->dev, "Can't support state = %d\n", state);
		return -EINVAL;
	}
	return 0

}

 

The function is getting called from the below conditions in the 

  1. From all the places in the driver disconnect is required, Recovery has failed and Power cycle is needed - bnx2x_set_power_state(np, PCI_D3hot);
  2. From driver_probe function to set the device driver_set_power_state(bp, PCI_D0);
  3. From driver_remove before removing the device and unregistering the device -  bnx2x_set_power_state(bp, PCI_D3hot);

  • The kernel documentation says - https://www.kernel.org/doc/html/latest/power/pci.html
  • D0, D3hot can be programmed
  • D1 and D2 states are device specific and optional (not very clear when call for the D1, D2 states change in the driver).
  • D0uninitialized and D3cold are not possible to program because both the states are when power is completely cutoff.

PM layer helper functions:

There are many PM layer helper functions below one't are mostly used when suspend and resume functions are called 
pci_save_state(pdev);
pci_enable_device(pdev);
pci_disable_device(pdev);
pci_wake_from_d3(pdev, true);
pci_set_power_state(pdev, PCI_D3hot);
pci_restore_state(pdev);
pci_load_saved_state(pdev, mhi_pdev->pci_state);

Testing: 

Devices supporting the native PCI PM usually can generate wakeup signals called Power Management Events (PMEs) to let the kernel, know about external events requiring the device to be active.  After receiving a PME the kernel is supposed to put the device that sent it into the full-power state.  Tools tobe used to debug: lspci, septic

How does PCI Device Initialization? 

The PCI subsystem's first task related to device power management is to prepare the device for power management and initialize the fields of struct pci_dev used for this purpose.  This happens in two functions defined in drivers/pci/pci.c, pci_pm_init() and platform_pci_wakeup_init().

 

The below prints are from pci_pm_init()


[fedora@localhost ~]$ lspci |grep <PCI ID>

<PCI ID> Ethernet controller: Intel Technology Device 0000 (rev 01)


[fedora@localhost ~]$ dmesg |grep <PCI ID>

.....

...........

...................

[   38.963312] pci <pci id>: supports D1 D2

[   38.967568] pci <pci id>  PME# supported from D0 D1 D2 D3hot D3cold

......

............

..................

[fedora@localhost ~]$

What does the pci_init_pm does ?

  • The first of these functions checks if the device supports native PCI PM and if that's the case the offset ofits power management capability structure in the configuration space is stored in the pm_cap field of the device's struct pci_dev object. 
  •  Next, the function checks which PCI low-power states are supported by the device and from which low-power states the device can generate native PCI PMEs. 
  • The power management fields of the device's struct pci_dev and the struct device embedded in it are updated accordingly and the generation of PMEs by the device is disabled.
  • The second function checks if the device can be prepared to signal wakeup with the help of the platform firmware, such as the ACPI BIOS.

 At this point the device is ready for power management.

References:

PCI Power Management

Driver References:

The .suspend and .resume is based on reference - Modem Host Interface (MHI) PCI controller driver 
The set device state function implementation is done based on reference Broadcom 10-Gigabit ethernet driver

Comments

Popular posts from this blog

Apache Ambari on ARM64

Benchmarking BigData

mockbuild warning in CentOS